Postgres - Sneak Peek

Prashant Kumar
Aug 9, 2022
6 min read

Updated: Aug 9, 2022

Sql database systems in general aim to solve for structured representation of data sets which our fragile mind can perceive and process with relative ease. Database systems also aspire to provide efficient and simple apis to perform different read and write operations on this data.

Imagine the task of getting a list of all the students who scored above 95 in Maths subject from a dataset of school results stored across multiple files. It's feasible but not as convenient as:

select student_names from results where subject = 'Maths' and score > 95.

Let's see how Sql database server works in general using the example of Postgres. This would help in giving two important insights among others:

Intuition on the internal considerations and thinking behind building such a complex system which works so efficiently, reliably and provides pretty neat api to perform different actions. You can use some of these ideas in your own implementations for similar problem sets.
Help you with your decisions while building systems which interact with any sql database.

High level Architecture

For any given query, the client makes a call to the postgres server via some network library like a RPC connection, a jdbc connection etc. Clients can be postgres client tools, postgres terminal client or any backend service. Once the postgres server receives an initial connection request, Postmaster daemon spawns a new process for further execution of requests from the client. Once the initial connection is established, the client sends a query request to the server which goes through multiple execution steps which we would discuss later.

While processing a query on a table, there can be multiple executions running in parallel on that table or there can also be a lock on a row in the table. To handle such cases, each execution process needs be aware of any other process running on the record it is trying to get or update. Postgres uses shared memory to communicate current states of processes and other inter process dependencies. Disk buffers and Kernel Disk buffer are used as temporary buffers / storage while moving data from disk to memory and otherwise.

Query Processing

Let's take an example of a table results in the school database. id is the primary key and we have an index at student_id.

id	student_id	subject	score
1	10	Maths	80
2	30	Science	90

Let's consider you type out the following query in your sql client.

select student_id from results where subject = 'Maths' and score > 90 limit 20;

From receiving the query string to returning the output row, postgres backend operations can be categorised into following three modules:

Parser - Query string at this point is in the form of a sentence or a string. This needs to be converted into a format postgres understands. The parser goes through each keyword in the query and makes a decision if it understands the keyword or not based on a predefined grammar. This action is broken down in following parts:

Syntax checking - If the syntax is wrong, returns the response from that point itself. Parser is define in postgres/backend/parser/gram.y
Lexical Analysis - Lexer and parser is done using flex and bison tools. Lexer is defined in postgres/backend/parser/scan.l
Parse Tree - This is sort of an in-memory representation of a query which postgres understands. A parse tree for our example query can look something like this:

Even if you don't want to completely understand the above tree and the keywords used in each node, important thing to notice here is, each keyword from our query string is encapsulated in one of the node, be it the operator, or the table fields or any constant value.

Optimiser - Task of an optimiser is to come up with the most optimised execution plan. Optimiser takes Parse Tree from above as input and generates a Plan tree which is immutable. A parse tree has two components which are used here - tables's statistics and cost parameters. Cost parameters are different parameters or configuration values which a DB admin can configure to tune in the behaviour of the optimiser.

You can see the query plan tree representation just by appending Explain before the query.

Explain select student_id from results where subject = 'Maths' and score > 90 limit 20;

Another example with student_id in the where clause. Notice the use of Bitmap Index scan as there's an index on student_id field.

EXPLAIN select student_id from results where student_id = 1

Optimiser considers following to generate a Plan tree:

Optimise the query - Optimiser figures out the best way to execute the given query using data structure paths. Paths is the representation of all execution plans to help the optimiser to come up with the final execution plan.
Optimise joins - If there are multiple joins, it optimises the joins for cheapest execution.
Using indexes and scans - It figures out which indexes can be used among hash, nbtree, scan etc.

Executor - Executor (postgres/src/backend/executor) uses the Plan Tree and executes the query. It works in a way that the executor code recursively iterates through each node and executes from bottom to return the output rows. Execution results from the child tree of each node provides input for its own execution. Executor also performs following tasks:

Access Methods - Executor doesn't have context of how indexes work which makes sense from a decoupling point of view. So the executor calls access methods like Nbtree, hash, gin, brin, scan etc defined in postgres/src/backend/access as needed based on the inputs from Plan tree.
Transaction and memory management - As executor does the actual reading and writing to the files, it has to handle transactions, handle race conditions and manage the memory utilisation.

Where is the Data ?

Postgres uses one directory per database and one file per object to write and read the data. For example data/school/12345, data/school/18746 where school is database directory under data directory. 12345 and 18746 can be an index object or a table object. Each file is divided into blocks of 8k pages. Each page block has mainly two components, tuples and item pointers. Each tuple can represent one single logical row from the table or an index entry. If you refer to the screen below, tuples are added from bottom right of the block and item pointers are added from top left. At the point where no further items and corresponding tuples can be added, a new 8k page block is allocated and tuples are added there with item pointers.

Updating Database Files

Postgres provides multiple configuration values to control the writes to database. Directionally, most of these configurations are around choices between improving the performance by sacrificing the consistency to database files or compromise on performance making database updates more reliable and in sync.

The parameter synchronous_commit can be used to specify how much WAL processing must be completed before returning a success response to the client. For each commit request the executor ensures that a log for the change is written to WAL buffer and in parallel written to shared buffer pool. If synchronous_commit option is off, postgres returns a success response to client before writing buffer contents to WAL file in disk. If synchronous_commit is on, postgres waits for the local buffer changes to be applied to the WAL file in the disk. Writing to a log file is still faster compared to database files as its only an append only operation and doesn't need different other possible operations like creating or updating indexes, creating new item pointer, allocating new 8k pages, updating the tuple headers etc which might be required for writing to the database files.

WAL Writer

WAL writer is a background process which keeps appending the logs to the WAL file from the WAL buffer. WAL files are also used to recover a system after a system crash by replaying the events or logs from the WAL file.

Check Pointer

In case of system crash, one approach can be to replay logs from the beginning of the WAL file which would be inefficient as many of such changes would already be persisted in the database files. So, Postgres has this concept of checkpoints. A background task Check Pointer, periodically updates all the dirty records from the shared buffer pool to the database files and updates WAL with a checkpoint entry. The next time Postgres has to use WAL for its recovery, in most of the cases it would replay the events only from the latest checkpoint considering all previous entries are already synced with database files. Depending on the data size of dirty records in shared buffer pool, Check Pointer can get overloaded. A new background task Background Writer was later introduced to assist Check Pointer which keeps writing records from shared buffer pool to the database files.

Not the End

Having covered some fundamental structures and functionalities, there are still multiple nuances which Postgres implements. Some of the important ones being Full page Writes, Hot Updates, usage of bloom filters etc. We will discuss these topics in our next posts. Couple of takeaways here:

Think through what all steps should be sync and what can run in the background. Postgres has multiple background processes which we discussed like CheckPointer, BackGround Writer, WAL Writer etc which runs as async processes periodically.
Decouple the tasks into multiple logical steps like postgres does with parser, planner and optimiser just to execute a query string. Even the executor doesn't have context of indexes and has to call respective access methods for their implementation, reflecting segregation of responsibilities.
Make your system robust to handle failures and system crashes. Postgres implements WAL to reinstate the system after system failures so that users don't have to lose their data.

References