Table of Contents
- Client/Server Architecture
- Process Architecture
- Memory Architecture
- Storage Architecture
- Transaction Management
- Query Processing Pipeline
- Indexing: Speeding Up Data Retrieval
- Conclusion
- References
1. Client/Server Architecture
PostgreSQL follows a classic client/server model, where:
- Clients (e.g.,
psql, Python apps usingpsycopg2, or BI tools) send SQL queries over a network. - Server (the PostgreSQL daemon) listens for connections, processes queries, and returns results.
How It Works:
- The server runs on a host machine, bound to a port (default: 5432).
- Clients connect via TCP/IP or Unix domain sockets, authenticate (using passwords, SSL, or peer authentication), and send queries.
- The server assigns a dedicated backend process to each client connection (more on this in Process Architecture).
Example: When you run psql -h localhost -U myuser mydb, psql acts as the client, connecting to the PostgreSQL server on localhost:5432 with user myuser and database mydb.
2. Process Architecture
PostgreSQL uses a multi-process architecture (unlike some databases that use threads). At startup, a parent process called the postmaster initializes the system and spawns child processes to handle clients and background tasks.
2.1 Postmaster: The “Gatekeeper”
The postmaster is the first process launched when PostgreSQL starts. Its responsibilities include:
- Managing the server lifecycle (startup, shutdown, crash recovery).
- Listening for incoming client connections on configured ports.
- Forking (creating) a new backend process for each client connection.
- Monitoring child processes and restarting critical background processes if they fail.
2.2 Backend Processes: “Worker Bees”
Each client connection gets its own backend process (sometimes called a “postgres” process). Backend processes:
- Parse, analyze, plan, and execute client queries.
- Manage transaction state for their client.
- Communicate with shared memory (e.g., to read/write data from
shared_buffers).
Key Note: Backend processes are isolated—if one crashes, others (and the postmaster) continue running.
2.3 Background Processes: “Support Crew”
PostgreSQL relies on several long-running background processes to maintain stability and performance:
| Process | Role |
|---|---|
| Checkpointer | Periodically writes dirty pages (modified data in shared_buffers) to disk, reducing WAL replay time during recovery. |
| WAL Writer | Writes WAL (Write-Ahead Log) data from memory to disk, ensuring durability. |
| Autovacuum | Automatically cleans up expired tuples (from MVCC) and updates table statistics for the query planner. |
| Stats Collector | Tracks query execution statistics (e.g., table access counts) used by the planner. |
| Archiver | Archives old WAL files (if WAL archiving is enabled, e.g., for replication). |
3. Memory Architecture
PostgreSQL uses two main memory pools: shared memory (shared across all processes) and local memory (private to each backend process).
3.1 Shared Memory: The “Communal Cache”
Shared memory is allocated at server startup and used by all processes. Key components:
-
Shared Buffers: A cache for frequently accessed database pages (default: 128MB, but often tuned to 25-50% of system RAM for production). Think of it as PostgreSQL’s “first stop” for data—if a page is in
shared_buffers, the backend avoids reading from slow disk. -
WAL Buffers: A small buffer (default: 16MB) for WAL records before they’re written to disk by the WAL Writer. Reduces disk I/O by batching WAL writes.
-
CLOG (Commit Log): Tracks the commit status of all transactions (e.g., committed, aborted, in-progress). Used by MVCC to determine tuple visibility.
3.2 Local Memory: “Private Workspaces”
Each backend process allocates local memory for query-specific tasks. Configurable via postgresql.conf:
-
work_mem: Memory for per-operation tasks like sorting (e.g.,
ORDER BY), hashing (e.g.,JOIN), or bitmap scans (default: 4MB). Too small → slow sorts (uses disk temp files); too large → memory bloat with many concurrent queries. -
maintenance_work_mem: Memory for maintenance operations like
VACUUM,CREATE INDEX, orALTER TABLE(default: 64MB). -
Temp Buffers: For temporary tables used by a single backend (default: 8MB).
4. Storage Architecture
PostgreSQL stores data on disk in a hierarchical structure: tablespaces → databases → relations (tables/indexes) → pages → tuples (rows).
4.1 Tablespaces: “File System Mapping”
A tablespace is a directory on the host’s file system that PostgreSQL uses to store database files. By default, PostgreSQL uses a pg_default tablespace, but you can create custom tablespaces to:
- Separate data (e.g., large tables on fast SSDs, archives on slower HDDs).
- Manage disk space across multiple mount points.
Example: CREATE TABLESPACE fast_data LOCATION '/mnt/ssd/postgres_data';
4.2 Database Directories and Files
Each database in PostgreSQL is stored as a subdirectory under its tablespace. For example, the mydb database in pg_default lives at:
$PGDATA/base/<db_oid>/ (where $PGDATA is the PostgreSQL data directory, and <db_oid> is a unique numeric ID for the database).
4.3 Pages: The “Atomic Unit” of Storage
PostgreSQL reads/writes data in fixed-size blocks called pages (default: 8KB, configurable at compile time). Pages are the smallest unit of I/O—even a 1-byte change requires reading/writing an entire page.
A page’s structure includes:
- Page Header: Metadata (e.g., page number, checksum, free space pointer).
- Item Pointers (Line Pointers): Array of pointers to tuples stored in the page.
- Tuple Data: The actual row data (tuples).
- Free Space: Unused space for new tuples (managed by PostgreSQL).
4.4 Heap Files and TOAST: Storing Tuples
Tables and indexes are stored as heap files (so named because tuples are unordered, like a “heap” of data). Each heap file is split into pages.
For large tuples (e.g., a TEXT column with 1GB of data), PostgreSQL uses TOAST (The Oversized-Attribute Storage Technique):
- Tuples larger than ~2KB are compressed and stored in separate TOAST tables.
- The original heap tuple stores a pointer to the TOASTed data.
5. Transaction Management
PostgreSQL guarantees ACID properties (Atomicity, Consistency, Isolation, Durability) for transactions, ensuring reliable data handling.
5.1 ACID Properties
- Atomicity: Transactions either complete fully (“commit”) or have no effect (“rollback”).
- Consistency: Transactions transform the database from one valid state to another (e.g., foreign key constraints are enforced).
- Isolation: Concurrent transactions don’t interfere with each other (PostgreSQL uses MVCC for isolation).
- Durability: Committed changes survive crashes (via WAL).
5.2 MVCC: “Snapshot Isolation”
PostgreSQL uses Multi-Version Concurrency Control (MVCC) to allow read/write concurrency without locking (in most cases). Here’s how it works:
- Tuples with Versions: When you update/delete a row, PostgreSQL doesn’t overwrite the old tuple—it marks it as “expired” and creates a new version with a higher transaction ID (XID).
- Snapshot Visibility: Each transaction sees a consistent “snapshot” of the database as it existed at the start of the transaction. A tuple is visible to a transaction only if:
- Its XID is ≤ the transaction’s snapshot XID, and
- It hasn’t been marked as expired by a later transaction.
Example: If Alice updates a row (XID=100), Bob (in a transaction started before XID=100) will still see the old version of the row until Bob’s transaction ends.
5.3 WAL: Write-Ahead Logging for Durability
To ensure durability (ACID’s “D”), PostgreSQL uses Write-Ahead Logging (WAL):
- Before modifying data in
shared_buffers, PostgreSQL writes a log entry to the WAL. - WAL is written to disk before the modified data pages, so if the server crashes, PostgreSQL can replay WAL to recover lost changes.
WAL records include:
- Transaction commits/rollbacks.
- Data modifications (inserts, updates, deletes).
5.4 Commit and Rollback
- Commit: When a transaction commits:
- A “commit record” is written to WAL.
- The WAL buffer is flushed to disk (ensuring durability).
- The transaction’s XID is marked as committed in the CLOG.
- Rollback: The transaction discards its changes (no WAL flush needed), and expired tuples are eventually cleaned up by Autovacuum.
6. Query Processing Pipeline
When a client sends a query (e.g., SELECT * FROM users WHERE id=1), PostgreSQL follows a 4-step pipeline to return results:
6.1 Parser: “Understanding the Query”
The parser converts the raw SQL string into a parse tree (a structured representation of the query). It checks syntax (e.g., SELECT vs. SELCT), but not semantics (e.g., if users table exists).
Example: For SELECT name FROM users WHERE id=1, the parse tree might represent: “Select column ‘name’ from relation ‘users’ where ‘id’ equals 1”.
6.2 Analyzer/Rewriter: “Validating and Rewriting”
The analyzer (or “semantic analyzer”) validates the parse tree:
- Checks if tables/columns exist and the user has permissions.
- Resolves data types (e.g.,
id=1→idis integer,1is integer).
The rewriter then applies transformation rules (e.g., expanding views into their underlying queries, or applying WHERE clauses from CHECK constraints).
6.3 Planner: “Choosing the Best Path”
The planner (or “optimizer”) generates an execution plan—a step-by-step strategy to retrieve the data. It uses cost-based optimization (CBO), choosing the plan with the lowest estimated cost (e.g., I/O, CPU time).
For example, for SELECT * FROM users WHERE id=1, the planner might:
- Use a B-tree index on
id(fast, low cost) if the index exists. - Do a sequential scan (slow, high cost) if no index exists.
The planner relies on statistics (e.g., table size, data distribution) collected by Autovacuum to estimate costs.
6.4 Executor: “Running the Plan”
The executor runs the execution plan, fetching data from storage (or shared_buffers) and returning results to the client. It uses iterator-based execution: each node in the plan (e.g., “Index Scan”, “Hash Join”) returns rows one at a time to the next node.
7. Indexing: Speeding Up Data Retrieval
Indexes are special data structures that allow PostgreSQL to find tuples without scanning entire tables. The most common (and default) index type is B-tree.
7.1 B-Tree Indexes: The Workhorse
A B-tree (balanced tree) index organizes keys in sorted order, enabling fast lookups, range queries (e.g., id BETWEEN 1 AND 100), and sorts.
B-tree structure:
- Root Page: Top-level node pointing to internal pages.
- Internal Pages: Direct pointers to child pages (like a directory).
- Leaf Pages: Store key-value pairs, where the “key” is the indexed column(s), and the “value” is the tuple’s location (page number + item pointer).
7.2 How B-Trees Accelerate Queries
Without an index, PostgreSQL does a sequential scan (reads every page in the table). With a B-tree index:
- Traverse from root → internal pages → leaf page to find the key.
- Use the leaf page’s tuple pointer to fetch the full row from the heap.
This reduces I/O from O(n) (sequential scan) to O(log n) (B-tree lookup), where n is the number of tuples.
8. Conclusion
PostgreSQL’s power lies in its sophisticated, modular architecture. From the client/server model and multi-process design to MVCC, WAL, and the query planner, each component works in harmony to deliver reliability, performance, and flexibility.
By understanding these core components, you can:
- Tune memory/disk settings for your workload.
- Write more efficient queries (e.g., avoiding sequential scans).
- Troubleshoot issues (e.g., slow queries, WAL disk full).
- Design schemas that leverage PostgreSQL’s strengths (e.g., B-tree indexes for frequent lookups).
9. References
- PostgreSQL Official Documentation: Internals
- PostgreSQL 16 Administration Cookbook by Simon Riggs et al.
- Inside the PostgreSQL Query Planner (PostgreSQL Docs)
- Write-Ahead Logging (WAL) (PostgreSQL Docs)
- MVCC in PostgreSQL (PostgreSQL Docs)