cyberangles guide

PostgreSQL Replication: Setting Up a Reliable Data System

In today’s data-driven world, ensuring high availability, fault tolerance, and scalability of databases is critical for businesses. Downtime or data loss can lead to lost revenue, damaged reputation, and operational disruptions. PostgreSQL, one of the most popular open-source relational databases, addresses these challenges through its robust **replication** capabilities. Replication in PostgreSQL involves copying data from a primary database server (where write operations occur) to one or more standby servers (which handle read traffic and act as failover targets). This setup enhances system reliability, enables load balancing for read-heavy workloads, and simplifies disaster recovery. In this blog, we’ll dive deep into PostgreSQL replication: explaining its core concepts, exploring different replication types, and providing step-by-step guides to set up both physical (streaming) and logical replication. We’ll also cover monitoring, failover, and best practices to ensure your replication setup remains robust.

Table of Contents

  1. Understanding PostgreSQL Replication
  2. Types of PostgreSQL Replication
  3. Setting Up Streaming (Physical) Replication
    • 3.1 Prerequisites
    • 3.2 Configuring the Primary Server
    • 3.3 Setting Up the Standby Server
    • 3.4 Verifying Replication
  4. Setting Up Logical Replication
    • 4.1 Use Cases for Logical Replication
    • 4.2 Configuration Steps
  5. Monitoring Replication Health
  6. Failover and Maintenance
  7. Best Practices for Reliable Replication
  8. Conclusion
  9. References

Understanding PostgreSQL Replication

At its core, PostgreSQL replication is the process of synchronizing data from a primary server to one or more standby servers. Here’s how it works:

  • Primary Server: The main database server that handles all write operations (INSERT, UPDATE, DELETE) and commits transactions. It generates a write-ahead log (WAL) to record changes before applying them to the database.
  • Standby Server: A read-only copy of the primary. Standbys replay the WAL from the primary to stay synchronized. They can serve read traffic to offload the primary and act as a failover target if the primary fails.

The Role of WAL (Write-Ahead Logging)

PostgreSQL uses WAL to ensure data integrity. All changes are first written to the WAL before being applied to the database (the “write-ahead” part). For replication, the primary streams WAL records to standbys, which replay these records to replicate changes. This makes WAL the backbone of PostgreSQL replication.

Types of PostgreSQL Replication

PostgreSQL supports multiple replication methods, each suited to different use cases. We’ll focus on the two most widely used:

1. Physical (Streaming) Replication

Physical replication creates an exact byte-for-byte copy of the primary’s database cluster. It replicates the entire database, including schema, indexes, and data files.

Key Features:

  • Synchronizes at the block level (low overhead).
  • Supports failover (standby can be promoted to primary).
  • Ideal for high availability and disaster recovery.
  • Requires identical PostgreSQL versions on primary and standby.

Limitations:

  • No selective replication (replicates the entire database).
  • Schema changes on the primary may break replication if not applied to standbys first.

2. Logical Replication

Logical replication replicates data at the row level, using logical WAL records (decoded from physical WAL). It allows replicating specific tables, rows, or columns.

Key Features:

  • Replicates changes as logical events (e.g., “INSERT INTO table X VALUES Y”).
  • Supports selective replication (e.g., replicate only a specific table).
  • Allows schema changes (e.g., adding a column) without breaking replication.
  • Works across different PostgreSQL versions (minor releases).

Use Cases:

  • Data sharing between databases (e.g., replicating a sales table to a reporting database).
  • Migrating data between servers with different schemas.
  • Offloading read traffic for specific tables.

Other Replication Methods

  • File-Based Replication: Uses archive_command to copy WAL files to a standby (older, less efficient than streaming).
  • Synchronous Replication: Ensures WAL is written to at least one standby before a transaction commits (higher latency, but zero data loss).

Setting Up Streaming (Physical) Replication

Let’s walk through setting up streaming replication with two servers: a primary (primary.example.com) and a standby (standby.example.com).

3.1 Prerequisites

  • Two servers (physical or virtual) running PostgreSQL 12+ (we’ll use PostgreSQL 16).
  • Network connectivity: Port 5432 (PostgreSQL) open between servers.
  • A postgres OS user with SSH access (for backups).
  • Sufficient disk space on the standby (equal to or larger than the primary).

3.2 Configuring the Primary Server

Step 1: Update postgresql.conf

Edit the primary’s configuration file (usually in /var/lib/postgresql/16/main/postgresql.conf):

# Allow connections from the standby
listen_addresses = '*'  # Or specify the standby's IP (e.g., '192.168.1.100,192.168.1.101')

# Enable replication (WAL level must be 'replica' or higher)
wal_level = replica  # 'logical' if using logical replication later
max_wal_senders = 3  # Max number of standby connections
wal_keep_size = 1GB  # Keep WAL files for 1GB (prevents standby from falling too far behind)
max_replication_slots = 3  # For replication slots (prevents WAL deletion)

Step 2: Allow Replication in pg_hba.conf

Edit /var/lib/postgresql/16/main/pg_hba.conf to permit the standby to connect for replication:

# Allow standby (IP: 192.168.1.101) to connect as replication user
host    replication     replicator      192.168.1.101/32        scram-sha-256

Replace 192.168.1.101/32 with the standby’s IP and scram-sha-256 with your authentication method (e.g., md5 for older setups).

Step 3: Restart PostgreSQL and Create a Replication User

Restart the primary to apply changes:

sudo systemctl restart postgresql@16-main

Create a dedicated replication user with the REPLICATION privilege:

-- On primary:
CREATE USER replicator WITH REPLICATION PASSWORD 'secure_password';

3.3 Setting Up the Standby Server

Step 1: Stop PostgreSQL on the Standby

If PostgreSQL is running on the standby, stop it to replace the data directory:

sudo systemctl stop postgresql@16-main
sudo rm -rf /var/lib/postgresql/16/main/*  # CAUTION: Deletes existing data!

Step 2: Take a Base Backup of the Primary

Use pg_basebackup to copy the primary’s data directory to the standby. Run this on the standby:

sudo -u postgres pg_basebackup \
  -h primary.example.com \
  -U replicator \
  -D /var/lib/postgresql/16/main \
  -Fp \  # Plain format (directory)
  -Xs \  # Stream WAL during backup (no need for archive)
  -Pv    # Progress and verbose output

Enter the replicator password when prompted. This creates an initial copy of the primary’s database on the standby.

Step 3: Configure Standby to Replicate from Primary

On the standby, create a postgresql.auto.conf file (PostgreSQL 12+; older versions use recovery.conf):

# /var/lib/postgresql/16/main/postgresql.auto.conf
standby_mode = 'on'
primary_conninfo = 'host=primary.example.com port=5432 user=replicator password=secure_password application_name=standby1'
recovery_target_timeline = 'latest'  # Replay all WAL from primary

Step 4: Start the Standby

Start PostgreSQL on the standby:

sudo systemctl start postgresql@16-main

3.4 Verifying Replication

On the Primary

Check replication status with pg_stat_replication:

-- On primary:
SELECT usename, application_name, client_addr, state, sync_state FROM pg_stat_replication;

You should see a row for the standby with state=streaming and sync_state=async (default).

On the Standby

Check if the standby is replaying WAL:

-- On standby:
SELECT pg_is_in_recovery();  -- Should return 't' (true, in recovery mode)
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;  -- Time since last replay

If replication_lag is low (seconds), replication is working!

Setting Up Logical Replication

Logical replication requires PostgreSQL 10+. We’ll replicate a sales table from the primary to the standby.

4.1 Use Cases for Logical Replication

  • Replicating a subset of tables (e.g., only sales and customers).
  • Migrating data to a new server with a modified schema.
  • Sharing data with a PostgreSQL instance in the cloud.

4.2 Configuration Steps

Step 1: Prepare the Primary Server

Enable logical replication by setting wal_level=logical in postgresql.conf (requires restart):

wal_level = logical  # In postgresql.conf on primary
max_replication_slots = 5  # For logical slots
max_wal_senders = 5

Restart the primary:

sudo systemctl restart postgresql@16-main

Step 2: Create a Publication on the Primary

A publication defines the set of tables to replicate. Create one for the sales table:

-- On primary:
CREATE TABLE sales (
  id SERIAL PRIMARY KEY,
  product_id INT,
  amount NUMERIC,
  sale_date TIMESTAMP DEFAULT NOW()
);

-- Create publication for 'sales' table:
CREATE PUBLICATION sales_publication FOR TABLE sales;

Grant SELECT on the table to the replication user (required for logical replication):

GRANT SELECT ON sales TO replicator;

Step 3: Create a Subscription on the Standby

A subscription connects the standby to the primary’s publication. First, create the sales table on the standby (schema must match the primary’s):

-- On standby:
CREATE TABLE sales (
  id SERIAL PRIMARY KEY,
  product_id INT,
  amount NUMERIC,
  sale_date TIMESTAMP DEFAULT NOW()
);

Create the subscription to sales_publication:

-- On standby:
CREATE SUBSCRIPTION sales_subscription
  CONNECTION 'host=primary.example.com port=5432 user=replicator password=secure_password dbname=postgres'
  PUBLICATION sales_publication;

Step 4: Test Replication

Insert a row on the primary and verify it appears on the standby:

-- On primary:
INSERT INTO sales (product_id, amount) VALUES (101, 99.99);

-- On standby:
SELECT * FROM sales;  -- Should return the new row!

Monitoring Replication Health

To ensure replication remains reliable, monitor key metrics:

Key Metrics to Track

  • Replication Lag: Time between when a change occurs on the primary and when it’s replayed on the standby.
    • On primary: pg_stat_replication.replay_lag (time since standby replayed the last WAL).
    • On standby: now() - pg_last_xact_replay_timestamp().
  • WAL Generation/Rate: Ensure the primary isn’t generating WAL faster than the standby can replay it.
  • Replication Slot Status: Check for unused slots (can cause WAL bloat).

Tools for Monitoring

  • pg_stat_replication: Built-in view on the primary to track standby connections.
  • pg_replication_slots: View slot status (e.g., active or inactive).
  • Prometheus + Grafana: Use the postgres_exporter to scrape metrics and visualize lag, WAL rate, etc.
  • pgBadger: Analyze PostgreSQL logs for replication errors.

Failover and Maintenance

Manual Failover

If the primary fails, promote the standby to take over:

  1. On the standby:

    sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main
  2. Update your application’s database connection string to point to the new primary (standby’s IP).

  3. Reconfigure the old primary as a standby for the new primary (repeat the base backup step).

Automated Failover

For production, use tools like:

  • Patroni: A HA solution with automatic failover, REST API, and integration with etcd/Consul.
  • repmgr: Simplifies replication setup, monitoring, and failover.

Maintenance Tips

  • Update PostgreSQL: Always update standbys first, then the primary (to avoid version mismatches).
  • Clean Up WAL: Use pg_archivecleanup to delete old WAL files (set archive_cleanup_command on standbys).
  • Test Failover Regularly: Simulate primary failures to ensure standbys promote correctly (e.g., pg_ctl stop on primary).

Best Practices for Reliable Replication

  1. Use Replication Slots: Prevent WAL deletion on the primary before standbys receive it (avoids “WAL file not found” errors).

    -- On primary: Create a slot for the standby
    SELECT * FROM pg_create_physical_replication_slot('standby1_slot');

    Update the standby’s primary_conninfo to use the slot:

    primary_conninfo = '... replication_slot=standby1_slot'
  2. Encrypt Replication Traffic: Use SSL to secure WAL streaming between primary and standby.

  3. Monitor Lag Aggressively: Set alerts for lag exceeding 30 seconds (indicates a problem).

  4. Avoid Overloading Standbys: Limit read traffic on standbys to prevent replay lag.

  5. Document Your Setup: Record IPs, replication users, and failover steps for your team.

Conclusion

PostgreSQL replication is a cornerstone of building reliable, highly available data systems. Whether you need a failover target (streaming replication) or selective data sharing (logical replication), PostgreSQL offers flexible tools to meet your needs.

By following the steps in this guide, you can set up a robust replication setup, monitor its health, and ensure your data remains accessible even during outages. Remember: replication is not a substitute for backups—always combine it with regular backups for full disaster recovery.

References