Replication
What is it?
Replication is PostgreSQL’s mechanism for maintaining one or more copies (replicas/standbys) of a database. The primary server streams WAL records to standby servers, which replay them to stay synchronized.
Core Concepts
Streaming Replication (Physical Replication)
PostgreSQL’s streaming replication replicates at the physical level (WAL records), creating exact byte-for-byte copies of the primary database.
How it works:
┌─────────────────────────────────────────────────────────────────────┐
│ PRIMARY SERVER │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Client writes data │
│ ↓ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Transactions │────────→│ pg_wal/ │ │
│ │ │ WAL │ (WAL files) │ │
│ └──────────────┘ └───────┬──────┘ │
│ │ │
│ │ WAL stream │
└───────────────────────────────────┼─────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ STANDBY 1 │ │ STANDBY 2 │ │ STANDBY 3 │
├──────────────────┤ ├──────────────────┤ ├──────────────────┤
│ │ │ │ │ │
│ WAL Receiver │ │ WAL Receiver │ │ WAL Receiver │
│ ↓ │ │ ↓ │ │ ↓ │
│ Replay WAL │ │ Replay WAL │ │ Replay WAL │
│ ↓ │ │ ↓ │ │ ↓ │
│ Stay in sync │ │ Stay in sync │ │ Stay in sync │
│ │ │ │ │ │
│ (Read-only) │ │ (Read-only) │ │ (Read-only) │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Key characteristics:
- Standbys are read-only (hot standby mode)
- Exact copies of primary (same PostgreSQL version, architecture)
- Block-level replication (copies physical pages, not logical changes)
- Can have multiple standbys from single primary
WAL transmission:
- WAL files from
pg_wal/are sent to standby servers using PostgreSQL’s protocol - Primary periodically recycles old WAL files (after checkpoints, when no longer needed for crash recovery)
- Problem: Primary may recycle WAL files before standby receives them
- Network delays, standby downtime, or slow standby can cause standby to fall behind
- If primary recycles WAL that standby hasn’t received yet, standby cannot catch up
- Standby would need full resynchronization (pg_basebackup) to recover
- Solution: Replication slots prevent this problem
Replication Slots
What are replication slots?
- Named connections that guarantee WAL retention for specific standbys
- Prevent primary from recycling WAL files that standbys still need
- Persist across restarts
Why they matter:
- Without slots: Primary may recycle WAL before standby consumes it (standby falls behind, needs full resync)
- With slots: Primary retains WAL until standby confirms receipt (safer, but can fill disk if standby disconnected)
Types:
- Physical slots: For streaming replication (physical standbys)
- Logical slots: For logical replication (subscribers)
Critical consideration:
- Slots prevent WAL deletion even if standby is offline
- Can cause
pg_wal/to fill disk if standby down for extended period - Must monitor slot lag and disk usage
Synchronous vs Asynchronous Replication
Asynchronous replication (default)
- Primary commits immediately after writing WAL locally
- WAL streamed to standby asynchronously (no waiting)
- Pros: Better performance (no network wait), standby failure doesn’t affect primary
- Cons: Potential data loss if primary crashes before standby receives WAL
Synchronous replication
- Primary waits for standby confirmation before COMMIT returns to client
- What “confirmation” means depends on
synchronous_commitlevel (see below) - Pros: Zero data loss on primary failure (when using
onorremote_apply) - Cons: Higher latency (network round-trip), standby failure can block commits
Configuration:
# On primary server (postgresql.conf)
synchronous_commit = on # Commit level (see table below)
synchronous_standby_names = 'standby1' # Which standbys to wait for
synchronous_commit levels explained:
| Level | Client waits for | Standby status | Data loss risk | Use case |
|---|---|---|---|---|
off | Nothing (returns immediately) | WAL sent async | High (even local crash) | Non-critical data, maximum speed |
local | Local WAL fsync | WAL sent async | On crash before replication | Default - balanced |
remote_write | WAL sent to standby OS | In OS buffer (not disk yet) | On standby OS crash | Better than async, faster than on |
on | WAL written to standby disk | On disk, not applied | None (on sync standby) | Production HA - zero data loss |
remote_apply | WAL applied on standby | Applied and queryable | None + read consistency | Strongest guarantee, read-your-writes |
Minimum: Use synchronous_commit = local or higher (never off in production). Recommended: Use synchronous_commit = on if the performance penalty is acceptable - this provides zero data loss on primary failure. The latency increase (one network round-trip per commit) is usually acceptable for most workloads compared to the data protection benefit.
Detailed explanation:
Transaction commit flow with synchronous_commit = 'on':
Client Primary Standby
│ │ │
│ COMMIT; │ │
├─────────────────────>│ │
│ │ │
│ │ 1. Write WAL local │
│ │ (to disk) │
│ │ │
│ │ 2. Send WAL ──────> │
│ │ │
│ │ │ 3. Write WAL to disk
│ │ │
│ │ 4. ACK <────────────┤
│ │ │
│ 5. SUCCESS │ │
│ <────────────────────┤ │
│ │ │
│ Client can proceed │ │
│ (transaction durable)│ │
With remote_apply:
Steps 1-2 same, then:
3. Write WAL to disk
4. Apply WAL to database (UPDATE visible on standby)
5. ACK back to primary
6. Primary returns SUCCESS to client
synchronous_commit = off is dangerous even without replication - disables local fsync waiting. Use with extreme caution.
Multiple synchronous standbys (quorum):
# Wait for ANY 1 of 3 standbys
synchronous_standby_names = 'ANY 1 (standby1, standby2, standby3)'
# Wait for FIRST 2 of 3 standbys (most common for HA)
synchronous_standby_names = 'FIRST 2 (standby1, standby2, standby3)'
# Priority-based: wait for standby1, fallback to standby2 if standby1 fails
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
What happens when a synchronous standby fails?
With FIRST 2 (standby1, standby2, standby3):
Scenario 1: All standbys healthy
- Primary waits for ANY 2 of the 3 standbys to confirm
- Commit completes normally
Scenario 2: One standby (standby3) fails
- Primary waits for standby1 + standby2 (still 2 healthy standbys)
- Commit completes normally
- standby3 reconnects later and replays missed WAL asynchronously
- No data inconsistency - all commits were confirmed by required quorum
Scenario 3: Two standbys fail (only 1 healthy)
- Primary CANNOT commit (needs 2, has only 1)
- Transactions HANG waiting for standbys to recover
- Client applications will timeout
- Fix: reduce synchronous_standby_names or restore failed standbys
PostgreSQL never commits a transaction unless the required number of standbys confirm. There is no inconsistency - if a commit hangs because standbys are down, the transaction is not committed anywhere. Either all required standbys confirm and commit succeeds, or the client times out and can retry.
Best practice for HA:
Provision N+1 standbys for resilience:
- Need 2 synchronous confirmations? Provision 3 standbys
- Configure:
synchronous_standby_names = 'FIRST 2 (standby1, standby2, standby3)' - Can lose 1 standby without blocking commits
Failover and Promotion
Failover is the process of switching to a standby when the primary fails.
Promotion:
- Converting a standby to a new primary
- Standby stops replaying WAL and starts accepting writes
- One-way operation: promoted standby cannot rejoin as standby without rebuild
Basic promotion:
# On standby server (PG 12+)
pg_ctl promote -D /var/lib/postgresql/data
# Or trigger file method (older versions, PG < 12)
# Note: The trigger file path is configured via promote_trigger_file parameter
# (in recovery.conf for PG < 12, or postgresql.conf for PG 12+)
# Default location and name varies - check your configuration
touch /var/lib/postgresql/data/promote.trigger # Example path
After promotion:
- Update application connection strings to new primary
- Old primary (if recoverable) must be rebuilt as standby
- Other standbys may need reconfiguration to follow new primary
Rebuilding Old Primary as Standby with pg_rewind
After failover, the old primary has diverged from the new primary (different timeline). To convert it to a standby, you have two options:
Option 1: pg_rewind (faster, recommended)
pg_rewind synchronizes the old primary with the new primary by copying only the changed blocks, much faster than full rebuild.
Requirements:
wal_log_hints = onon the old primary (or full_page_writes = on with data checksums enabled)- Old primary must have been shut down cleanly OR can be shut down now
- Both servers accessible
Steps:
# 1. Stop old primary (if still running)
pg_ctl stop -D /var/lib/postgresql/data
# 2. Run pg_rewind from old primary, pointing to new primary
pg_rewind \
--target-pgdata=/var/lib/postgresql/data \
--source-server="host=new-primary-host port=5432 user=replication password=xxx dbname=postgres"
# 3. Configure old primary as standby
# PG 12+: Create standby.signal file
touch /var/lib/postgresql/data/standby.signal
# 4. Configure connection to new primary in postgresql.conf or postgresql.auto.conf
cat >> /var/lib/postgresql/data/postgresql.auto.conf <<EOF
primary_conninfo = 'host=new-primary-host port=5432 user=replication password=xxx'
EOF
# 5. Start old primary (now as standby)
pg_ctl start -D /var/lib/postgresql/data
Option 2: pg_basebackup (slower, always works)
If pg_rewind cannot be used (wal_log_hints was off, or timeline divergence too complex), do full rebuild:
# 1. Stop old primary
pg_ctl stop -D /var/lib/postgresql/data
# 2. Remove old data directory
rm -rf /var/lib/postgresql/data/*
# 3. Take new base backup from new primary
pg_basebackup \
-h new-primary-host \
-U replication \
-D /var/lib/postgresql/data \
-Fp -Xs -P -R
# 4. Start new standby
pg_ctl start -D /var/lib/postgresql/data
When to use each:
- pg_rewind: When possible - much faster (minutes vs hours for large databases)
- pg_basebackup: When wal_log_hints was disabled, or pg_rewind fails
High Availability with Patroni
The manual failover and rebuild processes described above are complex and error-prone. Patroni is a popular open-source HA solution that automates all of this.
What Patroni does:
- Automatic failover: Detects primary failure and promotes standby automatically
- Automatic pg_rewind: Rebuilds old primary as standby without manual intervention
- Consensus-based: Uses etcd, Consul, or ZooKeeper for distributed consensus
- Health checks: Continuous monitoring of primary and standbys
- Switchover: Planned maintenance with zero downtime
- REST API: Easy integration with monitoring and orchestration tools
Key benefits:
- Eliminates manual failover steps (no touching promote.trigger or pg_ctl promote)
- Automatic timeline management and pg_rewind execution
- Built-in fencing to prevent split-brain scenarios
- Integrates well with Kubernetes (via Zalando PostgreSQL Operator)
Example scenario with Patroni:
Without Patroni (manual):
1. Primary fails
2. DBA notices (minutes/hours)
3. DBA manually promotes standby
4. DBA updates connection strings
5. DBA runs pg_rewind on old primary
6. DBA reconfigures old primary as standby
Total time: 15-60 minutes (depending on DBA availability)
With Patroni (automatic):
1. Primary fails
2. Patroni detects failure (seconds)
3. Patroni promotes standby automatically
4. Patroni updates DCS (etcd/Consul) - clients discover new primary
5. When old primary recovers, Patroni runs pg_rewind automatically
6. Old primary rejoins as standby
Total time: 30-60 seconds (fully automated)
Configuration example:
See patroni-config-example.yml for a complete Patroni configuration example.
Further reading:
Logical Replication (Briefly)
Unlike physical replication (entire database), logical replication replicates specific tables or changes based on replication identity. It uses a publish/subscribe model where one or more publishers send changes to one or more subscribers.
How it works:
- Publisher decodes WAL records into logical changes (INSERT, UPDATE, DELETE)
- Changes are sent to subscriber as SQL-like operations (not raw WAL bytes)
- Subscriber applies these changes to its own tables
- This allows cross-version replication and selective table replication
Example flow:
Publisher (Primary):
UPDATE users SET name='Alice' WHERE id=1
↓
WAL record written (physical)
↓
Logical decoding plugin decodes WAL
↓
Logical change: UPDATE users (id=1, name='Alice')
↓
Send to subscriber
↓
Subscriber:
Receives logical change
↓
Executes: UPDATE users SET name='Alice' WHERE id=1
Key differences from physical replication:
- Table-level granularity (not whole database)
- Only native way to replicate between different PostgreSQL versions (e.g., PG 15 to PG 16, PG 16 to PG 18)
- Allows filtering and transformation
- Higher overhead (decoding + re-executing operations)
- No automatic failover (not covered in detail here)
Why it matters
High Availability (HA)
- Automatic failover: Standby takes over if primary fails
- Minimize downtime: Seconds to minutes vs hours for restore from backup
- Read scaling: Distribute read queries across standbys (write queries still go to primary)
Disaster Recovery (DR)
- Geographic redundancy: Standby in different datacenter/region
- Protection against site failure: Fire, power outage, network partition
- Faster recovery: Promote standby vs restore from backup
Load Distribution
- Read replicas: Offload read-only queries to standbys
- Reporting queries: Run expensive analytics on standby without affecting primary
- Backup from standby: Take backups from standby instead of primary (reduces primary load)
Data Protection
- Synchronous replication: Zero data loss guarantee
- Multiple standbys: Redundancy (can lose one standby and still have protection)
How to monitor
Check Replication Status (Primary)
-- On PRIMARY: See connected standbys and their lag
SELECT
application_name,
client_addr,
state,
sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS send_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Example output:
application_name | client_addr | state | sync_state | send_lag_bytes | write_lag_bytes | flush_lag_bytes | replay_lag_bytes | write_lag | flush_lag | replay_lag
------------------+-------------+-----------+------------+----------------+-----------------+-----------------+------------------+------------+------------+------------
standby1 | 10.0.1.5 | streaming | sync | 0 | 0 | 0 | 8192 | 00:00:00.2 | 00:00:00.2 | 00:00:00.5
standby2 | 10.0.2.8 | streaming | async | 524288 | 524288 | 524288 | 1048576 | 00:00:05 | 00:00:05 | 00:00:10
What to look for:
state = 'streaming': Standby is actively receiving WALstate = 'catchup': Standby is behind and catching upsync_state = 'sync': Synchronous standby (primary waits for this one)sync_state = 'async': Asynchronous standby (primary doesn’t wait)replay_lag: How far behind standby is in applying WAL (most important metric)- Large lag values (>1GB or >1min): Investigate network, standby performance, or load
Check Replication Status (Standby)
-- On STANDBY: Check if in recovery mode and replication status
SELECT
pg_is_in_recovery() AS is_standby,
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_bytes,
pg_last_xact_replay_timestamp() AS last_replay_time,
NOW() - pg_last_xact_replay_timestamp() AS replication_lag_time;
Example output:
is_standby | receive_lsn | replay_lsn | replay_lag_bytes | last_replay_time | replication_lag_time
------------+-------------+-------------+------------------+---------------------------+---------------------
t | 0/6A000000 | 0/69FFF000 | 4096 | 2025-01-15 10:23:45+00 | 00:00:01.5
What to look for:
is_standby = true: Server is in recovery mode (standby)replay_lag_bytes: How many bytes behind in WAL replayreplication_lag_time: Time since last transaction replay- Growing over time = standby falling behind
- NULL = no transactions being replayed (idle primary or severe lag)
Monitor Replication Slots
-- On PRIMARY: Check replication slot status
SELECT
slot_name,
slot_type,
active,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS retained_wal_bytes,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;
Example output:
slot_name | slot_type | active | retained_wal_bytes | retained_wal
-----------+-----------+--------+--------------------+--------------
standby1 | physical | t | 16777216 | 16 MB
standby2 | physical | f | 10737418240 | 10 GB
What to look for:
active = false: Slot exists but standby not connected (WAL piling up!)retained_wal > 10GB: Excessive WAL retention, check standby status- Inactive slots with high retained WAL = disk space danger
Check pg_wal Disk Usage
# Check pg_wal directory size
du -sh /var/lib/postgresql/data/pg_wal
# Count WAL files
ls -1 /var/lib/postgresql/data/pg_wal | wc -l
-- Check WAL directory size from SQL
SELECT pg_size_pretty(
SUM(size)
) AS wal_size
FROM pg_ls_waldir();
Monitor Synchronous Commit Settings
-- Check current synchronous commit configuration
SHOW synchronous_commit;
SHOW synchronous_standby_names;
-- Check which standbys are sync vs async
SELECT
application_name,
sync_state,
sync_priority
FROM pg_stat_replication
ORDER BY sync_priority;
Common problems
Problem: Replication lag increasing
Symptom: Standby falling behind primary, replay_lag growing
Diagnosis:
-- On PRIMARY: Check lag for each standby
SELECT
application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) / 1024 / 1024 AS lag_mb,
replay_lag
FROM pg_stat_replication;
-- On STANDBY: Check if CPU/IO bottleneck
SELECT * FROM pg_stat_activity WHERE backend_type = 'walreceiver' OR backend_type = 'startup';
Common causes:
- High write load on primary: Standby can’t keep up with WAL generation
- Network bandwidth: Slow connection between primary and standby
- Standby resource constraints: CPU, disk I/O, memory
- Long-running queries on standby: Blocking WAL replay (hot standby conflict)
- Checkpoint settings: Standby checkpointing too frequently
Solutions:
- Increase standby resources (CPU, disk speed)
- Improve network bandwidth/latency
- Tune standby checkpoint settings:
checkpoint_timeout = 15min max_wal_size = 4GB - Enable
hot_standby_feedbackto reduce conflicts:hot_standby_feedback = on # On standby - Consider cascading replication (standby replicates from another standby)
Problem: pg_wal directory filling disk
Symptom: Disk space alerts, pg_wal directory growing uncontrollably
Diagnosis:
-- Check inactive replication slots
SELECT
slot_name,
active,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
WHERE active = false;
-- Check WAL size
SELECT pg_size_pretty(SUM(size)) FROM pg_ls_waldir();
Causes:
- Inactive replication slot (standby disconnected but slot remains)
- Standby down for extended period
archive_commandfailing (if archiving enabled)- Very high write load exceeding
max_wal_size
Solutions:
- Drop inactive slots (if standby permanently gone):
SELECT pg_drop_replication_slot('slot_name'); - Fix standby connection: Investigate why standby disconnected
- Increase
max_wal_sizetemporarily:ALTER SYSTEM SET max_wal_size = '10GB'; SELECT pg_reload_conf(); - Fix archive_command if archiving is stuck
- Emergency: Manually remove old WAL files (DANGEROUS - only if disk full and no other option):
# DO NOT DO THIS unless you understand the risks # This will break replication - standbys will need full resync
Problem: Synchronous standby blocking commits
Symptom: Transactions hanging, slow commits, timeouts
Diagnosis:
-- Check if sync standby is connected
SELECT
application_name,
sync_state,
state
FROM pg_stat_replication
WHERE sync_state = 'sync';
-- Check synchronous settings
SHOW synchronous_commit;
SHOW synchronous_standby_names;
-- Check for blocked sessions
SELECT
pid,
wait_event_type,
wait_event,
state,
query
FROM pg_stat_activity
WHERE wait_event = 'SyncRep';
Causes:
- Synchronous standby disconnected or crashed
- Network partition between primary and sync standby
- Sync standby overloaded (can’t keep up)
Solutions:
- Immediate: Disable synchronous replication (accept risk):
ALTER SYSTEM SET synchronous_standby_names = ''; SELECT pg_reload_conf(); - Fix standby: Restart standby, investigate why it stopped responding
- Add more sync standbys: Use
ANYorFIRSTsyntax for redundancy:synchronous_standby_names = 'FIRST 1 (standby1, standby2)' # Wait for any one - Lower synchronous_commit temporarily:
SET synchronous_commit = local; # Per-session, accept some data loss risk
Problem: Hot standby conflicts (queries canceled on standby)
Symptom: Read queries on standby canceled with “conflict with recovery” errors
Example error:
ERROR: canceling statement due to conflict with recovery
DETAIL: User query might have needed to see row versions that must be removed.
Cause:
- Primary vacuums or updates rows that standby queries are reading
- Standby WAL replay needs to remove row versions that queries need
Diagnosis:
-- On STANDBY: Check conflict statistics
SELECT
datname,
confl_tablespace,
confl_lock,
confl_snapshot,
confl_bufferpin,
confl_deadlock
FROM pg_stat_database_conflicts;
Solutions:
- Enable hot_standby_feedback (primary won’t vacuum rows standby needs):
hot_standby_feedback = on # On standby postgresql.conf- Tradeoff: Can cause bloat on primary if long queries on standby
- Increase max_standby_streaming_delay (give queries more time):
max_standby_streaming_delay = 300s # On standby, default: 30s - Run long queries on dedicated standby: Keep one standby for reporting queries
- Use logical replication: For reporting workloads that need no conflicts
Problem: Standby won’t start after promotion failure
Symptom: Tried to promote standby, something went wrong, now it won’t start as standby or primary
Diagnosis:
# Check PostgreSQL logs
tail -100 /var/log/postgresql/postgresql-*.log
# Check for standby.signal or recovery.signal file
ls -la /var/lib/postgresql/data/*.signal
Causes:
- Partial promotion (standby.signal removed but timeline not advanced)
- WAL divergence (standby has different timeline than primary)
- Corruption during promotion
Solutions:
- If promoting: Complete the promotion:
pg_ctl promote -D /var/lib/postgresql/data - If reverting to standby: Rebuild from primary (pg_basebackup)
- Check timeline: Ensure standby timeline matches primary or is direct successor
References
- PostgreSQL Documentation: High Availability, Load Balancing, and Replication
- PostgreSQL Documentation: Replication
- PostgreSQL Documentation: pg_stat_replication
- PostgreSQL Documentation: Replication Slots
- PostgreSQL Documentation: Logical Replication
- High Availability and Scalable Reads in PostgreSQL
- Best Practices for Postgres Database Replication
- Bulletproofing Your Database With Multiple PostgreSQL Replicas