Backup and Recovery

What is it?

PostgreSQL backup and recovery is the practice of creating copies of your database and being able to restore them when needed. The fundamental principle: backups are useless until you’ve successfully tested recovery.

Core Concepts

Important

PostgreSQL backups (pg_dump and pg_basebackup) reflect the database state at backup completion, not start. The database continues accepting writes during backup. Example: backup runs 10:00 AM → 2:00 PM → restored data will be as of 2:00 PM.

Backup Types

Logical Backups (pg_dump)

Export database as SQL statements or custom format
Platform-independent (works across PostgreSQL versions, architectures)
Slower for large databases (must replay all SQL)
Good for: migrations, selective restore, smaller databases

pg_dump formats:

Format	Flag	Characteristics	Best for
Plain	`-Fp`	SQL text, human-readable	Version control, manual editing
Custom	`-Fc`	Compressed binary, selective restore	General purpose, single-threaded
Directory	`-Fd`	Multiple files, parallel dump/restore	Large databases, faster backups
Tar	`-Ft`	Tar archive, no compression	Archiving, but use custom instead

Tip

Use directory format (-Fd) with -j flag for parallel backup on large databases. Example: pg_dump -Fd -j 4 uses 4 parallel workers, significantly faster than single-threaded custom format.

Physical Backups (pg_basebackup)

Copy of raw data files (PGDATA directory)
Faster for large databases (file-level copy)
Same PostgreSQL major version required (e.g., 15.x → 15.y)
Same platform required (Linux → Linux, not Linux → Windows)
Architecture: x86_64 ↔ ARM64 works (both little-endian, 64-bit)
Includes all databases in the cluster
Good for: large databases, PITR, production systems

Note

Physical backups can be restored between compatible architectures on the same platform. Linux x86_64 → Linux ARM64 works because both are little-endian¹ 64-bit. Cross-platform (Linux → Windows) or cross-endianness (x86 → SPARC) does not work.

pg_basebackup is not required:

pg_basebackup is a convenient wrapper, but you can create physical backups manually:

-- Method 1: Using pg_basebackup
pg_basebackup -D /backup -Ft -z -P -X stream

-- Method 2: Manual backup (more control)
-- Step 1: Mark backup start
SELECT pg_backup_start('manual_backup_label');

-- Step 2: Copy PGDATA with any tool (in a separate shell)
rsync -av --exclude='pg_wal' /var/lib/postgresql/data/ /backup/
-- OR: tar czf /backup/base.tar.gz -C /var/lib/postgresql/data .
-- OR: filesystem snapshot (LVM, ZFS, etc)

-- Step 3: Mark backup complete and get WAL info
SELECT * FROM pg_backup_stop();

WAL handling during backup:

Method	WAL during backup	Requires archive_mode?
`pg_basebackup -X stream`	Streamed and included in backup	No (for backup itself)
`pg_basebackup -X fetch`	Fetched at end, included in backup	No (for backup itself)
Manual (pg_backup_start/stop)	Must archive WAL separately	Yes (required)

Important

All methods require WAL archiving (archive_mode = on) for PITR - to restore to a point in time AFTER the backup. The backup itself contains WAL only during the backup period. For recovery to any later point, you need archived WAL files.

When to use manual method:

Filesystem snapshots (LVM, ZFS, Btrfs)
Cloud provider snapshots (EBS, persistent disk)
Enterprise backup software integration
Custom backup orchestration

Point-in-Time Recovery (PITR)

What is PITR?

PITR allows restoring a database to any specific moment in time, not just the backup time. This requires:

Base backup (physical backup via pg_basebackup)
WAL archive (continuous archiving of WAL files)
Recovery target (timestamp, transaction ID, or named restore point)

How it works:

┌─────────────────────────────────────────────────────────────────────┐
│                          TIMELINE                                   │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  10:00 AM          12:00 PM          2:00 PM          2:30 PM       │
│     │                 │                 │                │          │
│     ▼                 │                 ▼                ▼          │
│  [Backup]             │           [Want this!]    [Corruption]      │
│   taken               │                                             │
│     │                 │                 ▲                           │
│     │                 │                 │                           │
│     │  ┌──────────────┴─────────────────┘                           │
│     │  │  Continuous WAL archiving                                  │
│     │  │                                                            │
│     ▼  ▼                                                            │
│  ┌─────────┐      ┌─────────────────────────────┐                   │
│  │ Base    │      │ Archived WAL files          │                   │
│  │ Backup  │      │ 10:00 → 10:15 → 10:30 → ... │                   │
│  │         │      │ ... → 1:45 → 2:00 → 2:15    │                   │
│  │ State:  │      └─────────────────────────────┘                   │
│  │ 10:00AM │                                                        │
│  └─────────┘                                                        │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

RECOVERY PROCESS:

Step 1: Restore base backup
┌─────────────────────────┐
│ Database state: 10:00AM │
└─────────────────────────┘

Step 2: Replay WAL files (10:00 AM → 2:00 PM)
┌─────────────────────────┐
│ Apply: 10:15 WAL        │
│ Apply: 10:30 WAL        │
│ Apply: 10:45 WAL        │
│ ...                     │
│ Apply: 1:45 WAL         │
│ Apply: 2:00 WAL ← STOP  │
└─────────────────────────┘

Step 3: Result
┌──────────────────────────────────────┐
│ ✓ Database state: exactly 2:00 PM   │
│ ✓ All data up to corruption point   │
│ ✓ Zero data loss (10 AM → 2 PM)     │
└──────────────────────────────────────┘

WITHOUT PITR (backup only):
┌──────────────────────────────────────┐
│ ✗ Can only restore to 10:00 AM      │
│ ✗ Lose 4 hours of data (10AM → 2PM) │
└──────────────────────────────────────┘

Recovery targets:

-- Recover to specific timestamp
recovery_target_time = '2026-01-27 14:00:00'

-- Recover to specific transaction
recovery_target_xid = '12345678'

-- Recover to named restore point
recovery_target_name = 'before_migration'

-- Recover to end of available WAL (default)
recovery_target = 'immediate'

WAL Archiving

archive_command copies WAL files to safe storage before they’re recycled:

# postgresql.conf
wal_level = replica                    # Minimum for archiving (see below)
archive_mode = on                      # Enable archiving
archive_command = 'cp %p /mnt/archive/%f'  # Copy to archive location

wal_level values:

Level	WAL Information	Use case	Can archive?
`minimal`	Crash recovery only	Single instance, no backups	No
`replica`	+ streaming replication, archiving	Production default	Yes
`logical`	+ logical replication decoding	Logical replication needed	Yes

Tip

Use wal_level = replica for production systems. It enables archiving, PITR, and streaming replication with minimal overhead. Only use logical if you need logical replication (e.g., cross-version replication, selective table replication).

Archive command variables:

%p = Full path of file to archive
%f = File name only

Archive destinations:

Local mount (NFS, SMB)
Object storage (S3, GCS, Azure Blob)
Remote server (rsync, scp)

Important

archive_command must return 0 (success) only when the file is safely stored. Returning 0 prematurely causes data loss risk. PostgreSQL will retry failed archives until success.

Backup Tools Comparison

Tool	Type	Best for	Pros	Cons
pg_dump	Logical	Small DBs, migrations	Portable, selective	Slow for large DBs
pg_basebackup	Physical	Basic PITR	Built-in, simple	Manual WAL management
pgBackRest	Physical	Production systems	Parallel backup/restore, compression, S3 support	Learning curve
Barman	Physical	Enterprise	Full PITR, retention policies	Python dependency
WAL-G	Physical	Cloud-native	S3/GCS/Azure native, compression	Less mature

Recovery Point Objective (RPO) vs Recovery Time Objective (RTO)

RPO (Recovery Point Objective): How much data loss is acceptable?

Logical backup only: RPO = time since last backup (could be 24 hours)
PITR with WAL archiving: RPO = ~minutes (only lose uncommitted transactions)
Synchronous replication: RPO = 0 (zero data loss)

RTO (Recovery Time Objective): How long can recovery take?

Small database (< 100 GB): Minutes to hours
Large database (> 1 TB): Hours to days (without streaming restore)
With standby server: Seconds (failover)

Why it matters

Data Loss Prevention

Human error:

Accidental DROP TABLE or DELETE without WHERE
Wrong data in migration script
Application bug corrupting data

Hardware failure:

Disk failure (RAID isn’t backup)
Complete server failure
Datacenter disaster

Malicious activity:

Ransomware encryption
Malicious DELETE/UPDATE
Compromised credentials

Compliance Requirements

Many industries require:

Retention period (7 years for financial data)
Point-in-time recovery capability
Verified restore procedures
Off-site backups (geographic redundancy)

Operational Confidence

Tested backups enable:

Major schema migrations (can rollback via PITR)
Application deployments (restore if bad code corrupts data)
Testing environments (restore production to staging)
Analysis without affecting production

How to monitor

Check Archiving Status

-- Check if archiving is enabled
SHOW archive_mode;
SHOW archive_command;

-- Check archiver process activity
SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time,
    stats_reset
FROM pg_stat_archiver;

Example output:

 archived_count |     last_archived_wal      |       last_archived_time       | failed_count | last_failed_wal | last_failed_time | stats_reset
----------------+----------------------------+--------------------------------+--------------+-----------------+------------------+-------------
         245623 | 0000000100000A2B0000004C   | 2026-01-27 14:23:45.678+00     |            0 |                 |                  |

What to look for:

failed_count > 0: Archive command failing (check logs)
last_archived_time old: Archiving stuck or slow
archived_count not increasing: No WAL being generated (idle system) or archiving broken

Verify Archive Directory

# Check archive directory exists and has recent files
ls -lht /mnt/archive/ | head -20

# Count archived WAL files
ls -1 /mnt/archive/ | wc -l

# Check disk usage
du -sh /mnt/archive/

# Find oldest and newest archived files
ls -lt /mnt/archive/ | tail -1  # oldest
ls -lt /mnt/archive/ | head -2  # newest

What to look for:

No recent files: Archiving stopped
Disk full: Archive destination out of space
Gaps in sequence: Missing WAL files (data loss risk)

Note

Archive command failures are also logged to PostgreSQL logs. Check with grep "archive command failed" /var/log/postgresql/postgresql-*.log for detailed error messages.

Monitor Backup Age

# Check age of last pg_basebackup
ls -lht /backups/base/ | head -5

# Check backup size trends
du -sh /backups/base/* | sort -h

-- Create monitoring table (run once)
CREATE TABLE IF NOT EXISTS backup_history (
    backup_id SERIAL PRIMARY KEY,
    backup_type TEXT,
    backup_start TIMESTAMP,
    backup_end TIMESTAMP,
    backup_size_bytes BIGINT,
    wal_start_lsn PG_LSN,
    wal_end_lsn PG_LSN
);

-- Example: Record pg_dump backup
INSERT INTO backup_history (
    backup_type,
    backup_start,
    backup_end,
    backup_size_bytes,
    wal_start_lsn,
    wal_end_lsn
) VALUES (
    'pg_dump',
    '2026-01-27 10:00:00',
    '2026-01-27 10:15:00',
    5368709120,  -- Size of dump file in bytes
    NULL,        -- pg_dump doesn't use LSN
    NULL
);

-- Example: Record pg_basebackup (capture LSN before/after)
-- Step 1: Before backup
SELECT pg_current_wal_lsn() AS start_lsn, now() AS start_time \gset

-- Step 2: Run pg_basebackup (in shell)
-- pg_basebackup -D /backup/base_20260127 -Ft -z -P

-- Step 3: After backup, record it
INSERT INTO backup_history (
    backup_type,
    backup_start,
    backup_end,
    backup_size_bytes,
    wal_start_lsn,
    wal_end_lsn
) VALUES (
    'pg_basebackup',
    :'start_time',
    now(),
    (SELECT pg_size_bytes(pg_tablespace_size('pg_default'))),
    :'start_lsn',
    pg_current_wal_lsn()
);

-- Query backup history
SELECT
    backup_type,
    backup_start,
    backup_end,
    backup_end - backup_start AS duration,
    pg_size_pretty(backup_size_bytes) AS size,
    NOW() - backup_end AS age
FROM backup_history
ORDER BY backup_start DESC
LIMIT 10;

Automated tracking with script:

#!/bin/bash
# backup_with_tracking.sh

DB_HOST=localhost
DB_NAME=mydb
BACKUP_DIR=/backups
BACKUP_FILE="${BACKUP_DIR}/mydb_$(date +%Y%m%d_%H%M%S).dump"

# Capture start time and LSN
START_TIME=$(date -Iseconds)
START_LSN=$(psql -h $DB_HOST -d $DB_NAME -tAc "SELECT pg_current_wal_lsn()")

# Run backup
pg_dump -h $DB_HOST -Fc -f $BACKUP_FILE $DB_NAME

# Capture end time and size
END_TIME=$(date -Iseconds)
BACKUP_SIZE=$(stat -f%z "$BACKUP_FILE" 2>/dev/null || stat -c%s "$BACKUP_FILE")

# Record in database
psql -h $DB_HOST -d $DB_NAME <<EOF
INSERT INTO backup_history (
    backup_type, backup_start, backup_end,
    backup_size_bytes, wal_start_lsn, wal_end_lsn
) VALUES (
    'pg_dump',
    '$START_TIME'::timestamp,
    '$END_TIME'::timestamp,
    $BACKUP_SIZE,
    NULL, NULL
);
EOF

echo "✓ Backup completed and recorded: $BACKUP_FILE"

Test Recovery (Critical!)

Caution

Never assume backups work. Schedule regular recovery tests to verify:

Backup files are not corrupted
Archive command is storing files correctly
Recovery procedures are documented and work
RTO meets business requirements

Test recovery checklist:

Monthly: Restore latest backup to test environment
Quarterly: Full PITR test with specific recovery target
After major changes: Test restore after PostgreSQL upgrades or config changes
Document: Record time taken, issues encountered, procedure updates

Common backup strategies

Strategy 1: Logical Backups Only (Small Databases)

Best for: Development, staging, databases < 50 GB

# Single-threaded backup (custom format)
pg_dump -h localhost -U postgres -Fc -f /backups/mydb_$(date +%Y%m%d).dump mydb

# Parallel backup (directory format) - faster for larger DBs
pg_dump -h localhost -U postgres -Fd -j 4 -f /backups/mydb_$(date +%Y%m%d) mydb

# Plain SQL format (human-readable, version control)
pg_dump -h localhost -U postgres -Fp -f /backups/mydb_$(date +%Y%m%d).sql mydb

# Keep last 7 days
find /backups/ -name "mydb_*" -mtime +7 -delete

# Restore from custom format
pg_restore -h localhost -U postgres -d mydb_restored /backups/mydb_20260127.dump

# Restore from directory format (parallel)
pg_restore -h localhost -U postgres -j 4 -d mydb_restored /backups/mydb_20260127/

# Restore specific table only
pg_restore -h localhost -U postgres -d mydb_restored -t users /backups/mydb_20260127.dump

# Restore from plain SQL
psql -h localhost -U postgres -d mydb_restored -f /backups/mydb_20260127.sql

Directory format structure:

$ ls -lh /backups/mydb_20260127/
total 2.5G
-rw------- 1 postgres postgres  127 Jan 27 10:00 toc.dat        # Table of contents
-rw------- 1 postgres postgres  850M Jan 27 10:02 3456.dat.gz   # Table data (compressed)
-rw------- 1 postgres postgres  420M Jan 27 10:03 3457.dat.gz   # Another table
-rw------- 1 postgres postgres  1.2G Jan 27 10:05 3458.dat.gz   # Large table

Pros:

Simple, no WAL archiving needed
Portable across versions
Can restore specific tables
Directory format: parallel dump/restore (faster)

Cons:

No PITR (only restore to backup time)
Slow for large databases (even with parallelization)
RPO = backup frequency (typically 24 hours)

Strategy 2: Physical Backup + WAL Archiving (PITR)

Best for: Production systems, databases > 50 GB

# 1. Enable WAL archiving (postgresql.conf)
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f'

# 2. Weekly base backup
pg_basebackup -h localhost -U replication -D /backups/base_$(date +%Y%m%d) -Ft -z -P

# 3. WAL files archived continuously by PostgreSQL

# 4. PITR restore
# - Restore base backup
# - Create recovery.signal file
# - Configure recovery.conf or postgresql.conf
# - Start PostgreSQL (replays WAL to target time)

Pros:

PITR capability (restore to any point)
RPO in minutes
Faster restore than logical backups

Cons:

Requires WAL archiving setup
More complex to manage
Needs monitoring

Strategy 3: Modern Tools (pgBackRest/Barman)

Best for: Large production systems, multiple databases

pgBackRest example:

# Configure pgbackrest.conf
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=4
repo1-retention-diff=4
process-max=4

[main]
pg1-path=/var/lib/postgresql/data
pg1-port=5432

# Create full backup
pgbackrest --stanza=main backup --type=full

# Create differential backup (faster, based on last full)
pgbackrest --stanza=main backup --type=diff

# Restore to specific time
pgbackrest --stanza=main restore --type=time --target="2026-01-27 14:00:00"

Pros:

Parallel backup/restore (faster)
Incremental backups
Automatic retention management
Built-in compression and encryption
S3/GCS/Azure support

Cons:

Additional tool to learn and maintain
Configuration complexity

Strategy 4: Continuous Protection (Streaming Replication)

Best for: Critical systems requiring near-zero downtime

Primary ───(streaming replication)──> Standby
  │                                      │
  ├─> WAL Archive ───────────────────────┤
  │                                      │
  └─> pg_basebackup (weekly)             └─> Promoted on failover

Characteristics:

RPO ≈ 0 (with synchronous replication)
RTO = seconds to minutes (promote standby)
Combines replication with PITR backups
Standby can be used for backups (offload from primary)

Tip

Take backups from standby server to avoid impacting primary performance. Use pg_basebackup -h standby-host or configure backup tools to connect to standby.

Common problems

Problem: Archive command failing

Symptom: pg_stat_archiver shows failed_count > 0, logs show archive errors

Diagnosis:

SELECT failed_count, last_failed_wal, last_failed_time
FROM pg_stat_archiver;

# Check PostgreSQL logs
grep "archive command failed" /var/log/postgresql/postgresql-*.log

Common causes:

Archive destination disk full
Permission denied (postgres user can’t write)
Network mount disconnected
Archive command syntax error

Solutions:

Fix disk space:

df -h /mnt/archive/
# Clean old archives or expand storage

Fix permissions:

chown -R postgres:postgres /mnt/archive/
chmod 755 /mnt/archive/

Test archive command manually:

su - postgres
cp /var/lib/postgresql/data/pg_wal/000000010000000000000001 /mnt/archive/

Caution

Always test archive commands as the postgres user, not as root. Permission issues that work as root may fail when PostgreSQL (running as postgres) tries to execute the same command. Use su - postgres before testing.

Check archive_command syntax:

SHOW archive_command;
-- Common mistake: missing 'test ! -f' check
-- Correct: test ! -f /mnt/archive/%f && cp %p /mnt/archive/%f

Problem: pg_basebackup taking too long

Symptom: Backup takes hours, impacts production performance

Diagnosis:

# Monitor backup progress
pg_basebackup -D /backup -P  # -P shows progress

# Check I/O wait on server
iostat -x 5

Solutions:

Take backup from standby (offload primary):

pg_basebackup -h standby-host -D /backup -P

Use compression (reduce I/O if network is bottleneck):

pg_basebackup -D /backup -Ft -z -P  # gzip compression

Use modern tools with parallel processing:

pgbackrest --stanza=main backup --type=full  # parallel by default

Schedule during low-traffic periods:

# Cron: 2 AM Sunday
0 2 * * 0 /usr/bin/pg_basebackup -D /backup/$(date +\%Y\%m\%d) -P

Problem: Cannot restore - “requested timeline is not in this server’s history”

Symptom: PITR restore fails with timeline error

Cause: WAL archive contains files from multiple timelines (after previous recovery/promotion)

Diagnosis:

# Check timeline in backup label
cat /backup/backup_label
# Look for: START TIMELINE: 1

# Check available WAL files and their timelines
ls -1 /mnt/archive/ | head -20
# Format: 00000001... = timeline 1
# Format: 00000002... = timeline 2

Solution:

Ensure recovery_target_timeline is set:

# recovery.conf or postgresql.conf
recovery_target_timeline = 'latest'  # Follow timeline history

Clean old timeline files if needed (CAREFUL - can cause data loss):

# Only remove if you're certain they're from old, unwanted timelines

Use restore_command that understands timeline files:

restore_command = 'cp /mnt/archive/%f %p'
# Ensure timeline history files (.history) are also archived

Problem: Recovery stopping before target time

Symptom: PITR stops earlier than recovery_target_time, missing recent transactions

Diagnosis:

# Check available WAL files
ls -lt /mnt/archive/ | head -20

# Check PostgreSQL recovery logs
grep "recovery stopping" /var/log/postgresql/postgresql-*.log

Causes:

Missing WAL files (gaps in archive)
recovery_target_action = 'pause' (stops but doesn’t promote)
Archive incomplete at recovery time

Solutions:

Verify WAL continuity:

# Check for gaps in WAL sequence
ls -1 /mnt/archive/ | sort | awk '
  NR==1 {prev=$0; next}
  {
    split(prev, a, ""); split($0, b, "")
    # Compare sequence numbers (simplified)
    if ($0 != prev) print "Gap: " prev " -> " $0
    prev=$0
  }
'

Check recovery_target_action:

recovery_target_action = 'promote'  # Auto-promote at target
# NOT 'pause' (requires manual promotion)

Ensure archive_command completed before failure:

-- On original server (if accessible)
SELECT last_archived_wal FROM pg_stat_archiver;

Problem: Backup files corrupted or unusable

Symptom: Restore fails with “invalid backup” or corruption errors

Prevention:

Verify backups after creation:

# Test pg_dump backups
pg_restore -l /backups/mydb_20260127.dump > /dev/null

# Verify basebackup integrity
tar -tzf /backups/base.tar.gz > /dev/null

# Use checksums
sha256sum /backups/mydb_20260127.dump > /backups/mydb_20260127.dump.sha256

Test recovery regularly:

# Automated monthly recovery test
#!/bin/bash
BACKUP=/backups/latest.dump
TEST_DB=restore_test_$(date +%Y%m%d)

dropdb --if-exists $TEST_DB
createdb $TEST_DB
pg_restore -d $TEST_DB $BACKUP

if [ $? -eq 0 ]; then
  echo "✓ Backup verified successfully"
  dropdb $TEST_DB
else
  echo "✗ Backup verification FAILED - investigate!"
  exit 1
fi

Store backups redundantly:
- Local disk + remote location
- Multiple cloud regions
- Different storage classes (S3 + Glacier)

Problem: Backup retention filling disk

Symptom: Backup directory filling disk, old backups never deleted

Solution:

Implement retention policy:

# Keep 7 daily, 4 weekly, 12 monthly

# Daily: delete older than 7 days
find /backups/daily/ -name "*.dump" -mtime +7 -delete

# Weekly: delete older than 28 days
find /backups/weekly/ -name "*.dump" -mtime +28 -delete

# Monthly: delete older than 365 days
find /backups/monthly/ -name "*.dump" -mtime +365 -delete

Use backup tools with built-in retention:

# pgBackRest
repo1-retention-full=4      # Keep 4 full backups
repo1-retention-diff=4      # Keep 4 differential per full

# Barman
retention_policy = 'RECOVERY WINDOW OF 30 DAYS'

Monitor archive growth:

# Alert if archive directory > 80% full
df -h /mnt/archive/ | awk 'NR==2 {if ($5+0 > 80) print "ALERT: Archive disk usage " $5}'

References

Wikipedia: Endianness - Byte order in computer memory. Little-endian stores least significant byte first (x86, ARM). Big-endian stores most significant byte first (SPARC, PowerPC). Physical backups cannot be restored between different endianness. ↩︎

Edit this page