Tuần 07 — Backup, PITR & Disaster Recovery

“Có một sự khác biệt giữa ‘có backup’ và ‘có thể restore’. Tuần này dạy bạn tránh ảo tưởng đầu để có thực tại sau. Backup chưa từng test = không có backup.”

Tags: database postgresql backup pitr disaster-recovery operations Thời lượng: 7 ngày (4-6h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh (WAL, checkpoint) Liên quan: Tuan-08-Zero-Downtime-Migration · Tuan-Bonus-DB-Security-Compliance

1. Context & Why

1.1 Backup mindset

graph LR
    A[Backup taken] --> B{Tested restore?}
    B -->|No| C[Not a backup<br/>Just file on disk]
    B -->|Yes| D{Tested recently?}
    D -->|>3 months| E[Probably broken]
    D -->|<1 month| F[Real backup]

    style C fill:#ef9a9a
    style E fill:#ffccbc
    style F fill:#c8e6c9

Iron rule: Backup không phải file. Backup là process: take + verify + monitor + drill.

1.2 Câu hỏi quan trọng trước khi thiết kế backup

RPO (Recovery Point Objective) — chịu mất tối đa bao nhiêu data?
- “5 phút data cuối có thể mất” → continuous WAL archiving
- “1 ngày data có thể mất” → nightly snapshot
RTO (Recovery Time Objective) — bao lâu để restore xong?
- “10 phút” → hot standby ready
- “4 giờ” → restore từ snapshot S3
Retention — giữ backup bao lâu?
- 7 ngày để rollback gần?
- 30 ngày compliance?
- 7 năm tax/SOX?
Threat model — bảo vệ chống cái gì?
- Hardware failure → 1 replica đủ
- Datacenter disaster → cross-region backup
- Ransomware → immutable storage (S3 Object Lock)
- Bug in code corrupts data → PITR đến trước khi bug deploy

1.3 Mục tiêu tuần

Phân biệt logical vs physical backup, khi nào pick mỗi cái
Setup WAL archiving + base backup chained
Thực hiện PITR thực tế: recover đến exact transaction trước accident
Master tool: pgBackRest hoặc WAL-G
Cross-region DR setup
Restore drill — practice nguyên lý quan trọng nhất
Encryption at rest + immutability
Hiểu cloud managed: RDS, Aurora, Cloud SQL backup features

1.4 Tham chiếu

PostgreSQL docs Ch.26 (Backup and Restore) — https://www.postgresql.org/docs/current/backup.html
pgBackRest documentation — https://pgbackrest.org/
WAL-G documentation — https://github.com/wal-g/wal-g
Barman — https://pgbarman.org/
AWS RDS Backup — https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_CommonTasks.BackupRestore.html

2. Backup Types

2.1 Logical vs Physical

graph TB
    subgraph "Logical Backup"
        L1[pg_dump / pg_dumpall]
        L2[Output: SQL statements / custom format]
        L3[Restore: pg_restore]
        L4[Cross-version OK]
        L5[Selective: table/schema]
        L6[Slow on large DB]
    end

    subgraph "Physical Backup"
        P1[pg_basebackup / pgBackRest / WAL-G]
        P2[Output: binary files copy]
        P3[Restore: file copy + recovery]
        P4[Same major version required]
        P5[Whole cluster only]
        P6[Fast even on large DB]
        P7[Required for PITR]
    end

    style L4 fill:#c8e6c9
    style L5 fill:#c8e6c9
    style P6 fill:#c8e6c9
    style P7 fill:#c8e6c9
    style L6 fill:#ffccbc
    style P4 fill:#ffccbc
    style P5 fill:#ffccbc

2.2 Khi nào dùng cái nào

Scenario	Logical	Physical
DB <10GB	✓	✓
DB >100GB	✗ slow	✓
Cross-version migration (PG14→16)	✓	✗
PITR (5-min RPO)	✗	✓
Restore single table	✓	✗ (full cluster)
Move data across cloud providers	✓	✓ (with WAL-G S3 → GCS)
Schema-only backup	✓	✗

Production pattern 2024: Cả hai.

Physical (pgBackRest/WAL-G) — daily full + continuous WAL → primary backup strategy
Logical (pg_dump) weekly — schema + selective dump → for surgical restore, migration

3. Logical Backup — pg_dump

3.1 Formats

# Plain SQL (default)
pg_dump appdb > backup.sql
 
# Custom format (recommended)
pg_dump -Fc appdb -f backup.dump
 
# Directory format (parallel)
pg_dump -Fd appdb -f backup_dir -j 4
 
# Tar format
pg_dump -Ft appdb -f backup.tar

Format	Parallel	Selective restore	Notes
Plain	No	No	Just SQL, human-readable
Custom (`-Fc`)	Restore parallel	Yes	Recommended
Directory (`-Fd`)	Dump + restore parallel	Yes	Best for huge DBs
Tar (`-Ft`)	No	Yes	Compatibility

3.2 Options critical

pg_dump \
    --format=custom \
    --jobs=4 \                    # parallel (directory format)
    --no-owner --no-acl \         # for restore on different cluster
    --schema=public \             # specific schema
    --table='public.users' \      # specific table
    --exclude-table='audit.*' \   # exclude pattern
    --serializable-deferrable \   # consistent snapshot
    --section=pre-data \          # schema only, then data, then post-data
    -f backup.dump \
    appdb

3.3 Restore

# Custom format
pg_restore -d new_db backup.dump
 
# Parallel restore
pg_restore -d new_db -j 4 backup.dump
 
# Selective restore
pg_restore -d new_db --table=users backup.dump
 
# List contents
pg_restore --list backup.dump
 
# Generate edit-able TOC for selective restore
pg_restore --list backup.dump > toc.txt
# Edit toc.txt, comment out unwanted lines
pg_restore -d new_db --use-list=toc.txt backup.dump

3.4 pg_dumpall — cluster-wide

# Roles, tablespaces, all databases
pg_dumpall > cluster_backup.sql
 
# Globals only (roles + tablespace), per-DB dump elsewhere
pg_dumpall --globals-only > globals.sql

Pattern: pg_dumpall --globals-only + pg_dump per DB for fine control.

3.5 Limitations

Slow on large DB — 100GB có thể mất giờ
Lock: SHARE-mode acquired, không block normal operations nhưng block DDL
Consistent snapshot: dùng REPEATABLE READ → cần đủ work_mem
Replication slot: không capture WAL → không PITR

3.6 Best practice

pg_dump -Fc -Z9 (max compression)
Test restore monthly
Store off-site (S3, GCS) + cross-region
Encrypt at rest (age, gpg, server-side encryption)

4. Physical Backup — Foundation for PITR

4.1 pg_basebackup — built-in

pg_basebackup \
    -h localhost -U replicator \
    -D /backup/base \
    -Ft -z \                  # tar + gzip
    -P \                      # progress
    -X stream \               # include WAL stream
    -c fast                   # immediate checkpoint

Creates a binary copy of the data directory. Plus WAL needed to make it consistent.

Limitations:

Single-threaded (slow on 1TB+)
No incremental
No retention management
Manual everything

→ For production: use pgBackRest or WAL-G.

4.2 WAL archiving setup

The foundation of PITR.

# postgresql.conf
wal_level = replica            # or logical
archive_mode = on
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
# Or to S3:
archive_command = 'wal-g wal-push %p'
 
archive_timeout = 60           # force WAL switch every 60s if no traffic
max_wal_senders = 10
wal_keep_size = 2GB            # safety net for replica lag

%p — path of WAL segment in pg_wal/ %f — filename to archive as

4.3 archive_command guarantee

Postgres considers WAL segment archived ONLY after archive_command returns 0. If failure:

Segment kept in pg_wal/
Postgres retries
pg_wal/ grows unbounded if archive broken → eventually disk full → DB stops

→ Monitor archive_command failure! pg_stat_archiver shows last_failed_time.

SELECT * FROM pg_stat_archiver;
-- archived_count, last_archived_wal, failed_count, last_failed_wal, last_failed_time

4.4 archive_command idempotent

Critical: archive_command MUST be idempotent. If S3 upload fails after data uploaded, retry must not corrupt.

# BAD: not idempotent
archive_command = 'aws s3 cp %p s3://backup/%f'
 
# GOOD: check existence first
archive_command = '
  set -e
  if aws s3api head-object --bucket backup --key wal/%f 2>/dev/null; then
    if aws s3api get-object --bucket backup --key wal/%f /tmp/check
       && cmp /tmp/check %p; then
      exit 0  # already uploaded same content
    else
      exit 1  # different content - alert!
    fi
  fi
  aws s3 cp %p s3://backup/wal/%f
'

In practice: use WAL-G / pgBackRest which handle this.

5. pgBackRest — Production Gold Standard

5.1 Overview

pgbackrest.conf:

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-pass=secret-passphrase
repo1-cipher-type=aes-256-cbc
 
repo1-s3-bucket=mycorp-pgbackup
repo1-s3-endpoint=s3.amazonaws.com
repo1-s3-region=us-east-1
repo1-s3-key=AKIA...
repo1-s3-key-secret=...
repo1-type=s3
 
compress-type=zst
compress-level=3
process-max=4
log-level-console=info
log-level-file=detail
 
[appdb-prod]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432

postgresql.conf:

archive_command = 'pgbackrest --stanza=appdb-prod archive-push %p'

Init:

sudo -u postgres pgbackrest --stanza=appdb-prod stanza-create
sudo -u postgres pgbackrest --stanza=appdb-prod check

5.2 Backup types

# Full backup (everything)
pgbackrest --stanza=appdb-prod backup --type=full
 
# Differential (since last full)
pgbackrest --stanza=appdb-prod backup --type=diff
 
# Incremental (since last backup any type)
pgbackrest --stanza=appdb-prod backup --type=incr

gantt
    title pgBackRest backup chain
    dateFormat YYYY-MM-DD
    section Strategy
    Full backup       :f1, 2026-05-10, 1d
    Incr Mon          :i1, after f1, 1d
    Incr Tue          :i2, after i1, 1d
    Incr Wed          :i3, after i2, 1d
    Incr Thu          :i4, after i3, 1d
    Incr Fri          :i5, after i4, 1d
    Incr Sat          :i6, after i5, 1d
    Full backup       :f2, after i6, 1d

5.3 Restore

# Latest
pgbackrest --stanza=appdb-prod restore
 
# Specific backup
pgbackrest --stanza=appdb-prod restore --set=20260516-020000F
 
# PITR
pgbackrest --stanza=appdb-prod restore \
    --type=time \
    --target='2026-05-16 14:30:00'
 
# To different host (delta restore)
pgbackrest --stanza=appdb-prod restore --delta

5.4 Parallel & resume

process-max=4 → 4 parallel processes. Backup 1TB DB on 10Gbps network in ~30 minutes.

If restore interrupted, --delta resumes — only copies files different from current state. Game-changer for huge DBs.

5.5 Schedule

# Sunday 02:00 - full
0 2 * * 0 sudo -u postgres pgbackrest --stanza=appdb-prod backup --type=full
 
# Daily 02:00 - incremental
0 2 * * 1-6 sudo -u postgres pgbackrest --stanza=appdb-prod backup --type=incr
 
# Hourly verify (lightweight)
0 * * * * sudo -u postgres pgbackrest --stanza=appdb-prod check

5.6 Encryption + immutability

repo1-cipher-type=aes-256-cbc → server-side encrypt before upload to S3.

Plus S3 Object Lock (write-once-read-many) → ransomware-proof:

aws s3api put-object-lock-configuration \
    --bucket mycorp-pgbackup \
    --object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Days":35}}}'

6. WAL-G — Cloud-Native Alternative

6.1 Why WAL-G

Pros:

Cloud-native (S3, GCS, Azure Blob, Yandex Object Storage)
Built-in encryption (libsodium)
Delta backups (PG12+)
Catchup mode for replicas
Single binary, easy deploy

Cons (vs pgBackRest):

Less feature complete (no diff backup chain)
Smaller community

6.2 Setup

# Environment
export WALG_S3_PREFIX=s3://mycorp-pgbackup/appdb
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export WALG_COMPRESSION_METHOD=brotli
export WALG_DELTA_MAX_STEPS=5
export WALG_LIBSODIUM_KEY=...  # encryption key

# postgresql.conf
archive_command = 'wal-g wal-push %p'
restore_command = 'wal-g wal-fetch %f %p'

6.3 Operations

# Full backup
wal-g backup-push /var/lib/postgresql/16/main
 
# Delta backup (PG12+)
wal-g backup-push /var/lib/postgresql/16/main --delta-from-name=base_xxxxx
 
# List
wal-g backup-list
wal-g wal-show
 
# Restore
wal-g backup-fetch /var/lib/postgresql/16/main LATEST
# Or specific backup name
 
# Restore for PITR
# Set restore_command in recovery.conf / postgresql.auto.conf
# Set recovery_target_time

7. Point-In-Time Recovery (PITR)

7.1 The promise

“Restore database to any moment in past, down to the second.”

Pre-requisites:

Base backup + continuous WAL archive from then
Recovery target reachable in WAL chain

7.2 PITR mechanics

sequenceDiagram
    participant T as Time
    participant B as Base Backup
    participant W as WAL Archive
    participant R as Recovery

    Note over B,W: 2026-05-15 02:00 Base backup taken
    Note over W: Continuous WAL pushed
    Note over W: 2026-05-16 14:30:01 Bad DELETE happens
    Note over W: Continuous WAL pushed
    Note over R: Disaster recovery needed
    R->>B: Restore base from 2026-05-15
    loop Replay WAL
        R->>W: Fetch WAL segment
        W-->>R: Segment
        Note over R: Replay until target time
    end
    Note over R: Stop at 2026-05-16 14:29:00
    Note over R: 1 second before disaster

7.3 PITR setup PG12+

In postgresql.conf (or postgresql.auto.conf):

restore_command = 'pgbackrest --stanza=appdb-prod archive-get %f %p'
recovery_target_time = '2026-05-16 14:29:00'
recovery_target_action = 'promote'   # or 'pause', 'shutdown'

Create signal file:

touch /var/lib/postgresql/16/main/recovery.signal

Then start Postgres. It enters recovery mode, replays WAL until target.

7.4 Recovery targets

Options for recovery_target_*:

recovery_target_time = 'TIMESTAMP' — most common
recovery_target_xid = '12345' — specific transaction
recovery_target_name = 'foobar' — restore point (created via SELECT pg_create_restore_point('foobar'))
recovery_target_lsn = '0/3000028' — specific LSN
recovery_target = 'immediate' — first consistent point after start

recovery_target_inclusive = false — stop BEFORE the target.

7.5 Recovery actions

After target reached:

promote — Postgres becomes primary, no further replay
pause — pause, allow you to run SELECT pg_wal_replay_resume() or pg_promote()
shutdown — clean shutdown after recovery

Pattern recommended: pause, verify with read-only queries, then promote.

7.6 Worked example — recover bad DELETE

# Disaster: someone ran "DELETE FROM users;" at 14:30:01
 
# 1. Stop application (don't write to corrupted state)
kubectl scale deployment app --replicas=0
 
# 2. Identify exact time
# Look at pg_stat_statements or audit log or app log
# Recovery target = 14:30:00 (1s before)
 
# 3. Restore to new directory
sudo -u postgres pg_ctl stop -D /var/lib/postgresql/16/main
 
# Move corrupt cluster aside (don't delete - might still be useful)
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.corrupt
 
sudo -u postgres pgbackrest --stanza=appdb-prod restore \
    --type=time --target='2026-05-16 14:30:00' \
    --target-action=pause \
    --pg1-path=/var/lib/postgresql/16/main
 
# 4. Start, in recovery mode
sudo -u postgres pg_ctl start -D /var/lib/postgresql/16/main
 
# 5. Verify
psql -c "SELECT count(*) FROM users;"
# Should show pre-disaster count
 
# 6. Promote
psql -c "SELECT pg_promote();"
 
# 7. Restart app pointing to recovered cluster
kubectl scale deployment app --replicas=10

7.7 RPO calculation

With continuous WAL archiving:

archive_timeout = 60s → max 60s of data lost on archive failure
Realistic RPO with healthy archive: <1 second

If archive_command fails for hours, RPO degrades. Monitor!

8. Replication for High Availability

8.1 Streaming replication (sync vs async)

graph LR
    Primary[(Primary)] --WAL stream--> Replica1[(Hot Standby Sync)]
    Primary --WAL stream--> Replica2[(Hot Standby Async)]
    Primary --WAL archive--> S3[(S3 backup)]

    style Replica1 fill:#c8e6c9
    style Replica2 fill:#fff9c4

# Primary
synchronous_standby_names = 'FIRST 1 (replica1, replica2)'
# = wait for ack from first of replica1, replica2 before commit

Sync mode:

✅ RPO=0 (zero data loss)
❌ Latency increase ~1ms per replica round-trip
❌ Primary blocks if replica down

Async mode:

✅ Fast commit
❌ Replica lag possible (seconds-minutes during heavy write)
❌ Failover may lose data

Pattern 2024: 1 sync replica in same AZ (low latency, zero RPO) + 1 async replica cross-region (DR).

8.2 Replica setup

# On standby
sudo -u postgres pg_basebackup \
    -h primary.host -U replicator \
    -D /var/lib/postgresql/16/main \
    -X stream -P -W
 
# postgresql.conf on standby:
primary_conninfo = 'host=primary.host user=replicator password=...'
restore_command = 'pgbackrest --stanza=appdb-prod archive-get %f %p'
 
# standby.signal file
touch /var/lib/postgresql/16/main/standby.signal

8.3 Logical replication

PG10+. Replicate selected tables instead of whole cluster.

-- Primary
CREATE PUBLICATION pub_users FOR TABLE users, orders;
 
-- Replica
CREATE SUBSCRIPTION sub_users
CONNECTION 'host=primary user=replicator password=...'
PUBLICATION pub_users;

Use cases:

Cross-version migration (PG14 → PG16)
Selective replication
Different schemas on replica
Real-time data integration

Limitations:

No DDL replication (manual sync schema)
Sequence values not replicated
Truncate replication optional
Single-threaded apply (worse for high write)

8.4 Replica lag

-- On replica
SELECT now() - pg_last_xact_replay_timestamp() AS replay_lag;
 
-- On primary
SELECT
    client_addr,
    application_name,
    state,
    sync_state,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
    pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024.0 / 1024.0 AS lag_mb
FROM pg_stat_replication;

Alert when lag > 30s. Causes:

Heavy write on primary (replica replay single-threaded)
Conflict with queries on replica
Network slow
Replica disk slow

9. Cross-Region DR

9.1 Architecture

graph TB
    subgraph "Region US-EAST"
        Primary[(Primary)]
        SyncRep[(Sync Replica)]
        S3East[(S3 us-east)]
    end

    subgraph "Region EU-WEST"
        AsyncRep[(Async Replica)]
        S3EU[(S3 eu-west)]
    end

    Primary -.sync.-> SyncRep
    Primary -.async.-> AsyncRep
    Primary -.WAL archive.-> S3East
    S3East -.cross-region replicate.-> S3EU

9.2 Strategies

Strategy	RPO	RTO	Cost
Daily snapshot cross-region	1 day	hours	$
Hourly snapshot + WAL ship	1 hour	hour	$$
Async replica cross-region	seconds	minutes	$$$
Sync replica cross-region	0	minutes	$$$$ + latency cost

Pattern 2024: Async replica + WAL archive to cross-region S3.

9.3 S3 cross-region replication

aws s3api put-bucket-replication \
    --bucket mycorp-pgbackup-us-east \
    --replication-configuration file://replication.json

{
  "Role": "arn:aws:iam::ACCT:role/s3-replication",
  "Rules": [{
    "Status": "Enabled",
    "Priority": 1,
    "Filter": {"Prefix": ""},
    "Destination": {
      "Bucket": "arn:aws:s3:::mycorp-pgbackup-eu-west",
      "ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
      "Metrics": {"Status": "Enabled"}
    },
    "DeleteMarkerReplication": {"Status": "Enabled"}
  }]
}

Auto-replicate, ~15-min lag. Backup safe across region failure.

9.4 DR runbook

Critical: have a tested runbook. Sample:

## DR Runbook
 
### Trigger
- Region US-EAST unreachable >15 min, confirmed via Slack #incidents
 
### Step 1 — Failover decision (5 min)
- Engineering Manager + SRE on-call decide
- Communicate to Slack #status
 
### Step 2 — Promote EU replica (5 min)
```bash
ssh eu-replica
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/main

Step 3 — Update DNS (5 min)

aws route53 change-resource-record-sets \
    --hosted-zone-id Z123 \
    --change-batch file://failover-dns.json

Step 4 — App reconfigure (5 min)

App reads DB endpoint from DNS → automatic
Verify connections via pgbouncer logs

Step 5 — Verify (10 min)

App health checks green
Synthetic test: place order, check
Communicate to customers

Total RTO: ~30 min


---

## 10. Restore Drill — quan trọng nhất

### 10.1 Why drill

Backup không tested = backup không tồn tại. Drill:
- Verify backup completeness
- Verify procedure documented & followable
- Measure actual RTO (not theoretical)
- Train team — không ai panic vào hôm thật

### 10.2 Drill cadence

| Frequency | Activity |
|-----------|----------|
| Daily | Automated: restore latest backup to test cluster, run sanity SQL |
| Weekly | Restore + run app test suite against restored DB |
| Monthly | Manual full DR drill — promote replica, failover |
| Quarterly | Cross-region failover, app-level smoke test |
| Yearly | Tabletop exercise: simulate ransomware, full re-build |

### 10.3 Automated daily drill

```bash
#!/bin/bash
# /opt/scripts/backup_drill.sh
set -e

DRILL_HOST=drill-postgres.internal
TARGET_DATE=$(date -d 'yesterday' +'%Y-%m-%d 23:59:59')

# Restore latest backup to drill cluster
ssh $DRILL_HOST "
    sudo -u postgres pg_ctl stop -D /var/lib/postgresql/16/drill || true
    sudo rm -rf /var/lib/postgresql/16/drill
    sudo -u postgres pgbackrest --stanza=appdb-prod restore \
        --type=time --target='$TARGET_DATE' \
        --pg1-path=/var/lib/postgresql/16/drill
    sudo -u postgres pg_ctl start -D /var/lib/postgresql/16/drill
"

# Sanity SQL
ssh $DRILL_HOST "psql -p 5433 -c '
    SELECT count(*) FROM users;
    SELECT max(placed_at) FROM orders;
    SELECT pg_database_size(current_database());
'"

# Report to monitoring
curl -X POST https://metrics.internal/backup_drill_success \
    -d "duration_seconds=$SECONDS&target=$TARGET_DATE"

Run nightly. Alert if fails.

10.4 What to verify

Restore completes within RTO target
pg_database_size matches expected
App-side: representative queries succeed
Specific known records exist (e.g., user 1, oldest order)
No corruption: pg_amcheck on critical indexes

11. Cloud Managed Services

11.1 AWS RDS / Aurora

Automated backups:

Daily snapshot
Continuous WAL → PITR up to 35 days (configurable)
Cross-region snapshot copy

Restore:

Console: “Restore to point in time” → pick timestamp → new cluster
API: RestoreDBInstanceToPointInTime

Aurora specific:

Storage-level continuous backup (no separate WAL archive)
1-second PITR granularity
Fast clone — copy-on-write, instant cluster spin-up
Backtrack (Aurora MySQL only) — rewind without restore

Cost: ~equal to storage cost; cross-region copy egress charges.

11.2 Google Cloud SQL

Similar features:

Automated backups
PITR
Cross-region replica

11.3 Limitations of managed

Less control over WAL archive policy
Can’t run pgBackRest yourself (no shell access)
Vendor lock-in
Restore creates new instance — IP changes, DNS update needed
Cost predictable but not optimized

11.4 Self-managed vs managed

flowchart TD
    A[Choose backup strategy] --> B{Team size?}
    B -->|<5 engineers| C[Managed - RDS/Cloud SQL]
    B -->|10+ engineers| D{Compliance/control needed?}
    D -->|Yes| E[Self-manage on EC2/k8s with pgBackRest]
    D -->|No| F[Managed]
    A --> G{Cost-sensitive at scale?}
    G -->|Yes, >$10K/month| H[Self-manage]

    style C fill:#c8e6c9
    style F fill:#c8e6c9
    style E fill:#fff9c4
    style H fill:#fff9c4

12. Encryption & Compliance

12.1 Encryption at rest

Three layers:

Disk — LUKS, AWS EBS encryption (most common, transparent)
Backup — pgBackRest cipher-pass, WAL-G libsodium
TDE (Transparent Data Encryption) — NOT in Postgres core (even PG17/18). Available trong fork: EDB Advanced Server, Cybertec TDE patches, Percona Postgres. Community proposal long-running. → Real-world solution: dùng disk-level encryption + column-level cho PII

12.2 Encryption in transit

# postgresql.conf
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'
ssl_ca_file = '/path/to/ca.crt'
 
# pg_hba.conf
hostssl all all 0.0.0.0/0 scram-sha-256

12.3 Compliance considerations

GDPR: right to be forgotten — backup retention limits
HIPAA: encryption everywhere
PCI-DSS: 12-month audit log retention
SOX: 7-year retention

→ Retention policy reflect compliance, not “as long as possible”.

12.4 Backup ransomware protection

Threats: ransomware encrypts your backup too.

Defense:

Immutable storage — S3 Object Lock (compliance mode)
Air-gapped copy — periodic offline copy
MFA on delete — S3 versioning + MFA delete
Separate credentials — backup write-only, restore read-only

# S3 Object Lock - cannot delete for 30 days
aws s3api put-object-lock-configuration \
    --bucket prod-pgbackup \
    --object-lock-configuration '{
        "ObjectLockEnabled": "Enabled",
        "Rule": {"DefaultRetention": {"Mode": "COMPLIANCE", "Days": 30}}
    }'

13. Cost Optimization

13.1 Storage tiering

S3 lifecycle for backup:

{
  "Rules": [{
    "Id": "BackupLifecycle",
    "Status": "Enabled",
    "Transitions": [
      {"Days": 30, "StorageClass": "STANDARD_IA"},
      {"Days": 90, "StorageClass": "GLACIER"},
      {"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
    ],
    "Expiration": {"Days": 2555}
  }]
}

Saves 60-80% storage cost on old backups.

13.2 Compression

Method	Ratio	Speed
gzip	2-3x	Medium
zstd	3-4x	Fast
brotli	4-5x	Slow
lz4	1.5x	Very fast

pgBackRest default zstd. WAL-G brotli.

13.3 Incremental + diff backup

Daily full backup of 1TB = 1TB × 30 days = 30TB stored. Weekly full + daily diff:

4 fulls (4TB) + 28 diffs (~50GB each = 1.4TB) = 5.4TB total

→ 80% storage save.

14. Anti-patterns

Pattern	Tại sao tệ	Fix
Backup chưa test restore	Có thể không restore được khi cần	Monthly restore drill
Backup cùng region với primary	Region disaster wipes cả 2	Cross-region replicate
Retention = forever	Cost run away	Lifecycle policy + compliance-driven retention
archive_command fail silently	pg_wal grows, eventually disk full	Monitor pg_stat_archiver, alert
Backup user = superuser	Compromise = data leak + delete	Limited role, S3 write-only
Backup encrypted with key on same server	Server compromise = both lost	KMS / separate vault
Logical backup as primary for 500GB DB	pg_dump too slow	Physical backup
pg_basebackup as production tool	No retention, no incremental	pgBackRest/WAL-G
Restore in production cluster	Risk overwriting good data	Always restore to test cluster first
No standby replica	RTO measured in hours	At least 1 hot standby

15. Lab — 7 days

15.1 Day 1: pg_dump basics

# Setup PG 16 with sample data (e-commerce from Tuần 02)
docker compose up -d postgres
 
# Backup
pg_dump -h localhost -U postgres -Fc appdb -f appdb.dump
ls -lh appdb.dump
 
# Restore to new DB
createdb appdb_restored
pg_restore -h localhost -U postgres -d appdb_restored -j 4 appdb.dump
 
# Verify
psql -d appdb_restored -c "SELECT count(*) FROM orders;"

15.2 Day 2: WAL archiving manual

# Add to postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /tmp/wal_archive/%f && cp %p /tmp/wal_archive/%f'
archive_timeout = 60
 
mkdir /tmp/wal_archive
# Restart Postgres
 
# Force WAL switch
psql -c "SELECT pg_switch_wal();"
ls -la /tmp/wal_archive
# Should see WAL segment file

15.3 Day 3: pgBackRest setup

docker compose up -d  # postgres + pgbackrest container
pgbackrest --stanza=appdb stanza-create
pgbackrest --stanza=appdb backup --type=full
 
# View backup info
pgbackrest --stanza=appdb info

15.4 Day 4: PITR exercise

# State 1: insert known data
psql -c "INSERT INTO users (email, name) VALUES ('[email protected]', 'Alice');"
 
# Note time
RECOVERY_TIME=$(psql -t -c "SELECT now();")
 
# State 2: do bad thing
psql -c "DELETE FROM users WHERE email = '[email protected]';"
 
# Stop Postgres, restore to RECOVERY_TIME
pg_ctl stop
pgbackrest --stanza=appdb restore --type=time --target="$RECOVERY_TIME"
 
# Start, check Alice is back
pg_ctl start
psql -c "SELECT * FROM users WHERE email = '[email protected]';"

15.5 Day 5: Streaming replication

# Setup primary + replica via docker-compose
# Create replication user
psql -c "CREATE USER replicator REPLICATION LOGIN PASSWORD 'rep';"
 
# On replica
pg_basebackup -h primary -U replicator -D /var/lib/postgresql/data -P -X stream
touch /var/lib/postgresql/data/standby.signal
 
# Verify
psql -c "SELECT * FROM pg_stat_replication;" -h primary
psql -c "SELECT pg_is_in_recovery();" -h replica

15.6 Day 6: Failover simulation

# Stop primary (simulate failure)
docker stop pg_primary
 
# Promote replica
docker exec pg_replica pg_ctl promote
 
# Verify
docker exec pg_replica psql -c "SELECT pg_is_in_recovery();"  # false now

15.7 Day 7: Drill automation

Write daily_drill.sh that:

Restores latest backup to test cluster
Runs sanity queries
Reports success/fail to log
Cleanup test cluster

Cron it to run nightly.

16. Self-check

RPO vs RTO — định nghĩa? Backup strategy ảnh hưởng cái nào nhiều hơn?
Logical vs physical backup — khi nào dùng cái nào?
archive_command fail → hệ quả? Monitor metric nào?
PITR cần gì? Step-by-step recover từ disaster.
Sync vs async replica — pros/cons? Pattern 2024 thường dùng?
Cross-region DR strategy: 4 options sorted by RPO+RTO+cost?
Restore drill — vì sao quan trọng? Frequency?
Ransomware threat — backup defense?
RDS auto backup vs self-managed pgBackRest — pick khi nào?
archive_command idempotency — tại sao critical?

17. Tiếp theo

Bài tiếp: Tuan-08-Zero-Downtime-Migration — migration không xuống.

Tuần 07 hoàn thành. Tested backup = real backup. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-07-Backup-PITR-DR