Tuần 07 — Backup, PITR & Disaster Recovery
“Có một sự khác biệt giữa ‘có backup’ và ‘có thể restore’. Tuần này dạy bạn tránh ảo tưởng đầu để có thực tại sau. Backup chưa từng test = không có backup.”
Tags: database postgresql backup pitr disaster-recovery operations Thời lượng: 7 ngày (4-6h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh (WAL, checkpoint) Liên quan: Tuan-08-Zero-Downtime-Migration · Tuan-Bonus-DB-Security-Compliance
1. Context & Why
1.1 Backup mindset
graph LR A[Backup taken] --> B{Tested restore?} B -->|No| C[Not a backup<br/>Just file on disk] B -->|Yes| D{Tested recently?} D -->|>3 months| E[Probably broken] D -->|<1 month| F[Real backup] style C fill:#ef9a9a style E fill:#ffccbc style F fill:#c8e6c9
Iron rule: Backup không phải file. Backup là process: take + verify + monitor + drill.
1.2 Câu hỏi quan trọng trước khi thiết kế backup
-
RPO (Recovery Point Objective) — chịu mất tối đa bao nhiêu data?
- “5 phút data cuối có thể mất” → continuous WAL archiving
- “1 ngày data có thể mất” → nightly snapshot
-
RTO (Recovery Time Objective) — bao lâu để restore xong?
- “10 phút” → hot standby ready
- “4 giờ” → restore từ snapshot S3
-
Retention — giữ backup bao lâu?
- 7 ngày để rollback gần?
- 30 ngày compliance?
- 7 năm tax/SOX?
-
Threat model — bảo vệ chống cái gì?
- Hardware failure → 1 replica đủ
- Datacenter disaster → cross-region backup
- Ransomware → immutable storage (S3 Object Lock)
- Bug in code corrupts data → PITR đến trước khi bug deploy
1.3 Mục tiêu tuần
- Phân biệt logical vs physical backup, khi nào pick mỗi cái
- Setup WAL archiving + base backup chained
- Thực hiện PITR thực tế: recover đến exact transaction trước accident
- Master tool: pgBackRest hoặc WAL-G
- Cross-region DR setup
- Restore drill — practice nguyên lý quan trọng nhất
- Encryption at rest + immutability
- Hiểu cloud managed: RDS, Aurora, Cloud SQL backup features
1.4 Tham chiếu
- PostgreSQL docs Ch.26 (Backup and Restore) — https://www.postgresql.org/docs/current/backup.html
- pgBackRest documentation — https://pgbackrest.org/
- WAL-G documentation — https://github.com/wal-g/wal-g
- Barman — https://pgbarman.org/
- AWS RDS Backup — https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_CommonTasks.BackupRestore.html
2. Backup Types
2.1 Logical vs Physical
graph TB subgraph "Logical Backup" L1[pg_dump / pg_dumpall] L2[Output: SQL statements / custom format] L3[Restore: pg_restore] L4[Cross-version OK] L5[Selective: table/schema] L6[Slow on large DB] end subgraph "Physical Backup" P1[pg_basebackup / pgBackRest / WAL-G] P2[Output: binary files copy] P3[Restore: file copy + recovery] P4[Same major version required] P5[Whole cluster only] P6[Fast even on large DB] P7[Required for PITR] end style L4 fill:#c8e6c9 style L5 fill:#c8e6c9 style P6 fill:#c8e6c9 style P7 fill:#c8e6c9 style L6 fill:#ffccbc style P4 fill:#ffccbc style P5 fill:#ffccbc
2.2 Khi nào dùng cái nào
| Scenario | Logical | Physical |
|---|---|---|
| DB <10GB | ✓ | ✓ |
| DB >100GB | ✗ slow | ✓ |
| Cross-version migration (PG14→16) | ✓ | ✗ |
| PITR (5-min RPO) | ✗ | ✓ |
| Restore single table | ✓ | ✗ (full cluster) |
| Move data across cloud providers | ✓ | ✓ (with WAL-G S3 → GCS) |
| Schema-only backup | ✓ | ✗ |
Production pattern 2024: Cả hai.
- Physical (pgBackRest/WAL-G) — daily full + continuous WAL → primary backup strategy
- Logical (pg_dump) weekly — schema + selective dump → for surgical restore, migration
3. Logical Backup — pg_dump
3.1 Formats
# Plain SQL (default)
pg_dump appdb > backup.sql
# Custom format (recommended)
pg_dump -Fc appdb -f backup.dump
# Directory format (parallel)
pg_dump -Fd appdb -f backup_dir -j 4
# Tar format
pg_dump -Ft appdb -f backup.tar| Format | Parallel | Selective restore | Notes |
|---|---|---|---|
| Plain | No | No | Just SQL, human-readable |
Custom (-Fc) | Restore parallel | Yes | Recommended |
Directory (-Fd) | Dump + restore parallel | Yes | Best for huge DBs |
Tar (-Ft) | No | Yes | Compatibility |
3.2 Options critical
pg_dump \
--format=custom \
--jobs=4 \ # parallel (directory format)
--no-owner --no-acl \ # for restore on different cluster
--schema=public \ # specific schema
--table='public.users' \ # specific table
--exclude-table='audit.*' \ # exclude pattern
--serializable-deferrable \ # consistent snapshot
--section=pre-data \ # schema only, then data, then post-data
-f backup.dump \
appdb3.3 Restore
# Custom format
pg_restore -d new_db backup.dump
# Parallel restore
pg_restore -d new_db -j 4 backup.dump
# Selective restore
pg_restore -d new_db --table=users backup.dump
# List contents
pg_restore --list backup.dump
# Generate edit-able TOC for selective restore
pg_restore --list backup.dump > toc.txt
# Edit toc.txt, comment out unwanted lines
pg_restore -d new_db --use-list=toc.txt backup.dump3.4 pg_dumpall — cluster-wide
# Roles, tablespaces, all databases
pg_dumpall > cluster_backup.sql
# Globals only (roles + tablespace), per-DB dump elsewhere
pg_dumpall --globals-only > globals.sqlPattern: pg_dumpall --globals-only + pg_dump per DB for fine control.
3.5 Limitations
- Slow on large DB — 100GB có thể mất giờ
- Lock: SHARE-mode acquired, không block normal operations nhưng block DDL
- Consistent snapshot: dùng REPEATABLE READ → cần đủ work_mem
- Replication slot: không capture WAL → không PITR
3.6 Best practice
pg_dump -Fc -Z9(max compression)- Test restore monthly
- Store off-site (S3, GCS) + cross-region
- Encrypt at rest (
age,gpg, server-side encryption)
4. Physical Backup — Foundation for PITR
4.1 pg_basebackup — built-in
pg_basebackup \
-h localhost -U replicator \
-D /backup/base \
-Ft -z \ # tar + gzip
-P \ # progress
-X stream \ # include WAL stream
-c fast # immediate checkpointCreates a binary copy of the data directory. Plus WAL needed to make it consistent.
Limitations:
- Single-threaded (slow on 1TB+)
- No incremental
- No retention management
- Manual everything
→ For production: use pgBackRest or WAL-G.
4.2 WAL archiving setup
The foundation of PITR.
# postgresql.conf
wal_level = replica # or logical
archive_mode = on
archive_command = 'test ! -f /archive/%f && cp %p /archive/%f'
# Or to S3:
archive_command = 'wal-g wal-push %p'
archive_timeout = 60 # force WAL switch every 60s if no traffic
max_wal_senders = 10
wal_keep_size = 2GB # safety net for replica lag%p — path of WAL segment in pg_wal/
%f — filename to archive as
4.3 archive_command guarantee
Postgres considers WAL segment archived ONLY after archive_command returns 0. If failure:
- Segment kept in pg_wal/
- Postgres retries
- pg_wal/ grows unbounded if archive broken → eventually disk full → DB stops
→ Monitor archive_command failure! pg_stat_archiver shows last_failed_time.
SELECT * FROM pg_stat_archiver;
-- archived_count, last_archived_wal, failed_count, last_failed_wal, last_failed_time4.4 archive_command idempotent
Critical: archive_command MUST be idempotent. If S3 upload fails after data uploaded, retry must not corrupt.
# BAD: not idempotent
archive_command = 'aws s3 cp %p s3://backup/%f'
# GOOD: check existence first
archive_command = '
set -e
if aws s3api head-object --bucket backup --key wal/%f 2>/dev/null; then
if aws s3api get-object --bucket backup --key wal/%f /tmp/check
&& cmp /tmp/check %p; then
exit 0 # already uploaded same content
else
exit 1 # different content - alert!
fi
fi
aws s3 cp %p s3://backup/wal/%f
'In practice: use WAL-G / pgBackRest which handle this.
5. pgBackRest — Production Gold Standard
5.1 Overview
pgbackrest.conf:
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=7
repo1-cipher-pass=secret-passphrase
repo1-cipher-type=aes-256-cbc
repo1-s3-bucket=mycorp-pgbackup
repo1-s3-endpoint=s3.amazonaws.com
repo1-s3-region=us-east-1
repo1-s3-key=AKIA...
repo1-s3-key-secret=...
repo1-type=s3
compress-type=zst
compress-level=3
process-max=4
log-level-console=info
log-level-file=detail
[appdb-prod]
pg1-path=/var/lib/postgresql/16/main
pg1-port=5432postgresql.conf:
archive_command = 'pgbackrest --stanza=appdb-prod archive-push %p'Init:
sudo -u postgres pgbackrest --stanza=appdb-prod stanza-create
sudo -u postgres pgbackrest --stanza=appdb-prod check5.2 Backup types
# Full backup (everything)
pgbackrest --stanza=appdb-prod backup --type=full
# Differential (since last full)
pgbackrest --stanza=appdb-prod backup --type=diff
# Incremental (since last backup any type)
pgbackrest --stanza=appdb-prod backup --type=incrgantt title pgBackRest backup chain dateFormat YYYY-MM-DD section Strategy Full backup :f1, 2026-05-10, 1d Incr Mon :i1, after f1, 1d Incr Tue :i2, after i1, 1d Incr Wed :i3, after i2, 1d Incr Thu :i4, after i3, 1d Incr Fri :i5, after i4, 1d Incr Sat :i6, after i5, 1d Full backup :f2, after i6, 1d
5.3 Restore
# Latest
pgbackrest --stanza=appdb-prod restore
# Specific backup
pgbackrest --stanza=appdb-prod restore --set=20260516-020000F
# PITR
pgbackrest --stanza=appdb-prod restore \
--type=time \
--target='2026-05-16 14:30:00'
# To different host (delta restore)
pgbackrest --stanza=appdb-prod restore --delta5.4 Parallel & resume
process-max=4 → 4 parallel processes. Backup 1TB DB on 10Gbps network in ~30 minutes.
If restore interrupted, --delta resumes — only copies files different from current state. Game-changer for huge DBs.
5.5 Schedule
# Sunday 02:00 - full
0 2 * * 0 sudo -u postgres pgbackrest --stanza=appdb-prod backup --type=full
# Daily 02:00 - incremental
0 2 * * 1-6 sudo -u postgres pgbackrest --stanza=appdb-prod backup --type=incr
# Hourly verify (lightweight)
0 * * * * sudo -u postgres pgbackrest --stanza=appdb-prod check5.6 Encryption + immutability
repo1-cipher-type=aes-256-cbc → server-side encrypt before upload to S3.
Plus S3 Object Lock (write-once-read-many) → ransomware-proof:
aws s3api put-object-lock-configuration \
--bucket mycorp-pgbackup \
--object-lock-configuration '{"ObjectLockEnabled":"Enabled","Rule":{"DefaultRetention":{"Mode":"COMPLIANCE","Days":35}}}'6. WAL-G — Cloud-Native Alternative
6.1 Why WAL-G
Pros:
- Cloud-native (S3, GCS, Azure Blob, Yandex Object Storage)
- Built-in encryption (libsodium)
- Delta backups (PG12+)
- Catchup mode for replicas
- Single binary, easy deploy
Cons (vs pgBackRest):
- Less feature complete (no diff backup chain)
- Smaller community
6.2 Setup
# Environment
export WALG_S3_PREFIX=s3://mycorp-pgbackup/appdb
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export WALG_COMPRESSION_METHOD=brotli
export WALG_DELTA_MAX_STEPS=5
export WALG_LIBSODIUM_KEY=... # encryption key# postgresql.conf
archive_command = 'wal-g wal-push %p'
restore_command = 'wal-g wal-fetch %f %p'6.3 Operations
# Full backup
wal-g backup-push /var/lib/postgresql/16/main
# Delta backup (PG12+)
wal-g backup-push /var/lib/postgresql/16/main --delta-from-name=base_xxxxx
# List
wal-g backup-list
wal-g wal-show
# Restore
wal-g backup-fetch /var/lib/postgresql/16/main LATEST
# Or specific backup name
# Restore for PITR
# Set restore_command in recovery.conf / postgresql.auto.conf
# Set recovery_target_time7. Point-In-Time Recovery (PITR)
7.1 The promise
“Restore database to any moment in past, down to the second.”
Pre-requisites:
- Base backup + continuous WAL archive from then
- Recovery target reachable in WAL chain
7.2 PITR mechanics
sequenceDiagram participant T as Time participant B as Base Backup participant W as WAL Archive participant R as Recovery Note over B,W: 2026-05-15 02:00 Base backup taken Note over W: Continuous WAL pushed Note over W: 2026-05-16 14:30:01 Bad DELETE happens Note over W: Continuous WAL pushed Note over R: Disaster recovery needed R->>B: Restore base from 2026-05-15 loop Replay WAL R->>W: Fetch WAL segment W-->>R: Segment Note over R: Replay until target time end Note over R: Stop at 2026-05-16 14:29:00 Note over R: 1 second before disaster
7.3 PITR setup PG12+
In postgresql.conf (or postgresql.auto.conf):
restore_command = 'pgbackrest --stanza=appdb-prod archive-get %f %p'
recovery_target_time = '2026-05-16 14:29:00'
recovery_target_action = 'promote' # or 'pause', 'shutdown'Create signal file:
touch /var/lib/postgresql/16/main/recovery.signalThen start Postgres. It enters recovery mode, replays WAL until target.
7.4 Recovery targets
Options for recovery_target_*:
recovery_target_time = 'TIMESTAMP'— most commonrecovery_target_xid = '12345'— specific transactionrecovery_target_name = 'foobar'— restore point (created viaSELECT pg_create_restore_point('foobar'))recovery_target_lsn = '0/3000028'— specific LSNrecovery_target = 'immediate'— first consistent point after start
recovery_target_inclusive = false — stop BEFORE the target.
7.5 Recovery actions
After target reached:
promote— Postgres becomes primary, no further replaypause— pause, allow you to runSELECT pg_wal_replay_resume()orpg_promote()shutdown— clean shutdown after recovery
Pattern recommended: pause, verify with read-only queries, then promote.
7.6 Worked example — recover bad DELETE
# Disaster: someone ran "DELETE FROM users;" at 14:30:01
# 1. Stop application (don't write to corrupted state)
kubectl scale deployment app --replicas=0
# 2. Identify exact time
# Look at pg_stat_statements or audit log or app log
# Recovery target = 14:30:00 (1s before)
# 3. Restore to new directory
sudo -u postgres pg_ctl stop -D /var/lib/postgresql/16/main
# Move corrupt cluster aside (don't delete - might still be useful)
sudo mv /var/lib/postgresql/16/main /var/lib/postgresql/16/main.corrupt
sudo -u postgres pgbackrest --stanza=appdb-prod restore \
--type=time --target='2026-05-16 14:30:00' \
--target-action=pause \
--pg1-path=/var/lib/postgresql/16/main
# 4. Start, in recovery mode
sudo -u postgres pg_ctl start -D /var/lib/postgresql/16/main
# 5. Verify
psql -c "SELECT count(*) FROM users;"
# Should show pre-disaster count
# 6. Promote
psql -c "SELECT pg_promote();"
# 7. Restart app pointing to recovered cluster
kubectl scale deployment app --replicas=107.7 RPO calculation
With continuous WAL archiving:
archive_timeout = 60s→ max 60s of data lost on archive failure- Realistic RPO with healthy archive: <1 second
If archive_command fails for hours, RPO degrades. Monitor!
8. Replication for High Availability
8.1 Streaming replication (sync vs async)
graph LR Primary[(Primary)] --WAL stream--> Replica1[(Hot Standby Sync)] Primary --WAL stream--> Replica2[(Hot Standby Async)] Primary --WAL archive--> S3[(S3 backup)] style Replica1 fill:#c8e6c9 style Replica2 fill:#fff9c4
# Primary
synchronous_standby_names = 'FIRST 1 (replica1, replica2)'
# = wait for ack from first of replica1, replica2 before commitSync mode:
- ✅ RPO=0 (zero data loss)
- ❌ Latency increase ~1ms per replica round-trip
- ❌ Primary blocks if replica down
Async mode:
- ✅ Fast commit
- ❌ Replica lag possible (seconds-minutes during heavy write)
- ❌ Failover may lose data
Pattern 2024: 1 sync replica in same AZ (low latency, zero RPO) + 1 async replica cross-region (DR).
8.2 Replica setup
# On standby
sudo -u postgres pg_basebackup \
-h primary.host -U replicator \
-D /var/lib/postgresql/16/main \
-X stream -P -W
# postgresql.conf on standby:
primary_conninfo = 'host=primary.host user=replicator password=...'
restore_command = 'pgbackrest --stanza=appdb-prod archive-get %f %p'
# standby.signal file
touch /var/lib/postgresql/16/main/standby.signal8.3 Logical replication
PG10+. Replicate selected tables instead of whole cluster.
-- Primary
CREATE PUBLICATION pub_users FOR TABLE users, orders;
-- Replica
CREATE SUBSCRIPTION sub_users
CONNECTION 'host=primary user=replicator password=...'
PUBLICATION pub_users;Use cases:
- Cross-version migration (PG14 → PG16)
- Selective replication
- Different schemas on replica
- Real-time data integration
Limitations:
- No DDL replication (manual sync schema)
- Sequence values not replicated
- Truncate replication optional
- Single-threaded apply (worse for high write)
8.4 Replica lag
-- On replica
SELECT now() - pg_last_xact_replay_timestamp() AS replay_lag;
-- On primary
SELECT
client_addr,
application_name,
state,
sync_state,
pg_wal_lsn_diff(sent_lsn, replay_lsn) AS lag_bytes,
pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024.0 / 1024.0 AS lag_mb
FROM pg_stat_replication;Alert when lag > 30s. Causes:
- Heavy write on primary (replica replay single-threaded)
- Conflict with queries on replica
- Network slow
- Replica disk slow
9. Cross-Region DR
9.1 Architecture
graph TB subgraph "Region US-EAST" Primary[(Primary)] SyncRep[(Sync Replica)] S3East[(S3 us-east)] end subgraph "Region EU-WEST" AsyncRep[(Async Replica)] S3EU[(S3 eu-west)] end Primary -.sync.-> SyncRep Primary -.async.-> AsyncRep Primary -.WAL archive.-> S3East S3East -.cross-region replicate.-> S3EU
9.2 Strategies
| Strategy | RPO | RTO | Cost |
|---|---|---|---|
| Daily snapshot cross-region | 1 day | hours | $ |
| Hourly snapshot + WAL ship | 1 hour | hour | $$ |
| Async replica cross-region | seconds | minutes | $$$ |
| Sync replica cross-region | 0 | minutes | $$$$ + latency cost |
Pattern 2024: Async replica + WAL archive to cross-region S3.
9.3 S3 cross-region replication
aws s3api put-bucket-replication \
--bucket mycorp-pgbackup-us-east \
--replication-configuration file://replication.json{
"Role": "arn:aws:iam::ACCT:role/s3-replication",
"Rules": [{
"Status": "Enabled",
"Priority": 1,
"Filter": {"Prefix": ""},
"Destination": {
"Bucket": "arn:aws:s3:::mycorp-pgbackup-eu-west",
"ReplicationTime": {"Status": "Enabled", "Time": {"Minutes": 15}},
"Metrics": {"Status": "Enabled"}
},
"DeleteMarkerReplication": {"Status": "Enabled"}
}]
}Auto-replicate, ~15-min lag. Backup safe across region failure.
9.4 DR runbook
Critical: have a tested runbook. Sample:
## DR Runbook
### Trigger
- Region US-EAST unreachable >15 min, confirmed via Slack #incidents
### Step 1 — Failover decision (5 min)
- Engineering Manager + SRE on-call decide
- Communicate to Slack #status
### Step 2 — Promote EU replica (5 min)
```bash
ssh eu-replica
sudo -u postgres pg_ctl promote -D /var/lib/postgresql/16/mainStep 3 — Update DNS (5 min)
aws route53 change-resource-record-sets \
--hosted-zone-id Z123 \
--change-batch file://failover-dns.jsonStep 4 — App reconfigure (5 min)
- App reads DB endpoint from DNS → automatic
- Verify connections via pgbouncer logs
Step 5 — Verify (10 min)
- App health checks green
- Synthetic test: place order, check
- Communicate to customers
Total RTO: ~30 min
---
## 10. Restore Drill — quan trọng nhất
### 10.1 Why drill
Backup không tested = backup không tồn tại. Drill:
- Verify backup completeness
- Verify procedure documented & followable
- Measure actual RTO (not theoretical)
- Train team — không ai panic vào hôm thật
### 10.2 Drill cadence
| Frequency | Activity |
|-----------|----------|
| Daily | Automated: restore latest backup to test cluster, run sanity SQL |
| Weekly | Restore + run app test suite against restored DB |
| Monthly | Manual full DR drill — promote replica, failover |
| Quarterly | Cross-region failover, app-level smoke test |
| Yearly | Tabletop exercise: simulate ransomware, full re-build |
### 10.3 Automated daily drill
```bash
#!/bin/bash
# /opt/scripts/backup_drill.sh
set -e
DRILL_HOST=drill-postgres.internal
TARGET_DATE=$(date -d 'yesterday' +'%Y-%m-%d 23:59:59')
# Restore latest backup to drill cluster
ssh $DRILL_HOST "
sudo -u postgres pg_ctl stop -D /var/lib/postgresql/16/drill || true
sudo rm -rf /var/lib/postgresql/16/drill
sudo -u postgres pgbackrest --stanza=appdb-prod restore \
--type=time --target='$TARGET_DATE' \
--pg1-path=/var/lib/postgresql/16/drill
sudo -u postgres pg_ctl start -D /var/lib/postgresql/16/drill
"
# Sanity SQL
ssh $DRILL_HOST "psql -p 5433 -c '
SELECT count(*) FROM users;
SELECT max(placed_at) FROM orders;
SELECT pg_database_size(current_database());
'"
# Report to monitoring
curl -X POST https://metrics.internal/backup_drill_success \
-d "duration_seconds=$SECONDS&target=$TARGET_DATE"
Run nightly. Alert if fails.
10.4 What to verify
- Restore completes within RTO target
pg_database_sizematches expected- App-side: representative queries succeed
- Specific known records exist (e.g., user 1, oldest order)
- No corruption:
pg_amcheckon critical indexes
11. Cloud Managed Services
11.1 AWS RDS / Aurora
Automated backups:
- Daily snapshot
- Continuous WAL → PITR up to 35 days (configurable)
- Cross-region snapshot copy
Restore:
- Console: “Restore to point in time” → pick timestamp → new cluster
- API:
RestoreDBInstanceToPointInTime
Aurora specific:
- Storage-level continuous backup (no separate WAL archive)
- 1-second PITR granularity
- Fast clone — copy-on-write, instant cluster spin-up
- Backtrack (Aurora MySQL only) — rewind without restore
Cost: ~equal to storage cost; cross-region copy egress charges.
11.2 Google Cloud SQL
Similar features:
- Automated backups
- PITR
- Cross-region replica
11.3 Limitations of managed
- Less control over WAL archive policy
- Can’t run pgBackRest yourself (no shell access)
- Vendor lock-in
- Restore creates new instance — IP changes, DNS update needed
- Cost predictable but not optimized
11.4 Self-managed vs managed
flowchart TD A[Choose backup strategy] --> B{Team size?} B -->|<5 engineers| C[Managed - RDS/Cloud SQL] B -->|10+ engineers| D{Compliance/control needed?} D -->|Yes| E[Self-manage on EC2/k8s with pgBackRest] D -->|No| F[Managed] A --> G{Cost-sensitive at scale?} G -->|Yes, >$10K/month| H[Self-manage] style C fill:#c8e6c9 style F fill:#c8e6c9 style E fill:#fff9c4 style H fill:#fff9c4
12. Encryption & Compliance
12.1 Encryption at rest
Three layers:
- Disk — LUKS, AWS EBS encryption (most common, transparent)
- Backup — pgBackRest
cipher-pass, WAL-G libsodium - TDE (Transparent Data Encryption) — NOT in Postgres core (even PG17/18). Available trong fork: EDB Advanced Server, Cybertec TDE patches, Percona Postgres. Community proposal long-running. → Real-world solution: dùng disk-level encryption + column-level cho PII
12.2 Encryption in transit
# postgresql.conf
ssl = on
ssl_cert_file = '/path/to/server.crt'
ssl_key_file = '/path/to/server.key'
ssl_ca_file = '/path/to/ca.crt'
# pg_hba.conf
hostssl all all 0.0.0.0/0 scram-sha-25612.3 Compliance considerations
- GDPR: right to be forgotten — backup retention limits
- HIPAA: encryption everywhere
- PCI-DSS: 12-month audit log retention
- SOX: 7-year retention
→ Retention policy reflect compliance, not “as long as possible”.
12.4 Backup ransomware protection
Threats: ransomware encrypts your backup too.
Defense:
- Immutable storage — S3 Object Lock (compliance mode)
- Air-gapped copy — periodic offline copy
- MFA on delete — S3 versioning + MFA delete
- Separate credentials — backup write-only, restore read-only
# S3 Object Lock - cannot delete for 30 days
aws s3api put-object-lock-configuration \
--bucket prod-pgbackup \
--object-lock-configuration '{
"ObjectLockEnabled": "Enabled",
"Rule": {"DefaultRetention": {"Mode": "COMPLIANCE", "Days": 30}}
}'13. Cost Optimization
13.1 Storage tiering
S3 lifecycle for backup:
{
"Rules": [{
"Id": "BackupLifecycle",
"Status": "Enabled",
"Transitions": [
{"Days": 30, "StorageClass": "STANDARD_IA"},
{"Days": 90, "StorageClass": "GLACIER"},
{"Days": 365, "StorageClass": "DEEP_ARCHIVE"}
],
"Expiration": {"Days": 2555}
}]
}Saves 60-80% storage cost on old backups.
13.2 Compression
| Method | Ratio | Speed |
|---|---|---|
| gzip | 2-3x | Medium |
| zstd | 3-4x | Fast |
| brotli | 4-5x | Slow |
| lz4 | 1.5x | Very fast |
pgBackRest default zstd. WAL-G brotli.
13.3 Incremental + diff backup
Daily full backup of 1TB = 1TB × 30 days = 30TB stored. Weekly full + daily diff:
- 4 fulls (4TB) + 28 diffs (~50GB each = 1.4TB) = 5.4TB total
→ 80% storage save.
14. Anti-patterns
| Pattern | Tại sao tệ | Fix |
|---|---|---|
| Backup chưa test restore | Có thể không restore được khi cần | Monthly restore drill |
| Backup cùng region với primary | Region disaster wipes cả 2 | Cross-region replicate |
| Retention = forever | Cost run away | Lifecycle policy + compliance-driven retention |
| archive_command fail silently | pg_wal grows, eventually disk full | Monitor pg_stat_archiver, alert |
| Backup user = superuser | Compromise = data leak + delete | Limited role, S3 write-only |
| Backup encrypted with key on same server | Server compromise = both lost | KMS / separate vault |
| Logical backup as primary for 500GB DB | pg_dump too slow | Physical backup |
| pg_basebackup as production tool | No retention, no incremental | pgBackRest/WAL-G |
| Restore in production cluster | Risk overwriting good data | Always restore to test cluster first |
| No standby replica | RTO measured in hours | At least 1 hot standby |
15. Lab — 7 days
15.1 Day 1: pg_dump basics
# Setup PG 16 with sample data (e-commerce from Tuần 02)
docker compose up -d postgres
# Backup
pg_dump -h localhost -U postgres -Fc appdb -f appdb.dump
ls -lh appdb.dump
# Restore to new DB
createdb appdb_restored
pg_restore -h localhost -U postgres -d appdb_restored -j 4 appdb.dump
# Verify
psql -d appdb_restored -c "SELECT count(*) FROM orders;"15.2 Day 2: WAL archiving manual
# Add to postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /tmp/wal_archive/%f && cp %p /tmp/wal_archive/%f'
archive_timeout = 60
mkdir /tmp/wal_archive
# Restart Postgres
# Force WAL switch
psql -c "SELECT pg_switch_wal();"
ls -la /tmp/wal_archive
# Should see WAL segment file15.3 Day 3: pgBackRest setup
docker compose up -d # postgres + pgbackrest container
pgbackrest --stanza=appdb stanza-create
pgbackrest --stanza=appdb backup --type=full
# View backup info
pgbackrest --stanza=appdb info15.4 Day 4: PITR exercise
# State 1: insert known data
psql -c "INSERT INTO users (email, name) VALUES ('[email protected]', 'Alice');"
# Note time
RECOVERY_TIME=$(psql -t -c "SELECT now();")
# State 2: do bad thing
psql -c "DELETE FROM users WHERE email = '[email protected]';"
# Stop Postgres, restore to RECOVERY_TIME
pg_ctl stop
pgbackrest --stanza=appdb restore --type=time --target="$RECOVERY_TIME"
# Start, check Alice is back
pg_ctl start
psql -c "SELECT * FROM users WHERE email = '[email protected]';"15.5 Day 5: Streaming replication
# Setup primary + replica via docker-compose
# Create replication user
psql -c "CREATE USER replicator REPLICATION LOGIN PASSWORD 'rep';"
# On replica
pg_basebackup -h primary -U replicator -D /var/lib/postgresql/data -P -X stream
touch /var/lib/postgresql/data/standby.signal
# Verify
psql -c "SELECT * FROM pg_stat_replication;" -h primary
psql -c "SELECT pg_is_in_recovery();" -h replica15.6 Day 6: Failover simulation
# Stop primary (simulate failure)
docker stop pg_primary
# Promote replica
docker exec pg_replica pg_ctl promote
# Verify
docker exec pg_replica psql -c "SELECT pg_is_in_recovery();" # false now15.7 Day 7: Drill automation
Write daily_drill.sh that:
- Restores latest backup to test cluster
- Runs sanity queries
- Reports success/fail to log
- Cleanup test cluster
Cron it to run nightly.
16. Self-check
- RPO vs RTO — định nghĩa? Backup strategy ảnh hưởng cái nào nhiều hơn?
- Logical vs physical backup — khi nào dùng cái nào?
- archive_command fail → hệ quả? Monitor metric nào?
- PITR cần gì? Step-by-step recover từ disaster.
- Sync vs async replica — pros/cons? Pattern 2024 thường dùng?
- Cross-region DR strategy: 4 options sorted by RPO+RTO+cost?
- Restore drill — vì sao quan trọng? Frequency?
- Ransomware threat — backup defense?
- RDS auto backup vs self-managed pgBackRest — pick khi nào?
- archive_command idempotency — tại sao critical?
17. Tiếp theo
Bài tiếp: Tuan-08-Zero-Downtime-Migration — migration không xuống.
Tuần 07 hoàn thành. Tested backup = real backup. Cập nhật: 2026-05-16