Tuần 09 — DB Observability & Performance Tuning

“Production DB chạy chậm. Bạn thấy CPU 80%, IOPS 5K/s, connection 200. Bạn vẫn không biết tại sao. Vì observability không phải số liệu — observability là khả năng trả lời câu hỏi ‘tại sao’. Tuần này dạy bạn xây capability đó.”

Tags: database postgresql observability performance tuning operations Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-04-Query-Optimization-EXPLAIN · Tuan-01-DB-Internals-Refresh Liên quan: Tuan-06-Connection-Pooling-PgBouncer · Tuan-13-Monitoring-Observability (SD course)

1. Context & Why

1.1 USE method cho DB

Brendan Gregg’s USE method:

Utilization — % time resource was busy
Saturation — degree resource has work it can’t service
Errors — error events

Áp dụng cho DB:

Resource	Utilization	Saturation	Errors
CPU	%busy	run queue length	crashes
Memory	%used	swap activity, OOM	OOM kills
Disk I/O	%busy, IOPS	queue length	I/O errors
Network	bandwidth used	drops, retransmits	connection errors
DB connections	active/max	clients waiting	connection refused
Buffer cache	hit ratio	eviction rate	—
WAL	write throughput	archive backlog	archive failures
Replication	replay rate	lag	replication slot drop
Locks	held count	wait queue length	deadlocks
Vacuum	tables/sec	dead tup ratio	xid wraparound risk

1.2 4 Golden signals của DB

Throughput — TPS/QPS
Latency — P50/P95/P99
Errors — failed queries, deadlocks, OOM
Saturation — connection pool, IOPS, CPU

1.3 Mục tiêu tuần

Master pg_stat_statements — query attribution
Đọc pg_stat_user_tables để spot bloat, sequential scan abuse
pg_stat_io (PG16+) — phân tích I/O patterns
Tune autovacuum cho high-write workload
Setup Prometheus + Grafana với postgres_exporter
Capacity planning thực dụng
Build dashboard “Top 10 slow queries” + alerts
Compare commercial tools: pganalyze, Datadog DBM, AWS Performance Insights

1.4 Tham chiếu

PostgreSQL Documentation: Monitoring Database Activity — Ch. 28
pg_stat_statements docs — https://www.postgresql.org/docs/current/pgstatstatements.html
postgres_exporter — https://github.com/prometheus-community/postgres_exporter
pgwatch2 — https://github.com/cybertec-postgresql/pgwatch2
Cybertec, Crunchy Data, EDB blogs — postgresql performance
PgAnalyze blog — https://pganalyze.com/blog (deep dives)
Brendan Gregg — Systems Performance (book) cho USE method

2. pg_stat_statements — Query attribution

2.1 Setup

# postgresql.conf
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.max = 10000
pg_stat_statements.track = all  # or 'top'
pg_stat_statements.track_utility = on
pg_stat_statements.save = on

CREATE EXTENSION pg_stat_statements;

2.2 Core queries

Top 10 by total time (the most important query):

SELECT
    substring(query, 1, 80) AS query,
    calls,
    round(total_exec_time::numeric, 0) AS total_ms,
    round(mean_exec_time::numeric, 1) AS mean_ms,
    round(stddev_exec_time::numeric, 1) AS stddev_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 10;

Top 10 by call frequency (often-called slow queries):

SELECT substring(query, 1, 80), calls, mean_exec_time
FROM pg_stat_statements
WHERE mean_exec_time > 10  -- >10ms
ORDER BY calls DESC
LIMIT 10;

Top 10 by I/O:

SELECT
    substring(query, 1, 80),
    calls,
    shared_blks_read,
    shared_blks_hit,
    round((100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0))::numeric, 1) AS cache_hit_pct
FROM pg_stat_statements
ORDER BY shared_blks_read DESC
LIMIT 10;

2.3 Other useful columns

PG14+ separates plan time and execution time:

total_plan_time, mean_plan_time — planner overhead
total_exec_time, mean_exec_time — execution
rows — total rows returned
shared_blks_* — pages read/hit/dirtied
temp_blks_read/written — temp file (sort spill, hash spill)
wal_records, wal_bytes — WAL generated
jit_* (PG14+) — JIT compilation stats

2.4 Reset

-- Reset all
SELECT pg_stat_statements_reset();
 
-- Reset for specific user
SELECT pg_stat_statements_reset(userid, dbid);

Reset weekly to keep stats fresh (snapshot perspective).

2.5 Query normalization

pg_stat_statements normalizes literals:

-- Original
SELECT * FROM users WHERE id = 42;
 
-- Normalized in pg_stat_statements
SELECT * FROM users WHERE id = $1

This means many similar queries aggregate. But:

IN (...) with different lengths = different queries
DDL not normalized
Comments NOT stripped (PG13+: track_utility setting)

2.6 Anti-patterns to find

Query taking 80% total time → that’s your target.

-- Pattern: high calls × moderate time = compound problem
SELECT query, calls * mean_exec_time AS total_impact
FROM pg_stat_statements
ORDER BY total_impact DESC LIMIT 10;

3. pg_stat_user_tables — Table-level stats

3.1 Sequential scan abuse

SELECT
    schemaname, relname,
    seq_scan, seq_tup_read,
    idx_scan, idx_tup_fetch,
    seq_tup_read / nullif(seq_scan, 0) AS avg_seq_tup,
    round(100.0 * seq_scan / nullif(seq_scan + idx_scan, 0), 1) AS pct_seq
FROM pg_stat_user_tables
WHERE seq_scan > 100
ORDER BY seq_tup_read DESC
LIMIT 20;

seq_tup_read >> idx_tup_fetch and table is large → missing index.

3.2 Bloat detection

SELECT
    schemaname || '.' || relname AS table,
    n_live_tup,
    n_dead_tup,
    round(100.0 * n_dead_tup / nullif(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
    pg_size_pretty(pg_total_relation_size(schemaname || '.' || relname)) AS size,
    last_autovacuum,
    last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY dead_pct DESC
LIMIT 20;

dead_pct > 30% → autovacuum behind. Investigate.

3.3 Hot tables — write rate

SELECT
    relname,
    n_tup_ins, n_tup_upd, n_tup_del,
    n_tup_hot_upd,
    round(100.0 * n_tup_hot_upd / nullif(n_tup_upd, 0), 1) AS hot_pct
FROM pg_stat_user_tables
WHERE n_tup_upd > 1000
ORDER BY n_tup_upd DESC
LIMIT 10;

n_tup_hot_upd / n_tup_upd — HOT (Heap-Only Tuple) update %. High = good (no index update needed). Low = many index updates → more WAL, more lock contention.

Improve HOT %:

Reduce indexes on hot-updated columns
fillfactor < 100 leaves space in page for HOT updates

ALTER TABLE hot_table SET (fillfactor = 80);
-- 20% free space in each page for HOT updates

3.4 Index usage

SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    pg_relation_size(indexrelid) AS size_bytes
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY size_bytes DESC
LIMIT 20;

Find unused (Tuần 03 deep dive):

SELECT indexrelname, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey'
ORDER BY pg_relation_size(indexrelid) DESC;

3.5 Cache hit ratio per table

SELECT
    schemaname || '.' || relname AS table,
    heap_blks_read,
    heap_blks_hit,
    round(100.0 * heap_blks_hit / nullif(heap_blks_hit + heap_blks_read, 0), 1) AS hit_pct
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 20;

Hot tables should have >99% hit. Low hit → cold cache or table too big for buffer pool.

4. pg_stat_io (PG16+) — I/O analytics

4.1 What it shows

SELECT
    backend_type,
    object,         -- relation/temp_relation
    context,        -- normal/vacuum/bulkread/bulkwrite
    reads, read_time,
    writes, write_time,
    extends, extend_time,
    hits,
    evictions,
    reuses,
    fsyncs, fsync_time
FROM pg_stat_io;

Powerful: see what’s doing I/O. Backend type:

client backend — app queries
autovacuum worker
background writer
checkpointer
walwriter

Context:

normal — typical operations
vacuum — vacuum activity
bulkread — large sequential scans (use ring buffer)
bulkwrite — COPY/CTAS

4.2 Use cases

Who’s reading from disk most?

SELECT backend_type, sum(reads) AS reads
FROM pg_stat_io
GROUP BY backend_type
ORDER BY reads DESC;

Vacuum I/O cost:

SELECT
    sum(reads) AS reads,
    sum(read_time) / 1000 AS read_time_sec,
    sum(writes) AS writes
FROM pg_stat_io
WHERE backend_type = 'autovacuum worker';

5. Autovacuum — The Most Important Background Process

5.1 What autovacuum does

graph TB
    A[Autovacuum launcher<br/>checks every minute] --> B{Any table due?}
    B -->|No| A
    B -->|Yes| C[Spawn worker]
    C --> D{Per table: dead_tuples vs threshold}
    D -->|VACUUM needed| E[Run VACUUM]
    D -->|ANALYZE needed| F[Run ANALYZE]
    D -->|VACUUM FREEZE needed| G[Run aggressive vacuum]
    E --> H[Update FSM/VM]
    E --> I[Mark dead tuples reusable]

Triggers:

autovacuum_vacuum_threshold + scale_factor × n_live_tup rows changed → VACUUM
autovacuum_analyze_threshold + scale_factor × n_live_tup → ANALYZE
autovacuum_freeze_max_age exceeded → forced VACUUM FREEZE

5.2 Default settings

autovacuum = on                          # NEVER disable
autovacuum_max_workers = 3
autovacuum_naptime = 1min
autovacuum_vacuum_threshold = 50         # base rows
autovacuum_vacuum_scale_factor = 0.2     # 20%
autovacuum_analyze_threshold = 50
autovacuum_analyze_scale_factor = 0.1    # 10%
autovacuum_vacuum_cost_delay = 2ms       # throttle
autovacuum_vacuum_cost_limit = 200       # throttle budget
autovacuum_freeze_max_age = 200000000    # 200M xid age

5.3 Tuning for high-write tables

Default scale_factor = 0.2 means vacuum waits until 20% bloat. On 1B row table = 200M dead tuples → vacuum runs forever.

Per-table tuning:

ALTER TABLE hot_orders SET (
    autovacuum_vacuum_scale_factor = 0.05,           -- 5% trigger
    autovacuum_vacuum_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.02,           -- 2% trigger
    autovacuum_vacuum_cost_delay = 0,                 -- no throttle
    autovacuum_vacuum_cost_limit = 2000               -- fast vacuum
);

PG13+ added insert-trigger (for append-mostly tables):

ALTER TABLE event_log SET (
    autovacuum_vacuum_insert_scale_factor = 0.05,    -- vacuum after 5% inserts (for VM bit setup)
    autovacuum_vacuum_insert_threshold = 1000
);

5.4 Detect autovacuum behind

-- Long-running autovacuum
SELECT
    pid, datname, relid::regclass AS table,
    phase,                              -- 'scanning heap', 'cleaning indexes', etc
    heap_blks_total, heap_blks_scanned,
    round(100.0 * heap_blks_scanned / nullif(heap_blks_total, 0), 1) AS pct_done,
    pg_size_pretty(heap_blks_total * 8192) AS table_size
FROM pg_stat_progress_vacuum
JOIN pg_stat_activity USING (pid)
WHERE backend_type = 'autovacuum worker';

PG14+: pg_stat_progress_vacuum includes index_vacuum_count, much detail.

5.5 Anti-correlated indicators

Autovacuum struggle:

n_dead_tup grows faster than dropping
last_autovacuum long ago (>1 day for hot tables)
Bloat grows
Replication lag spikes during vacuum

Causes:

Long-running transactions hold xmin horizon → vacuum can’t remove
autovacuum_vacuum_cost_limit too low → throttled
Too few autovacuum_max_workers for table count
IO bottleneck

5.6 Per-table tuning matrix

Workload	scale_factor	threshold	cost_delay	Notes
Read-mostly	0.2 default	50 default	2ms default	OK
Mostly insert (event log)	0.05	1000	0	+ insert_scale_factor 0.02
Heavy update (queue, counter)	0.02	100	0	+ cost_limit 5000
Append-then-update (orders)	0.1	500	0	+ insert_scale_factor 0.05
Bulk load then read	autovacuum off during load	—	—	VACUUM ANALYZE after
Soft-delete heavy	0.05	500	0	Consider partition + drop

-- Apply per-table
ALTER TABLE events SET (
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_threshold = 1000,
    autovacuum_analyze_scale_factor = 0.02,
    autovacuum_vacuum_cost_delay = 0,
    autovacuum_vacuum_cost_limit = 5000,
    autovacuum_vacuum_insert_scale_factor = 0.05,    -- PG13+
    autovacuum_vacuum_insert_threshold = 1000        -- PG13+
);

5.7 Vacuum throttling math

Default budget:

vacuum_cost_limit = 200 units / vacuum_cost_delay

Cost per page (default):

Shared buffer hit: 1 unit
Buffer miss: 10 units
Dirty page write: 20 units

Vacuum scanning 1M pages, 50% miss rate, 10% dirty:
  500K × 1 + 500K × 10 + 100K × 20 = 7.5M units
  At 200 units / 2ms delay = 100K units/sec
  Total time: 75 seconds throttled

For background priority: increase cost_limit + remove delay:

ALTER TABLE hot_table SET (autovacuum_vacuum_cost_limit = 5000, autovacuum_vacuum_cost_delay = 0);

5.8 Manual VACUUM with options (PG16+)

-- PG16: parallel vacuum index, limit RAM
VACUUM (PARALLEL 4, ANALYZE, VERBOSE, BUFFER_USAGE_LIMIT '1GB') orders;
 
-- Quick freeze for wraparound prevention
VACUUM (FREEZE, ANALYZE) orders;
 
-- Disable analyze if just space reclaim
VACUUM (DISABLE_PAGE_SKIPPING) orders;

5.9 Verify VM accuracy

CREATE EXTENSION pg_visibility;
 
-- Find pages not all-visible (where index-only scan would heap-fetch)
SELECT count(*) FROM pg_visibility_map('orders'::regclass) WHERE NOT all_visible;
 
-- Force VM rebuild if suspect
VACUUM (DISABLE_PAGE_SKIPPING) orders;

5.10 Diagnose xmin horizon block

SELECT
    pid,
    age(backend_xmin) AS xmin_age,
    now() - xact_start AS xact_duration,
    state,
    query
FROM pg_stat_activity
WHERE backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC NULLS LAST
LIMIT 10;

Top entry with xmin_age in millions → long transaction blocking vacuum globally.

Setting:

idle_in_transaction_session_timeout = 30s   # kill idle transactions
statement_timeout = 60s                      # kill runaway queries

5.7 XID wraparound monitor

SELECT
    datname,
    age(datfrozenxid) AS xid_age,
    2^31 - age(datfrozenxid) AS wraparound_distance,
    pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY xid_age DESC;

Alert when xid_age > 1.5B (75% threshold). Forced VACUUM FREEZE runs at 200M default but can be triggered earlier.

Per-table:

SELECT relname,
       age(relfrozenxid) AS xid_age,
       pg_size_pretty(pg_total_relation_size(c.oid)) AS size
FROM pg_class c
WHERE relkind = 'r'
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

6. Slow Query Log

6.1 Setup

log_min_duration_statement = 1000   # log queries >1s
log_duration = off                   # don't log all (noise)
log_lock_waits = on                  # log lock waits >deadlock_timeout
log_temp_files = 0                   # log all temp file creation
log_checkpoints = on
log_connections = on                  # for audit
log_disconnections = on
log_autovacuum_min_duration = 1000   # log vacuum >1s
 
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '

6.2 Parse with pgBadger

pgbadger /var/log/postgresql/postgresql-16-main.log -o report.html

Gives:

Slow query distribution
Hourly traffic
Lock waits
Connection activity
Top queries by various criteria

6.3 auto_explain — capture plan for slow queries

shared_preload_libraries = 'auto_explain, pg_stat_statements'
auto_explain.log_min_duration = 1000
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain.log_format = 'json'
auto_explain.log_triggers = on

When query >1s, full EXPLAIN ANALYZE plan logged. Diagnose ad-hoc slow queries without reproducing.

⚠️ Cost: ~5-20% overhead due to per-row timing. Use selectively.

7. Prometheus + Grafana — DIY observability

7.1 Stack

graph LR
    PG[(Postgres)] -.queries.-> Exporter[postgres_exporter]
    PGB[PgBouncer] -.SHOW STATS.-> PGBExporter[pgbouncer_exporter]
    Node[Linux node] -.metrics.-> NodeExp[node_exporter]

    Exporter --> Prom[Prometheus]
    PGBExporter --> Prom
    NodeExp --> Prom

    Prom --> Graf[Grafana]
    Prom --> AM[Alertmanager]
    AM --> Slack[Slack/PagerDuty]

7.2 postgres_exporter setup

# docker-compose
services:
  postgres_exporter:
    image: prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://monitor_user:pass@postgres:5432/appdb?sslmode=disable"
    ports: ["9187:9187"]

Create monitor user:

CREATE USER monitor_user WITH PASSWORD '...';
GRANT pg_monitor TO monitor_user;
GRANT CONNECT ON DATABASE appdb TO monitor_user;

7.3 Custom metrics queries

queries.yaml:

pg_stat_statements_top:
  query: |
    SELECT substring(query, 1, 100) AS query_text, calls, total_exec_time, mean_exec_time
    FROM pg_stat_statements
    ORDER BY total_exec_time DESC LIMIT 20
  metrics:
    - query_text:
        usage: "LABEL"
    - calls:
        usage: "COUNTER"
    - total_exec_time:
        usage: "COUNTER"
    - mean_exec_time:
        usage: "GAUGE"

Mount and pass to exporter via --extend.query-path=/queries.yaml.

7.4 Grafana dashboard essentials

Panel 1: TPS — rate(pg_stat_database_xact_commit[1m]) per DB Panel 2: Latency P99 — from pg_stat_statements mean+stddev approximations Panel 3: Connection count — pg_stat_database_numbackends Panel 4: Replication lag — pg_replication_lag_seconds Panel 5: Cache hit ratio — rate(pg_stat_database_blks_hit) / rate(pg_stat_database_blks_hit + blks_read) Panel 6: Locks waiting — pg_locks_count{mode='ExclusiveLock', granted='false'} Panel 7: Autovacuum — pg_stat_activity_count{backend_type='autovacuum worker'} Panel 8: WAL generation rate — rate(pg_stat_wal_records[1m]) Panel 9: Slow query top 10 — from custom query Panel 10: Bloat top 10 — from custom query

Public dashboards on Grafana.com: search “postgres exporter”.

7.5 Alerts

groups:
- name: postgres
  rules:
  - alert: PostgresDown
    expr: pg_up == 0
    for: 1m
  - alert: PostgresHighConnections
    expr: pg_stat_database_numbackends / pg_settings_max_connections > 0.8
    for: 5m
  - alert: PostgresReplicationLag
    expr: pg_replication_lag_seconds > 30
    for: 5m
  - alert: PostgresXIDWraparound
    expr: pg_database_xid_age > 1500000000
    for: 5m
    annotations:
      summary: "XID wraparound risk"
      runbook: "https://wiki.internal/postgres/xid-wraparound"
  - alert: PostgresHighDiskIO
    expr: rate(node_disk_io_time_seconds_total[1m]) > 0.8
    for: 10m

8. Tuning Common Issues

8.1 High CPU

-- Top CPU queries (PG14+ has total_exec_time)
SELECT substring(query, 1, 80), calls, total_exec_time / sum(total_exec_time) OVER () AS pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
 
-- Currently active expensive queries
SELECT pid, now() - query_start AS duration, state, substring(query, 1, 80)
FROM pg_stat_activity
WHERE state = 'active' AND query NOT ILIKE '%pg_stat_activity%'
ORDER BY duration DESC LIMIT 10;

Fixes:

Add indexes (Tuần 03)
Rewrite queries (Tuần 04)
Increase planner cost params correctly
Scale CPU

8.2 High Disk I/O

SELECT backend_type, sum(reads), sum(writes)
FROM pg_stat_io
GROUP BY backend_type
ORDER BY sum(reads) DESC;

Causes:

Cold cache → tune shared_buffers
Big sequential scans → add index
Vacuum throttle → tune
Checkpoint flood → tune checkpoint_timeout, max_wal_size

shared_buffers = 8GB           # ~25-35% RAM
effective_cache_size = 24GB     # OS cache + shared_buffers
checkpoint_timeout = 15min
max_wal_size = 8GB
checkpoint_completion_target = 0.9

8.3 High Latency Specific Query

EXPLAIN ANALYZE (Tuần 04), then targeted fix.

8.4 Connection saturation

Tuần 06. Pooler.

8.5 Replica Lag — Deep Diagnose

Three lag types

-- On primary
SELECT
    application_name,
    state,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, write_lsn)) AS network_lag,
    pg_size_pretty(pg_wal_lsn_diff(write_lsn, flush_lsn)) AS disk_lag,
    pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn)) AS apply_lag,
    write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

Lag	Meaning	Cause	Fix
`write_lag`	Sent → received by replica	Network slow	Bandwidth, geographically closer
`flush_lag`	Received → fsync’d to disk	Replica disk slow	Better disk, async (`synchronous_commit=off`)
`replay_lag`	fsync’d → applied	Replica CPU/apply bottleneck	Single-threaded apply is the limit

On replica side

SELECT
    pg_is_in_recovery() AS is_replica,
    pg_last_wal_receive_lsn() AS received,
    pg_last_wal_replay_lsn() AS applied,
    now() - pg_last_xact_replay_timestamp() AS replay_age;

replay_age is time-based lag. Goes UP when no writes happen on primary (no new WAL to apply, looks like idle).

Causes — root cause analysis

Bulk DML on primary — UPDATE huge_table SET x=y generates massive WAL → single-threaded apply lags. Fix: chunk DML (Tuần 08).
Network bottleneck — write_lag high. Check bandwidth, latency. Co-locate if possible.
Replica disk slow — flush_lag high. Bigger IOPS or use synchronous_commit = local (trades durability for speed).
Long queries on replica — Replay paused for queries (snapshot conflict). Check:

SELECT * FROM pg_stat_database_conflicts WHERE datname = current_database();
-- confl_tablespace, confl_lock, confl_snapshot, confl_bufferpin, confl_deadlock

Cascading replica behind another replica — debug all hops.

hot_standby_feedback trade-off

# On replica
hot_standby_feedback = on

Replica tells primary “I’m reading xid X, don’t vacuum yet”
Pro: replica queries don’t get cancelled
Con: primary bloat accumulates while replica has long query

For analytics replica with hour-long queries: BAD (massive bloat). For OLTP read replica with sub-second queries: OK.

max_standby_streaming_delay

# On replica
max_standby_streaming_delay = 30s

How long replica delays WAL apply to let queries finish before cancelling. Trade-off:

High value: queries safer, lag grows
Low value: queries cancelled more, lag stays low

Parallel apply (PG16+)

max_logical_replication_workers = 4

For logical replication. Physical replication still single-threaded apply (as of PG17).

PG14+ has max_parallel_workers parallel apply for specific operations (large CREATE INDEX from primary), but day-to-day WAL replay single thread.

This is the fundamental bottleneck: write rate on primary must be ⇐ apply rate on replica. If primary 5GB/hour WAL → replica can apply ~2-5GB/hour realistically.

Replication slot lag (logical replication)

SELECT slot_name, active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots;

If lag growing → consumer stuck. Safety valve max_slot_wal_keep_size (Bonus CDC).

Quick fixes

Avoid bulk DML on primary (or chunk it)
Move heavy queries off replica (or use logical replication to dedicated analytics replica)
Faster network + disk on replica
More replicas (read-only fanout) instead of overloading one
Accept eventual consistency, design app for stale reads

Application impact

Read-after-write inconsistency: app writes to primary, reads from replica, doesn’t see write.

Patterns:

Read from primary for short window after write
Sticky session to primary
Causal consistency: pass LSN, wait on replica until applied
Accept (UI shows “saving…” until refresh)

8.6 Bloat

Tuần 01 + 03. VACUUM tune + monitor + pg_repack.

8.7 Lock waits

-- Currently waiting
SELECT pid, wait_event, wait_event_type, now() - state_change AS waited, query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL;
 
-- Blocking tree
SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query,
       blocking.pid AS blocking_pid, blocking.query AS blocking_query,
       blocking.state AS blocking_state
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.wait_event_type = 'Lock';

9. Capacity Planning

9.1 Growth metrics

Daily snapshot:

SELECT now() AS captured_at,
       sum(pg_total_relation_size(oid)) AS total_size,
       sum(pg_indexes_size(oid)) AS index_size
FROM pg_class WHERE relkind = 'r';

Plot weekly → linear/exponential? Forecast 6 months.

9.2 Connection growth

Monitor pg_stat_database.numbackends peak per hour. Trend:

Linear with users? → scale by adding pooler / instances
Spike-driven? → tune pool

9.3 IOPS

Cloud (AWS): CloudWatch ReadIOPS, WriteIOPS. RDS DiskQueueDepth.

Required IOPS = TPS × queries_per_transaction × pages_per_query × hit_miss_ratio

Example: 1000 TPS × 5 queries × 10 pages × 5% miss = 2500 IOPS (read). Write IOPS = write rate × write amplification (~3-5x).

9.4 RAM

RAM needed = shared_buffers + max_connections × work_mem × ops + maintenance_work_mem × autovacuum_workers + OS cache target

For 100 connections, 16MB work_mem, 3 ops/query avg, 64MB maint_work_mem × 3 workers, 25% OS cache:

8GB + 100 × 16MB × 3 + 64MB × 3 + 8GB OS = 8 + 4.8 + 0.2 + 8 ≈ 21GB

Round up: 24-32GB instance.

9.5 Storage

DB size + bloat headroom (20%) + WAL retention (1-2 days) + backup space

For 500GB DB: 500 × 1.2 + 50 (WAL) + 500 (backup local) = ~1.2TB.

9.6 Forecasting

-- Growth via pg_stat tables daily snapshot
CREATE TABLE db_size_history (
    captured_at timestamptz DEFAULT now(),
    db_size_bytes bigint,
    total_relation_bytes bigint
);
 
INSERT INTO db_size_history (db_size_bytes, total_relation_bytes)
SELECT pg_database_size(current_database()),
       sum(pg_total_relation_size(oid)) FROM pg_class WHERE relkind='r';

Cron daily. Plot linear regression. Alert when projected > available.

10. Commercial Tools 2024-2026

10.1 pganalyze

https://pganalyze.com

Features:

Auto-collected stats
Index advisor (suggests indexes based on workload)
Query insights
Schema advisor

Cost: $200 -$ 2000/month.

10.2 Datadog Database Monitoring

Integrated with Datadog APM. Strong if already using Datadog.

10.3 AWS Performance Insights

Free for RDS. Limited customization but solid baseline.

10.4 Azure SQL Insights, GCP SQL Insights

Cloud-native equivalents.

10.5 Self-host vs SaaS

Option	Best for
Prometheus + Grafana	DIY-friendly, small team
pgwatch2	Self-host but managed UI
pganalyze	Advanced advisor needed
Datadog DBM	Already on Datadog
RDS PI	On AWS RDS, basic OK

11. Production Patterns

11.1 Weekly review checklist

## DB Weekly Review
 
### Performance
- [ ] Top 10 slow queries (pg_stat_statements) — any new?
- [ ] Query P99 trend — increasing?
- [ ] Cache hit ratio — <99%? Why?
 
### Health
- [ ] Replication lag P99 — <5s?
- [ ] Bloat ratio — any table >30%?
- [ ] Autovacuum keeping up — last_autovacuum < 1 day for hot tables?
- [ ] XID wraparound distance — >1B safe?
 
### Capacity
- [ ] Disk usage trend — projected full date?
- [ ] Connection peak — within budget?
- [ ] IOPS peak — within instance limits?
 
### Operations
- [ ] Backup last successful — <24h?
- [ ] Restore drill — last quarter?
- [ ] Schema changes pending — any blocked?

11.2 Runbooks

Document common scenarios:

## Runbook: Postgres High CPU
 
### Detection
- Alert: pg_cpu_usage > 80% for 10m
 
### Diagnose
1. Connect to DB
2. Run top 10 active queries
3. Run pg_stat_statements top 10 by total time last 1h
4. Check connection count
 
### Mitigation
- Kill runaway query: pg_terminate_backend(pid)
- Throttle ingress (rate limit at LB/API gateway)
- Scale up CPU (cloud autoscale or manual)
 
### Resolution
- Add index for slow query
- Rewrite query
- Tune planner stats

12. Lab — 7 days

12.1 Day 1: pg_stat_statements

-- Enable, generate workload (pgbench)
pgbench -c 10 -j 4 -T 60 testdb
 
-- Analyze
SELECT * FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

12.2 Day 2: Autovacuum tuning

-- Create high-write table, disable autovacuum
CREATE TABLE hot_log AS SELECT i, md5(i::text) FROM generate_series(1,10000) i;
ALTER TABLE hot_log SET (autovacuum_enabled = false);
 
-- Generate updates
DO $$ FOR i IN 1..10 LOOP UPDATE hot_log SET md5 = md5||'x'; END LOOP; END $$;
-- Check bloat
SELECT n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname='hot_log';
 
-- Tune
ALTER TABLE hot_log SET (
    autovacuum_enabled = true,
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_cost_delay = 0
);
-- Wait, check

12.3 Day 3: Slow query log + pgBadger

Setup log_min_duration_statement = 100, run mixed workload, parse with pgBadger.

12.4 Day 4: Prometheus + Grafana

Docker compose: postgres + postgres_exporter + prometheus + grafana. Import community dashboard.

12.5 Day 5: Capacity planning

Insert 1M rows daily simulated, track pg_database_size. Plot, forecast.

12.6 Day 6: pg_stat_io (PG16+)

SELECT backend_type, sum(reads), sum(writes) FROM pg_stat_io GROUP BY backend_type;
-- Run various workloads, compare

12.7 Day 7: Build runbook

Pick scenario (e.g., replica lag spike), document:

How to detect
How to diagnose
3 mitigations
Long-term fix

13. Self-check

pg_stat_statements — 3 top queries bạn check?
USE method cho DB — define cho 3 resource.
Autovacuum default scale_factor 0.2 — sao tệ cho hot table?
xmin horizon block vacuum — diagnose thế nào?
HOT update — là gì, vì sao tốt?
Cache hit ratio target? Khi nào thấp = problem?
pg_stat_io context bulkread — khi nào appear?
log_min_duration_statement vs auto_explain — khác nhau, khi nào dùng?
Capacity plan storage — 4 components?
Top 5 metric Grafana dashboard bạn build?

14. Tiếp theo

Phase 2 hoàn thành. Bạn đã có Operations skills. Bài tiếp: Tuan-10-Redis-Mastery — bước sang NoSQL.

Tuần 09 hoàn thành. You cannot improve what you cannot measure. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-09-DB-Observability-Tuning