Tuần 01 — Database Internals Refresh

“Bạn không thể debug cái mà bạn không hiểu. Postgres không phải hộp đen — nó là 30 năm engineering decisions, mỗi cái đều có lý do, và 90% bug production xuất phát từ việc dev không hiểu một trong những decisions đó.”

Tags: database postgresql internals storage-engine mvcc wal Học viên: Hieu (Fullstack 2+ năm → Senior với chiều sâu DB) Thời lượng: 7 ngày (4-6h/ngày) Liên quan: Tuan-Foundations-Database-Internals (concept layer) · Tuan-02-Schema-Design-Normalization · Tuan-05-Transactions-Locking · Tuan-09-DB-Observability-Tuning

1. Context & Why

1.1 Khóa Foundations đã dạy gì, và tuần này thêm gì?

Tuan-Foundations-Database-Internals trong khóa System Design dạy concept layer: B-tree vs LSM nói chung, MVCC khái niệm, columnar storage là gì. Đó là kiến thức để thiết kế hệ thống.

Tuần 01 này dạy engineering layer: cụ thể Postgres lưu byte trên disk như thế nào, vì sao shared_buffers = 25% RAM là rule-of-thumb sai trên server >32GB, vì sao xid wraparound từng làm sập production Sentry, làm thế nào để đọc một row trong pg_class và biết được nó đang ở page nào, offset nào.

Khác biệt:

Khía cạnh	Foundations (SD course)	Tuần 01 (DB course)
Mục tiêu	Hiểu để chọn DB	Hiểu để debug + tune
Code	Pseudocode	SQL thật + `pg_*` system views
Lab	Không có	Docker compose + thí nghiệm reproduce
Scope	B-tree vs LSM vs Columnar	Postgres-first, so sánh khác khi cần

1.2 Tại sao Postgres-first?

Trong khóa này, Postgres là DB reference vì:

Open source thật sự — đọc source code được, không bị vendor lock-in tài liệu
Tài liệu chính thức tốt nhất ngành — postgresql.org/docs
Internals được document hóa chi tiết qua “The Internals of PostgreSQL” (Hironobu Suzuki)
Phổ biến 2024-2026 — RDS, Aurora, Supabase, Neon, CrunchyData… đều build trên Postgres
Concept transfer được — hiểu Postgres MVCC sẽ hiểu Oracle, hiểu InnoDB nhanh hơn

Khi nào nói về MySQL/InnoDB: ở Tuan-Bonus-MySQL-Internals hoặc khi có khác biệt quan trọng (vd: clustered index, gap lock).

1.3 Mục tiêu kết thúc tuần

Bạn sẽ:

Vẽ được page layout của Postgres (header + line pointer + tuple)
Giải thích WAL flow: từ INSERT trong client → byte ghi vào WAL file → page bị modified → checkpoint flush
Tự reproduce bloat trên một bảng, đo bằng pg_stat_user_tables, fix bằng VACUUM
Biết xmin, xmax, ctid là gì và dùng chúng để debug visibility
Hiểu vì sao shared_buffers không nên >40% RAM dù máy có nhiều RAM
Tự config được Postgres cluster local cho lab, có monitoring cơ bản

1.4 Tham chiếu chính

The Internals of PostgreSQL — Hironobu Suzuki — https://www.interdb.jp/pg/
PostgreSQL Documentation — Chapter 65 (Internals) — https://www.postgresql.org/docs/current/internals.html
Database Internals — Alex Petrov (2019) — Chương 1-5
PostgreSQL 14 Internals — Egor Rogov (free PDF) — https://postgrespro.com/community/books/internals
Bruce Momjian’s slides — https://momjian.us/main/presentations/internals.html
Use The Index, Luke! — Markus Winand — https://use-the-index-luke.com/

2. Storage Layer — Postgres lưu byte như thế nào

2.1 Hierarchy: Cluster → Database → Tablespace → Object

PostgreSQL Cluster (initdb tạo)
├── pg_default tablespace
│   ├── Database "appdb" (oid=16384)
│   │   ├── Table "users" (oid=16385)
│   │   │   ├── Main fork:   16385       (heap data)
│   │   │   ├── FSM fork:    16385_fsm   (free space map)
│   │   │   ├── VM fork:     16385_vm    (visibility map)
│   │   │   └── Init fork:   16385_init  (unlogged tables only)
│   │   └── Index "users_pkey" (oid=16386)
│   │       └── Main fork:   16386
│   └── ...
└── pg_global tablespace (shared catalogs)

Forks là gì: Mỗi table có nhiều “fork” — file riêng cho data, free space map, visibility map. Đây là lý do bạn thấy nhiều file số ở $PGDATA/base/<dboid>/.

-- Xem oid và path file
SELECT
    c.oid,
    c.relname,
    pg_relation_filepath(c.oid) AS filepath,
    pg_size_pretty(pg_relation_size(c.oid, 'main')) AS main_size,
    pg_size_pretty(pg_relation_size(c.oid, 'fsm'))  AS fsm_size,
    pg_size_pretty(pg_relation_size(c.oid, 'vm'))   AS vm_size
FROM pg_class c
WHERE c.relname = 'users';

2.2 Page (Block) — đơn vị I/O cơ bản

Mọi I/O của Postgres đều theo page 8KB (compile-time constant BLCKSZ, có thể đổi nhưng không nên). Đây là điểm xuất phát của mọi performance reasoning.

Page layout (8192 bytes):

┌──────────────────────────────────────────┐ offset 0
│ PageHeaderData (24 bytes)                │
│ ├─ pd_lsn (LSN của WAL record cuối)      │
│ ├─ pd_checksum                           │
│ ├─ pd_flags                              │
│ ├─ pd_lower (offset cuối line pointer)   │
│ ├─ pd_upper (offset đầu tuple data)      │
│ ├─ pd_special                            │
│ ├─ pd_pagesize_version                   │
│ └─ pd_prune_xid                          │
├──────────────────────────────────────────┤
│ Line Pointer 1 (4 bytes: offset, length) │
│ Line Pointer 2                           │
│ Line Pointer 3                           │
│ ...                                       │
├──────────────────────────────────────────┤ pd_lower
│ (free space — grows from both ends)      │
├──────────────────────────────────────────┤ pd_upper
│ ...                                       │
│ Tuple 3                                   │
│ Tuple 2                                   │
│ Tuple 1                                   │
├──────────────────────────────────────────┤ pd_special
│ Special space (index only)               │
└──────────────────────────────────────────┘ offset 8192

Insight quan trọng:

Line pointer (ItemId) grow từ trên xuống, tuple grow từ dưới lên → meet in middle
Khi update, nếu tuple mới fit trong page cũ → HOT update; nếu không → tuple mới ở page khác, line pointer của tuple cũ thành “redirect”
pd_lsn cho phép Postgres biết: nếu crash, page này phải replay WAL từ LSN nào trở lên

2.3 Tuple — một row trong page

Tuple layout:

┌─────────────────────────────────┐
│ HeapTupleHeaderData (23 bytes)  │
│ ├─ t_xmin (xid inserted)         │
│ ├─ t_xmax (xid deleted, 0 nếu live) │
│ ├─ t_cid (command id)            │
│ ├─ t_ctid (self pointer or next)│
│ ├─ t_infomask                    │
│ ├─ t_infomask2                   │
│ └─ t_hoff (offset to data)       │
├─────────────────────────────────┤
│ NULL bitmap (nếu có column NULL)│
├─────────────────────────────────┤
│ User data (column values)        │
└─────────────────────────────────┘

Overhead tối thiểu cho một row: 23 bytes header + 4 bytes line pointer = 27 bytes.

Có nghĩa: một bảng (id int) với 1 row chiếm bao nhiêu byte trên disk? Không phải 4 byte. Là ~31 byte (27 overhead + 4 data + padding). Hiểu điều này → biết vì sao “small table” không bao giờ nhỏ như bạn nghĩ.

Thí nghiệm:

CREATE TABLE t_overhead (id int);
INSERT INTO t_overhead SELECT generate_series(1, 1);
SELECT pg_relation_size('t_overhead');
-- Returns: 8192 (1 page minimum)
 
INSERT INTO t_overhead SELECT generate_series(2, 1000);
SELECT pg_relation_size('t_overhead');
-- Roughly 40-48 KB for 1000 rows (~40 bytes/row including overhead)

2.4 TOAST — The Oversized-Attribute Storage Technique

Tuple không được vượt quá 1 page (8KB). Vậy lưu TEXT 1MB thế nào?

TOAST = mechanism để store large value out-of-line:

┌─ Main table row ────────┐
│ id, name, content_ref   │
└────┬────────────────────┘
     │ (TOAST pointer 18 bytes)
     ▼
┌─ pg_toast_16385 ────────┐
│ chunk_id, chunk_seq,    │
│ chunk_data (2KB each)   │
└─────────────────────────┘

4 strategies cho column lớn:

PLAIN — không nén, không TOAST (fixed-length types)
EXTENDED — default cho TEXT/JSONB/BYTEA — nén + TOAST nếu cần
EXTERNAL — TOAST nhưng không nén (tốt cho substring queries)
MAIN — nén nhưng cố giữ inline

ALTER TABLE articles ALTER COLUMN content SET STORAGE EXTERNAL;
-- Useful nếu bạn hay query substring(content, ...) — không nén → đọc partial nhanh

Production gotcha: Bảng có nhiều JSONB lớn → table size show nhỏ nhưng pg_total_relation_size() lớn → TOAST table chiếm phần lớn. Đừng quên check toast khi debug disk usage:

SELECT
    pg_size_pretty(pg_relation_size('articles'))           AS main,
    pg_size_pretty(pg_total_relation_size('articles'))     AS total,
    pg_size_pretty(pg_total_relation_size('articles')
                  - pg_relation_size('articles'))          AS toast_plus_indexes;

2.5 Free Space Map (FSM) & Visibility Map (VM)

FSM (_fsm fork): B-tree-like structure tracking page nào còn free space, để insert biết đặt tuple mới vào đâu mà không scan hết.

VM (_vm fork): 2 bits/page:

all-visible — mọi tuple trong page này visible to all transactions → có thể skip MVCC check → enable index-only scan
all-frozen — mọi tuple đã frozen → skip khi VACUUM FREEZE → quan trọng cho xid wraparound prevention

Insight: Index-only scan chỉ hoạt động khi VM bit set. Nếu bạn VACUUM không thường xuyên → VM stale → query tưởng dùng index-only scan nhưng vẫn fetch heap → chậm. Đây là 1 root cause của “tôi thêm index rồi mà vẫn chậm”.

-- Check VM coverage
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    100 * pg_stat_get_live_tuples(c.oid)::float
        / NULLIF(pg_stat_get_tuples_returned(c.oid), 0) AS visibility_pct
FROM pg_stat_user_tables s
JOIN pg_class c ON s.relid = c.oid
WHERE relname = 'your_table';

3. WAL — Write-Ahead Log Deep Dive

3.1 Tại sao WAL?

Vấn đề: DB transaction phải durable (D trong ACID). Cách naive: fsync mọi page bị modify. Nhưng:

1 transaction modify 100 pages → 100 fsync → cực chậm
fsync là phép ghi ngẫu nhiên (random write) → HDD chết, SSD cũng tệ

Giải pháp: Write-Ahead Logging — ghi log của thay đổi trước khi modify data file.

Log là sequential append → nhanh
Chỉ fsync log, không fsync data pages
Khi crash, replay log để recover

3.2 WAL structure

$PGDATA/pg_wal/
├── 000000010000000000000001    (16MB segment)
├── 000000010000000000000002
├── 000000010000000000000003
└── archive_status/

Mỗi segment 16MB (default), tên = <TimelineId><Hi32 LSN><Lo32 LSN>.

LSN (Log Sequence Number) — 64-bit pointer vào WAL. Mọi page có pd_lsn = LSN của WAL record cuối modify nó. Recovery so sánh pd_lsn với current LSN để biết phải replay tới đâu.

SELECT pg_current_wal_lsn();
-- 0/3A1B2C40 (hex: high 32 / low 32)
 
SELECT pg_walfile_name(pg_current_wal_lsn());
-- 00000001000000000000003A

3.3 WAL flow chi tiết

Client: INSERT INTO users VALUES (...)
        │
        ▼
┌─────────────────────────────────┐
│ 1. Lock page (buffer)           │
│ 2. Modify tuple in buffer       │
│ 3. Generate WAL record          │
│ 4. Copy WAL record → WAL buffer │
│ 5. Mark page DIRTY              │
│ 6. Unlock                       │
└─────────────────────────────────┘
        │
        ▼ COMMIT
┌─────────────────────────────────┐
│ 7. Flush WAL buffer → WAL file  │
│ 8. fsync(WAL file)              │  ← durability guarantee
│ 9. Send "COMMIT OK" to client   │
└─────────────────────────────────┘
        │
        ▼ ASYNC (background)
┌─────────────────────────────────┐
│ Background writer / checkpoint:  │
│ - Flush dirty pages to data file │
│ - fsync(data file)               │
└─────────────────────────────────┘

Key insight: Khi client nhận “COMMIT OK”, data page chưa ghi xuống disk. Chỉ WAL được fsync. Nếu crash ngay sau commit → recovery sẽ replay WAL → state đúng.

3.4 Checkpoint — “marker phòng khi crash”

Checkpoint = “tại điểm này, mọi dirty page đã được flush xuống disk”. Khi crash, recovery chỉ cần replay WAL từ checkpoint cuối, không phải từ đầu.

Khi nào trigger checkpoint:

checkpoint_timeout (default 5 phút)
WAL size vượt max_wal_size (default 1GB Postgres 14+)
Manual: CHECKPOINT;

Trade-off:

Checkpoint thường xuyên → recovery nhanh nhưng I/O spike thường xuyên
Checkpoint thưa → recovery chậm nhưng I/O smoother

Production setting điển hình:

checkpoint_timeout = 15min
max_wal_size = 8GB
checkpoint_completion_target = 0.9   # Spread I/O across 90% of interval

3.5 fsync, full_page_writes, và partial-write hell

Vấn đề partial write: Khi OS đang ghi 8KB page xuống disk, mất điện → 4KB ghi thành, 4KB không. Page hỏng → corrupt → DB không recover được dù có WAL.

Giải pháp full_page_writes = on (default): Lần đầu page bị modify sau checkpoint, ghi full page image vào WAL. Recovery thấy WAL có full page → khôi phục page nguyên vẹn.

Trade-off: WAL volume to hơn (vì có full image). Nhưng không tắt option này trên production. Modern filesystems (ZFS, btrfs) có atomic write thì có thể tắt, nhưng cực hiếm trường hợp dùng.

synchronous_commit:

on (default) — fsync WAL trước khi return commit
off — return commit ngay, fsync sau (mất tối đa wal_writer_delay data nếu crash)
local — fsync local, không chờ replica
remote_write, remote_apply — chờ replica (HA setup)

Khi nào tắt synchronous_commit: Workload OK với mất 200ms data cuối khi crash (vd: logging, metrics, analytics). Đừng tắt với payment, order, user data.

-- Per-transaction
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO event_log ...;
COMMIT;

3.6 WAL archiving (cho PITR, sẽ học sâu Tuần 07)

archive_mode = on
archive_command = 'aws s3 cp %p s3://backup/wal/%f'

Mỗi WAL segment đầy sẽ được copy ra S3. Kết hợp với base backup → có thể restore tại bất kỳ điểm thời gian nào (Point-In-Time Recovery).

4. Buffer Pool — RAM management

4.1 shared_buffers

Shared buffer pool = RAM Postgres dành để cache pages. Mọi page đọc/ghi đều qua đây.

Architecture:

┌─────────────────────────────────────┐
│ shared_buffers (e.g. 8GB)            │
│ ┌────────────────────────────────┐  │
│ │ Buffer descriptor array        │  │
│ │ [buf_id, tag, refcount, flags] │  │
│ └────────────────────────────────┘  │
│ ┌────────────────────────────────┐  │
│ │ Buffer blocks (actual 8KB pages)│ │
│ └────────────────────────────────┘  │
└─────────────────────────────────────┘

Cấp phát: Khi process cần page X:

Tính hash của (rel, fork, blocknum) → tìm trong buffer table
Nếu hit → tăng refcount → return
Nếu miss → tìm buffer free (clock-sweep algorithm) → đọc từ disk → return

Clock-sweep: Không phải LRU thực sự (LRU strict cần lock toàn bộ). Mỗi buffer có “usage count” 0-5. Background process quét vòng tròn, giảm count. Khi count=0 và refcount=0 → có thể evict.

4.2 Vì sao shared_buffers không nên = 100% RAM?

Postgres dựa vào OS page cache như layer cache thứ 2:

┌─────────────────┐
│ shared_buffers  │  ← 25-40% RAM
└────────┬────────┘
         │ miss
         ▼
┌─────────────────┐
│ OS page cache   │  ← phần RAM còn lại
└────────┬────────┘
         │ miss
         ▼
┌─────────────────┐
│ Disk            │
└─────────────────┘

Nếu set shared_buffers = 100% RAM → không còn RAM cho OS cache → mọi miss của shared_buffers = đọc disk thật. Tệ hơn.

Rule of thumb hiện đại (2024-2026):

< 32GB RAM: shared_buffers = 25% RAM
32GB - 128GB RAM: shared_buffers = 25-35% RAM
128GB RAM: shared_buffers = 20-30% RAM (cận trên giảm vì OS cache vẫn quan trọng cho large datasets)
Cloud (RDS, Aurora): theo recommendation của vendor, thường thấp hơn

4.3 Đo hit ratio

SELECT
    datname,
    blks_hit,
    blks_read,
    round(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_pct
FROM pg_stat_database
WHERE datname = current_database();

Target: >99% cho OLTP. <95% → cần tăng shared_buffers hoặc thêm RAM.

Nhưng cẩn thận: hit ratio cao không có nghĩa fast. Nếu working set tự nhiên nhỏ → hit ratio cao dễ dàng. Đừng làm metric duy nhất.

4.4 Per-process memory: work_mem & maintenance_work_mem

shared_buffers — toàn bộ Postgres dùng chung
work_mem — per-operation (sort, hash join). Default 4MB.
maintenance_work_mem — cho VACUUM, CREATE INDEX. Default 64MB.

Trap kinh điển: work_mem là per-operation, không phải per-query, không phải per-connection. Một query có 5 sort + 2 hash → 7 × work_mem. 100 connections → tối đa 700 × work_mem tổng RAM dùng.

Công thức an toàn:

total_work_mem_max = work_mem × max_connections × avg_ops_per_query

Với max_connections = 200, work_mem = 64MB, avg_ops = 3:

Tổng tối đa: 200 × 64MB × 3 = 38GB

Trên máy 32GB RAM, set work_mem = 64MB với 200 connections là bom hẹn giờ OOM.

Pattern thực tế: work_mem thấp toàn cục, override per-session khi cần:

-- Default global: work_mem = 8MB
-- Session cần sort lớn:
SET LOCAL work_mem = '256MB';
SELECT ... ORDER BY ...;

5. MVCC — Multi-Version Concurrency Control

5.1 Vì sao MVCC?

Vấn đề cũ: Reader và writer block lẫn nhau.

Reader giữ shared lock → writer phải chờ
Writer giữ exclusive lock → reader phải chờ

Giải pháp MVCC: Mỗi UPDATE/DELETE không xóa row cũ, tạo version mới. Reader thấy version “đúng” với transaction của mình. Reader không bao giờ block writer, writer không bao giờ block reader.

5.2 Cách Postgres làm MVCC

Mỗi tuple có 2 system column:

xmin — Transaction ID đã INSERT tuple này
xmax — Transaction ID đã DELETE/UPDATE tuple này (0 nếu vẫn live)

Khi UPDATE:

Tuple cũ: set xmax = current_xid
Tuple mới: tạo trong page (có thể cùng page = HOT update, hoặc page khác)
Tuple cũ trỏ tới tuple mới qua t_ctid

Khi DELETE: chỉ set xmax, không xóa thật.

-- Xem hidden columns
SELECT xmin, xmax, ctid, * FROM users LIMIT 3;
--   xmin  | xmax |  ctid  | id | name
-- --------+------+--------+----+-------
--    742  |  0   | (0,1)  | 1  | Alice
--    743  |  745 | (0,2)  | 2  | Bob (deleted by tx 745, still on disk)
--    744  |  0   | (0,3)  | 3  | Carol

5.3 Visibility rules — ai thấy ai

Một transaction T với snapshot S thấy tuple nếu:

xmin < S.xmin_horizon AND xmin đã committed
AND (xmax = 0 OR xmax chưa committed OR xmax > S.xmax_horizon)

Đơn giản: “tuple được tạo trước snapshot và chưa bị xóa tại thời điểm snapshot”.

Insight: Nếu transaction T2 update row, transaction T1 (bắt đầu trước T2) vẫn thấy version cũ. Đây là cơ sở của REPEATABLE READ và SERIALIZABLE isolation.

5.4 Bloat — cái giá của MVCC

Tuple cũ không bị xóa ngay → table grow theo write rate. Đó là bloat.

Ai dọn? → VACUUM (autovacuum chạy nền).

VACUUM:

Đánh dấu dead tuple là removable (line pointer thành LP_UNUSED)
Update FSM, VM
Không trả disk space lại OS (vẫn giữ trong file để reuse cho INSERT mới)

VACUUM FULL:

Rewrite toàn bộ table → trả disk lại OS
Lock exclusive → block mọi query → không dùng trên production
Thay bằng pg_repack (online) nếu cần

5.5 Đo bloat

-- Quick estimate from pg_stat_user_tables
SELECT
    schemaname, relname,
    n_live_tup, n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 2) AS dead_pct,
    last_vacuum, last_autovacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY dead_pct DESC NULLS LAST
LIMIT 20;

Threshold:

dead_pct < 10% — healthy
10-30% — bình thường giữa các VACUUM cycles
> 30% — autovacuum không kịp, cần tune

Tool chính xác hơn: pgstattuple extension hoặc query của Igor Neyman (https://wiki.postgresql.org/wiki/Show_database_bloat).

5.6 Transaction ID (XID) wraparound — vấn đề kinh điển

XID là 32-bit. Sau ~4 tỷ transactions, wraparound. Nếu không xử lý → DB stops to prevent corruption.

Cơ chế phòng ngừa: VACUUM FREEZE đánh dấu old tuples là “frozen” (special XID) → wraparound không ảnh hưởng.

autovacuum_freeze_max_age (default 200M) — khi table age đến threshold này, forced autovacuum chạy bất kể settings khác.

Production incident kinh điển — Sentry 2015:

Forced autovacuum chạy đột ngột trên huge table → massive I/O → site down
Bài học: monitor age thường xuyên, schedule manual VACUUM FREEZE trong off-peak

-- Monitor xid age
SELECT
    datname,
    age(datfrozenxid) AS xid_age,
    2^31 - age(datfrozenxid) AS xid_until_wraparound
FROM pg_database
ORDER BY xid_age DESC;
 
-- Per table
SELECT
    schemaname || '.' || relname AS table,
    age(relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS size
FROM pg_class c
JOIN pg_namespace n ON c.relnamespace = n.oid
WHERE relkind = 'r' AND n.nspname NOT IN ('pg_catalog', 'information_schema')
ORDER BY age(relfrozenxid) DESC
LIMIT 20;

Action item: Set up alert khi xid_age > 1B (50% threshold).

6. Query Execution Pipeline

6.1 Từ SQL string đến result set

SQL string
    │
    ▼ ┌──────────────┐
      │ Parser        │  → parse tree
      └──────────────┘
    │
    ▼ ┌──────────────┐
      │ Analyzer      │  → query tree (semantic check, type resolve)
      └──────────────┘
    │
    ▼ ┌──────────────┐
      │ Rewriter      │  → modified tree (views, rules)
      └──────────────┘
    │
    ▼ ┌──────────────┐
      │ Planner       │  → plan tree (THE smart part)
      │   - Cost-based │
      │   - Statistics │
      └──────────────┘
    │
    ▼ ┌──────────────┐
      │ Executor      │  → tuples to client
      └──────────────┘

6.2 Planner cost model (sneak peek — Tuần 04 đào sâu)

Cost = abstract unit. Formula đơn giản:

cost = (pages_to_read × seq_page_cost)
     + (rows_to_process × cpu_tuple_cost)
     + (operators × cpu_operator_cost)

Default values:

seq_page_cost = 1.0
random_page_cost = 4.0 (HDD assumption — modern SSD nên giảm về 1.1)
cpu_tuple_cost = 0.01
cpu_operator_cost = 0.0025

SSD adjustment:

random_page_cost = 1.1   # SSD random ≈ sequential
effective_io_concurrency = 200   # SSD parallel I/O

Đặt sai → planner chọn sequential scan thay vì index scan dù index nhanh hơn 100x.

6.3 Plan node types (preview)

Seq Scan — đọc toàn bộ table
Index Scan — đọc index → fetch heap
Index Only Scan — đọc chỉ từ index (cần VM bit)
Bitmap Heap Scan — combine nhiều index, deduplicate bằng bitmap
Nested Loop / Hash Join / Merge Join — 3 join algorithms
Sort, Aggregate, Hash, Materialize — operators trung gian

Tuần 04 sẽ đào sâu, đây chỉ preview.

7. Lab — Hands-on (4-6 giờ)

7.1 Setup Docker Postgres với observability

docker-compose.yml:

version: '3.8'
services:
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: lab
      POSTGRES_DB: labdb
    ports:
      - "5432:5432"
    volumes:
      - ./pgdata:/var/lib/postgresql/data
      - ./init.sql:/docker-entrypoint-initdb.d/init.sql
    command: >
      postgres
      -c shared_buffers=512MB
      -c work_mem=16MB
      -c maintenance_work_mem=128MB
      -c wal_level=replica
      -c log_min_duration_statement=100
      -c log_checkpoints=on
      -c log_lock_waits=on
      -c log_temp_files=0
      -c track_io_timing=on
      -c shared_preload_libraries=pg_stat_statements
 
  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: lab@local
      PGADMIN_DEFAULT_PASSWORD: lab
    ports:
      - "5050:80"

init.sql:

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
CREATE EXTENSION IF NOT EXISTS pgstattuple;
CREATE EXTENSION IF NOT EXISTS pg_buffercache;

7.2 Thí nghiệm 1 — Page layout

CREATE TABLE lab_page (id int, name text);
INSERT INTO lab_page SELECT i, 'name_' || i FROM generate_series(1, 100) i;
 
-- pageinspect extension (install: CREATE EXTENSION pageinspect;)
CREATE EXTENSION pageinspect;
 
SELECT * FROM page_header(get_raw_page('lab_page', 0));
-- See: lsn, lower, upper, special, pagesize
 
SELECT lp, lp_off, lp_len, t_xmin, t_xmax, t_ctid
FROM heap_page_items(get_raw_page('lab_page', 0))
LIMIT 5;

Câu hỏi cho bạn: Làm sao tìm row có id=50 nằm ở page nào, offset bao nhiêu?

SELECT id, ctid FROM lab_page WHERE id = 50;
-- ctid = (page_number, line_pointer_index)

7.3 Thí nghiệm 2 — Reproduce bloat

CREATE TABLE lab_bloat AS SELECT i, md5(i::text) FROM generate_series(1, 100000) i;
ALTER TABLE lab_bloat ADD PRIMARY KEY (i);
 
-- Disable autovacuum on this table to reproduce bloat cleanly
ALTER TABLE lab_bloat SET (autovacuum_enabled = false);
 
-- Trước update
SELECT pg_size_pretty(pg_relation_size('lab_bloat')) AS size,
       n_live_tup, n_dead_tup
FROM pg_stat_user_tables WHERE relname = 'lab_bloat';
 
-- Bloat 1: Update all rows 10 times (mỗi update tạo dead tuple)
DO $$ BEGIN FOR i IN 1..10 LOOP
    UPDATE lab_bloat SET md5 = md5(md5 || 'salt');
END LOOP; END $$;
 
-- Sau update
SELECT pg_size_pretty(pg_relation_size('lab_bloat')) AS size,
       n_live_tup, n_dead_tup,
       round(100.0 * n_dead_tup / (n_live_tup + n_dead_tup), 2) AS dead_pct
FROM pg_stat_user_tables WHERE relname = 'lab_bloat';
 
-- Bloat đo chính xác bằng pgstattuple
SELECT * FROM pgstattuple('lab_bloat');
-- tuple_count, dead_tuple_count, free_percent
 
-- Fix
VACUUM lab_bloat;
SELECT pg_size_pretty(pg_relation_size('lab_bloat')); -- không giảm! vì VACUUM không trả disk
VACUUM FULL lab_bloat;
SELECT pg_size_pretty(pg_relation_size('lab_bloat')); -- giảm rõ rệt

Bài học: VACUUM thường (autovacuum) đủ cho hầu hết trường hợp. VACUUM FULL chỉ khi bloat cực nặng và OK với downtime, hoặc dùng pg_repack/pg_squeeze (online).

7.4 Thí nghiệm 3 — MVCC visibility

Mở 2 psql session:

Session 1:

BEGIN;
SELECT * FROM lab_page WHERE id = 1;  -- name: name_1
-- DON'T COMMIT YET

Session 2:

UPDATE lab_page SET name = 'CHANGED' WHERE id = 1;
COMMIT;
SELECT name FROM lab_page WHERE id = 1;  -- name: CHANGED

Session 1 (vẫn trong transaction):

SELECT name FROM lab_page WHERE id = 1;  -- name: name_1  (vẫn thấy version cũ!)
SHOW transaction_isolation;  -- read committed (default)
 
-- Wait, default là READ COMMITTED, sao vẫn thấy cũ?
-- → Vì statement của Session 1 ABOVE chạy trước Session 2 commit.
-- Run lại:
SELECT name FROM lab_page WHERE id = 1;  -- name: CHANGED (giờ thấy mới)
COMMIT;

Lặp lại với REPEATABLE READ:

BEGIN ISOLATION LEVEL REPEATABLE READ;
SELECT name FROM lab_page WHERE id = 1;
-- Session 2 update + commit lại
SELECT name FROM lab_page WHERE id = 1;  -- vẫn name cũ
COMMIT;
SELECT name FROM lab_page WHERE id = 1;  -- giờ mới

7.5 Thí nghiệm 4 — WAL volume

SELECT pg_current_wal_lsn() AS lsn_before;
-- 0/AB1234
 
CREATE TABLE lab_wal AS SELECT i, repeat('x', 1000) FROM generate_series(1, 100000) i;
 
SELECT pg_current_wal_lsn() AS lsn_after,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), '0/AB1234'::pg_lsn)
       ) AS wal_generated;
-- wal_generated: ~100 MB
 
-- So sánh COPY vs INSERT (COPY ít WAL hơn nếu bulk + table empty)
TRUNCATE lab_wal;
SELECT pg_current_wal_lsn() AS lsn_before;
COPY lab_wal FROM PROGRAM 'seq 1 100000 | awk ''{print $1 "\t" "yyyyy"}''';
SELECT pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), '<previous>')) AS wal;

7.6 Thí nghiệm 5 — Buffer cache inspection

SELECT
    c.relname,
    count(*) AS buffers,
    pg_size_pretty(count(*) * 8192) AS size_in_cache,
    round(100.0 * count(*) / (SELECT setting::int FROM pg_settings WHERE name='shared_buffers'), 2) AS pct_of_buffer
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
WHERE b.reldatabase = (SELECT oid FROM pg_database WHERE datname = current_database())
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;

Bạn sẽ thấy: hot tables (vd: catalog pg_class) chiếm cache nhiều. Sau VACUUM hoặc large query, cache pattern thay đổi.

8. Production Gotchas — Bài học từ máu

8.1 “Tôi VACUUM rồi sao table vẫn lớn?”

→ VACUUM (không FULL) không trả disk space lại OS. Nó chỉ mark removable cho reuse. Để recover space:

VACUUM FULL (block writes/reads — đừng làm trên production)
pg_repack (online, requires extension)
Đợi bảng grow tự nhiên đến mức cũ thì OK (space sẽ reuse)

8.2 “Autovacuum chạy mãi không hết bloat”

Triệu chứng: n_dead_tup tăng nhanh hơn autovacuum xóa.

Nguyên nhân thường gặp:

autovacuum_vacuum_scale_factor quá lớn (default 0.2 = 20% bloat mới trigger)
autovacuum_max_workers quá ít (default 3)
autovacuum_vacuum_cost_limit quá thấp (default 200) → vacuum throttled
Long-running transaction giữ xmin horizon → vacuum không thể remove dead tuples mới hơn xid đó

Diagnose:

SELECT pid, xact_start, age(backend_xid), age(backend_xmin), state, query
FROM pg_stat_activity
WHERE backend_xid IS NOT NULL OR backend_xmin IS NOT NULL
ORDER BY age(backend_xmin) DESC NULLS LAST;

Nếu có transaction chạy hàng giờ với old backend_xmin → đó là thủ phạm. Kill nó hoặc fix code.

Tune cho high-write table:

ALTER TABLE busy_table SET (
    autovacuum_vacuum_scale_factor = 0.05,   -- trigger sớm hơn (5%)
    autovacuum_vacuum_cost_limit = 2000,     -- vacuum nhanh hơn
    autovacuum_vacuum_cost_delay = 0         -- không throttle
);

8.3 “Replica lag tăng đột ngột”

Replica replay WAL sequentially trên 1 process. Nếu primary có:

Bulk operation (large UPDATE/DELETE) → WAL volume cao → replica replay không kịp
Conflict với queries đang chạy trên replica → wait

Diagnose trên replica:

SELECT now() - pg_last_xact_replay_timestamp() AS replay_lag;
 
-- pg_stat_database_conflicts
SELECT * FROM pg_stat_database_conflicts WHERE datname = current_database();

Solutions:

hot_standby_feedback = on (replica báo primary “tôi đang đọc xid X, đừng vacuum sớm hơn”) — tradeoff: bloat primary
max_standby_streaming_delay = 30s — give replica queries 30s grace
Avoid bulk DML on primary (batch + sleep)

8.4 “fsync error → DB stops”

Postgres < 12 có bug: nếu OS fsync trả lỗi (vd: disk full), Postgres không retry → data có thể lost mà process không biết. Postgres 12+ panic và stop để protect.

→ Always monitor disk space. Set alert ở 80% used.

8.5 Connection storm sau restart

Sau khi restart Postgres, app reconnect đồng loạt → connection storm. Mỗi connection = ~10MB RAM + ~1ms setup. 500 connections cùng lúc → 5GB RAM spike + cold cache.

→ Dùng connection pooler (PgBouncer) ngay từ đầu, không phải khi vấn đề xuất hiện. Sẽ học sâu Tuần 06.

9. Anti-patterns

Anti-pattern	Tại sao tệ	Thay bằng
`shared_buffers = 75% RAM`	OS cache bị bóp → miss cost cao hơn	25-35% RAM
`work_mem = 256MB` toàn cục	OOM khi concurrent queries	`work_mem = 8-16MB` global, override per session
Tắt `full_page_writes`	Risk corrupt khi crash giữa write	Để on (mặc định)
`VACUUM FULL` định kỳ	Block toàn bộ table	`pg_repack` hoặc tune autovacuum
Bỏ qua xid age	Wraparound → DB stop	Monitor `age(datfrozenxid)`, alert ở 1B
Long transaction `BEGIN; ... <hours> ... COMMIT;`	Block vacuum, replica feedback	Batch nhỏ + commit
Không pooler	Connection storm, OOM	PgBouncer/RDS Proxy từ ngày 1
`synchronous_commit = off` cho payment	Mất data khi crash	Để on cho critical, off cho logs

10. Self-check — Bạn đã hiểu chưa?

Sau tuần này, bạn phải trả lời được những câu sau không cần Google:

Postgres lưu một row 50-byte cần bao nhiêu byte trên disk (gồm overhead)?
Vẽ flow từ INSERT của client đến lúc client nhận “COMMIT OK”. Khi nào fsync? Cái gì được fsync?
Vì sao VACUUM FULL block production?
xmin, xmax, ctid là gì? Cho ví dụ truy vấn để xem chúng.
shared_buffers và work_mem khác nhau ra sao? Setting nào dangerous khi đặt cao?
Index-only scan cần điều kiện gì? Tại sao đôi khi không hoạt động dù index covering?
Bạn thấy n_dead_tup 30% và đang tăng. Bước đầu tiên debug là gì?
XID wraparound là gì? Postgres làm gì để ngăn?
Tại sao random_page_cost = 4.0 là default sai trên modern SSD?
WAL khác với data file thế nào? Cái nào được fsync trước khi return commit?

Nếu trả lời <7/10 → đọc lại + làm lab. Nếu >=8/10 → sẵn sàng Tuần 02.

11. Tiếp theo & cross-links

Bài tiếp theo: Tuan-02-Schema-Design-Normalization — sau khi hiểu storage, bắt đầu thiết kế schema tốt.

Đọc song song:

Tuan-Foundations-Database-Internals — refresh concept (B-tree vs LSM)
Tuan-Foundations-OS-Essentials — fsync, page cache, mmap

Trong khóa SD: Tuan-07-Database-Sharding-Replication — sau khi hiểu single-node, lên distributed.

Tuần 01 hoàn thành. Built on bytes, not vibes.

Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-01-DB-Internals-Refresh