Foundations: Database Internals

“Backend dev: ‘Postgres slow, add index’. Senior: ‘Slow because index doesn’t fit memory’. Architect: ‘B-tree vs LSM-tree trade-off, columnar vs row, WAL fsync rate, MVCC bloat — đó là 4 root cause của 90% performance issue’. Hiểu DB internals = hiểu sao Cassandra optimize cho write, sao ClickHouse 100x cho analytics, sao Postgres VACUUM kill production khi quên.”

Tags: cs-foundations database storage-engine fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-07-Database-Sharding-Replication · Tuan-Bonus-Consistency-Models-Isolation · Case-Design-Modern-Data-Lakehouse · Tuan-Foundations-OS-Essentials

1. Context & Why

Tại sao cần hiểu Database Internals?

Vấn đề trong production	DB internals concept
Postgres VACUUM kill performance	MVCC bloat, dead tuples
Cassandra write fast, read slow	LSM-tree compaction
Index thêm nhưng query vẫn slow	Buffer pool hit rate, query plan
Iceberg time travel	Append-only manifest log
Aurora 5x faster than RDS	Disaggregated storage, log-is-DB
ClickHouse 100x analytics	Columnar storage, vectorized exec
Postgres connection limit	Process-per-connection
Replica lag spike	WAL apply throughput

Key insight: Mỗi database = collection of trade-off decisions ở 4 layer:

Storage layer (B-tree vs LSM)

Transaction layer (MVCC vs locks)

Query layer (optimizer, execution)

Replication layer (sync vs async, primary vs multi-primary)

Tham chiếu chính

Database Internals (Alex Petrov, 2019) — bible cho topic này
Designing Data-Intensive Applications Ch.3 (Storage), Ch.7 (Transactions) — Kleppmann
Readings in Database Systems (Red Book) — http://www.redbook.io/
CMU 15-445 Database Systems (Andy Pavlo) — https://15445.courses.cs.cmu.edu/
Use The Index, Luke! (free) — Markus Winand — https://use-the-index-luke.com/

2. Deep Dive — Storage Engines

2.1 The two families: B-tree vs LSM-tree

Mọi DB chọn 1 trong 2 (hoặc hybrid):

	B-tree	LSM-tree
Used by	PostgreSQL, MySQL InnoDB, SQLite, SQL Server	Cassandra, RocksDB, LevelDB, ScyllaDB, HBase
Write	In-place (random I/O)	Append-only (sequential I/O)
Read	O(log N) lookups, mostly 1 disk I/O	Multiple SST files merged
Write throughput	Lower (random I/O)	Higher (sequential I/O)
Read throughput	Higher	Lower (worse cache, multiple files)
Space amplification	Low	Higher (multiple versions, compaction debt)
Write amplification	Lower	Higher (compaction rewrites)
Best for	Read-heavy, OLTP	Write-heavy, time-series

2.2 B-tree Deep Dive

2.2.1 Structure

                    [50]
                  /      \
            [20, 35]    [70, 85]
           /   |   \    /   |   \
       [10] [25] [40][55][75][90]
         data data  data data

Balanced: All leaves at same depth
Page-based: Each node = 1 page (4-16 KB)
High fan-out: 100s-1000s children per node
Depth: Typically 3-4 levels for billions of rows

Lookup cost: ~3-4 disk I/O for any key (with caching, 0-1).

2.2.2 B+ Tree (most common variant)

All data in leaf nodes
Internal nodes only have keys (for navigation)
Leaves linked: efficient range scan

Internal: [20 | 50 | 80]
              ↓    ↓    ↓
Leaves: [..15, 18][20, 35, 45][50, 65][80, 95...]
            ←─→             ←─→
        linked list for range scan

Why used: Better range queries, more keys per internal node → shallower tree.

2.2.3 Page splits

When inserting and page full:

Split into 2 half-full pages
Promote middle key to parent
May cascade up

Cost: Random I/O for splits → why writes slower than LSM.

Fragmentation: Over time, pages get half-empty → bloat. Fix: REINDEX, VACUUM FULL.

2.2.4 Concurrency in B-tree

Latches (lightweight locks) protect pages:

Crab walking: lock parent → child → release parent
Lock coupling
B-link tree: simpler concurrency

2.3 LSM-tree Deep Dive

Log-Structured Merge-tree (Patrick O’Neil 1996).

2.3.1 Architecture

┌─────────────┐
│ Memtable    │  ← In-memory sorted (skip list / red-black tree)
│ (current)   │     Write here
└──────┬──────┘
       │ Flush when full
       ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ SSTable L0  │  │ SSTable L0  │  │ SSTable L0  │  ← Newer
└─────────────┘  └─────────────┘  └─────────────┘
       │              │                │
       └──────────────┴────┬───────────┘ Compaction
                           ▼
       ┌─────────────────────────────────────┐
       │     SSTable L1 (larger, sorted)     │
       └─────────────────────────────────────┘
                           │
                           ▼ Compaction
       ┌─────────────────────────────────────┐
       │     SSTable L2 (even larger)        │  ← Older
       └─────────────────────────────────────┘

SSTable = Sorted String Table = immutable file with sorted (key, value) pairs.

2.3.2 Write path

Append to WAL (durability)
Insert into memtable (in-memory sorted)
When memtable full → flush to disk as SSTable (L0)
Background compaction merges SSTables, removing duplicates

Why fast write: All sequential I/O.

2.3.3 Read path

Lookup key K:
  1. Check memtable (current)
  2. Check L0 SSTables (newest, may be 4-10 files)
  3. Check L1, L2, ... (older)
  4. Use Bloom filter to skip files that definitely don't have K
  5. Use sparse index per SSTable to find page within file

Why slower read: Multiple files to check.

Bloom filters: Each SSTable has Bloom filter. No false negatives → if BF says “not present”, definitely not in file.

2.3.4 Compaction strategies

Strategy	Behavior	Trade-off
Size-Tiered (STCS)	Merge similar-sized SSTables	Faster writes, higher space amp
Leveled (LCS)	Each level 10x larger; non-overlapping	Slower writes, lower space amp
Time-Window (TWCS)	Group by time	Best for time-series TTL
Universal (RocksDB)	Configurable	Tunable

Production tip (Cassandra):

STCS: heavy writes, can tolerate space waste
LCS: read-heavy
TWCS: time-series with TTL

2.3.5 Compaction debt

If writes faster than compaction → SSTables pile up.

Symptoms:

Read latency increases (more files to merge)
Disk space inflates
Eventually: writes slow down

Fix: Add I/O throughput, throttle writes, increase compaction parallelism.

2.4 Write-Ahead Log (WAL)

Critical for durability: Every change written to WAL before applied to data files.

Write transaction:
  1. Write change to WAL (sequential, fast)
  2. fsync WAL  ← durability
  3. Apply to in-memory buffer
  4. Reply to client
  5. Later: flush dirty pages to data files (background)
  6. Eventually: WAL old segments recycled

On crash recovery:

Read WAL from last checkpoint
Replay records to reconstruct state
“Redo” committed, “undo” uncommitted

2.4.1 WAL formats

Postgres: Per-tuple records (insert/update/delete with old + new values). MySQL InnoDB: Redo log (page-level changes) + undo log (for MVCC). RocksDB: Per-batch records.

2.4.2 Group commit

Optimization: Batch many transactions in 1 fsync.

Without group commit:
  T1: write WAL → fsync → done (1ms)
  T2: write WAL → fsync → done (1ms)
  T3: write WAL → fsync → done (1ms)
  Total: 3ms, 3000 TPS

With group commit:
  T1, T2, T3: write WAL
  Single fsync → all done
  Total: 1.1ms, 30000 TPS

Postgres: commit_delay + commit_siblings for group commit tuning.

2.4.3 Synchronous vs asynchronous commit

Mode	Behavior	Trade-off
Synchronous	Wait for fsync before reply	Durable but slow
Asynchronous	Reply immediately, fsync later	Fast, may lose recent commits on crash
Synchronous + replica	Wait for replica too	Most durable, slowest

Postgres: synchronous_commit = on/off/local/remote_apply.

2.5 MVCC Implementation

Tham chiếu: Tuan-Bonus-Consistency-Models-Isolation cho high-level. Đây là implementation details.

2.5.1 Postgres MVCC

Each row has hidden columns:

xmin: transaction that inserted
xmax: transaction that deleted/updated (NULL = current)
ctid: physical location (block, offset)

Initial: row(id=1, value='A', xmin=10, xmax=NULL)

UPDATE id=1 SET value='B' (txn 20):
  Insert NEW row(id=1, value='B', xmin=20, xmax=NULL)
  Mark OLD row(xmax=20)
  ─→ Both versions exist

Visibility check:

SELECT * FROM users WHERE id=1;
-- For each row, check:
--   xmin in committed txns AND
--   (xmax IS NULL OR xmax NOT in committed txns at our snapshot)

2.5.2 The bloat problem

Updates create new row versions → table grows.

Without VACUUM:

Table size 10x original
Index size 10x original
Query slow (sequential scan over dead tuples)

VACUUM (Postgres):

Mark dead tuples as reusable
Update visibility map
Don’t shrink table file (autovacuum default)

VACUUM FULL:

Rewrite entire table → reclaim space
Requires AccessExclusiveLock → blocks queries
Use pg_repack for online alternative

2.5.3 Oracle/MySQL MVCC

Oracle: Undo segments — older versions in separate space, GC’d over time. MySQL InnoDB: Undo logs in system tablespace.

Difference from Postgres:

Postgres: in-place new versions in heap
Oracle/MySQL: separate undo store, in-place data

→ Different VACUUM behavior. Oracle/MySQL no VACUUM but bigger undo space.

2.6 Query Optimizer

The brain of database: Decide HOW to execute SQL.

2.6.1 Phases

SQL → Parser → AST → Logical Plan → Optimizer → Physical Plan → Executor

Logical plan: Relational algebra (project, select, join, group by). Physical plan: Specific algorithms (hash join vs merge join, index scan vs seq scan).

2.6.2 Cost-Based Optimization (CBO)

For each candidate plan:

Estimate cardinality (rows at each step)
Estimate cost (CPU + I/O)
Pick lowest-cost plan

Statistics needed:

Number of rows per table
Distinct values per column
Histograms for value distribution
Index selectivity

-- Postgres: update stats
ANALYZE users;
 
-- View stats
SELECT * FROM pg_stats WHERE tablename = 'users';

When stats wrong: Bad plan → 1000x slow.

2.6.3 Join algorithms

Algorithm	Best when	Cost
Nested Loop	Small tables, indexed inner	O(N×M) or O(N×log M)
Hash Join	Large tables, no order	O(N+M) but needs hash table memory
Merge Join	Both tables sorted	O(N+M)
Index Nested Loop	Inner has index	O(N×log M)

Optimizer chooses based on size, indexes, available memory.

2.6.4 EXPLAIN

EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.country = 'VN';
 
-- Output:
-- Hash Join (cost=10..1000 rows=100 actual=80)
--   ->  Seq Scan on orders (cost=0..500 rows=10000)
--   ->  Hash (cost=5..20 rows=100)
--         ->  Index Scan on users_country_idx (cost=0..15 rows=100)

Reading EXPLAIN:

cost: planner estimate (start..end)
rows: estimated rows
actual: real (after ANALYZE)
Big mismatch estimate vs actual → bad stats

2.6.5 Plan stability problems

Plan flips = same query gets different plan due to:

Stats changed
Data distribution shift
Memory pressure
Locks contention

Prevention:

pg_hint_plan extension
Query store / plan freezing (SQL Server)
Database tuning advisors

2.7 Indexes

2.7.1 B-tree index (most common)

CREATE INDEX idx_users_email ON users(email);
 
SELECT * FROM users WHERE email = '[email protected]';
-- Index lookup: O(log N) → fetch row from heap

Pros: Range queries (>, <, BETWEEN), equality, sorting. Cons: Write overhead, space.

2.7.2 Hash index

CREATE INDEX idx_users_hash ON users USING HASH (email);

Pros: O(1) equality lookup. Cons: No range queries.

Postgres 10+: hash indexes WAL-logged, crash-safe.

2.7.3 Bitmap index

For low-cardinality columns (e.g., gender, status):

status = 'active':  101010111000...
status = 'pending': 010101000111...
status = 'closed':  000000000000...

Pros: Fast WHERE status='active' AND country='VN' (AND bitmaps). Cons: Slow updates. Best for analytics, not OLTP.

Used by: Oracle, ClickHouse (different impl).

2.7.4 Covering index

Include extra columns in index to avoid heap lookup:

CREATE INDEX idx_users_covering
ON users (email)
INCLUDE (name, country);
 
-- Query satisfied entirely from index
SELECT email, name, country FROM users WHERE email = '...';

Pros: Index-only scan, no heap fetch. Cons: Larger index size.

2.7.5 Partial index

CREATE INDEX idx_active_users
ON users (last_login)
WHERE status = 'active';
 
-- Smaller index (only active users)

2.7.6 Expression index

CREATE INDEX idx_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = '[email protected]';

2.7.7 Multi-column index

CREATE INDEX idx_country_city ON users (country, city);
 
-- Used for:
WHERE country = 'VN'                       ✓
WHERE country = 'VN' AND city = 'HCM'      ✓
WHERE country = 'VN' ORDER BY city         ✓
 
-- Not used for:
WHERE city = 'HCM'  (city not leading column)

Order matters: Most selective column first, equality before range.

2.7.8 GIN / GiST / BRIN (Postgres specialized)

GIN (Generalized Inverted Index): Array, JSONB, full-text
GiST (Generalized Search Tree): Geometric, full-text, range
BRIN (Block Range Index): Huge tables, sorted by physical order, very small index

-- BRIN for time-series
CREATE INDEX idx_logs_time ON logs USING BRIN (created_at);
-- 1000x smaller than B-tree, fast for range queries

2.8 Storage Formats: Row vs Columnar

2.8.1 Row-oriented

Each row stored contiguously:

Row 1: [id=1, name='Alice', age=25, country='VN', email='[email protected]']
Row 2: [id=2, name='Bob', age=30, country='US', email='[email protected]']
...

Used by: PostgreSQL, MySQL, Oracle (OLTP).

Pros: Fast for “fetch all columns of 1 row” (typical OLTP query). Cons: Slow for analytics (“AVG(age) over 1B rows” needs to read all columns).

2.8.2 Column-oriented (columnar)

Each column stored contiguously:

id:      [1, 2, 3, 4, 5, ...]
name:    ['Alice', 'Bob', 'Charlie', ...]
age:     [25, 30, 28, 35, 22, ...]
country: ['VN', 'US', 'VN', 'JP', 'VN', ...]

Used by: ClickHouse, Apache Parquet, Apache ORC, AWS Redshift, Snowflake, BigQuery.

Pros:

Compression: Same-type values compress 5-10x
Vectorization: SIMD over column
Late materialization: Skip unused columns
Predicate pushdown: Skip blocks via min/max stats

Cons: Slow for “fetch entire row” (need to assemble from columns).

Performance gap: 100x for analytics queries.

2.8.3 Apache Parquet — modern columnar

Used by: Lakehouse (Iceberg, Delta, Hudi), Spark, Trino, DuckDB.

Structure:

Parquet file
├── Header
├── Row group 1 (e.g., 128 MB)
│   ├── Column chunk: id (compressed, indexed)
│   ├── Column chunk: name
│   ├── ... (each column separate, with stats)
│   └── Column chunk: country
├── Row group 2
├── ...
└── Footer (metadata: schema, stats per column chunk)

Compression: Snappy, zstd, lz4 commonly.

Predicate pushdown:

SELECT * FROM events WHERE date = '2026-05-01' AND country = 'VN';
-- Reader checks footer:
-- Row group 1: date range 2026-04-30 to 2026-05-02 → may match → read
-- Row group 2: date range 2026-04-25 to 2026-04-29 → skip!
-- Saves 50%+ I/O

2.9 Buffer Pool

Cache of disk pages in RAM.

SELECT id FROM users WHERE id = 100;

Process:
  1. Hash to find page containing id=100
  2. Check buffer pool: is page in memory?
  3. Hit: return immediately (~1 μs)
  4. Miss: read from disk (~10-100 μs), evict cold page

Hit rate = critical metric:

99%+ → great
90% → ok
80% → bad (mostly disk I/O)

Postgres:

SELECT
  sum(heap_blks_hit) / nullif(sum(heap_blks_hit + heap_blks_read), 0) AS cache_hit
FROM pg_statio_user_tables;

Tuning:

Postgres: shared_buffers = 25% RAM
MySQL InnoDB: innodb_buffer_pool_size = 60-70% RAM
Watch for double caching (DB cache + OS page cache)

2.10 ARIES Recovery (briefly)

Algorithm for Recovery and Isolation Exploiting Semantics (IBM 1992).

3 phases on crash recovery:

Analysis: Determine which transactions were active at crash
Redo: Replay all WAL records (idempotent)
Undo: Roll back uncommitted transactions

Key concept: WAL with LSN (Log Sequence Number), checkpoints.

Used by: MySQL InnoDB, Postgres (variant), SQL Server.

2.11 Connection Models

2.11.1 Process per connection (Postgres default)

Each connection = OS process (~10MB)
Crash isolation
Limited concurrency (1000 conn = 10GB RAM)
Need PgBouncer for pooling

2.11.2 Thread per connection (MySQL default)

Each connection = OS thread (~1MB)
Better concurrency
Less crash isolation

2.11.3 Coroutine / async (newer)

One process, many connections
Used by ScyllaDB (seastar framework)
10x throughput vs Cassandra

2.12 Distributed DB Specific Concepts

2.12.1 Disaggregated Storage

Aurora model (and Spanner, Snowflake):

Compute nodes: stateless, scale independently
       ↓ (network)
Distributed storage: replicated, durable

Aurora innovation: “Log is the database” — only WAL crosses network, storage applies records.

2.12.2 Compute-Storage separation

Modern DBs split:

Compute: query execution (scales with workload)
Storage: data persistence (scales with data)

Examples: Snowflake, Aurora, Spanner, Singlestore, ClickHouse Cloud.

Benefit: Pay separately, scale independently.

2.12.3 Vectorized execution

ClickHouse, DuckDB: Process columns in batches (1024 rows), not row-by-row.

// Naive (row-by-row)
for (row : rows) {
    if (row.age > 25) result.push_back(row);
}
 
// Vectorized (batch)
vector<bool> mask = compare_gt(age_column, 25);  // SIMD
result = compact(rows, mask);

3-10x faster even before SIMD optimizations.

3. Practical Implications

3.1 OLTP vs OLAP DB choice

	OLTP	OLAP
Workload	Many small transactions	Few large queries
Storage	Row-oriented	Column-oriented
Index	B-tree	Bitmap, BRIN
Examples	Postgres, MySQL	ClickHouse, BigQuery

3.2 When to denormalize

Normalized OLTP: 3NF, joins for reads. Denormalized analytics: Pre-joined wide tables.

Star schema:

Fact table (large, transactions)
Dimension tables (smaller, joined)

Snowflake schema: Star schema with normalized dimensions.

3.3 Sharding considerations

Tuan-07-Database-Sharding-Replication for high-level. Internal:

Shard key = primary access pattern
Cross-shard joins expensive (avoid)
Resharding rare = painful (use consistent hashing)

3.4 Backup strategies

Logical backup (pg_dump): SQL statements. Slow for big DB, portable. Physical backup (file-level): Fast, restore exact state, version-locked. PITR (Point-In-Time Recovery): Base backup + WAL → recover to any moment.

3-2-1 rule: 3 copies, 2 different media, 1 off-site.

4. Security First

4.1 SQL injection — Always parameterize

# BAD
db.execute(f"SELECT * FROM users WHERE id = {user_input}")
 
# GOOD
db.execute("SELECT * FROM users WHERE id = %s", (user_input,))

4.2 Encryption

At rest: Transparent Data Encryption (TDE) — Postgres extensions, MySQL InnoDB, SQL Server built-in. In transit: TLS for client-server. Field-level: Encrypt sensitive columns separately.

4.3 Audit logging

Postgres pgaudit extension:

ALTER SYSTEM SET pgaudit.log = 'write, role, ddl';

4.4 Row-Level Security

Tuan-Bonus-Multi-Tenancy-SaaS-Patterns section 2.2 — RLS for multi-tenant isolation.

5. Performance Tuning Checklist

5.1 Slow query investigation

-- Postgres: find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 20;
 
-- Check plan
EXPLAIN (ANALYZE, BUFFERS) <query>;

Common fixes:

Add index (most common)
Update stats (ANALYZE)
Rewrite query (denormalize, materialize)
Increase work_mem for sort/hash

5.2 Lock contention

-- Find blocking queries (Postgres)
SELECT
  blocked_activity.pid AS blocked_pid,
  blocked_activity.query AS blocked_query,
  blocking_activity.pid AS blocking_pid,
  blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.relation = blocking_locks.relation
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

5.3 Buffer pool sizing

Postgres shared_buffers: 25% of RAM (default 128MB too small)
effective_cache_size: total RAM available for caching (75% RAM)
work_mem: per-operation sort/hash memory (16-64 MB)
maintenance_work_mem: VACUUM, CREATE INDEX (1-2 GB)

5.4 Connection pooling

Without pooling:
  10K clients × 10MB = 100GB RAM (Postgres process-per-conn)

With PgBouncer:
  10K clients ↔ PgBouncer ↔ 100 backend connections
  Memory: 100 × 10MB = 1GB

5.5 Replication tuning

wal_level = replica for streaming replication
max_wal_senders = number of replicas + room
wal_keep_size to retain WAL for late replicas
synchronous_standby_names for sync replication

6. Code Examples

6.1 Demonstrate B-tree vs LSM behavior

"""
Compare write/read performance of B-tree (sqlite) vs LSM (rocksdb).
"""
 
import sqlite3
import rocksdb
import time
import random
 
N = 100_000
 
# B-tree (SQLite)
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE kv (k TEXT PRIMARY KEY, v TEXT)")
 
start = time.time()
for i in range(N):
    conn.execute("INSERT INTO kv VALUES (?, ?)", (f"key{i}", f"val{i}"))
conn.commit()
print(f"SQLite write: {N / (time.time() - start):.0f} ops/s")
 
start = time.time()
for _ in range(N):
    k = f"key{random.randint(0, N-1)}"
    conn.execute("SELECT v FROM kv WHERE k = ?", (k,)).fetchone()
print(f"SQLite read: {N / (time.time() - start):.0f} ops/s")
 
 
# LSM (RocksDB)
db = rocksdb.DB("/tmp/rocks.db", rocksdb.Options(create_if_missing=True))
 
start = time.time()
batch = rocksdb.WriteBatch()
for i in range(N):
    batch.put(f"key{i}".encode(), f"val{i}".encode())
db.write(batch)
print(f"RocksDB write: {N / (time.time() - start):.0f} ops/s")
 
start = time.time()
for _ in range(N):
    k = f"key{random.randint(0, N-1)}".encode()
    db.get(k)
print(f"RocksDB read: {N / (time.time() - start):.0f} ops/s")
 
# Typical results:
#   SQLite write: 30,000 ops/s    LSM write: 200,000 ops/s
#   SQLite read: 100,000 ops/s    LSM read: 50,000 ops/s

6.2 Show MVCC bloat

-- Setup
CREATE TABLE bloat_test (id SERIAL PRIMARY KEY, val INT);
INSERT INTO bloat_test SELECT generate_series(1, 100000);
 
-- Initial size
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~3.5 MB
 
-- Update all rows
UPDATE bloat_test SET val = val + 1;
 
-- Now size doubled (old + new versions)
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~7 MB
 
-- More updates
UPDATE bloat_test SET val = val + 1;
UPDATE bloat_test SET val = val + 1;
 
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~14 MB (4x original!)
 
-- VACUUM (mark for reuse, doesn't shrink)
VACUUM bloat_test;
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- Still ~14 MB
 
-- VACUUM FULL (shrinks but locks)
VACUUM FULL bloat_test;
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- Back to ~3.5 MB

6.3 Index usage analysis

-- Find unused indexes (Postgres)
SELECT
  schemaname, tablename, indexname,
  idx_scan as scans,
  pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- Find missing indexes (large seq scans)
SELECT
  schemaname, tablename,
  seq_scan, idx_scan,
  pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan AND seq_scan > 100
ORDER BY seq_scan DESC;

7. System Design Diagrams

7.1 B-tree vs LSM Write Path

flowchart TB
    subgraph BTree["B-tree Write"]
        BW[Write request] --> BWAL[WAL append<br/>sequential]
        BW --> BPage[Find leaf page<br/>random I/O]
        BPage --> BUpdate[In-place update]
        BUpdate --> BSplit{Page full?}
        BSplit -->|Yes| BSplit2[Split, propagate<br/>multiple writes]
        BSplit -->|No| BDone[Done]
        BSplit2 --> BDone
    end

    subgraph LSM["LSM Write"]
        LW[Write request] --> LWAL[WAL append<br/>sequential]
        LW --> LMem[Insert into memtable<br/>in-memory]
        LMem --> LDone[Done]

        LMem -.flush async.-> LSST[SSTable L0<br/>sequential write]
        LSST -.compact async.-> LL1[SSTable L1]
    end

    style LSM fill:#c8e6c9
    style BTree fill:#fff9c4

7.2 LSM Architecture

flowchart TB
    Write[Write] --> Mem[Memtable<br/>in-memory<br/>sorted]
    Write --> WAL[(WAL)]

    Mem -.flush.-> L0_1[SSTable L0]
    Mem -.flush.-> L0_2[SSTable L0]
    Mem -.flush.-> L0_3[SSTable L0]

    L0_1 -.compact.-> L1_1[SSTable L1<br/>10x larger]
    L0_2 -.compact.-> L1_1
    L0_3 -.compact.-> L1_1

    L1_1 -.compact.-> L2_1[SSTable L2<br/>100x larger]

    Read[Read] --> Mem
    Read --> L0_1
    Read --> L0_2
    Read --> L0_3
    Read --> L1_1
    Read --> L2_1

    BloomFilter[Bloom Filter<br/>per SSTable] -.skip non-matching.-> L0_1
    BloomFilter -.-> L1_1
    BloomFilter -.-> L2_1

7.3 MVCC Visibility

flowchart LR
    subgraph Versions["Row Versions"]
        V1["v1: value=A<br/>xmin=10<br/>xmax=20"]
        V2["v2: value=B<br/>xmin=20<br/>xmax=30"]
        V3["v3: value=C<br/>xmin=30<br/>xmax=NULL"]

        V1 -.updated by.-> V2
        V2 -.updated by.-> V3
    end

    T15[Txn 15 snapshot] -->|visible| V1
    T15 -.invisible.-> V2
    T15 -.invisible.-> V3

    T25[Txn 25 snapshot] -.invisible.-> V1
    T25 -->|visible| V2
    T25 -.invisible.-> V3

    T35[Txn 35 snapshot] -.invisible.-> V1
    T35 -.invisible.-> V2
    T35 -->|visible| V3

    style V1 fill:#ffe0b2
    style V2 fill:#fff9c4
    style V3 fill:#c8e6c9

7.4 Row vs Columnar

flowchart LR
    subgraph Row["Row-oriented"]
        R1["Row1: id|name|age|country"]
        R2["Row2: id|name|age|country"]
        R3["Row3: id|name|age|country"]
        R4["Row4: id|name|age|country"]

        Note1[Read row = 1 disk seek<br/>Best for OLTP]
    end

    subgraph Col["Columnar"]
        C1["id: [1,2,3,4,5,...]"]
        C2["name: [...]"]
        C3["age: [25,30,28,35,22,...]"]
        C4["country: [VN,US,VN,JP,...]"]

        Note2[AVG(age) reads only age column<br/>10-100x faster<br/>Best for OLAP]
    end

    style Row fill:#fff9c4
    style Col fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: B-tree vs LSM = read-optimized vs write-optimized. Both ACID. Choose based on workload. Most OLTP DBs B-tree, most NoSQL/wide-column LSM.

#2: WAL fsync is fundamental bottleneck. Commit throughput = fsync rate. NVMe enables 10-100x more commits/sec than HDD.

#3: MVCC eliminates read-write blocking. Readers see snapshot, writers create new versions. Cost: bloat → VACUUM.

#4: Postgres “REPEATABLE READ” = SI, not SQL standard RR. Use SERIALIZABLE for true serializability. (See Tuan-Bonus-Consistency-Models-Isolation)

#5: Compaction debt kills LSM systems. Writes faster than compaction → SSTables pile up. Monitor and tune.

#6: Columnar gives 10-100x for analytics. Same data, different layout, different workload. Why ClickHouse, Snowflake exist.

#7: Buffer pool hit rate >90% mandatory. <80% = mostly disk reads = slow. Tune shared_buffers aggressively.

#8: Stats determine plan quality. Bad stats → bad plan → 1000x slow. Run ANALYZE regularly.

Pitfalls

Pitfall 1: Over-indexing

Adding indexes “just in case” → write amplification, space waste. Fix: Only index columns used in WHERE/JOIN/ORDER BY of slow queries.

Pitfall 2: ORDER BY without index

ORDER BY created_at on 10M row table without index → sort 10M rows. Fix: Index on sort column.

Pitfall 3: SELECT *

Reads all columns → bloats network, harms cache. Fix: Explicit column list.

Pitfall 4: Forgot ANALYZE

Bulk insert 10M rows → no ANALYZE → optimizer thinks empty → terrible plan. Fix: ANALYZE after bulk operations.

Pitfall 5: VACUUM disabled

“Save resources” → autovacuum off → bloat grows → query slow → space exhausted. Fix: NEVER disable autovacuum. Tune autovacuum_vacuum_scale_factor.

Pitfall 6: Long-running transactions

Block VACUUM → bloat grows. Fix: Monitor pg_stat_activity, kill long-running queries (statement_timeout).

Pitfall 7: N+1 query

Loop in app, query per iter → 1000 queries instead of 1. Fix: Single query with JOIN, or batch IN list.

Pitfall 8: Connection without pooling

10K clients × 10MB = 100GB RAM exhausted. Fix: PgBouncer in transaction mode.

Pitfall 9: Wrong index column order

(city, country) for WHERE country='VN' AND city='HCM' → unused. Fix: Most selective + leading column convention. Use EXPLAIN.

Pitfall 10: Mistaking row-DB for analytics

Run SELECT AVG(amount) FROM 1B_row_table on Postgres → 30 minutes. Fix: Move analytics to columnar (ClickHouse, BigQuery, Lakehouse).

9. Internal Links

Topic	Connects to
Tuan-07-Database-Sharding-Replication	Sharding, replication patterns
Tuan-Bonus-Consistency-Models-Isolation	MVCC, isolation levels
Tuan-Bonus-Outbox-Pattern	WAL-based CDC (Debezium reads WAL)
Case-Design-Modern-Data-Lakehouse	Parquet, columnar at scale
Tuan-Foundations-OS-Essentials	fsync, mmap, page cache
Tuan-Foundations-Computer-Architecture	Cache-aware query execution
Tuan-Bonus-Multi-Region-Active-Active-DSQL	Distributed SQL internals

Tham khảo

Books:

Database Internals (Alex Petrov, 2019) — comprehensive
Designing Data-Intensive Applications Ch.3, 7 (Kleppmann)
Use The Index, Luke! (free, Markus Winand) — https://use-the-index-luke.com/
High Performance MySQL (4th ed)
PostgreSQL: Up & Running (Obe & Hsu)

Papers:

O’Neil et al., The LSM-Tree (1996)
Mohan et al., ARIES (1992)
Bayer & McCreight, Organization and Maintenance of Large Ordered Indexes (1972) — original B-tree
Stonebraker, The Design and Implementation of POSTGRES (1986)

Online:

CMU 15-445 Database Systems — https://15445.courses.cs.cmu.edu/
Andy Pavlo’s talks on YouTube
The Internals of PostgreSQL (Hironobu Suzuki) — https://www.interdb.jp/pg/
RocksDB Wiki — https://github.com/facebook/rocksdb/wiki

Tools:

pgAdmin, pgcli for Postgres
pg_stat_statements extension
pgaudit for audit
pg_repack for online VACUUM FULL
Percona Toolkit for MySQL

Tiếp theo: Tuan-Foundations-Compilers-VMs — Compilers, bytecode, GC, JIT.

lthieu's notes

Explorer

Tuan-Foundations-Database-Internals