Foundations: Database Internals
“Backend dev: ‘Postgres slow, add index’. Senior: ‘Slow because index doesn’t fit memory’. Architect: ‘B-tree vs LSM-tree trade-off, columnar vs row, WAL fsync rate, MVCC bloat — đó là 4 root cause của 90% performance issue’. Hiểu DB internals = hiểu sao Cassandra optimize cho write, sao ClickHouse 100x cho analytics, sao Postgres VACUUM kill production khi quên.”
Tags: cs-foundations database storage-engine fundamentals Student: Hieu (Backend Dev → Architect) Liên quan: Tuan-07-Database-Sharding-Replication · Tuan-Bonus-Consistency-Models-Isolation · Case-Design-Modern-Data-Lakehouse · Tuan-Foundations-OS-Essentials
1. Context & Why
Tại sao cần hiểu Database Internals?
| Vấn đề trong production | DB internals concept |
|---|---|
| Postgres VACUUM kill performance | MVCC bloat, dead tuples |
| Cassandra write fast, read slow | LSM-tree compaction |
| Index thêm nhưng query vẫn slow | Buffer pool hit rate, query plan |
| Iceberg time travel | Append-only manifest log |
| Aurora 5x faster than RDS | Disaggregated storage, log-is-DB |
| ClickHouse 100x analytics | Columnar storage, vectorized exec |
| Postgres connection limit | Process-per-connection |
| Replica lag spike | WAL apply throughput |
Key insight: Mỗi database = collection of trade-off decisions ở 4 layer:
- Storage layer (B-tree vs LSM)
- Transaction layer (MVCC vs locks)
- Query layer (optimizer, execution)
- Replication layer (sync vs async, primary vs multi-primary)
Tham chiếu chính
- Database Internals (Alex Petrov, 2019) — bible cho topic này
- Designing Data-Intensive Applications Ch.3 (Storage), Ch.7 (Transactions) — Kleppmann
- Readings in Database Systems (Red Book) — http://www.redbook.io/
- CMU 15-445 Database Systems (Andy Pavlo) — https://15445.courses.cs.cmu.edu/
- Use The Index, Luke! (free) — Markus Winand — https://use-the-index-luke.com/
2. Deep Dive — Storage Engines
2.1 The two families: B-tree vs LSM-tree
Mọi DB chọn 1 trong 2 (hoặc hybrid):
| B-tree | LSM-tree | |
|---|---|---|
| Used by | PostgreSQL, MySQL InnoDB, SQLite, SQL Server | Cassandra, RocksDB, LevelDB, ScyllaDB, HBase |
| Write | In-place (random I/O) | Append-only (sequential I/O) |
| Read | O(log N) lookups, mostly 1 disk I/O | Multiple SST files merged |
| Write throughput | Lower (random I/O) | Higher (sequential I/O) |
| Read throughput | Higher | Lower (worse cache, multiple files) |
| Space amplification | Low | Higher (multiple versions, compaction debt) |
| Write amplification | Lower | Higher (compaction rewrites) |
| Best for | Read-heavy, OLTP | Write-heavy, time-series |
2.2 B-tree Deep Dive
2.2.1 Structure
[50]
/ \
[20, 35] [70, 85]
/ | \ / | \
[10] [25] [40][55][75][90]
data data data data
- Balanced: All leaves at same depth
- Page-based: Each node = 1 page (4-16 KB)
- High fan-out: 100s-1000s children per node
- Depth: Typically 3-4 levels for billions of rows
Lookup cost: ~3-4 disk I/O for any key (with caching, 0-1).
2.2.2 B+ Tree (most common variant)
- All data in leaf nodes
- Internal nodes only have keys (for navigation)
- Leaves linked: efficient range scan
Internal: [20 | 50 | 80]
↓ ↓ ↓
Leaves: [..15, 18][20, 35, 45][50, 65][80, 95...]
←─→ ←─→
linked list for range scan
Why used: Better range queries, more keys per internal node → shallower tree.
2.2.3 Page splits
When inserting and page full:
- Split into 2 half-full pages
- Promote middle key to parent
- May cascade up
Cost: Random I/O for splits → why writes slower than LSM.
Fragmentation: Over time, pages get half-empty → bloat. Fix: REINDEX, VACUUM FULL.
2.2.4 Concurrency in B-tree
Latches (lightweight locks) protect pages:
- Crab walking: lock parent → child → release parent
- Lock coupling
- B-link tree: simpler concurrency
2.3 LSM-tree Deep Dive
Log-Structured Merge-tree (Patrick O’Neil 1996).
2.3.1 Architecture
┌─────────────┐
│ Memtable │ ← In-memory sorted (skip list / red-black tree)
│ (current) │ Write here
└──────┬──────┘
│ Flush when full
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ SSTable L0 │ │ SSTable L0 │ │ SSTable L0 │ ← Newer
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└──────────────┴────┬───────────┘ Compaction
▼
┌─────────────────────────────────────┐
│ SSTable L1 (larger, sorted) │
└─────────────────────────────────────┘
│
▼ Compaction
┌─────────────────────────────────────┐
│ SSTable L2 (even larger) │ ← Older
└─────────────────────────────────────┘
SSTable = Sorted String Table = immutable file with sorted (key, value) pairs.
2.3.2 Write path
- Append to WAL (durability)
- Insert into memtable (in-memory sorted)
- When memtable full → flush to disk as SSTable (L0)
- Background compaction merges SSTables, removing duplicates
Why fast write: All sequential I/O.
2.3.3 Read path
Lookup key K:
1. Check memtable (current)
2. Check L0 SSTables (newest, may be 4-10 files)
3. Check L1, L2, ... (older)
4. Use Bloom filter to skip files that definitely don't have K
5. Use sparse index per SSTable to find page within file
Why slower read: Multiple files to check.
Bloom filters: Each SSTable has Bloom filter. No false negatives → if BF says “not present”, definitely not in file.
2.3.4 Compaction strategies
| Strategy | Behavior | Trade-off |
|---|---|---|
| Size-Tiered (STCS) | Merge similar-sized SSTables | Faster writes, higher space amp |
| Leveled (LCS) | Each level 10x larger; non-overlapping | Slower writes, lower space amp |
| Time-Window (TWCS) | Group by time | Best for time-series TTL |
| Universal (RocksDB) | Configurable | Tunable |
Production tip (Cassandra):
- STCS: heavy writes, can tolerate space waste
- LCS: read-heavy
- TWCS: time-series with TTL
2.3.5 Compaction debt
If writes faster than compaction → SSTables pile up.
Symptoms:
- Read latency increases (more files to merge)
- Disk space inflates
- Eventually: writes slow down
Fix: Add I/O throughput, throttle writes, increase compaction parallelism.
2.4 Write-Ahead Log (WAL)
Critical for durability: Every change written to WAL before applied to data files.
Write transaction:
1. Write change to WAL (sequential, fast)
2. fsync WAL ← durability
3. Apply to in-memory buffer
4. Reply to client
5. Later: flush dirty pages to data files (background)
6. Eventually: WAL old segments recycled
On crash recovery:
- Read WAL from last checkpoint
- Replay records to reconstruct state
- “Redo” committed, “undo” uncommitted
2.4.1 WAL formats
Postgres: Per-tuple records (insert/update/delete with old + new values). MySQL InnoDB: Redo log (page-level changes) + undo log (for MVCC). RocksDB: Per-batch records.
2.4.2 Group commit
Optimization: Batch many transactions in 1 fsync.
Without group commit:
T1: write WAL → fsync → done (1ms)
T2: write WAL → fsync → done (1ms)
T3: write WAL → fsync → done (1ms)
Total: 3ms, 3000 TPS
With group commit:
T1, T2, T3: write WAL
Single fsync → all done
Total: 1.1ms, 30000 TPS
Postgres: commit_delay + commit_siblings for group commit tuning.
2.4.3 Synchronous vs asynchronous commit
| Mode | Behavior | Trade-off |
|---|---|---|
| Synchronous | Wait for fsync before reply | Durable but slow |
| Asynchronous | Reply immediately, fsync later | Fast, may lose recent commits on crash |
| Synchronous + replica | Wait for replica too | Most durable, slowest |
Postgres: synchronous_commit = on/off/local/remote_apply.
2.5 MVCC Implementation
Tham chiếu: Tuan-Bonus-Consistency-Models-Isolation cho high-level. Đây là implementation details.
2.5.1 Postgres MVCC
Each row has hidden columns:
xmin: transaction that insertedxmax: transaction that deleted/updated (NULL = current)ctid: physical location (block, offset)
Initial: row(id=1, value='A', xmin=10, xmax=NULL)
UPDATE id=1 SET value='B' (txn 20):
Insert NEW row(id=1, value='B', xmin=20, xmax=NULL)
Mark OLD row(xmax=20)
─→ Both versions exist
Visibility check:
SELECT * FROM users WHERE id=1;
-- For each row, check:
-- xmin in committed txns AND
-- (xmax IS NULL OR xmax NOT in committed txns at our snapshot)2.5.2 The bloat problem
Updates create new row versions → table grows.
Without VACUUM:
- Table size 10x original
- Index size 10x original
- Query slow (sequential scan over dead tuples)
VACUUM (Postgres):
- Mark dead tuples as reusable
- Update visibility map
- Don’t shrink table file (autovacuum default)
VACUUM FULL:
- Rewrite entire table → reclaim space
- Requires AccessExclusiveLock → blocks queries
- Use
pg_repackfor online alternative
2.5.3 Oracle/MySQL MVCC
Oracle: Undo segments — older versions in separate space, GC’d over time. MySQL InnoDB: Undo logs in system tablespace.
Difference from Postgres:
- Postgres: in-place new versions in heap
- Oracle/MySQL: separate undo store, in-place data
→ Different VACUUM behavior. Oracle/MySQL no VACUUM but bigger undo space.
2.6 Query Optimizer
The brain of database: Decide HOW to execute SQL.
2.6.1 Phases
SQL → Parser → AST → Logical Plan → Optimizer → Physical Plan → Executor
Logical plan: Relational algebra (project, select, join, group by). Physical plan: Specific algorithms (hash join vs merge join, index scan vs seq scan).
2.6.2 Cost-Based Optimization (CBO)
For each candidate plan:
- Estimate cardinality (rows at each step)
- Estimate cost (CPU + I/O)
- Pick lowest-cost plan
Statistics needed:
- Number of rows per table
- Distinct values per column
- Histograms for value distribution
- Index selectivity
-- Postgres: update stats
ANALYZE users;
-- View stats
SELECT * FROM pg_stats WHERE tablename = 'users';When stats wrong: Bad plan → 1000x slow.
2.6.3 Join algorithms
| Algorithm | Best when | Cost |
|---|---|---|
| Nested Loop | Small tables, indexed inner | O(N×M) or O(N×log M) |
| Hash Join | Large tables, no order | O(N+M) but needs hash table memory |
| Merge Join | Both tables sorted | O(N+M) |
| Index Nested Loop | Inner has index | O(N×log M) |
Optimizer chooses based on size, indexes, available memory.
2.6.4 EXPLAIN
EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN users u ON o.user_id = u.id
WHERE u.country = 'VN';
-- Output:
-- Hash Join (cost=10..1000 rows=100 actual=80)
-- -> Seq Scan on orders (cost=0..500 rows=10000)
-- -> Hash (cost=5..20 rows=100)
-- -> Index Scan on users_country_idx (cost=0..15 rows=100)Reading EXPLAIN:
cost: planner estimate (start..end)rows: estimated rowsactual: real (after ANALYZE)- Big mismatch estimate vs actual → bad stats
2.6.5 Plan stability problems
Plan flips = same query gets different plan due to:
- Stats changed
- Data distribution shift
- Memory pressure
- Locks contention
Prevention:
- pg_hint_plan extension
- Query store / plan freezing (SQL Server)
- Database tuning advisors
2.7 Indexes
2.7.1 B-tree index (most common)
CREATE INDEX idx_users_email ON users(email);
SELECT * FROM users WHERE email = '[email protected]';
-- Index lookup: O(log N) → fetch row from heapPros: Range queries (>, <, BETWEEN), equality, sorting.
Cons: Write overhead, space.
2.7.2 Hash index
CREATE INDEX idx_users_hash ON users USING HASH (email);Pros: O(1) equality lookup. Cons: No range queries.
Postgres 10+: hash indexes WAL-logged, crash-safe.
2.7.3 Bitmap index
For low-cardinality columns (e.g., gender, status):
status = 'active': 101010111000...
status = 'pending': 010101000111...
status = 'closed': 000000000000...
Pros: Fast WHERE status='active' AND country='VN' (AND bitmaps).
Cons: Slow updates. Best for analytics, not OLTP.
Used by: Oracle, ClickHouse (different impl).
2.7.4 Covering index
Include extra columns in index to avoid heap lookup:
CREATE INDEX idx_users_covering
ON users (email)
INCLUDE (name, country);
-- Query satisfied entirely from index
SELECT email, name, country FROM users WHERE email = '...';Pros: Index-only scan, no heap fetch. Cons: Larger index size.
2.7.5 Partial index
CREATE INDEX idx_active_users
ON users (last_login)
WHERE status = 'active';
-- Smaller index (only active users)2.7.6 Expression index
CREATE INDEX idx_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = '[email protected]';2.7.7 Multi-column index
CREATE INDEX idx_country_city ON users (country, city);
-- Used for:
WHERE country = 'VN' ✓
WHERE country = 'VN' AND city = 'HCM' ✓
WHERE country = 'VN' ORDER BY city ✓
-- Not used for:
WHERE city = 'HCM' (city not leading column)Order matters: Most selective column first, equality before range.
2.7.8 GIN / GiST / BRIN (Postgres specialized)
- GIN (Generalized Inverted Index): Array, JSONB, full-text
- GiST (Generalized Search Tree): Geometric, full-text, range
- BRIN (Block Range Index): Huge tables, sorted by physical order, very small index
-- BRIN for time-series
CREATE INDEX idx_logs_time ON logs USING BRIN (created_at);
-- 1000x smaller than B-tree, fast for range queries2.8 Storage Formats: Row vs Columnar
2.8.1 Row-oriented
Each row stored contiguously:
Row 1: [id=1, name='Alice', age=25, country='VN', email='[email protected]']
Row 2: [id=2, name='Bob', age=30, country='US', email='[email protected]']
...
Used by: PostgreSQL, MySQL, Oracle (OLTP).
Pros: Fast for “fetch all columns of 1 row” (typical OLTP query). Cons: Slow for analytics (“AVG(age) over 1B rows” needs to read all columns).
2.8.2 Column-oriented (columnar)
Each column stored contiguously:
id: [1, 2, 3, 4, 5, ...]
name: ['Alice', 'Bob', 'Charlie', ...]
age: [25, 30, 28, 35, 22, ...]
country: ['VN', 'US', 'VN', 'JP', 'VN', ...]
Used by: ClickHouse, Apache Parquet, Apache ORC, AWS Redshift, Snowflake, BigQuery.
Pros:
- Compression: Same-type values compress 5-10x
- Vectorization: SIMD over column
- Late materialization: Skip unused columns
- Predicate pushdown: Skip blocks via min/max stats
Cons: Slow for “fetch entire row” (need to assemble from columns).
Performance gap: 100x for analytics queries.
2.8.3 Apache Parquet — modern columnar
Used by: Lakehouse (Iceberg, Delta, Hudi), Spark, Trino, DuckDB.
Structure:
Parquet file
├── Header
├── Row group 1 (e.g., 128 MB)
│ ├── Column chunk: id (compressed, indexed)
│ ├── Column chunk: name
│ ├── ... (each column separate, with stats)
│ └── Column chunk: country
├── Row group 2
├── ...
└── Footer (metadata: schema, stats per column chunk)
Compression: Snappy, zstd, lz4 commonly.
Predicate pushdown:
SELECT * FROM events WHERE date = '2026-05-01' AND country = 'VN';
-- Reader checks footer:
-- Row group 1: date range 2026-04-30 to 2026-05-02 → may match → read
-- Row group 2: date range 2026-04-25 to 2026-04-29 → skip!
-- Saves 50%+ I/O2.9 Buffer Pool
Cache of disk pages in RAM.
SELECT id FROM users WHERE id = 100;
Process:
1. Hash to find page containing id=100
2. Check buffer pool: is page in memory?
3. Hit: return immediately (~1 μs)
4. Miss: read from disk (~10-100 μs), evict cold page
Hit rate = critical metric:
- 99%+ → great
- 90% → ok
- 80% → bad (mostly disk I/O)
Postgres:
SELECT
sum(heap_blks_hit) / nullif(sum(heap_blks_hit + heap_blks_read), 0) AS cache_hit
FROM pg_statio_user_tables;Tuning:
- Postgres:
shared_buffers= 25% RAM - MySQL InnoDB:
innodb_buffer_pool_size= 60-70% RAM - Watch for double caching (DB cache + OS page cache)
2.10 ARIES Recovery (briefly)
Algorithm for Recovery and Isolation Exploiting Semantics (IBM 1992).
3 phases on crash recovery:
- Analysis: Determine which transactions were active at crash
- Redo: Replay all WAL records (idempotent)
- Undo: Roll back uncommitted transactions
Key concept: WAL with LSN (Log Sequence Number), checkpoints.
Used by: MySQL InnoDB, Postgres (variant), SQL Server.
2.11 Connection Models
2.11.1 Process per connection (Postgres default)
- Each connection = OS process (~10MB)
- Crash isolation
- Limited concurrency (1000 conn = 10GB RAM)
- Need PgBouncer for pooling
2.11.2 Thread per connection (MySQL default)
- Each connection = OS thread (~1MB)
- Better concurrency
- Less crash isolation
2.11.3 Coroutine / async (newer)
- One process, many connections
- Used by ScyllaDB (seastar framework)
- 10x throughput vs Cassandra
2.12 Distributed DB Specific Concepts
2.12.1 Disaggregated Storage
Aurora model (and Spanner, Snowflake):
Compute nodes: stateless, scale independently
↓ (network)
Distributed storage: replicated, durable
Aurora innovation: “Log is the database” — only WAL crosses network, storage applies records.
2.12.2 Compute-Storage separation
Modern DBs split:
- Compute: query execution (scales with workload)
- Storage: data persistence (scales with data)
Examples: Snowflake, Aurora, Spanner, Singlestore, ClickHouse Cloud.
Benefit: Pay separately, scale independently.
2.12.3 Vectorized execution
ClickHouse, DuckDB: Process columns in batches (1024 rows), not row-by-row.
// Naive (row-by-row)
for (row : rows) {
if (row.age > 25) result.push_back(row);
}
// Vectorized (batch)
vector<bool> mask = compare_gt(age_column, 25); // SIMD
result = compact(rows, mask);3-10x faster even before SIMD optimizations.
3. Practical Implications
3.1 OLTP vs OLAP DB choice
| OLTP | OLAP | |
|---|---|---|
| Workload | Many small transactions | Few large queries |
| Storage | Row-oriented | Column-oriented |
| Index | B-tree | Bitmap, BRIN |
| Examples | Postgres, MySQL | ClickHouse, BigQuery |
3.2 When to denormalize
Normalized OLTP: 3NF, joins for reads. Denormalized analytics: Pre-joined wide tables.
Star schema:
- Fact table (large, transactions)
- Dimension tables (smaller, joined)
Snowflake schema: Star schema with normalized dimensions.
3.3 Sharding considerations
Tuan-07-Database-Sharding-Replication for high-level. Internal:
- Shard key = primary access pattern
- Cross-shard joins expensive (avoid)
- Resharding rare = painful (use consistent hashing)
3.4 Backup strategies
Logical backup (pg_dump): SQL statements. Slow for big DB, portable.
Physical backup (file-level): Fast, restore exact state, version-locked.
PITR (Point-In-Time Recovery): Base backup + WAL → recover to any moment.
3-2-1 rule: 3 copies, 2 different media, 1 off-site.
4. Security First
4.1 SQL injection — Always parameterize
# BAD
db.execute(f"SELECT * FROM users WHERE id = {user_input}")
# GOOD
db.execute("SELECT * FROM users WHERE id = %s", (user_input,))4.2 Encryption
At rest: Transparent Data Encryption (TDE) — Postgres extensions, MySQL InnoDB, SQL Server built-in. In transit: TLS for client-server. Field-level: Encrypt sensitive columns separately.
4.3 Audit logging
Postgres pgaudit extension:
ALTER SYSTEM SET pgaudit.log = 'write, role, ddl';4.4 Row-Level Security
Tuan-Bonus-Multi-Tenancy-SaaS-Patterns section 2.2 — RLS for multi-tenant isolation.
5. Performance Tuning Checklist
5.1 Slow query investigation
-- Postgres: find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC LIMIT 20;
-- Check plan
EXPLAIN (ANALYZE, BUFFERS) <query>;Common fixes:
- Add index (most common)
- Update stats (ANALYZE)
- Rewrite query (denormalize, materialize)
- Increase work_mem for sort/hash
5.2 Lock contention
-- Find blocking queries (Postgres)
SELECT
blocked_activity.pid AS blocked_pid,
blocked_activity.query AS blocked_query,
blocking_activity.pid AS blocking_pid,
blocking_activity.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.relation = blocking_locks.relation
JOIN pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;5.3 Buffer pool sizing
- Postgres
shared_buffers: 25% of RAM (default 128MB too small) effective_cache_size: total RAM available for caching (75% RAM)work_mem: per-operation sort/hash memory (16-64 MB)maintenance_work_mem: VACUUM, CREATE INDEX (1-2 GB)
5.4 Connection pooling
Without pooling:
10K clients × 10MB = 100GB RAM (Postgres process-per-conn)
With PgBouncer:
10K clients ↔ PgBouncer ↔ 100 backend connections
Memory: 100 × 10MB = 1GB
5.5 Replication tuning
wal_level = replicafor streaming replicationmax_wal_senders= number of replicas + roomwal_keep_sizeto retain WAL for late replicassynchronous_standby_namesfor sync replication
6. Code Examples
6.1 Demonstrate B-tree vs LSM behavior
"""
Compare write/read performance of B-tree (sqlite) vs LSM (rocksdb).
"""
import sqlite3
import rocksdb
import time
import random
N = 100_000
# B-tree (SQLite)
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE kv (k TEXT PRIMARY KEY, v TEXT)")
start = time.time()
for i in range(N):
conn.execute("INSERT INTO kv VALUES (?, ?)", (f"key{i}", f"val{i}"))
conn.commit()
print(f"SQLite write: {N / (time.time() - start):.0f} ops/s")
start = time.time()
for _ in range(N):
k = f"key{random.randint(0, N-1)}"
conn.execute("SELECT v FROM kv WHERE k = ?", (k,)).fetchone()
print(f"SQLite read: {N / (time.time() - start):.0f} ops/s")
# LSM (RocksDB)
db = rocksdb.DB("/tmp/rocks.db", rocksdb.Options(create_if_missing=True))
start = time.time()
batch = rocksdb.WriteBatch()
for i in range(N):
batch.put(f"key{i}".encode(), f"val{i}".encode())
db.write(batch)
print(f"RocksDB write: {N / (time.time() - start):.0f} ops/s")
start = time.time()
for _ in range(N):
k = f"key{random.randint(0, N-1)}".encode()
db.get(k)
print(f"RocksDB read: {N / (time.time() - start):.0f} ops/s")
# Typical results:
# SQLite write: 30,000 ops/s LSM write: 200,000 ops/s
# SQLite read: 100,000 ops/s LSM read: 50,000 ops/s6.2 Show MVCC bloat
-- Setup
CREATE TABLE bloat_test (id SERIAL PRIMARY KEY, val INT);
INSERT INTO bloat_test SELECT generate_series(1, 100000);
-- Initial size
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~3.5 MB
-- Update all rows
UPDATE bloat_test SET val = val + 1;
-- Now size doubled (old + new versions)
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~7 MB
-- More updates
UPDATE bloat_test SET val = val + 1;
UPDATE bloat_test SET val = val + 1;
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- ~14 MB (4x original!)
-- VACUUM (mark for reuse, doesn't shrink)
VACUUM bloat_test;
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- Still ~14 MB
-- VACUUM FULL (shrinks but locks)
VACUUM FULL bloat_test;
SELECT pg_size_pretty(pg_relation_size('bloat_test'));
-- Back to ~3.5 MB6.3 Index usage analysis
-- Find unused indexes (Postgres)
SELECT
schemaname, tablename, indexname,
idx_scan as scans,
pg_size_pretty(pg_relation_size(indexrelid)) AS size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
-- Find missing indexes (large seq scans)
SELECT
schemaname, tablename,
seq_scan, idx_scan,
pg_size_pretty(pg_relation_size(relid)) AS size
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan AND seq_scan > 100
ORDER BY seq_scan DESC;7. System Design Diagrams
7.1 B-tree vs LSM Write Path
flowchart TB subgraph BTree["B-tree Write"] BW[Write request] --> BWAL[WAL append<br/>sequential] BW --> BPage[Find leaf page<br/>random I/O] BPage --> BUpdate[In-place update] BUpdate --> BSplit{Page full?} BSplit -->|Yes| BSplit2[Split, propagate<br/>multiple writes] BSplit -->|No| BDone[Done] BSplit2 --> BDone end subgraph LSM["LSM Write"] LW[Write request] --> LWAL[WAL append<br/>sequential] LW --> LMem[Insert into memtable<br/>in-memory] LMem --> LDone[Done] LMem -.flush async.-> LSST[SSTable L0<br/>sequential write] LSST -.compact async.-> LL1[SSTable L1] end style LSM fill:#c8e6c9 style BTree fill:#fff9c4
7.2 LSM Architecture
flowchart TB Write[Write] --> Mem[Memtable<br/>in-memory<br/>sorted] Write --> WAL[(WAL)] Mem -.flush.-> L0_1[SSTable L0] Mem -.flush.-> L0_2[SSTable L0] Mem -.flush.-> L0_3[SSTable L0] L0_1 -.compact.-> L1_1[SSTable L1<br/>10x larger] L0_2 -.compact.-> L1_1 L0_3 -.compact.-> L1_1 L1_1 -.compact.-> L2_1[SSTable L2<br/>100x larger] Read[Read] --> Mem Read --> L0_1 Read --> L0_2 Read --> L0_3 Read --> L1_1 Read --> L2_1 BloomFilter[Bloom Filter<br/>per SSTable] -.skip non-matching.-> L0_1 BloomFilter -.-> L1_1 BloomFilter -.-> L2_1
7.3 MVCC Visibility
flowchart LR subgraph Versions["Row Versions"] V1["v1: value=A<br/>xmin=10<br/>xmax=20"] V2["v2: value=B<br/>xmin=20<br/>xmax=30"] V3["v3: value=C<br/>xmin=30<br/>xmax=NULL"] V1 -.updated by.-> V2 V2 -.updated by.-> V3 end T15[Txn 15 snapshot] -->|visible| V1 T15 -.invisible.-> V2 T15 -.invisible.-> V3 T25[Txn 25 snapshot] -.invisible.-> V1 T25 -->|visible| V2 T25 -.invisible.-> V3 T35[Txn 35 snapshot] -.invisible.-> V1 T35 -.invisible.-> V2 T35 -->|visible| V3 style V1 fill:#ffe0b2 style V2 fill:#fff9c4 style V3 fill:#c8e6c9
7.4 Row vs Columnar
flowchart LR subgraph Row["Row-oriented"] R1["Row1: id|name|age|country"] R2["Row2: id|name|age|country"] R3["Row3: id|name|age|country"] R4["Row4: id|name|age|country"] Note1[Read row = 1 disk seek<br/>Best for OLTP] end subgraph Col["Columnar"] C1["id: [1,2,3,4,5,...]"] C2["name: [...]"] C3["age: [25,30,28,35,22,...]"] C4["country: [VN,US,VN,JP,...]"] Note2[AVG(age) reads only age column<br/>10-100x faster<br/>Best for OLAP] end style Row fill:#fff9c4 style Col fill:#c8e6c9
8. Aha Moments & Pitfalls
Aha Moments
#1: B-tree vs LSM = read-optimized vs write-optimized. Both ACID. Choose based on workload. Most OLTP DBs B-tree, most NoSQL/wide-column LSM.
#2: WAL fsync is fundamental bottleneck. Commit throughput = fsync rate. NVMe enables 10-100x more commits/sec than HDD.
#3: MVCC eliminates read-write blocking. Readers see snapshot, writers create new versions. Cost: bloat → VACUUM.
#4: Postgres “REPEATABLE READ” = SI, not SQL standard RR. Use SERIALIZABLE for true serializability. (See Tuan-Bonus-Consistency-Models-Isolation)
#5: Compaction debt kills LSM systems. Writes faster than compaction → SSTables pile up. Monitor and tune.
#6: Columnar gives 10-100x for analytics. Same data, different layout, different workload. Why ClickHouse, Snowflake exist.
#7: Buffer pool hit rate >90% mandatory. <80% = mostly disk reads = slow. Tune
shared_buffersaggressively.
#8: Stats determine plan quality. Bad stats → bad plan → 1000x slow. Run ANALYZE regularly.
Pitfalls
Pitfall 1: Over-indexing
Adding indexes “just in case” → write amplification, space waste. Fix: Only index columns used in WHERE/JOIN/ORDER BY of slow queries.
Pitfall 2: ORDER BY without index
ORDER BY created_aton 10M row table without index → sort 10M rows. Fix: Index on sort column.
Pitfall 3: SELECT *
Reads all columns → bloats network, harms cache. Fix: Explicit column list.
Pitfall 4: Forgot ANALYZE
Bulk insert 10M rows → no ANALYZE → optimizer thinks empty → terrible plan. Fix: ANALYZE after bulk operations.
Pitfall 5: VACUUM disabled
“Save resources” → autovacuum off → bloat grows → query slow → space exhausted. Fix: NEVER disable autovacuum. Tune
autovacuum_vacuum_scale_factor.
Pitfall 6: Long-running transactions
Block VACUUM → bloat grows. Fix: Monitor
pg_stat_activity, kill long-running queries (statement_timeout).
Pitfall 7: N+1 query
Loop in app, query per iter → 1000 queries instead of 1. Fix: Single query with JOIN, or batch IN list.
Pitfall 8: Connection without pooling
10K clients × 10MB = 100GB RAM exhausted. Fix: PgBouncer in transaction mode.
Pitfall 9: Wrong index column order
(city, country)forWHERE country='VN' AND city='HCM'→ unused. Fix: Most selective + leading column convention. UseEXPLAIN.
Pitfall 10: Mistaking row-DB for analytics
Run
SELECT AVG(amount) FROM 1B_row_tableon Postgres → 30 minutes. Fix: Move analytics to columnar (ClickHouse, BigQuery, Lakehouse).
9. Internal Links
| Topic | Connects to |
|---|---|
| Tuan-07-Database-Sharding-Replication | Sharding, replication patterns |
| Tuan-Bonus-Consistency-Models-Isolation | MVCC, isolation levels |
| Tuan-Bonus-Outbox-Pattern | WAL-based CDC (Debezium reads WAL) |
| Case-Design-Modern-Data-Lakehouse | Parquet, columnar at scale |
| Tuan-Foundations-OS-Essentials | fsync, mmap, page cache |
| Tuan-Foundations-Computer-Architecture | Cache-aware query execution |
| Tuan-Bonus-Multi-Region-Active-Active-DSQL | Distributed SQL internals |
Tham khảo
Books:
- Database Internals (Alex Petrov, 2019) — comprehensive
- Designing Data-Intensive Applications Ch.3, 7 (Kleppmann)
- Use The Index, Luke! (free, Markus Winand) — https://use-the-index-luke.com/
- High Performance MySQL (4th ed)
- PostgreSQL: Up & Running (Obe & Hsu)
Papers:
- O’Neil et al., The LSM-Tree (1996)
- Mohan et al., ARIES (1992)
- Bayer & McCreight, Organization and Maintenance of Large Ordered Indexes (1972) — original B-tree
- Stonebraker, The Design and Implementation of POSTGRES (1986)
Online:
- CMU 15-445 Database Systems — https://15445.courses.cs.cmu.edu/
- Andy Pavlo’s talks on YouTube
- The Internals of PostgreSQL (Hironobu Suzuki) — https://www.interdb.jp/pg/
- RocksDB Wiki — https://github.com/facebook/rocksdb/wiki
Tools:
- pgAdmin, pgcli for Postgres
- pg_stat_statements extension
- pgaudit for audit
- pg_repack for online VACUUM FULL
- Percona Toolkit for MySQL
Tiếp theo: Tuan-Foundations-Compilers-VMs — Compilers, bytecode, GC, JIT.