Tuần 03 — Indexing Mastery
“Mid-level dev: ‘Query chậm, thêm index.’ Senior: ‘Query chậm vì index không match access pattern. Đây là index đúng.’ Staff: ‘Query chậm vì 3 lý do, mình bỏ 2 index thừa, thêm 1 covering index, redesign schema để index-only scan work, tiết kiệm 40% disk.‘”
Tags: database postgresql indexing b-tree performance Thời lượng: 7 ngày (5-7h/ngày — tuần dày) Prerequisites: Tuan-01-DB-Internals-Refresh (B-tree page layout), Tuan-02-Schema-Design-Normalization (schema sạch trước khi index) Liên quan: Tuan-04-Query-Optimization-EXPLAIN · Tuan-09-DB-Observability-Tuning
1. Context & Why
1.1 Vì sao tuần này là xương sống
Khả năng “đọc một query và biết index nào cần thiết” là kỹ năng phân biệt mid-level và senior. Trong production:
- 60-80% bug performance gốc là missing/wrong index
- 10-20% là redundant/unused index ăn write throughput
- 5-10% là bloated index không reindex
Hiểu indexing = giải quyết 80% bottleneck DB tự thân.
1.2 Mental model — Index không phải miễn phí
graph LR A[Query SELECT] -->|index hits| B[Index lookup<br/>O(log N)] A -->|no index| C[Sequential scan<br/>O(N)] B -.->|fetch heap| D[Heap access<br/>1 random I/O] E[INSERT/UPDATE/DELETE] --> F[Maintain main heap] E --> G[Maintain ALL indexes] G --> H[+1 write per index] G --> I[+WAL volume] G --> J[Lock contention] style C fill:#ffccbc style G fill:#fff9c4
Mỗi index thêm:
- ✅ Đọc nhanh khi match access pattern
- ❌ +1 write per index per row mutation
- ❌ +N MB disk
- ❌ +WAL volume
- ❌ +shared_buffers space
- ❌ Maintenance: bloat, REINDEX
Rule vàng: Add index khi có bằng chứng (slow query, EXPLAIN), không phải khi “phòng hờ”.
1.3 Mục tiêu tuần
- Hiểu sâu B-tree, GIN, GiST, BRIN, Hash, SP-GiST — pick đúng cho từng workload
- Master multi-column index: thứ tự column, ESR rule
- Tự tin design covering index để đạt index-only scan
- Partial index, expression index — pattern thực chiến
- Detect và xóa unused/redundant index
- Manage bloat: monitor, REINDEX CONCURRENTLY
- Đọc
pg_stat_user_indexesnhư đọc bản đồ
1.4 Tham chiếu
- Use The Index, Luke! — Markus Winand (free, must-read) — https://use-the-index-luke.com/
- PostgreSQL Documentation: Indexes — https://www.postgresql.org/docs/current/indexes.html
- The Art of PostgreSQL — Dimitri Fontaine, Ch. Indexing
- PostgreSQL 14 Internals — Egor Rogov, Part 3
- Citus blog series on indexes — https://www.citusdata.com/blog/
2. Index Types — Map of Choices
mindmap root((Postgres<br/>Index<br/>Types)) B-tree Default Equality + Range Sortable Most queries Hash Equality only WAL-logged from PG10 Niche use GIN Array, JSONB Full-text search Trigram Multi-value column GiST Geometric Range types Trigram alt EXCLUDE constraint SP-GiST Space partitioned IP, hierarchical Less common BRIN Block Range Huge tables Naturally ordered data Specialized Bloom RUM pg_trgm
2.1 B-tree — workhorse
Default. Support:
- Equality:
= - Range:
<,<=,>,>=,BETWEEN - IN list
- Pattern matching:
LIKE 'prefix%'(NOTLIKE '%suffix') - NULL handling:
IS NULL,IS NOT NULL - Sort: ORDER BY uses index nếu match column order
- Joins: hash/merge/nested loop, planner pick
Structure (refresh từ Tuần 01):
graph TD Root["Root: keys [40, 80]"] L1["[10, 20, 30]"] L2["[50, 60, 70]"] L3["[90, 100, 110]"] Leaf1["[10→ctid] [15→ctid] [20→ctid]"] Leaf2["[50→ctid] [55→ctid] [60→ctid]"] Leaf3["[90→ctid] [95→ctid] [100→ctid]"] Root --> L1 Root --> L2 Root --> L3 L1 --> Leaf1 L2 --> Leaf2 L3 --> Leaf3 Leaf1 -.linked list.-> Leaf2 -.-> Leaf3
Postgres B-tree là B+ tree: data trong leaf, leaf linked list cho range scan.
Depth: 3-4 cho table tỷ rows. Lookup cost: 3-4 random I/O worst case, 0-1 với cache hot.
2.2 Hash — niche
Equality only. Trước PG10 không WAL-logged → không crash safe → ít dùng. Từ PG10 đã fix.
Khi nào dùng vs B-tree:
- Index size nhỏ hơn ~5-10% B-tree (cùng data)
- Lookup nhanh hơn marginal cho exact equality
- Không sortable, không range
Trong thực tế 2024: 99% case B-tree tốt hơn. Hash hiếm khi pick.
2.3 GIN — Generalized Inverted Index
GIN inverted index — như Google search. Một entry → list of doc IDs có entry đó.
Use cases:
- Array:
tags text[] - JSONB:
metadata jsonb - Full-text search:
tsvector - Trigram:
pg_trgmcho fuzzy search
CREATE TABLE articles (id bigint, tags text[], content text);
CREATE INDEX idx_articles_tags ON articles USING gin(tags);
CREATE INDEX idx_articles_fts ON articles USING gin(to_tsvector('english', content));
-- Query
SELECT * FROM articles WHERE tags @> ARRAY['rust'];
SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('postgres');Trade-off GIN:
- Lookup very fast
- Write slow (multiple entries per row)
- Update tốn (rebuild posting list)
- Index size lớn (overhead per token)
Pattern 2024 — pending list: GIN có gin_pending_list_limit (default 4MB). Writes go to pending list trước, periodic merge. Giảm write amplification.
-- Tune cho high-write
ALTER INDEX idx_articles_tags SET (fastupdate = on, gin_pending_list_limit = '32MB');2.4 GiST — Generalized Search Tree
Framework cho indexes “không-standard”: geometric, range, custom.
Use cases:
- PostGIS: spatial queries
- Range types:
tstzrange && tstzrange - EXCLUDE constraints
- Trigram (
pg_trgmcũng support GiST) - KNN search (
ORDER BY point <-> point LIMIT 10)
CREATE TABLE bookings (
id bigint, room_id int, during tstzrange,
EXCLUDE USING gist (room_id WITH =, during WITH &&) -- needs GiST
);
CREATE TABLE places (id bigint, location geography);
CREATE INDEX idx_places_loc ON places USING gist(location);
SELECT * FROM places ORDER BY location <-> ST_MakePoint(-122.4, 37.7) LIMIT 10;KNN search (k-nearest neighbor): GiST có thể sort theo distance trong index. Cực mạnh cho map/geo queries.
2.5 SP-GiST — Space-Partitioned
Cho data nội tại phân vùng không cân bằng: IP addresses, phone numbers, hierarchical paths.
CREATE TABLE access_log (id bigint, ip inet);
CREATE INDEX idx_access_log_ip ON access_log USING spgist(ip);
SELECT * FROM access_log WHERE ip << '192.168.0.0/16'; -- subnet containsNiche. Khi không chắc, GiST hoặc B-tree thường đủ.
2.6 BRIN — Block Range Index
Lưu min/max của mỗi N block (default 128 = 1MB). Cực nhỏ — index 100GB table chỉ vài MB.
Điều kiện hoạt động: Data phải naturally ordered theo column. Vd: created_at của events table, append-only.
CREATE TABLE events (
id bigint, occurred_at timestamptz, payload jsonb
);
CREATE INDEX idx_events_brin ON events USING brin(occurred_at);
SELECT * FROM events WHERE occurred_at BETWEEN '2026-05-01' AND '2026-05-02';Trade-off:
- ✅ Index siêu nhỏ
- ✅ Build fast
- ❌ Chỉ work với clustered data — nếu data shuffled, BRIN useless
- ❌ Không thay được B-tree cho equality lookup nhanh
Use case: time-series, append-only logs, partitioned tables.
2.7 Bloom (extension)
CREATE EXTENSION bloom;
Index nhiều column, mỗi column equality. Approximate (false positive possible).
CREATE INDEX idx_widgets_bloom ON widgets USING bloom(c1, c2, c3, c4, c5)
WITH (length=80, col1=2, col2=2, col3=4, col4=2, col5=2);
SELECT * FROM widgets WHERE c1 = 1 AND c3 = 100 AND c5 = 5;Use case: bảng nhiều column, query mix equality không predict được. Hiếm.
2.8 Decision tree — pick index type
flowchart TD A[Query pattern?] --> B{Equality + range trên scalar?} B -->|Yes| B1[B-tree] A --> C{Array / JSONB containment?} C -->|Yes| C1[GIN] A --> D{Full-text search?} D -->|Yes| D1[GIN on tsvector] A --> E{Fuzzy text / similarity?} E -->|Yes| E1[GIN/GiST + pg_trgm] A --> F{Geometric / spatial?} F -->|Yes| F1[GiST + PostGIS] A --> G{Range overlap?} G -->|Yes| G1[GiST] A --> H{Append-only time-series huge table?} H -->|Yes| H1[BRIN] A --> I{Vector similarity / embedding?} I -->|Yes| I1[HNSW or IVFFlat<br/>via pgvector] style B1 fill:#c8e6c9 style C1 fill:#c8e6c9 style I1 fill:#fff9c4
(Vector index sẽ học sâu Tuan-15-Vector-DB-AI)
3. Multi-column Index — Nghệ thuật thứ tự
3.1 Vấn đề kinh điển
Có query:
SELECT * FROM orders WHERE user_id = $1 AND status = 'pending' ORDER BY created_at DESC;Index nào tốt?
- A:
(user_id) - B:
(user_id, status) - C:
(user_id, status, created_at) - D:
(status, user_id, created_at) - E:
(created_at, user_id, status)
3.2 Rule ESR — Equality, Sort, Range
Markus Winand’s pattern:
- Equality columns trước (column với
=) - Sort column tiếp theo (column trong ORDER BY)
- Range column cuối (column với
<,>,BETWEEN)
Áp dụng cho query trên:
- Equality:
user_id,status - Sort:
created_at - Range: không có
→ Index tối ưu: (user_id, status, created_at)
Tại sao C thắng D?
- Cả 2 cùng cover (user_id, status, created_at)
- Nhưng D đặt
statustrướcuser_id. Nếuuser_idselectivity cao (mỗi user có nhiều order), đặt user_id trước giúp narrow xuống nhanh hơn.
Tại sao C thắng E?
- E đặt
created_attrước. Equalityuser_id = Xkhông thể skip trong B-tree khi column đầu làcreated_atrange/sort. - Index B-tree là multi-dimensional theo column order. Phải dùng prefix.
3.3 Left-prefix rule
Index (a, b, c) hỗ trợ:
WHERE a = $1WHERE a = $1 AND b = $2WHERE a = $1 AND b = $2 AND c = $3WHERE a = $1 ORDER BY b, cWHERE a = $1 AND b > $2 ORDER BY c(range trên b)
KHÔNG hỗ trợ (efficient):
WHERE b = $1(skip a)WHERE c = $1WHERE a > $1 AND b = $2(range a → không dùng được b cho equality lookup nữa)
3.4 Khi đảo column order
Query 1: WHERE country = 'VN' AND status = 'active' (1M VN users, 100K active)
Query 2: WHERE country = 'VN' AND created_at > '2026-01-01'
Index (country, status) tốt cho Q1. Q2 cần (country, created_at).
Cân nhắc gộp (country, status, created_at):
- Cover Q1 (prefix)
- Cover Q2? Không hiệu quả — sau
country, B-tree sort theostatustrước, không phảicreated_at. Range oncreated_atcần scan tất cả status.
→ Đôi khi cần 2 index riêng: (country, status) và (country, created_at).
3.5 PG17/18 — SAOP + Skip Scan
PG17 thực tế ship: Cải thiện handling của SAOP (Scalar Array Op — IN (...), = ANY(...)) trong B-tree. Planner có thể “advance to next array element” hiệu quả hơn → query với IN (long list) nhanh hơn rõ rệt.
Skip scan thực sự (Peter Geoghegan patch): Landed vào master cuối 2024, slated for PG18 (Sept 2025). Cho phép skip qua leading column low-cardinality:
-- Index: (gender, last_name)
-- Query: WHERE last_name = 'Smith'
-- Pre-PG18: seq scan or full index scan
-- PG18+: skip scan - iterate distinct gender values, lookup last_name within eachHữu ích khi leading column ít distinct values. Không thay được proper index design — chỉ safety net cho queries không match prefix.
Trên PG17 đang xài: workaround là tạo composite index match query, hoặc tạo nhiều partial index.
3.6 Demo lab
CREATE TABLE orders (
id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
user_id bigint NOT NULL,
status text NOT NULL,
created_at timestamptz NOT NULL DEFAULT now(),
total numeric
);
INSERT INTO orders (user_id, status, total)
SELECT
(random() * 10000)::bigint,
(ARRAY['pending', 'paid', 'shipped', 'cancelled'])[1 + (random() * 3)::int],
(random() * 1000)::numeric
FROM generate_series(1, 1000000);
-- No index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending' ORDER BY created_at DESC LIMIT 10;
-- Seq Scan, ~50ms
-- Index A
CREATE INDEX idx_a ON orders(user_id);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Index Scan on idx_a, ~5ms
-- Index B
CREATE INDEX idx_b ON orders(user_id, status, created_at DESC);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Index Scan on idx_b, ~0.2ms (winner)
DROP INDEX idx_a; -- planner sẽ pick idx_b4. Covering Index & Index-Only Scan
4.1 Index-only scan là gì
Bình thường:
- Planner dùng index → tìm ctid
- Fetch heap page → return tuple
Index-only scan: planner đọc tất cả column cần thiết từ index, không fetch heap.
Điều kiện:
- Query chỉ reference columns có trong index
- Visibility map nói page là “all-visible” (Tuần 01)
4.2 INCLUDE — covering index
PG11+ syntax:
-- Index B-tree key: (user_id), payload: total, created_at
CREATE INDEX idx_orders_covering
ON orders(user_id) INCLUDE (total, created_at);INCLUDE columns:
- Lưu trong leaf nhưng không là key (không sortable, không equality lookup)
- Cho phép index-only scan với query select chúng
EXPLAIN (ANALYZE, BUFFERS)
SELECT total, created_at FROM orders WHERE user_id = 42;
-- Index Only Scan using idx_orders_covering
-- Heap Fetches: 0 ← perfect index-only scan4.3 Khi nào dùng covering index
flowchart TD A[Query] --> B{Hot path?<br/>>1000 qps?} B -->|No| B1[Đừng add covering<br/>tăng size không xứng] B -->|Yes| C{Select bao nhiêu column?} C -->|1-3 column nhỏ| D[Add INCLUDE<br/>+payload] C -->|Nhiều column / big TEXT| E[Đừng INCLUDE<br/>index sẽ phình] D --> F{VM coverage?} F -->|>95% all-visible| G[Index-only scan work] F -->|<95%| H[Tune autovacuum<br/>để VM stay fresh] style G fill:#c8e6c9 style B1 fill:#fff9c4 style E fill:#ffccbc
Trade-off:
- ✅ Tiết kiệm 1 random I/O per row (heap fetch)
- ❌ Index size tăng (lưu thêm columns)
- ❌ Update bất kỳ column nào trong INCLUDE → maintain index
4.4 Heap Fetches — kẻ thù của index-only scan
EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Index Only Scan ...
-- Heap Fetches: 1234 ← bad! VM staleHeap Fetches > 0 nghĩa là index-only scan không thực sự là index-only — phải fetch heap vì VM bit chưa set.
Fix:
VACUUM (VERBOSE) orders;
-- Wait until autovacuum run, hoặc tune scale_factorPattern 2024: high-write table cần VACUUM thường xuyên hơn để keep VM fresh.
ALTER TABLE orders SET (
autovacuum_vacuum_scale_factor = 0.05, -- 5% instead of 20%
autovacuum_vacuum_insert_scale_factor = 0.05 -- PG13+: trigger by inserts too
);5. Partial Index
5.1 Concept
Index chỉ một subset rows.
-- Chỉ index active orders
CREATE INDEX idx_orders_active
ON orders(user_id, created_at)
WHERE status IN ('pending', 'paid', 'shipping');Khi nào dùng:
- Hot subset nhỏ trong table to (vd: 5% active, 95% archived)
- Filter condition luôn xuất hiện trong query
- Soft-delete pattern:
WHERE deleted_at IS NULL
Lợi ích:
- Index nhỏ → fit cache → faster
- Write cheaper (chỉ archived rows tránh maintain)
- Disk save lớn
5.2 Production patterns
-- 1. Soft delete
CREATE INDEX idx_users_email ON users(email) WHERE deleted_at IS NULL;
-- 2. Unique constraint cho subset
CREATE UNIQUE INDEX idx_one_default_per_user
ON addresses(user_id) WHERE is_default = true;
-- 3. Skip null-heavy column
CREATE INDEX idx_orders_payment_ref ON orders(payment_ref) WHERE payment_ref IS NOT NULL;
-- 4. Status workflow
CREATE INDEX idx_jobs_pending ON jobs(created_at) WHERE status = 'pending';
-- Job queue worker query: SELECT ... WHERE status='pending' ORDER BY created_at LIMIT 1005.3 Gotcha — query phải match WHERE clause
CREATE INDEX idx_partial ON orders(user_id) WHERE status = 'active';
-- DÙNG ĐƯỢC
SELECT * FROM orders WHERE user_id = 42 AND status = 'active';
-- KHÔNG DÙNG ĐƯỢC (planner không chắc partial bao trùm)
SELECT * FROM orders WHERE user_id = 42; -- Có thể có rows status != 'active'
SELECT * FROM orders WHERE user_id = 42 AND status IN ('active', 'paused');Predicate trong query phải imply predicate trong index.
6. Expression Index (Functional Index)
6.1 Concept
Index trên kết quả của function/expression, không phải column raw.
-- Case-insensitive lookup
CREATE INDEX idx_users_email_lower ON users(lower(email));
SELECT * FROM users WHERE lower(email) = lower($1);
-- Dùng được indexFunction phải IMMUTABLE — return same result cho same input, không có side effect.
6.2 Patterns hữu ích
-- 1. Date truncation
CREATE INDEX idx_events_day ON events(date_trunc('day', occurred_at));
SELECT * FROM events WHERE date_trunc('day', occurred_at) = '2026-05-16';
-- 2. JSONB extraction
CREATE INDEX idx_users_country ON users((metadata->>'country'));
SELECT * FROM users WHERE metadata->>'country' = 'VN';
-- 3. Computed value
CREATE INDEX idx_orders_revenue ON orders((qty * unit_price));
-- 4. Full-text search column
CREATE INDEX idx_products_search ON products USING gin(to_tsvector('english', name || ' ' || description));6.3 Combine với partial
-- Tag count cho active products
CREATE INDEX idx_products_tag_count
ON products(array_length(tags, 1))
WHERE status = 'active';
SELECT * FROM products
WHERE status = 'active' AND array_length(tags, 1) > 5;7. Specialized Indexes 2024-2026
7.1 pg_trgm — Trigram fuzzy search
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_users_name_trgm ON users USING gin(name gin_trgm_ops);
-- Fuzzy match
SELECT * FROM users WHERE name ILIKE '%nguyen%'; -- ngày xưa seq scan, giờ index
SELECT * FROM users ORDER BY similarity(name, 'Nguyen') DESC LIMIT 10;Trigram = 3-char substring. “hello” → {’ h’, ’ he’, ‘hel’, ‘ell’, ‘llo’, ‘lo ’, ‘o ’}. Hai chuỗi share trigrams → similar.
Cực mạnh cho search-as-you-type, typo tolerance, fuzzy autocomplete.
7.2 PostGIS — spatial
CREATE EXTENSION postgis;
CREATE TABLE shops (
id bigint,
location geography(POINT, 4326)
);
CREATE INDEX idx_shops_location ON shops USING gist(location);
-- Within 5km of point
SELECT * FROM shops
WHERE ST_DWithin(location, ST_MakePoint(-122.4, 37.7)::geography, 5000);
-- K-nearest
SELECT * FROM shops
ORDER BY location <-> ST_MakePoint(-122.4, 37.7)::geography
LIMIT 10;7.3 pgvector — vector similarity (preview Tuần 15)
CREATE EXTENSION vector;
CREATE TABLE documents (
id bigint,
embedding vector(1536)
);
-- HNSW (PG14+ với pgvector 0.5+)
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops);
SELECT * FROM documents ORDER BY embedding <=> $1 LIMIT 10;Sẽ đào sâu Tuan-15-Vector-DB-AI.
7.4 RUM (extension) — alternative GIN
Better cho full-text search với ranking + ordering:
CREATE EXTENSION rum;
CREATE INDEX idx_docs_rum ON docs USING rum(fts_vector rum_tsvector_ops);
SELECT *, ts_rank(fts_vector, query) AS rank
FROM docs, to_tsquery('english', 'postgres') query
WHERE fts_vector @@ query
ORDER BY fts_vector <=> query LIMIT 10; -- distance-based ordering trong index8. Index Maintenance & Bloat
8.1 Index bloat — vấn đề kinh điển
Mỗi UPDATE/DELETE → dead tuple → index entry trỏ tới dead tuple. VACUUM dọn nhưng không reclaim disk.
Hậu quả:
- Index size grow theo write rate
- Random I/O nhiều hơn (sparse pages)
- Cache hit ratio giảm
8.2 Đo bloat
-- pgstattuple
CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('idx_orders_user_id');
-- avg_leaf_density: % - <50% là bloated nặng
-- Quick estimate
SELECT
indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;Production tool: pgexperts/pgx_scripts/bloat/ hoặc pg_repack.
8.3 REINDEX CONCURRENTLY (PG12+)
Cách an toàn rebuild index trên production:
-- Lock acquired ngắn, không block reads/writes
REINDEX INDEX CONCURRENTLY idx_orders_user_id;
REINDEX TABLE CONCURRENTLY orders;
REINDEX DATABASE CONCURRENTLY appdb;Trade-off:
- ✅ No long lock
- ❌ Chậm hơn REINDEX thường (~2x)
- ❌ Cần disk space gấp đôi (build new index trước khi drop old)
- ❌ Nếu fail giữa chừng, leave invalid index → cần cleanup
8.4 Detect invalid index
SELECT
schemaname || '.' || indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE indexrelid IN (
SELECT indexrelid FROM pg_index WHERE indisvalid = false
);
-- Cleanup
DROP INDEX CONCURRENTLY idx_invalid;8.5 Schedule maintenance
# crontab: rebuild bloated indexes weekly off-peak
0 3 * * 0 psql -c "REINDEX TABLE CONCURRENTLY orders;"Hoặc tool pg_repack:
pg_repack -t orders --jobs 4pg_repack ưu việt vì cũng repack heap (table itself bloat).
9. Unused & Redundant Indexes — kẻ ngầm phá
9.1 Find unused
SELECT
schemaname || '.' || relname AS table,
indexrelname AS index,
pg_size_pretty(pg_relation_size(indexrelid)) AS size,
idx_scan AS scans,
idx_tup_read,
idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan < 50 -- ít scan
AND pg_relation_size(indexrelid) > 10 * 1024 * 1024 -- >10MB
AND indexrelname NOT LIKE '%_pkey' -- skip PK
ORDER BY pg_relation_size(indexrelid) DESC;Cẩn thận:
- Reset stats khi cluster restart → đợi vài ngày sau restart mới đo
- Index dùng cho rare reports vẫn quan trọng
- Unique constraint indexes — không drop dù idx_scan thấp
9.2 Find redundant (subset)
Index A redundant nếu A là prefix của B:
idx_a: (user_id)
idx_b: (user_id, status)
→ idx_b cover idx_a. Drop idx_a.
Query find redundant:
WITH idx_info AS (
SELECT
n.nspname AS schema,
t.relname AS table,
i.relname AS index,
array_to_string(array_agg(a.attname ORDER BY x.n), ',') AS cols
FROM pg_index ix
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
CROSS JOIN LATERAL unnest(ix.indkey) WITH ORDINALITY AS x(attnum, n)
JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = x.attnum
WHERE n.nspname NOT IN ('pg_catalog','information_schema')
GROUP BY 1,2,3
)
SELECT a.schema, a.table, a.index AS redundant_with, a.cols, b.index AS covers_via, b.cols
FROM idx_info a JOIN idx_info b
ON a.schema = b.schema AND a.table = b.table
AND a.index <> b.index
AND b.cols LIKE a.cols || ',%';9.3 Index opinion vs index plan
Đừng drop blindly. Workflow:
- Drop in staging environment
- Run synthetic workload
- Monitor query plans
- Confirm no regression
- Drop production with
DROP INDEX CONCURRENTLY
10. Common Query Patterns → Index Recipes
10.1 OLTP — equality + sort
-- Query: orders by user, latest first
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);10.2 Status filter + range
SELECT * FROM jobs WHERE status = 'pending' AND scheduled_at <= now() ORDER BY scheduled_at LIMIT 100;
-- Partial + composite
CREATE INDEX idx_jobs_pending ON jobs(scheduled_at) WHERE status = 'pending';10.3 Search by tag
SELECT * FROM articles WHERE tags @> ARRAY['rust', 'database'];
CREATE INDEX idx_articles_tags ON articles USING gin(tags);10.4 Full-text search
SELECT * FROM products WHERE search_vector @@ plainto_tsquery('iphone case');
-- Generated column + GIN
ALTER TABLE products ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', name || ' ' || coalesce(description,''))) STORED;
CREATE INDEX idx_products_search ON products USING gin(search_vector);10.5 Time-series
SELECT count(*) FROM events WHERE occurred_at BETWEEN $1 AND $2;
-- BRIN nếu insert-only, sorted
CREATE INDEX idx_events_brin ON events USING brin(occurred_at);10.6 Geospatial
SELECT * FROM shops WHERE ST_DWithin(location, $1, 5000);
CREATE INDEX idx_shops_location ON shops USING gist(location);10.7 JSON path
SELECT * FROM users WHERE preferences @> '{"theme": "dark"}';
CREATE INDEX idx_users_prefs ON users USING gin(preferences);
-- Hoặc cho specific path
CREATE INDEX idx_users_theme ON users((preferences->>'theme'));10.8 Email case-insensitive
SELECT * FROM users WHERE lower(email) = lower($1);
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- HOẶC dùng citext type (Tuần 02)10.9 Pagination cursor
-- Worst: OFFSET pagination
SELECT * FROM orders ORDER BY created_at DESC OFFSET 10000 LIMIT 20;
-- ↑ Skip 10000 rows trong index
-- Best: cursor (keyset pagination)
SELECT * FROM orders
WHERE (created_at, id) < ($last_created_at, $last_id)
ORDER BY created_at DESC, id DESC LIMIT 20;
CREATE INDEX idx_orders_keyset ON orders(created_at DESC, id DESC);10.10 Count-distinct
SELECT count(DISTINCT user_id) FROM events WHERE occurred_at > now() - interval '1 day';
-- Index helps but count(DISTINCT) inherently expensive
-- Alternative: HyperLogLog
CREATE EXTENSION hll;
-- Or pre-aggregate in materialized view11. Index trên different DBs (comparative)
11.1 MySQL InnoDB
- Clustered index (table is the PK B-tree)
- Secondary index lưu PK, không lưu ctid
- Lookup secondary → secondary tìm PK → PK tìm row (2 lookups vs Postgres 1)
→ Trong MySQL, PK lookup faster, secondary slower (extra hop). Trong Postgres, secondary tương đương PK.
11.2 Cassandra/DynamoDB
- Primary key gồm partition key + clustering key
- “Secondary index” rất hạn chế (only local), thường dùng materialized view
→ NoSQL design tập trung vào access pattern trước, schema sau.
11.3 ClickHouse
- Sparse index (mỗi 8192 rows 1 entry) — gần BRIN concept
- Sorted on disk by ORDER BY clause của table
→ Different paradigm. Đào sâu Tuần 14.
12. Lab — Real-world indexing
12.1 Setup
-- Reuse schema từ Tuần 02 lab (e-commerce)
-- Bulk insert 5M orders, 50K users, 100K products
INSERT INTO users (email, name) SELECT 'u' || i || '@x.com', 'User ' || i FROM generate_series(1, 50000) i;
INSERT INTO products (category_id, sku, name, price_cents, currency, status)
SELECT 1 + (i % 100), 'SKU' || i, 'Product ' || i, 100 + i, 'USD', 'active'
FROM generate_series(1, 100000) i;
INSERT INTO orders (user_id, status, currency, subtotal_cents, tax_cents, shipping_cents, shipping_address_id, placed_at)
SELECT
1 + (random() * 49999)::int,
(ARRAY['pending', 'paid', 'shipped', 'cancelled'])[1 + (random() * 3)::int]::order_status,
'USD',
(100 + random() * 10000)::bigint,
0, 0, NULL,
now() - (random() * interval '365 days')
FROM generate_series(1, 5000000);
ANALYZE;12.2 Bài tập 1 — Query plan với và không index
-- Query: 20 đơn mới nhất của user 42
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC LIMIT 20;
-- Note time + plan
-- Add B-tree
CREATE INDEX idx_orders_user_placed ON orders(user_id, placed_at DESC);
EXPLAIN (ANALYZE, BUFFERS) <same query>;
-- Compare12.3 Bài tập 2 — Index-only scan
-- Query chỉ select 2 column
SELECT placed_at, total_cents FROM orders WHERE user_id = 42 LIMIT 100;
-- Current plan (with idx_orders_user_placed): Index Scan + heap fetch
-- Improve: covering index
CREATE INDEX idx_orders_user_covering ON orders(user_id) INCLUDE (placed_at, total_cents);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Check: Index Only Scan, Heap Fetches = ?12.4 Bài tập 3 — Partial index
-- Hot path: pending orders only
SELECT * FROM orders WHERE status = 'pending' ORDER BY placed_at LIMIT 50;
-- All-rows index
CREATE INDEX idx_orders_status_placed ON orders(status, placed_at);
SELECT pg_size_pretty(pg_relation_size('idx_orders_status_placed'));
-- Partial
CREATE INDEX idx_orders_pending ON orders(placed_at) WHERE status = 'pending';
SELECT pg_size_pretty(pg_relation_size('idx_orders_pending'));
-- Compare sizes. Compare query plans.
DROP INDEX idx_orders_status_placed;
EXPLAIN (ANALYZE, BUFFERS) <query>;12.5 Bài tập 4 — Reproduce bloat & REINDEX
CREATE INDEX idx_test ON orders(placed_at);
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS before;
-- Generate bloat
DO $$ BEGIN
FOR i IN 1..5 LOOP
UPDATE orders SET placed_at = placed_at + interval '1 hour' WHERE id % 10 = 0;
END LOOP;
END $$;
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS after_bloat;
REINDEX INDEX CONCURRENTLY idx_test;
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS after_reindex;12.6 Bài tập 5 — Find unused indexes
-- Reset stats
SELECT pg_stat_reset();
-- Simulate workload (run 10 mins of mixed queries)
-- ... your workload ...
-- Find unused
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE schemaname = 'public' AND idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;12.7 Bài tập 6 — GIN index trên tags
-- Random tags
UPDATE products SET tags = ARRAY(
SELECT (ARRAY['rust','postgres','ml','web','mobile','cloud'])[1 + (random()*5)::int]
FROM generate_series(1, (1 + random()*3)::int)
);
EXPLAIN ANALYZE SELECT * FROM products WHERE tags @> ARRAY['rust'];
-- Seq scan
CREATE INDEX idx_products_tags ON products USING gin(tags);
EXPLAIN ANALYZE SELECT * FROM products WHERE tags @> ARRAY['rust'];
-- Bitmap Heap Scan via GIN13. Anti-patterns
| Pattern | Tại sao tệ | Fix |
|---|---|---|
| Index mọi FK tự động | Một số FK ít query, tốn write | Index FK chỉ khi có truy vấn |
| Index trên column low-cardinality một mình | (status với 3 values) → seq scan vẫn nhanh hơn | Composite với column khác |
| Order column theo “intuition” | (created_at, user_id) khi query equality user_id | ESR rule |
LIKE '%suffix' expect index hit | B-tree không hỗ trợ leading wildcard | Reverse string + index, hoặc trigram |
| GIN trên column thường update | Heavy write amplification | B-tree + fastupdate tuning hoặc đổi access pattern |
| Drop unused index lúc production stress | Có thể là index dùng cho quarterly report | Theo dõi lâu hơn, check pg_stat_statements |
| REINDEX (non-concurrent) trong giờ peak | Block table | REINDEX CONCURRENTLY |
| Function trong WHERE không match expression index | WHERE upper(email) = X mà index trên lower(email) | Match exact expression |
| INCLUDE column lớn (TEXT 10KB) | Index size phình | Tách ra column raw, index không INCLUDE |
| Không ANALYZE sau bulk insert | Planner stats stale → pick wrong index | ANALYZE table_name sau bulk load |
14. Production Gotchas
14.1 Index disable vs drop
Không có “disable index” như MySQL. Cách workaround:
-- "Soft disable" by marking invalid
UPDATE pg_index SET indisvalid = false WHERE indexrelid = 'idx_x'::regclass;
-- Planner skip invalid index
-- Bật lại
UPDATE pg_index SET indisvalid = true WHERE indexrelid = 'idx_x'::regclass;Cẩn thận: trực tiếp update pg_index là hack, không dùng routine.
14.2 CREATE INDEX CONCURRENTLY và lock
CREATE INDEX CONCURRENTLY idx_x ON huge_table(col);- Không block read/write
- Chạy chậm hơn (2 lần scan table)
- Không thể inside transaction
- Nếu fail (vd: duplicate key in unique index) → leave invalid index → DROP & retry
Best practice: monitor progress:
SELECT * FROM pg_stat_progress_create_index;14.3 Index quá lớn không fit vào RAM
Nếu index size > shared_buffers → mỗi lookup có thể disk I/O.
Pattern:
- Hot table small enough: trust shared_buffers
- Hot table large: tune shared_buffers, hoặc partition table → mỗi partition có index nhỏ hơn
14.4 Index trên TOAST’d column
Index không thể trỏ tới TOAST’d value. Postgres tự động compress in main heap nếu cần, hoặc dùng MD5 substring.
Practical: nếu cần index trên TEXT 10KB:
- Tách thành column riêng nhỏ hơn
- Index trên hash:
CREATE INDEX ON t(md5(big_text)) - Hoặc full-text search → tsvector + GIN
14.5 ANALYZE quan trọng
Planner stats từ pg_statistic. Sau bulk insert, migration, không có stats → planner đoán cardinality sai → pick wrong plan.
ANALYZE orders;
-- Hoặc tự động
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.05);14.6 Index hint? Không có
Postgres không có USE INDEX hint như MySQL/Oracle. Cách “force”:
- Drop other indexes (temporary)
- Disable seq scan:
SET enable_seqscan = off;(debugging only) - Adjust planner cost:
SET random_page_cost = 1.0;
Extension pg_hint_plan (3rd party) có thể inject hints nhưng controversial. Postgres philosophy: fix planner stats, không hint.
15. Self-check
- Bạn có query
WHERE a = 1 AND b > 10 ORDER BY c. Index(a, b, c)vs(a, c, b)— cái nào tốt hơn? Tại sao? - INCLUDE column vs index column khác nhau ra sao? Tại sao có cả 2?
- Index-only scan có
Heap Fetches: 1000. Đây là good hay bad? Fix thế nào? - Khi nào dùng GIN, khi nào GiST? Cho 2 ví dụ rõ ràng.
- Partial index
WHERE status = 'pending'. QueryWHERE status IN ('pending', 'active')có dùng được index này không? - BRIN khi nào hữu ích? Khi nào unhelpful?
LIKE '%abc%'— index nào hỗ trợ?LIKE 'abc%'— index nào?- Drop index trên FK column — risk gì? Khi nào chấp nhận drop?
- REINDEX vs REINDEX CONCURRENTLY — trade-off?
- Bạn có 5M-row table với 20 index. INSERT chậm. Bước đầu tiên?
16. Tiếp theo
Bài tiếp: Tuan-04-Query-Optimization-EXPLAIN — đọc plan, hiểu cost, debug slow query.
Đọc:
- Use The Index, Luke! — đọc chương “The Anatomy of an Index” + “The Where Clause”
- Postgres docs Chapter 11 (Indexes)
Tuần 03 hoàn thành. ESR rule + INCLUDE + partial = 80% indexing skill. Cập nhật: 2026-05-16