Tuần 03 — Indexing Mastery

“Mid-level dev: ‘Query chậm, thêm index.’ Senior: ‘Query chậm vì index không match access pattern. Đây là index đúng.’ Staff: ‘Query chậm vì 3 lý do, mình bỏ 2 index thừa, thêm 1 covering index, redesign schema để index-only scan work, tiết kiệm 40% disk.‘”

Tags: database postgresql indexing b-tree performance Thời lượng: 7 ngày (5-7h/ngày — tuần dày) Prerequisites: Tuan-01-DB-Internals-Refresh (B-tree page layout), Tuan-02-Schema-Design-Normalization (schema sạch trước khi index) Liên quan: Tuan-04-Query-Optimization-EXPLAIN · Tuan-09-DB-Observability-Tuning

1. Context & Why

1.1 Vì sao tuần này là xương sống

Khả năng “đọc một query và biết index nào cần thiết” là kỹ năng phân biệt mid-level và senior. Trong production:

60-80% bug performance gốc là missing/wrong index
10-20% là redundant/unused index ăn write throughput
5-10% là bloated index không reindex

Hiểu indexing = giải quyết 80% bottleneck DB tự thân.

1.2 Mental model — Index không phải miễn phí

graph LR
    A[Query SELECT] -->|index hits| B[Index lookup<br/>O&#40;log N&#41;]
    A -->|no index| C[Sequential scan<br/>O&#40;N&#41;]
    B -.->|fetch heap| D[Heap access<br/>1 random I/O]

    E[INSERT/UPDATE/DELETE] --> F[Maintain main heap]
    E --> G[Maintain ALL indexes]
    G --> H[+1 write per index]
    G --> I[+WAL volume]
    G --> J[Lock contention]

    style C fill:#ffccbc
    style G fill:#fff9c4

Mỗi index thêm:

✅ Đọc nhanh khi match access pattern
❌ +1 write per index per row mutation
❌ +N MB disk
❌ +WAL volume
❌ +shared_buffers space
❌ Maintenance: bloat, REINDEX

Rule vàng: Add index khi có bằng chứng (slow query, EXPLAIN), không phải khi “phòng hờ”.

1.3 Mục tiêu tuần

Hiểu sâu B-tree, GIN, GiST, BRIN, Hash, SP-GiST — pick đúng cho từng workload
Master multi-column index: thứ tự column, ESR rule
Tự tin design covering index để đạt index-only scan
Partial index, expression index — pattern thực chiến
Detect và xóa unused/redundant index
Manage bloat: monitor, REINDEX CONCURRENTLY
Đọc pg_stat_user_indexes như đọc bản đồ

1.4 Tham chiếu

Use The Index, Luke! — Markus Winand (free, must-read) — https://use-the-index-luke.com/
PostgreSQL Documentation: Indexes — https://www.postgresql.org/docs/current/indexes.html
The Art of PostgreSQL — Dimitri Fontaine, Ch. Indexing
PostgreSQL 14 Internals — Egor Rogov, Part 3
Citus blog series on indexes — https://www.citusdata.com/blog/

2. Index Types — Map of Choices

mindmap
  root((Postgres<br/>Index<br/>Types))
    B-tree
      Default
      Equality + Range
      Sortable
      Most queries
    Hash
      Equality only
      WAL-logged from PG10
      Niche use
    GIN
      Array, JSONB
      Full-text search
      Trigram
      Multi-value column
    GiST
      Geometric
      Range types
      Trigram alt
      EXCLUDE constraint
    SP-GiST
      Space partitioned
      IP, hierarchical
      Less common
    BRIN
      Block Range
      Huge tables
      Naturally ordered data
    Specialized
      Bloom
      RUM
      pg_trgm

2.1 B-tree — workhorse

Default. Support:

Equality: =
Range: <, <=, >, >=, BETWEEN
IN list
Pattern matching: LIKE 'prefix%' (NOT LIKE '%suffix')
NULL handling: IS NULL, IS NOT NULL
Sort: ORDER BY uses index nếu match column order
Joins: hash/merge/nested loop, planner pick

Structure (refresh từ Tuần 01):

graph TD
    Root["Root: keys [40, 80]"]
    L1["[10, 20, 30]"]
    L2["[50, 60, 70]"]
    L3["[90, 100, 110]"]
    Leaf1["[10→ctid] [15→ctid] [20→ctid]"]
    Leaf2["[50→ctid] [55→ctid] [60→ctid]"]
    Leaf3["[90→ctid] [95→ctid] [100→ctid]"]

    Root --> L1
    Root --> L2
    Root --> L3
    L1 --> Leaf1
    L2 --> Leaf2
    L3 --> Leaf3

    Leaf1 -.linked list.-> Leaf2 -.-> Leaf3

Postgres B-tree là B+ tree: data trong leaf, leaf linked list cho range scan.

Depth: 3-4 cho table tỷ rows. Lookup cost: 3-4 random I/O worst case, 0-1 với cache hot.

2.2 Hash — niche

Equality only. Trước PG10 không WAL-logged → không crash safe → ít dùng. Từ PG10 đã fix.

Khi nào dùng vs B-tree:

Index size nhỏ hơn ~5-10% B-tree (cùng data)
Lookup nhanh hơn marginal cho exact equality
Không sortable, không range

Trong thực tế 2024: 99% case B-tree tốt hơn. Hash hiếm khi pick.

2.3 GIN — Generalized Inverted Index

GIN inverted index — như Google search. Một entry → list of doc IDs có entry đó.

Use cases:

Array: tags text[]
JSONB: metadata jsonb
Full-text search: tsvector
Trigram: pg_trgm cho fuzzy search

CREATE TABLE articles (id bigint, tags text[], content text);
 
CREATE INDEX idx_articles_tags ON articles USING gin(tags);
CREATE INDEX idx_articles_fts ON articles USING gin(to_tsvector('english', content));
 
-- Query
SELECT * FROM articles WHERE tags @> ARRAY['rust'];
SELECT * FROM articles WHERE to_tsvector('english', content) @@ to_tsquery('postgres');

Trade-off GIN:

Lookup very fast
Write slow (multiple entries per row)
Update tốn (rebuild posting list)
Index size lớn (overhead per token)

Pattern 2024 — pending list: GIN có gin_pending_list_limit (default 4MB). Writes go to pending list trước, periodic merge. Giảm write amplification.

-- Tune cho high-write
ALTER INDEX idx_articles_tags SET (fastupdate = on, gin_pending_list_limit = '32MB');

2.4 GiST — Generalized Search Tree

Framework cho indexes “không-standard”: geometric, range, custom.

Use cases:

PostGIS: spatial queries
Range types: tstzrange && tstzrange
EXCLUDE constraints
Trigram (pg_trgm cũng support GiST)
KNN search (ORDER BY point <-> point LIMIT 10)

CREATE TABLE bookings (
    id bigint, room_id int, during tstzrange,
    EXCLUDE USING gist (room_id WITH =, during WITH &&)  -- needs GiST
);
 
CREATE TABLE places (id bigint, location geography);
CREATE INDEX idx_places_loc ON places USING gist(location);
SELECT * FROM places ORDER BY location <-> ST_MakePoint(-122.4, 37.7) LIMIT 10;

KNN search (k-nearest neighbor): GiST có thể sort theo distance trong index. Cực mạnh cho map/geo queries.

2.5 SP-GiST — Space-Partitioned

Cho data nội tại phân vùng không cân bằng: IP addresses, phone numbers, hierarchical paths.

CREATE TABLE access_log (id bigint, ip inet);
CREATE INDEX idx_access_log_ip ON access_log USING spgist(ip);
 
SELECT * FROM access_log WHERE ip << '192.168.0.0/16';  -- subnet contains

Niche. Khi không chắc, GiST hoặc B-tree thường đủ.

2.6 BRIN — Block Range Index

Lưu min/max của mỗi N block (default 128 = 1MB). Cực nhỏ — index 100GB table chỉ vài MB.

Điều kiện hoạt động: Data phải naturally ordered theo column. Vd: created_at của events table, append-only.

CREATE TABLE events (
    id bigint, occurred_at timestamptz, payload jsonb
);
CREATE INDEX idx_events_brin ON events USING brin(occurred_at);
 
SELECT * FROM events WHERE occurred_at BETWEEN '2026-05-01' AND '2026-05-02';

Trade-off:

✅ Index siêu nhỏ
✅ Build fast
❌ Chỉ work với clustered data — nếu data shuffled, BRIN useless
❌ Không thay được B-tree cho equality lookup nhanh

Use case: time-series, append-only logs, partitioned tables.

2.7 Bloom (extension)

CREATE EXTENSION bloom;

Index nhiều column, mỗi column equality. Approximate (false positive possible).

CREATE INDEX idx_widgets_bloom ON widgets USING bloom(c1, c2, c3, c4, c5)
WITH (length=80, col1=2, col2=2, col3=4, col4=2, col5=2);
 
SELECT * FROM widgets WHERE c1 = 1 AND c3 = 100 AND c5 = 5;

Use case: bảng nhiều column, query mix equality không predict được. Hiếm.

2.8 Decision tree — pick index type

flowchart TD
    A[Query pattern?] --> B{Equality + range trên scalar?}
    B -->|Yes| B1[B-tree]

    A --> C{Array / JSONB containment?}
    C -->|Yes| C1[GIN]

    A --> D{Full-text search?}
    D -->|Yes| D1[GIN on tsvector]

    A --> E{Fuzzy text / similarity?}
    E -->|Yes| E1[GIN/GiST + pg_trgm]

    A --> F{Geometric / spatial?}
    F -->|Yes| F1[GiST + PostGIS]

    A --> G{Range overlap?}
    G -->|Yes| G1[GiST]

    A --> H{Append-only time-series huge table?}
    H -->|Yes| H1[BRIN]

    A --> I{Vector similarity / embedding?}
    I -->|Yes| I1[HNSW or IVFFlat<br/>via pgvector]

    style B1 fill:#c8e6c9
    style C1 fill:#c8e6c9
    style I1 fill:#fff9c4

(Vector index sẽ học sâu Tuan-15-Vector-DB-AI)

3. Multi-column Index — Nghệ thuật thứ tự

3.1 Vấn đề kinh điển

Có query:

SELECT * FROM orders WHERE user_id = $1 AND status = 'pending' ORDER BY created_at DESC;

Index nào tốt?

A: (user_id)
B: (user_id, status)
C: (user_id, status, created_at)
D: (status, user_id, created_at)
E: (created_at, user_id, status)

3.2 Rule ESR — Equality, Sort, Range

Markus Winand’s pattern:

Equality columns trước (column với =)
Sort column tiếp theo (column trong ORDER BY)
Range column cuối (column với <, >, BETWEEN)

Áp dụng cho query trên:

Equality: user_id, status
Sort: created_at
Range: không có

→ Index tối ưu: (user_id, status, created_at)

Tại sao C thắng D?

Cả 2 cùng cover (user_id, status, created_at)
Nhưng D đặt status trước user_id. Nếu user_id selectivity cao (mỗi user có nhiều order), đặt user_id trước giúp narrow xuống nhanh hơn.

Tại sao C thắng E?

E đặt created_at trước. Equality user_id = X không thể skip trong B-tree khi column đầu là created_at range/sort.
Index B-tree là multi-dimensional theo column order. Phải dùng prefix.

3.3 Left-prefix rule

Index (a, b, c) hỗ trợ:

WHERE a = $1
WHERE a = $1 AND b = $2
WHERE a = $1 AND b = $2 AND c = $3
WHERE a = $1 ORDER BY b, c
WHERE a = $1 AND b > $2 ORDER BY c (range trên b)

KHÔNG hỗ trợ (efficient):

WHERE b = $1 (skip a)
WHERE c = $1
WHERE a > $1 AND b = $2 (range a → không dùng được b cho equality lookup nữa)

3.4 Khi đảo column order

Query 1: WHERE country = 'VN' AND status = 'active' (1M VN users, 100K active) Query 2: WHERE country = 'VN' AND created_at > '2026-01-01'

Index (country, status) tốt cho Q1. Q2 cần (country, created_at).

Cân nhắc gộp (country, status, created_at):

Cover Q1 (prefix)
Cover Q2? Không hiệu quả — sau country, B-tree sort theo status trước, không phải created_at. Range on created_at cần scan tất cả status.

→ Đôi khi cần 2 index riêng: (country, status) và (country, created_at).

3.5 PG17/18 — SAOP + Skip Scan

PG17 thực tế ship: Cải thiện handling của SAOP (Scalar Array Op — IN (...), = ANY(...)) trong B-tree. Planner có thể “advance to next array element” hiệu quả hơn → query với IN (long list) nhanh hơn rõ rệt.

Skip scan thực sự (Peter Geoghegan patch): Landed vào master cuối 2024, slated for PG18 (Sept 2025). Cho phép skip qua leading column low-cardinality:

-- Index: (gender, last_name)
-- Query: WHERE last_name = 'Smith'
-- Pre-PG18: seq scan or full index scan
-- PG18+: skip scan - iterate distinct gender values, lookup last_name within each

Hữu ích khi leading column ít distinct values. Không thay được proper index design — chỉ safety net cho queries không match prefix.

Trên PG17 đang xài: workaround là tạo composite index match query, hoặc tạo nhiều partial index.

3.6 Demo lab

CREATE TABLE orders (
    id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_id bigint NOT NULL,
    status text NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    total numeric
);
 
INSERT INTO orders (user_id, status, total)
SELECT
    (random() * 10000)::bigint,
    (ARRAY['pending', 'paid', 'shipped', 'cancelled'])[1 + (random() * 3)::int],
    (random() * 1000)::numeric
FROM generate_series(1, 1000000);
 
-- No index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending' ORDER BY created_at DESC LIMIT 10;
-- Seq Scan, ~50ms
 
-- Index A
CREATE INDEX idx_a ON orders(user_id);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Index Scan on idx_a, ~5ms
 
-- Index B
CREATE INDEX idx_b ON orders(user_id, status, created_at DESC);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Index Scan on idx_b, ~0.2ms (winner)
 
DROP INDEX idx_a;  -- planner sẽ pick idx_b

4. Covering Index & Index-Only Scan

4.1 Index-only scan là gì

Bình thường:

Planner dùng index → tìm ctid
Fetch heap page → return tuple

Index-only scan: planner đọc tất cả column cần thiết từ index, không fetch heap.

Điều kiện:

Query chỉ reference columns có trong index
Visibility map nói page là “all-visible” (Tuần 01)

4.2 INCLUDE — covering index

PG11+ syntax:

-- Index B-tree key: (user_id), payload: total, created_at
CREATE INDEX idx_orders_covering
ON orders(user_id) INCLUDE (total, created_at);

INCLUDE columns:

Lưu trong leaf nhưng không là key (không sortable, không equality lookup)
Cho phép index-only scan với query select chúng

EXPLAIN (ANALYZE, BUFFERS)
SELECT total, created_at FROM orders WHERE user_id = 42;
-- Index Only Scan using idx_orders_covering
-- Heap Fetches: 0  ← perfect index-only scan

4.3 Khi nào dùng covering index

flowchart TD
    A[Query] --> B{Hot path?<br/>>1000 qps?}
    B -->|No| B1[Đừng add covering<br/>tăng size không xứng]
    B -->|Yes| C{Select bao nhiêu column?}
    C -->|1-3 column nhỏ| D[Add INCLUDE<br/>+payload]
    C -->|Nhiều column / big TEXT| E[Đừng INCLUDE<br/>index sẽ phình]
    D --> F{VM coverage?}
    F -->|>95% all-visible| G[Index-only scan work]
    F -->|<95%| H[Tune autovacuum<br/>để VM stay fresh]

    style G fill:#c8e6c9
    style B1 fill:#fff9c4
    style E fill:#ffccbc

Trade-off:

✅ Tiết kiệm 1 random I/O per row (heap fetch)
❌ Index size tăng (lưu thêm columns)
❌ Update bất kỳ column nào trong INCLUDE → maintain index

4.4 Heap Fetches — kẻ thù của index-only scan

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;
-- Index Only Scan ...
-- Heap Fetches: 1234   ← bad! VM stale

Heap Fetches > 0 nghĩa là index-only scan không thực sự là index-only — phải fetch heap vì VM bit chưa set.

Fix:

VACUUM (VERBOSE) orders;
-- Wait until autovacuum run, hoặc tune scale_factor

Pattern 2024: high-write table cần VACUUM thường xuyên hơn để keep VM fresh.

ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.05,  -- 5% instead of 20%
    autovacuum_vacuum_insert_scale_factor = 0.05  -- PG13+: trigger by inserts too
);

5. Partial Index

5.1 Concept

Index chỉ một subset rows.

-- Chỉ index active orders
CREATE INDEX idx_orders_active
ON orders(user_id, created_at)
WHERE status IN ('pending', 'paid', 'shipping');

Khi nào dùng:

Hot subset nhỏ trong table to (vd: 5% active, 95% archived)
Filter condition luôn xuất hiện trong query
Soft-delete pattern: WHERE deleted_at IS NULL

Lợi ích:

Index nhỏ → fit cache → faster
Write cheaper (chỉ archived rows tránh maintain)
Disk save lớn

5.2 Production patterns

-- 1. Soft delete
CREATE INDEX idx_users_email ON users(email) WHERE deleted_at IS NULL;
 
-- 2. Unique constraint cho subset
CREATE UNIQUE INDEX idx_one_default_per_user
ON addresses(user_id) WHERE is_default = true;
 
-- 3. Skip null-heavy column
CREATE INDEX idx_orders_payment_ref ON orders(payment_ref) WHERE payment_ref IS NOT NULL;
 
-- 4. Status workflow
CREATE INDEX idx_jobs_pending ON jobs(created_at) WHERE status = 'pending';
-- Job queue worker query: SELECT ... WHERE status='pending' ORDER BY created_at LIMIT 100

5.3 Gotcha — query phải match WHERE clause

CREATE INDEX idx_partial ON orders(user_id) WHERE status = 'active';
 
-- DÙNG ĐƯỢC
SELECT * FROM orders WHERE user_id = 42 AND status = 'active';
 
-- KHÔNG DÙNG ĐƯỢC (planner không chắc partial bao trùm)
SELECT * FROM orders WHERE user_id = 42;  -- Có thể có rows status != 'active'
SELECT * FROM orders WHERE user_id = 42 AND status IN ('active', 'paused');

Predicate trong query phải imply predicate trong index.

6. Expression Index (Functional Index)

6.1 Concept

Index trên kết quả của function/expression, không phải column raw.

-- Case-insensitive lookup
CREATE INDEX idx_users_email_lower ON users(lower(email));
 
SELECT * FROM users WHERE lower(email) = lower($1);
-- Dùng được index

Function phải IMMUTABLE — return same result cho same input, không có side effect.

6.2 Patterns hữu ích

-- 1. Date truncation
CREATE INDEX idx_events_day ON events(date_trunc('day', occurred_at));
SELECT * FROM events WHERE date_trunc('day', occurred_at) = '2026-05-16';
 
-- 2. JSONB extraction
CREATE INDEX idx_users_country ON users((metadata->>'country'));
SELECT * FROM users WHERE metadata->>'country' = 'VN';
 
-- 3. Computed value
CREATE INDEX idx_orders_revenue ON orders((qty * unit_price));
 
-- 4. Full-text search column
CREATE INDEX idx_products_search ON products USING gin(to_tsvector('english', name || ' ' || description));

6.3 Combine với partial

-- Tag count cho active products
CREATE INDEX idx_products_tag_count
ON products(array_length(tags, 1))
WHERE status = 'active';
 
SELECT * FROM products
WHERE status = 'active' AND array_length(tags, 1) > 5;

7. Specialized Indexes 2024-2026

7.1 pg_trgm — Trigram fuzzy search

CREATE EXTENSION pg_trgm;
 
CREATE INDEX idx_users_name_trgm ON users USING gin(name gin_trgm_ops);
 
-- Fuzzy match
SELECT * FROM users WHERE name ILIKE '%nguyen%';   -- ngày xưa seq scan, giờ index
SELECT * FROM users ORDER BY similarity(name, 'Nguyen') DESC LIMIT 10;

Trigram = 3-char substring. “hello” → {’ h’, ’ he’, ‘hel’, ‘ell’, ‘llo’, ‘lo ’, ‘o ’}. Hai chuỗi share trigrams → similar.

Cực mạnh cho search-as-you-type, typo tolerance, fuzzy autocomplete.

7.2 PostGIS — spatial

CREATE EXTENSION postgis;
 
CREATE TABLE shops (
    id bigint,
    location geography(POINT, 4326)
);
CREATE INDEX idx_shops_location ON shops USING gist(location);
 
-- Within 5km of point
SELECT * FROM shops
WHERE ST_DWithin(location, ST_MakePoint(-122.4, 37.7)::geography, 5000);
 
-- K-nearest
SELECT * FROM shops
ORDER BY location <-> ST_MakePoint(-122.4, 37.7)::geography
LIMIT 10;

7.3 pgvector — vector similarity (preview Tuần 15)

CREATE EXTENSION vector;
 
CREATE TABLE documents (
    id bigint,
    embedding vector(1536)
);
 
-- HNSW (PG14+ với pgvector 0.5+)
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops);
 
SELECT * FROM documents ORDER BY embedding <=> $1 LIMIT 10;

Sẽ đào sâu Tuan-15-Vector-DB-AI.

7.4 RUM (extension) — alternative GIN

Better cho full-text search với ranking + ordering:

CREATE EXTENSION rum;
 
CREATE INDEX idx_docs_rum ON docs USING rum(fts_vector rum_tsvector_ops);
 
SELECT *, ts_rank(fts_vector, query) AS rank
FROM docs, to_tsquery('english', 'postgres') query
WHERE fts_vector @@ query
ORDER BY fts_vector <=> query LIMIT 10;  -- distance-based ordering trong index

8. Index Maintenance & Bloat

8.1 Index bloat — vấn đề kinh điển

Mỗi UPDATE/DELETE → dead tuple → index entry trỏ tới dead tuple. VACUUM dọn nhưng không reclaim disk.

Hậu quả:

Index size grow theo write rate
Random I/O nhiều hơn (sparse pages)
Cache hit ratio giảm

8.2 Đo bloat

-- pgstattuple
CREATE EXTENSION pgstattuple;
SELECT * FROM pgstattuple('idx_orders_user_id');
-- avg_leaf_density: %  - <50% là bloated nặng
 
-- Quick estimate
SELECT
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Production tool: pgexperts/pgx_scripts/bloat/ hoặc pg_repack.

8.3 REINDEX CONCURRENTLY (PG12+)

Cách an toàn rebuild index trên production:

-- Lock acquired ngắn, không block reads/writes
REINDEX INDEX CONCURRENTLY idx_orders_user_id;
 
REINDEX TABLE CONCURRENTLY orders;
REINDEX DATABASE CONCURRENTLY appdb;

Trade-off:

✅ No long lock
❌ Chậm hơn REINDEX thường (~2x)
❌ Cần disk space gấp đôi (build new index trước khi drop old)
❌ Nếu fail giữa chừng, leave invalid index → cần cleanup

8.4 Detect invalid index

SELECT
    schemaname || '.' || indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE indexrelid IN (
    SELECT indexrelid FROM pg_index WHERE indisvalid = false
);
 
-- Cleanup
DROP INDEX CONCURRENTLY idx_invalid;

8.5 Schedule maintenance

# crontab: rebuild bloated indexes weekly off-peak
0 3 * * 0 psql -c "REINDEX TABLE CONCURRENTLY orders;"

Hoặc tool pg_repack:

pg_repack -t orders --jobs 4

pg_repack ưu việt vì cũng repack heap (table itself bloat).

9. Unused & Redundant Indexes — kẻ ngầm phá

9.1 Find unused

SELECT
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    idx_scan AS scans,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
WHERE idx_scan < 50              -- ít scan
  AND pg_relation_size(indexrelid) > 10 * 1024 * 1024  -- >10MB
  AND indexrelname NOT LIKE '%_pkey'                    -- skip PK
ORDER BY pg_relation_size(indexrelid) DESC;

Cẩn thận:

Reset stats khi cluster restart → đợi vài ngày sau restart mới đo
Index dùng cho rare reports vẫn quan trọng
Unique constraint indexes — không drop dù idx_scan thấp

9.2 Find redundant (subset)

Index A redundant nếu A là prefix của B:

idx_a: (user_id)
idx_b: (user_id, status)

→ idx_b cover idx_a. Drop idx_a.

Query find redundant:

WITH idx_info AS (
    SELECT
        n.nspname AS schema,
        t.relname AS table,
        i.relname AS index,
        array_to_string(array_agg(a.attname ORDER BY x.n), ',') AS cols
    FROM pg_index ix
    JOIN pg_class t ON t.oid = ix.indrelid
    JOIN pg_class i ON i.oid = ix.indexrelid
    JOIN pg_namespace n ON n.oid = t.relnamespace
    CROSS JOIN LATERAL unnest(ix.indkey) WITH ORDINALITY AS x(attnum, n)
    JOIN pg_attribute a ON a.attrelid = t.oid AND a.attnum = x.attnum
    WHERE n.nspname NOT IN ('pg_catalog','information_schema')
    GROUP BY 1,2,3
)
SELECT a.schema, a.table, a.index AS redundant_with, a.cols, b.index AS covers_via, b.cols
FROM idx_info a JOIN idx_info b
  ON a.schema = b.schema AND a.table = b.table
 AND a.index <> b.index
 AND b.cols LIKE a.cols || ',%';

9.3 Index opinion vs index plan

Đừng drop blindly. Workflow:

Drop in staging environment
Run synthetic workload
Monitor query plans
Confirm no regression
Drop production with DROP INDEX CONCURRENTLY

10. Common Query Patterns → Index Recipes

10.1 OLTP — equality + sort

-- Query: orders by user, latest first
SELECT * FROM orders WHERE user_id = $1 ORDER BY created_at DESC LIMIT 20;
 
CREATE INDEX idx_orders_user_created ON orders(user_id, created_at DESC);

10.2 Status filter + range

SELECT * FROM jobs WHERE status = 'pending' AND scheduled_at <= now() ORDER BY scheduled_at LIMIT 100;
 
-- Partial + composite
CREATE INDEX idx_jobs_pending ON jobs(scheduled_at) WHERE status = 'pending';

10.3 Search by tag

SELECT * FROM articles WHERE tags @> ARRAY['rust', 'database'];
 
CREATE INDEX idx_articles_tags ON articles USING gin(tags);

10.4 Full-text search

SELECT * FROM products WHERE search_vector @@ plainto_tsquery('iphone case');
 
-- Generated column + GIN
ALTER TABLE products ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (to_tsvector('english', name || ' ' || coalesce(description,''))) STORED;
CREATE INDEX idx_products_search ON products USING gin(search_vector);

10.5 Time-series

SELECT count(*) FROM events WHERE occurred_at BETWEEN $1 AND $2;
 
-- BRIN nếu insert-only, sorted
CREATE INDEX idx_events_brin ON events USING brin(occurred_at);

10.6 Geospatial

SELECT * FROM shops WHERE ST_DWithin(location, $1, 5000);
 
CREATE INDEX idx_shops_location ON shops USING gist(location);

10.7 JSON path

SELECT * FROM users WHERE preferences @> '{"theme": "dark"}';
 
CREATE INDEX idx_users_prefs ON users USING gin(preferences);
-- Hoặc cho specific path
CREATE INDEX idx_users_theme ON users((preferences->>'theme'));

10.8 Email case-insensitive

SELECT * FROM users WHERE lower(email) = lower($1);
 
CREATE INDEX idx_users_email_lower ON users(lower(email));
-- HOẶC dùng citext type (Tuần 02)

10.9 Pagination cursor

-- Worst: OFFSET pagination
SELECT * FROM orders ORDER BY created_at DESC OFFSET 10000 LIMIT 20;
-- ↑ Skip 10000 rows trong index
 
-- Best: cursor (keyset pagination)
SELECT * FROM orders
WHERE (created_at, id) < ($last_created_at, $last_id)
ORDER BY created_at DESC, id DESC LIMIT 20;
 
CREATE INDEX idx_orders_keyset ON orders(created_at DESC, id DESC);

10.10 Count-distinct

SELECT count(DISTINCT user_id) FROM events WHERE occurred_at > now() - interval '1 day';
 
-- Index helps but count(DISTINCT) inherently expensive
-- Alternative: HyperLogLog
CREATE EXTENSION hll;
-- Or pre-aggregate in materialized view

11. Index trên different DBs (comparative)

11.1 MySQL InnoDB

Clustered index (table is the PK B-tree)
Secondary index lưu PK, không lưu ctid
Lookup secondary → secondary tìm PK → PK tìm row (2 lookups vs Postgres 1)

→ Trong MySQL, PK lookup faster, secondary slower (extra hop). Trong Postgres, secondary tương đương PK.

11.2 Cassandra/DynamoDB

Primary key gồm partition key + clustering key
“Secondary index” rất hạn chế (only local), thường dùng materialized view

→ NoSQL design tập trung vào access pattern trước, schema sau.

11.3 ClickHouse

Sparse index (mỗi 8192 rows 1 entry) — gần BRIN concept
Sorted on disk by ORDER BY clause của table

→ Different paradigm. Đào sâu Tuần 14.

12. Lab — Real-world indexing

12.1 Setup

-- Reuse schema từ Tuần 02 lab (e-commerce)
-- Bulk insert 5M orders, 50K users, 100K products
INSERT INTO users (email, name) SELECT 'u' || i || '@x.com', 'User ' || i FROM generate_series(1, 50000) i;
 
INSERT INTO products (category_id, sku, name, price_cents, currency, status)
SELECT 1 + (i % 100), 'SKU' || i, 'Product ' || i, 100 + i, 'USD', 'active'
FROM generate_series(1, 100000) i;
 
INSERT INTO orders (user_id, status, currency, subtotal_cents, tax_cents, shipping_cents, shipping_address_id, placed_at)
SELECT
    1 + (random() * 49999)::int,
    (ARRAY['pending', 'paid', 'shipped', 'cancelled'])[1 + (random() * 3)::int]::order_status,
    'USD',
    (100 + random() * 10000)::bigint,
    0, 0, NULL,
    now() - (random() * interval '365 days')
FROM generate_series(1, 5000000);
 
ANALYZE;

12.2 Bài tập 1 — Query plan với và không index

-- Query: 20 đơn mới nhất của user 42
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 ORDER BY placed_at DESC LIMIT 20;
-- Note time + plan
 
-- Add B-tree
CREATE INDEX idx_orders_user_placed ON orders(user_id, placed_at DESC);
EXPLAIN (ANALYZE, BUFFERS) <same query>;
-- Compare

12.3 Bài tập 2 — Index-only scan

-- Query chỉ select 2 column
SELECT placed_at, total_cents FROM orders WHERE user_id = 42 LIMIT 100;
 
-- Current plan (with idx_orders_user_placed): Index Scan + heap fetch
-- Improve: covering index
CREATE INDEX idx_orders_user_covering ON orders(user_id) INCLUDE (placed_at, total_cents);
EXPLAIN (ANALYZE, BUFFERS) <query>;
-- Check: Index Only Scan, Heap Fetches = ?

12.4 Bài tập 3 — Partial index

-- Hot path: pending orders only
SELECT * FROM orders WHERE status = 'pending' ORDER BY placed_at LIMIT 50;
 
-- All-rows index
CREATE INDEX idx_orders_status_placed ON orders(status, placed_at);
SELECT pg_size_pretty(pg_relation_size('idx_orders_status_placed'));
 
-- Partial
CREATE INDEX idx_orders_pending ON orders(placed_at) WHERE status = 'pending';
SELECT pg_size_pretty(pg_relation_size('idx_orders_pending'));
 
-- Compare sizes. Compare query plans.
DROP INDEX idx_orders_status_placed;
EXPLAIN (ANALYZE, BUFFERS) <query>;

12.5 Bài tập 4 — Reproduce bloat & REINDEX

CREATE INDEX idx_test ON orders(placed_at);
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS before;
 
-- Generate bloat
DO $$ BEGIN
    FOR i IN 1..5 LOOP
        UPDATE orders SET placed_at = placed_at + interval '1 hour' WHERE id % 10 = 0;
    END LOOP;
END $$;
 
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS after_bloat;
 
REINDEX INDEX CONCURRENTLY idx_test;
SELECT pg_size_pretty(pg_relation_size('idx_test')) AS after_reindex;

12.6 Bài tập 5 — Find unused indexes

-- Reset stats
SELECT pg_stat_reset();
 
-- Simulate workload (run 10 mins of mixed queries)
-- ... your workload ...
 
-- Find unused
SELECT indexrelname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid))
FROM pg_stat_user_indexes
WHERE schemaname = 'public' AND idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

12.7 Bài tập 6 — GIN index trên tags

-- Random tags
UPDATE products SET tags = ARRAY(
    SELECT (ARRAY['rust','postgres','ml','web','mobile','cloud'])[1 + (random()*5)::int]
    FROM generate_series(1, (1 + random()*3)::int)
);
 
EXPLAIN ANALYZE SELECT * FROM products WHERE tags @> ARRAY['rust'];
-- Seq scan
 
CREATE INDEX idx_products_tags ON products USING gin(tags);
EXPLAIN ANALYZE SELECT * FROM products WHERE tags @> ARRAY['rust'];
-- Bitmap Heap Scan via GIN

13. Anti-patterns

Pattern	Tại sao tệ	Fix
Index mọi FK tự động	Một số FK ít query, tốn write	Index FK chỉ khi có truy vấn
Index trên column low-cardinality một mình	(status với 3 values) → seq scan vẫn nhanh hơn	Composite với column khác
Order column theo “intuition”	(created_at, user_id) khi query equality user_id	ESR rule
`LIKE '%suffix'` expect index hit	B-tree không hỗ trợ leading wildcard	Reverse string + index, hoặc trigram
GIN trên column thường update	Heavy write amplification	B-tree + fastupdate tuning hoặc đổi access pattern
Drop unused index lúc production stress	Có thể là index dùng cho quarterly report	Theo dõi lâu hơn, check pg_stat_statements
REINDEX (non-concurrent) trong giờ peak	Block table	REINDEX CONCURRENTLY
Function trong WHERE không match expression index	`WHERE upper(email) = X` mà index trên `lower(email)`	Match exact expression
INCLUDE column lớn (TEXT 10KB)	Index size phình	Tách ra column raw, index không INCLUDE
Không ANALYZE sau bulk insert	Planner stats stale → pick wrong index	`ANALYZE table_name` sau bulk load

14. Production Gotchas

14.1 Index disable vs drop

Không có “disable index” như MySQL. Cách workaround:

-- "Soft disable" by marking invalid
UPDATE pg_index SET indisvalid = false WHERE indexrelid = 'idx_x'::regclass;
-- Planner skip invalid index
-- Bật lại
UPDATE pg_index SET indisvalid = true WHERE indexrelid = 'idx_x'::regclass;

Cẩn thận: trực tiếp update pg_index là hack, không dùng routine.

14.2 CREATE INDEX CONCURRENTLY và lock

CREATE INDEX CONCURRENTLY idx_x ON huge_table(col);

Không block read/write
Chạy chậm hơn (2 lần scan table)
Không thể inside transaction
Nếu fail (vd: duplicate key in unique index) → leave invalid index → DROP & retry

Best practice: monitor progress:

SELECT * FROM pg_stat_progress_create_index;

14.3 Index quá lớn không fit vào RAM

Nếu index size > shared_buffers → mỗi lookup có thể disk I/O.

Pattern:

Hot table small enough: trust shared_buffers
Hot table large: tune shared_buffers, hoặc partition table → mỗi partition có index nhỏ hơn

14.4 Index trên TOAST’d column

Index không thể trỏ tới TOAST’d value. Postgres tự động compress in main heap nếu cần, hoặc dùng MD5 substring.

Practical: nếu cần index trên TEXT 10KB:

Tách thành column riêng nhỏ hơn
Index trên hash: CREATE INDEX ON t(md5(big_text))
Hoặc full-text search → tsvector + GIN

14.5 ANALYZE quan trọng

Planner stats từ pg_statistic. Sau bulk insert, migration, không có stats → planner đoán cardinality sai → pick wrong plan.

ANALYZE orders;
-- Hoặc tự động
ALTER TABLE orders SET (autovacuum_analyze_scale_factor = 0.05);

14.6 Index hint? Không có

Postgres không có USE INDEX hint như MySQL/Oracle. Cách “force”:

Drop other indexes (temporary)
Disable seq scan: SET enable_seqscan = off; (debugging only)
Adjust planner cost: SET random_page_cost = 1.0;

Extension pg_hint_plan (3rd party) có thể inject hints nhưng controversial. Postgres philosophy: fix planner stats, không hint.

15. Self-check

Bạn có query WHERE a = 1 AND b > 10 ORDER BY c. Index (a, b, c) vs (a, c, b) — cái nào tốt hơn? Tại sao?
INCLUDE column vs index column khác nhau ra sao? Tại sao có cả 2?
Index-only scan có Heap Fetches: 1000. Đây là good hay bad? Fix thế nào?
Khi nào dùng GIN, khi nào GiST? Cho 2 ví dụ rõ ràng.
Partial index WHERE status = 'pending'. Query WHERE status IN ('pending', 'active') có dùng được index này không?
BRIN khi nào hữu ích? Khi nào unhelpful?
LIKE '%abc%' — index nào hỗ trợ? LIKE 'abc%' — index nào?
Drop index trên FK column — risk gì? Khi nào chấp nhận drop?
REINDEX vs REINDEX CONCURRENTLY — trade-off?
Bạn có 5M-row table với 20 index. INSERT chậm. Bước đầu tiên?

16. Tiếp theo

Bài tiếp: Tuan-04-Query-Optimization-EXPLAIN — đọc plan, hiểu cost, debug slow query.

Đọc:

Use The Index, Luke! — đọc chương “The Anatomy of an Index” + “The Where Clause”
Postgres docs Chapter 11 (Indexes)

Tuần 03 hoàn thành. ESR rule + INCLUDE + partial = 80% indexing skill. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-03-Indexing-Mastery