Bonus — Postgres Partitioning Deep Dive
“Bảng 100M rows OK. 1B rows: VACUUM mất giờ, autovacuum không kịp, query plan chậm dù index, backup chậm, drop old data lock cả ngày. Partitioning fix mọi thứ — nhưng chỉ khi design đúng. Tuần này dạy bạn cách design đúng.”
Tags: database postgresql partitioning pg_partman operations Thời lượng: 5-6 ngày (5-7h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh · Tuan-03-Indexing-Mastery · Tuan-09-DB-Observability-Tuning Liên quan: Tuan-08-Zero-Downtime-Migration · Case-Design-Data-Realtime-Analytics
1. Context & Why
1.1 Khi nào cần partition
Triệu chứng cần partition:
- Bảng > 100GB hoặc > 1B rows
- Query chậm dù có index đúng (cache pressure)
- Bulk delete old data tốn cả ngày (
DELETE WHERE created_at < ...lock everything) - VACUUM không bao giờ “xong”
- Backup/restore mất quá lâu
- Index maintenance bloat
- Hot data + cold data distinct access pattern
Khi KHÔNG cần:
- Bảng < 10GB
- Query đều spread across rows (no time/range locality)
- Không có natural partition key
1.2 Partition là gì
Logical: 1 bảng từ góc nhìn SQL — INSERT INTO orders ..., SELECT FROM orders ... work as normal.
Physical: Postgres lưu nhiều “child tables”, route insert + filter query theo partition key.
graph TB Parent[orders<br/>parent partition table<br/>no rows stored here] P1[orders_2026_q1<br/>partition Q1] P2[orders_2026_q2<br/>partition Q2] P3[orders_2026_q3<br/>partition Q3] P4[orders_2026_q4<br/>partition Q4] Parent -.routes by created_at.-> P1 Parent -.routes.-> P2 Parent -.routes.-> P3 Parent -.routes.-> P4 Query[SELECT WHERE created_at = '2026-05-16'] -->|pruned| P2
1.3 Mục tiêu chương
- 3 partition strategies: range, list, hash
- Postgres native partitioning (PG10+) — fundamentals + production patterns
- Partition pruning rules — khi pruning work / không work
- Sub-partitioning (multi-level)
- Attach/detach partition online
- Foreign keys trên partitioned tables (PG12+)
- Partition-wise joins (PG12+)
- Default partition trap
- pg_partman cho automation
- Migration: existing huge table → partitioned
1.4 Tham chiếu
- Postgres docs Ch.5.11 — https://www.postgresql.org/docs/current/ddl-partitioning.html
- pg_partman — https://github.com/pgpartman/pg_partman
- Citus blog: partitioning — https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-scalable-time-series-database/
- Robert Haas / Bruce Momjian talks — partitioning internals
2. 3 Partition Strategies
2.1 Range partitioning — Most common
Partition theo continuous range của 1 hoặc nhiều columns. Typical: time.
CREATE TABLE orders (
id bigint GENERATED ALWAYS AS IDENTITY,
user_id bigint NOT NULL,
total numeric NOT NULL,
status text NOT NULL,
created_at timestamptz NOT NULL,
PRIMARY KEY (created_at, id) -- partition key MUST be in PK
) PARTITION BY RANGE (created_at);
-- Create partitions
CREATE TABLE orders_2026_q1 PARTITION OF orders
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
CREATE TABLE orders_2026_q2 PARTITION OF orders
FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');
CREATE TABLE orders_2026_q3 PARTITION OF orders
FOR VALUES FROM ('2026-07-01') TO ('2026-10-01');
CREATE TABLE orders_2026_q4 PARTITION OF orders
FOR VALUES FROM ('2026-10-01') TO ('2027-01-01');Range “FROM .. TO ..”: inclusive lower, exclusive upper.
Use cases:
- Time-series data (logs, events, orders)
- Sequential IDs (orders by id range)
- Geographic regions (lat/lon ranges)
2.2 List partitioning
Discrete enum values.
CREATE TABLE users_by_country (
id bigint,
country_code char(2) NOT NULL,
...
) PARTITION BY LIST (country_code);
CREATE TABLE users_us PARTITION OF users_by_country FOR VALUES IN ('US');
CREATE TABLE users_eu PARTITION OF users_by_country FOR VALUES IN ('DE', 'FR', 'GB', 'IT', 'ES');
CREATE TABLE users_apac PARTITION OF users_by_country FOR VALUES IN ('JP', 'KR', 'VN', 'TH', 'SG');
CREATE TABLE users_other PARTITION OF users_by_country DEFAULT;Use cases:
- Multi-region (compliance data residency)
- Status partitions (active vs archived)
- Tenant tier (free vs paid vs enterprise)
2.3 Hash partitioning
Spread evenly by hash. PG11+.
CREATE TABLE events (
id bigint, user_id bigint NOT NULL, ...
) PARTITION BY HASH (user_id);
CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);Use cases:
- Spread write load (no time/range locality)
- Tenant_id where uniform distribution wanted
- Avoid hot partitions
⚠️ Limit: cannot add/remove partitions without rehashing all data. Plan number upfront (e.g., 8, 16, 32).
2.4 Decision matrix
flowchart TD A[Choose strategy] --> B{Data has natural time?} B -->|Yes| C[Range by time] B -->|No| D{Discrete values like region/tier?} D -->|Yes| E[List] D -->|No| F{Need even spread?} F -->|Yes| G[Hash] F -->|No| H[Maybe don't partition] style C fill:#c8e6c9 style E fill:#c8e6c9 style G fill:#fff9c4
3. Constraints + Index trên Partitioned Table
3.1 Primary key
Rule: PK MUST include partition key columns.
-- BAD: PK doesn't include partition key
CREATE TABLE orders (
id bigint PRIMARY KEY,
created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);
-- ERROR: unique constraint must include all partitioning columns
-- GOOD: composite PK
CREATE TABLE orders (
id bigint,
created_at timestamptz NOT NULL,
PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);Rationale: Postgres can only guarantee uniqueness within a partition. PK includes partition key → uniqueness implies cross-partition.
3.2 Indexes
PG11+: create index on parent → auto-propagate to all partitions.
CREATE INDEX ON orders (user_id);
-- Creates index on each partition automatically
-- Future partitions get index automaticallyPG11+ also: CREATE INDEX ON ONLY parent = parent-only marker. Then ATTACH per-partition indexes manually. Used for staged builds.
3.3 Unique constraints
Like PK — must include partition key.
-- Each user unique active session per quarter
CREATE TABLE sessions (
id bigint,
user_id bigint NOT NULL,
created_at timestamptz NOT NULL,
UNIQUE (user_id, created_at) -- includes partition key
) PARTITION BY RANGE (created_at);If you really need cross-partition unique (e.g., global email unique), partitioning may not fit, or you accept partial uniqueness + app-level check.
3.4 Foreign keys — PG12+
Big news (often missed): PG12 brought:
- FK FROM partitioned table to regular table (and other partitioned table)
- FK TO partitioned table from regular table
PG10-11 was very limited.
-- Order_items references orders
CREATE TABLE order_items (
id bigint,
order_created_at timestamptz NOT NULL,
order_id bigint NOT NULL,
product_id bigint NOT NULL REFERENCES products(id),
qty int,
PRIMARY KEY (order_created_at, id),
FOREIGN KEY (order_created_at, order_id) REFERENCES orders (created_at, id)
) PARTITION BY RANGE (order_created_at);FK across partition tables works PG12+. Performance: lookup on parent → routes to child partition.
3.5 CHECK constraints
Auto-propagate, also explicit per-partition:
ALTER TABLE orders_2026_q1 ADD CONSTRAINT q1_status CHECK (status IN ('pending', 'paid', 'shipped'));4. Partition Pruning
4.1 Concept
Planner skips irrelevant partitions when query filters on partition key.
-- Pruned: only orders_2026_q2 scanned
SELECT * FROM orders WHERE created_at >= '2026-04-15' AND created_at < '2026-05-15';
-- Not pruned: scans all partitions
SELECT * FROM orders WHERE user_id = 42;4.2 EXPLAIN to verify
EXPLAIN SELECT * FROM orders WHERE created_at = '2026-05-16 10:00:00';
-- Good:
-- Append
-- -> Seq Scan on orders_2026_q2 orders
-- Filter: ...
-- (only Q2 partition listed)
-- Bad (no pruning):
-- Append
-- -> Seq Scan on orders_2026_q1
-- -> Seq Scan on orders_2026_q2
-- -> Seq Scan on orders_2026_q3
-- -> Seq Scan on orders_2026_q44.3 When pruning works
flowchart TD A[Pruning enabled?] --> B{Filter directly uses partition key?} B -->|Yes col = X, col BETWEEN X AND Y| WORKS[Pruning works] B -->|No col uses function| C{Volatile function?} C -->|Yes now, current_user| MAYBE[PG14+ runtime pruning yes,<br/>PG13- maybe no] C -->|No, immutable| WORKS A --> D{Plan-time constant?} D -->|Yes literal in WHERE| WORKS D -->|No prepared statement parameter| WORKS_PG11[PG11+ generic plan supports] style WORKS fill:#c8e6c9 style MAYBE fill:#fff9c4
4.4 Common pruning failures
Function on partition column:
-- BAD: pruning may fail
SELECT * FROM events WHERE date(created_at) = '2026-05-16';
-- GOOD: range
SELECT * FROM events WHERE created_at >= '2026-05-16' AND created_at < '2026-05-17';Type mismatch:
-- BAD: 'created_at' partition is timestamptz, filter is text
SELECT * FROM events WHERE created_at::text = '2026-05-16';
-- GOOD: native type
SELECT * FROM events WHERE created_at >= timestamptz '2026-05-16';OR conditions can block:
SELECT * FROM events WHERE created_at = '2026-05-16' OR user_id = 42;
-- May not prune (planner uncertain)4.5 Runtime pruning (PG12+)
For prepared statements, parameters known only at exec time.
PREPARE q AS SELECT * FROM events WHERE created_at = $1;
EXECUTE q (timestamp '2026-05-16');
-- PG12+ prunes at executionCheck in EXPLAIN: Subplans Removed: N indicates runtime pruning kicked in.
4.6 Settings
SET enable_partition_pruning = on; -- default
SET constraint_exclusion = partition; -- legacy, less common now5. Default Partition (Trap)
Catch-all for values not matching any explicit partition.
CREATE TABLE orders_default PARTITION OF orders DEFAULT;⚠️ Trap: Adding new partition requires scanning default partition to verify no overlap. With huge default, that scan is slow + locks.
Pattern: don’t use default partition for time-based; always create future partitions ahead of time (pg_partman handles).
For list partitioning, default catches unexpected values — useful safety net.
6. Sub-partitioning (Multi-level)
CREATE TABLE events (
id bigint,
created_at timestamptz NOT NULL,
region text NOT NULL,
PRIMARY KEY (created_at, region, id)
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2026 PARTITION OF events
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01')
PARTITION BY LIST (region);
CREATE TABLE events_2026_us PARTITION OF events_2026 FOR VALUES IN ('US');
CREATE TABLE events_2026_eu PARTITION OF events_2026 FOR VALUES IN ('EU');
CREATE TABLE events_2026_apac PARTITION OF events_2026 FOR VALUES IN ('APAC');Query with both keys → prune both levels:
SELECT * FROM events WHERE created_at >= '2026-05-01' AND region = 'US';
-- Pruned to events_2026_us onlyUse cases:
- Time + region
- Time + tenant
- Time + status
⚠️ Complex management — pg_partman helps but extra layers add overhead.
7. Operations on Partitions
7.1 Attach (add existing table as partition)
-- Create regular table, populate, then attach
CREATE TABLE orders_2026_q1_new (LIKE orders INCLUDING ALL);
INSERT INTO orders_2026_q1_new SELECT * FROM ... ;
ALTER TABLE orders_2026_q1_new ADD CONSTRAINT q1_range
CHECK (created_at >= '2026-01-01' AND created_at < '2026-04-01');
ALTER TABLE orders ATTACH PARTITION orders_2026_q1_new
FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');Optimization: With pre-existing CHECK constraint matching FROM/TO, Postgres skips full-table scan to validate.
7.2 Detach (remove partition without dropping data)
ALTER TABLE orders DETACH PARTITION orders_2026_q1;
-- Now standalone table; can drop, archive, move tablespacePG14+: concurrent detach:
ALTER TABLE orders DETACH PARTITION orders_2026_q1 CONCURRENTLY;Avoid long lock — useful for production.
7.3 Drop old partition (fast cleanup)
The big win:
-- Drop 3-year-old data in milliseconds (no scan, no DELETE)
DROP TABLE orders_2023_q1;vs DELETE FROM orders WHERE created_at < '2023-04-01' which scans + writes WAL + bloats table for hours.
This is the #1 reason to partition: time-based retention becomes free.
7.4 Move to cold tablespace
CREATE TABLESPACE cold LOCATION '/mnt/slow-disk/cold';
ALTER TABLE orders_2023_q1 SET TABLESPACE cold;
-- Old data on cheap HDD, hot data on SSDCommon pattern: hot SSD for last 3 months, warm HDD for older, drop oldest.
7.5 Indexing per partition
Different indexes per partition possible:
-- Old data: heavy compression, summary index only
ALTER TABLE orders_2023_q1 SET (toast_tuple_target = 128);
DROP INDEX idx_orders_2023_q1_user_id; -- old data not queried by user
-- New data: full indexing
-- (inherits from parent index)7.6 VACUUM per partition
VACUUM works per-partition. Hot partition gets aggressive autovacuum:
ALTER TABLE orders_2026_q2 SET (
autovacuum_vacuum_scale_factor = 0.05,
autovacuum_vacuum_cost_delay = 0
);Old partitions: low autovacuum priority.
7.7 Statistics per partition
Each partition has own stats — planner picks based on filter:
ALTER TABLE orders_2026_q2 ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders_2026_q2;7.8 Backup per partition
Logical backup per partition possible:
pg_dump --table=orders_2023_q1 mydb > q1_2023.sqlUseful for archiving old partitions to cold storage offsite.
8. Partition-wise Operations — PG12+
8.1 Partition-wise join
When 2 tables have matching partition scheme, planner can join per-partition (parallel-friendly).
-- Both partitioned same way (range on created_at, same boundaries)
SELECT * FROM orders o
JOIN order_items oi ON o.id = oi.order_id AND o.created_at = oi.order_created_at
WHERE o.created_at BETWEEN '2026-04-01' AND '2026-07-01';Enable:
SET enable_partitionwise_join = on;
SET enable_partitionwise_aggregate = on;Default off (PG12+) because planner cost increases. Turn on if useful for your workload.
8.2 Partition-wise aggregate
GROUP BY can run per-partition then combine.
SET enable_partitionwise_aggregate = on;
SELECT date_trunc('day', created_at), count(*) FROM events GROUP BY 1;
-- Plan: aggregate within each partition, then UNIONParallelizes well across CPU cores.
9. pg_partman — Automation
9.1 Why pg_partman
Manual partition management:
- Create future partitions
- Drop old partitions
- Retention policies
- Sub-partitions
→ Error-prone, repetitive. pg_partman automates.
9.2 Setup
CREATE EXTENSION pg_partman;9.3 Time-based partitioning
-- Create parent
CREATE TABLE events (
id bigint, created_at timestamptz NOT NULL, payload jsonb
) PARTITION BY RANGE (created_at);
-- Register with pg_partman
SELECT partman.create_parent(
p_parent_table => 'public.events',
p_control => 'created_at',
p_type => 'native',
p_interval => 'daily',
p_premake => 7 -- always 7 days ahead
);
-- Configure retention
UPDATE partman.part_config
SET retention = '90 days',
retention_keep_table = false -- drop, don't just detach
WHERE parent_table = 'public.events';
-- Run maintenance (cron this every hour)
SELECT partman.run_maintenance();Output: pg_partman creates events_p20260516, events_p20260517, etc. Drops anything < 90 days old.
9.4 Schedule maintenance
# Crontab
0 * * * * psql -c "SELECT partman.run_maintenance();" appdbOr use pg_cron extension for in-DB scheduling.
9.5 Sub-partition automation
pg_partman handles 2-level (time + region) automatically.
SELECT partman.create_sub_parent(
p_top_parent => 'public.events',
p_control => 'region',
p_type => 'native',
p_interval => '...'
);9.6 Other features
- Constraint exclusion on detached
- Automatic index on new partitions
- Partition templates (apply settings)
10. Migration: Existing Huge Table → Partitioned
10.1 The challenge
Non-partitioned orders table 500GB. Want to partition.
⚠️ Cannot directly convert — must rebuild.
10.2 Strategy A: New table + dual-write + backfill
-- 1. Create new partitioned table (same schema)
CREATE TABLE orders_new (
LIKE orders INCLUDING ALL,
PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);
-- 2. Create partitions for history + future
CREATE TABLE orders_new_2024 PARTITION OF orders_new FOR VALUES FROM (...) TO (...);
-- ... per year/quarter
-- 3. App dual-write (INSERT to both)
-- 4. Backfill in chunks (Tuần 08)
-- 5. Switch reads to new
-- 6. Stop writing old
-- 7. DROP TABLE orders; ALTER TABLE orders_new RENAME TO orders;Standard expand-contract (Tuần 08). Takes weeks for huge table.
10.3 Strategy B: pg_partman with retention from start
For new feature where you anticipate growth:
- Design as partitioned from day 1
- Avoid migration pain later
Rule of thumb 2024: If you expect > 100M rows over product lifetime, partition from day 1.
10.4 Strategy C: Use pg_partman partition_data_proc
pg_partman ships with partition_data_proc to migrate existing data into partitioned table in batches.
-- After setting up partitioned table
SELECT partman.partition_data_proc(
p_parent_table => 'public.orders_new',
p_batch_count => 100,
p_interval => '1 month'
);Manages copy + lock minimization.
11. Performance Considerations
11.1 Planning time scales with partition count
100s of partitions → planning can take 10-100ms (vs 1ms unpartitioned).
-- Check
EXPLAIN SELECT ... ;
-- Planning Time: 50ms (high!)Mitigations:
- Prepared statements (plan reused)
- Fewer partitions (quarterly vs daily)
- Use pruning aggressively
- PG14+ has many planner optimizations for partitioned tables
11.2 Cross-partition operations slow
Operations spanning many partitions:
count(*)whole table → scan each partition- Cross-partition unique check
- Joins not partition-wise
→ Always filter on partition key when possible.
11.3 Connection overhead
Each backend caches partition info → memory + setup time. Many connections × many partitions = memory pressure.
→ Pooler (Tuần 06) crucial with partitioned tables.
11.4 Lock escalation
Some DDL on parent locks ALL partitions (e.g., ALTER TABLE orders ADD COLUMN).
PG12+ improvements but still careful with DDL.
12. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Too few partitions (1 per year) | No pruning benefit | Monthly/weekly |
| Too many partitions (1 per minute) | Planning overhead | Match data volume |
| Default partition huge | New partition attach slow | Always pre-create future |
| Partition key not in PK | Errors at creation | Composite PK |
WHERE func(partition_col) = X | No pruning | Use range on raw col |
| Forgot to ANALYZE new partition | Bad plans | Auto-analyze or manual |
| Cross-partition unique constraint | Not enforceable | Accept, use composite |
| Update partition key | Postgres physically moves row | Avoid; immutable design |
| Hash partition count guessed | Can’t easily resize | Pick generous power of 2 |
| No automation | Forget to create future partition | pg_partman or pg_cron |
| Heavy joins across partitions | Slow | Partition-wise join enabled |
DELETE FROM events WHERE created_at < ... | Slow + bloat | DROP TABLE old_partition |
13. Lab — 7 days
13.1 Day 1: Basic range partitioning
CREATE TABLE events (
id bigserial,
created_at timestamptz NOT NULL,
payload jsonb,
PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);
-- Create monthly partitions for 6 months
DO $$
DECLARE
start_date date;
BEGIN
FOR i IN 0..5 LOOP
start_date := date_trunc('month', current_date) + (i || ' months')::interval;
EXECUTE format(
'CREATE TABLE events_%s PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
to_char(start_date, 'YYYY_MM'),
start_date,
start_date + interval '1 month'
);
END LOOP;
END $$;
-- Insert spreading across months
INSERT INTO events (created_at, payload)
SELECT
current_date + (random() * 150 || ' days')::interval,
jsonb_build_object('user', i)
FROM generate_series(1, 1000000) i;
-- EXPLAIN pruning
EXPLAIN ANALYZE SELECT count(*) FROM events
WHERE created_at >= current_date AND created_at < current_date + interval '7 days';
-- Compare with non-partitioned
CREATE TABLE events_flat (LIKE events INCLUDING ALL);
INSERT INTO events_flat SELECT * FROM events;
EXPLAIN ANALYZE SELECT count(*) FROM events_flat WHERE created_at >= ... ;13.2 Day 2: List partitioning
Partition users by country. Test queries with/without country filter.
13.3 Day 3: Hash partitioning
Partition events by user_id hash (8 partitions). Verify even distribution. Test pruning.
13.4 Day 4: Sub-partitioning
Time + region two-level. Verify both pruning levels.
13.5 Day 5: pg_partman setup
Install, register events table, schedule maintenance, watch partitions auto-created.
13.6 Day 6: Attach/detach
Add partition for next quarter via ATTACH. Detach old partition. Drop.
13.7 Day 7: Migration
Take existing flat table 5M rows. Migrate to partitioned via dual-write + backfill (simplified version of production strategy).
14. Self-check
- Khi nào nên partition? 3 triệu chứng?
- 3 partition strategies + use case mỗi cái?
- Default partition trap — vì sao tránh time-based?
- PG12+ FK trên partitioned table — gì mới so PG10/11?
- Partition pruning — khi nào work, khi nào fail?
- Sub-partitioning — trade-off?
- ATTACH vs DETACH partition — workflow?
- pg_partman vai trò?
- Vì sao “drop old partition” rẻ hơn DELETE? (10x lighter cost)
- Migration flat → partitioned: chiến lược tổng quan?
15. Tiếp theo
Back to roadmap. Tuần này nên đọc kèm Tuần 08 (Migration) và Tuần 09 (Observability).
Bonus partitioning hoàn thành. Drop, not delete. Cập nhật: 2026-05-16