Bonus — Postgres Partitioning Deep Dive

“Bảng 100M rows OK. 1B rows: VACUUM mất giờ, autovacuum không kịp, query plan chậm dù index, backup chậm, drop old data lock cả ngày. Partitioning fix mọi thứ — nhưng chỉ khi design đúng. Tuần này dạy bạn cách design đúng.”

Tags: database postgresql partitioning pg_partman operations Thời lượng: 5-6 ngày (5-7h/ngày) Prerequisites: Tuan-01-DB-Internals-Refresh · Tuan-03-Indexing-Mastery · Tuan-09-DB-Observability-Tuning Liên quan: Tuan-08-Zero-Downtime-Migration · Case-Design-Data-Realtime-Analytics


1. Context & Why

1.1 Khi nào cần partition

Triệu chứng cần partition:

  • Bảng > 100GB hoặc > 1B rows
  • Query chậm dù có index đúng (cache pressure)
  • Bulk delete old data tốn cả ngày (DELETE WHERE created_at < ... lock everything)
  • VACUUM không bao giờ “xong”
  • Backup/restore mất quá lâu
  • Index maintenance bloat
  • Hot data + cold data distinct access pattern

Khi KHÔNG cần:

  • Bảng < 10GB
  • Query đều spread across rows (no time/range locality)
  • Không có natural partition key

1.2 Partition là gì

Logical: 1 bảng từ góc nhìn SQL — INSERT INTO orders ..., SELECT FROM orders ... work as normal.

Physical: Postgres lưu nhiều “child tables”, route insert + filter query theo partition key.

graph TB
    Parent[orders<br/>parent partition table<br/>no rows stored here]
    P1[orders_2026_q1<br/>partition Q1]
    P2[orders_2026_q2<br/>partition Q2]
    P3[orders_2026_q3<br/>partition Q3]
    P4[orders_2026_q4<br/>partition Q4]

    Parent -.routes by created_at.-> P1
    Parent -.routes.-> P2
    Parent -.routes.-> P3
    Parent -.routes.-> P4

    Query[SELECT WHERE created_at = '2026-05-16'] -->|pruned| P2

1.3 Mục tiêu chương

  • 3 partition strategies: range, list, hash
  • Postgres native partitioning (PG10+) — fundamentals + production patterns
  • Partition pruning rules — khi pruning work / không work
  • Sub-partitioning (multi-level)
  • Attach/detach partition online
  • Foreign keys trên partitioned tables (PG12+)
  • Partition-wise joins (PG12+)
  • Default partition trap
  • pg_partman cho automation
  • Migration: existing huge table → partitioned

1.4 Tham chiếu


2. 3 Partition Strategies

2.1 Range partitioning — Most common

Partition theo continuous range của 1 hoặc nhiều columns. Typical: time.

CREATE TABLE orders (
    id bigint GENERATED ALWAYS AS IDENTITY,
    user_id bigint NOT NULL,
    total numeric NOT NULL,
    status text NOT NULL,
    created_at timestamptz NOT NULL,
    PRIMARY KEY (created_at, id)  -- partition key MUST be in PK
) PARTITION BY RANGE (created_at);
 
-- Create partitions
CREATE TABLE orders_2026_q1 PARTITION OF orders
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');
CREATE TABLE orders_2026_q2 PARTITION OF orders
    FOR VALUES FROM ('2026-04-01') TO ('2026-07-01');
CREATE TABLE orders_2026_q3 PARTITION OF orders
    FOR VALUES FROM ('2026-07-01') TO ('2026-10-01');
CREATE TABLE orders_2026_q4 PARTITION OF orders
    FOR VALUES FROM ('2026-10-01') TO ('2027-01-01');

Range “FROM .. TO ..”: inclusive lower, exclusive upper.

Use cases:

  • Time-series data (logs, events, orders)
  • Sequential IDs (orders by id range)
  • Geographic regions (lat/lon ranges)

2.2 List partitioning

Discrete enum values.

CREATE TABLE users_by_country (
    id bigint,
    country_code char(2) NOT NULL,
    ...
) PARTITION BY LIST (country_code);
 
CREATE TABLE users_us PARTITION OF users_by_country FOR VALUES IN ('US');
CREATE TABLE users_eu PARTITION OF users_by_country FOR VALUES IN ('DE', 'FR', 'GB', 'IT', 'ES');
CREATE TABLE users_apac PARTITION OF users_by_country FOR VALUES IN ('JP', 'KR', 'VN', 'TH', 'SG');
CREATE TABLE users_other PARTITION OF users_by_country DEFAULT;

Use cases:

  • Multi-region (compliance data residency)
  • Status partitions (active vs archived)
  • Tenant tier (free vs paid vs enterprise)

2.3 Hash partitioning

Spread evenly by hash. PG11+.

CREATE TABLE events (
    id bigint, user_id bigint NOT NULL, ...
) PARTITION BY HASH (user_id);
 
CREATE TABLE events_p0 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE events_p1 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE events_p2 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE events_p3 PARTITION OF events FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Use cases:

  • Spread write load (no time/range locality)
  • Tenant_id where uniform distribution wanted
  • Avoid hot partitions

⚠️ Limit: cannot add/remove partitions without rehashing all data. Plan number upfront (e.g., 8, 16, 32).

2.4 Decision matrix

flowchart TD
    A[Choose strategy] --> B{Data has natural time?}
    B -->|Yes| C[Range by time]
    B -->|No| D{Discrete values like region/tier?}
    D -->|Yes| E[List]
    D -->|No| F{Need even spread?}
    F -->|Yes| G[Hash]
    F -->|No| H[Maybe don't partition]

    style C fill:#c8e6c9
    style E fill:#c8e6c9
    style G fill:#fff9c4

3. Constraints + Index trên Partitioned Table

3.1 Primary key

Rule: PK MUST include partition key columns.

-- BAD: PK doesn't include partition key
CREATE TABLE orders (
    id bigint PRIMARY KEY,
    created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);
-- ERROR: unique constraint must include all partitioning columns
 
-- GOOD: composite PK
CREATE TABLE orders (
    id bigint,
    created_at timestamptz NOT NULL,
    PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);

Rationale: Postgres can only guarantee uniqueness within a partition. PK includes partition key → uniqueness implies cross-partition.

3.2 Indexes

PG11+: create index on parent → auto-propagate to all partitions.

CREATE INDEX ON orders (user_id);
-- Creates index on each partition automatically
-- Future partitions get index automatically

PG11+ also: CREATE INDEX ON ONLY parent = parent-only marker. Then ATTACH per-partition indexes manually. Used for staged builds.

3.3 Unique constraints

Like PK — must include partition key.

-- Each user unique active session per quarter
CREATE TABLE sessions (
    id bigint,
    user_id bigint NOT NULL,
    created_at timestamptz NOT NULL,
    UNIQUE (user_id, created_at)  -- includes partition key
) PARTITION BY RANGE (created_at);

If you really need cross-partition unique (e.g., global email unique), partitioning may not fit, or you accept partial uniqueness + app-level check.

3.4 Foreign keys — PG12+

Big news (often missed): PG12 brought:

  • FK FROM partitioned table to regular table (and other partitioned table)
  • FK TO partitioned table from regular table

PG10-11 was very limited.

-- Order_items references orders
CREATE TABLE order_items (
    id bigint,
    order_created_at timestamptz NOT NULL,
    order_id bigint NOT NULL,
    product_id bigint NOT NULL REFERENCES products(id),
    qty int,
    PRIMARY KEY (order_created_at, id),
    FOREIGN KEY (order_created_at, order_id) REFERENCES orders (created_at, id)
) PARTITION BY RANGE (order_created_at);

FK across partition tables works PG12+. Performance: lookup on parent → routes to child partition.

3.5 CHECK constraints

Auto-propagate, also explicit per-partition:

ALTER TABLE orders_2026_q1 ADD CONSTRAINT q1_status CHECK (status IN ('pending', 'paid', 'shipped'));

4. Partition Pruning

4.1 Concept

Planner skips irrelevant partitions when query filters on partition key.

-- Pruned: only orders_2026_q2 scanned
SELECT * FROM orders WHERE created_at >= '2026-04-15' AND created_at < '2026-05-15';
 
-- Not pruned: scans all partitions
SELECT * FROM orders WHERE user_id = 42;

4.2 EXPLAIN to verify

EXPLAIN SELECT * FROM orders WHERE created_at = '2026-05-16 10:00:00';
 
-- Good:
-- Append
--   ->  Seq Scan on orders_2026_q2 orders
--         Filter: ...
-- (only Q2 partition listed)
 
-- Bad (no pruning):
-- Append
--   ->  Seq Scan on orders_2026_q1
--   ->  Seq Scan on orders_2026_q2
--   ->  Seq Scan on orders_2026_q3
--   ->  Seq Scan on orders_2026_q4

4.3 When pruning works

flowchart TD
    A[Pruning enabled?] --> B{Filter directly uses partition key?}
    B -->|Yes col = X, col BETWEEN X AND Y| WORKS[Pruning works]
    B -->|No col uses function| C{Volatile function?}
    C -->|Yes now, current_user| MAYBE[PG14+ runtime pruning yes,<br/>PG13- maybe no]
    C -->|No, immutable| WORKS

    A --> D{Plan-time constant?}
    D -->|Yes literal in WHERE| WORKS
    D -->|No prepared statement parameter| WORKS_PG11[PG11+ generic plan supports]

    style WORKS fill:#c8e6c9
    style MAYBE fill:#fff9c4

4.4 Common pruning failures

Function on partition column:

-- BAD: pruning may fail
SELECT * FROM events WHERE date(created_at) = '2026-05-16';
 
-- GOOD: range
SELECT * FROM events WHERE created_at >= '2026-05-16' AND created_at < '2026-05-17';

Type mismatch:

-- BAD: 'created_at' partition is timestamptz, filter is text
SELECT * FROM events WHERE created_at::text = '2026-05-16';
 
-- GOOD: native type
SELECT * FROM events WHERE created_at >= timestamptz '2026-05-16';

OR conditions can block:

SELECT * FROM events WHERE created_at = '2026-05-16' OR user_id = 42;
-- May not prune (planner uncertain)

4.5 Runtime pruning (PG12+)

For prepared statements, parameters known only at exec time.

PREPARE q AS SELECT * FROM events WHERE created_at = $1;
EXECUTE q (timestamp '2026-05-16');
-- PG12+ prunes at execution

Check in EXPLAIN: Subplans Removed: N indicates runtime pruning kicked in.

4.6 Settings

SET enable_partition_pruning = on;   -- default
SET constraint_exclusion = partition;  -- legacy, less common now

5. Default Partition (Trap)

Catch-all for values not matching any explicit partition.

CREATE TABLE orders_default PARTITION OF orders DEFAULT;

⚠️ Trap: Adding new partition requires scanning default partition to verify no overlap. With huge default, that scan is slow + locks.

Pattern: don’t use default partition for time-based; always create future partitions ahead of time (pg_partman handles).

For list partitioning, default catches unexpected values — useful safety net.


6. Sub-partitioning (Multi-level)

CREATE TABLE events (
    id bigint,
    created_at timestamptz NOT NULL,
    region text NOT NULL,
    PRIMARY KEY (created_at, region, id)
) PARTITION BY RANGE (created_at);
 
CREATE TABLE events_2026 PARTITION OF events
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01')
    PARTITION BY LIST (region);
 
CREATE TABLE events_2026_us PARTITION OF events_2026 FOR VALUES IN ('US');
CREATE TABLE events_2026_eu PARTITION OF events_2026 FOR VALUES IN ('EU');
CREATE TABLE events_2026_apac PARTITION OF events_2026 FOR VALUES IN ('APAC');

Query with both keys → prune both levels:

SELECT * FROM events WHERE created_at >= '2026-05-01' AND region = 'US';
-- Pruned to events_2026_us only

Use cases:

  • Time + region
  • Time + tenant
  • Time + status

⚠️ Complex management — pg_partman helps but extra layers add overhead.


7. Operations on Partitions

7.1 Attach (add existing table as partition)

-- Create regular table, populate, then attach
CREATE TABLE orders_2026_q1_new (LIKE orders INCLUDING ALL);
INSERT INTO orders_2026_q1_new SELECT * FROM ... ;
ALTER TABLE orders_2026_q1_new ADD CONSTRAINT q1_range
    CHECK (created_at >= '2026-01-01' AND created_at < '2026-04-01');
 
ALTER TABLE orders ATTACH PARTITION orders_2026_q1_new
    FOR VALUES FROM ('2026-01-01') TO ('2026-04-01');

Optimization: With pre-existing CHECK constraint matching FROM/TO, Postgres skips full-table scan to validate.

7.2 Detach (remove partition without dropping data)

ALTER TABLE orders DETACH PARTITION orders_2026_q1;
-- Now standalone table; can drop, archive, move tablespace

PG14+: concurrent detach:

ALTER TABLE orders DETACH PARTITION orders_2026_q1 CONCURRENTLY;

Avoid long lock — useful for production.

7.3 Drop old partition (fast cleanup)

The big win:

-- Drop 3-year-old data in milliseconds (no scan, no DELETE)
DROP TABLE orders_2023_q1;

vs DELETE FROM orders WHERE created_at < '2023-04-01' which scans + writes WAL + bloats table for hours.

This is the #1 reason to partition: time-based retention becomes free.

7.4 Move to cold tablespace

CREATE TABLESPACE cold LOCATION '/mnt/slow-disk/cold';
 
ALTER TABLE orders_2023_q1 SET TABLESPACE cold;
-- Old data on cheap HDD, hot data on SSD

Common pattern: hot SSD for last 3 months, warm HDD for older, drop oldest.

7.5 Indexing per partition

Different indexes per partition possible:

-- Old data: heavy compression, summary index only
ALTER TABLE orders_2023_q1 SET (toast_tuple_target = 128);
DROP INDEX idx_orders_2023_q1_user_id;  -- old data not queried by user
 
-- New data: full indexing
-- (inherits from parent index)

7.6 VACUUM per partition

VACUUM works per-partition. Hot partition gets aggressive autovacuum:

ALTER TABLE orders_2026_q2 SET (
    autovacuum_vacuum_scale_factor = 0.05,
    autovacuum_vacuum_cost_delay = 0
);

Old partitions: low autovacuum priority.

7.7 Statistics per partition

Each partition has own stats — planner picks based on filter:

ALTER TABLE orders_2026_q2 ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders_2026_q2;

7.8 Backup per partition

Logical backup per partition possible:

pg_dump --table=orders_2023_q1 mydb > q1_2023.sql

Useful for archiving old partitions to cold storage offsite.


8. Partition-wise Operations — PG12+

8.1 Partition-wise join

When 2 tables have matching partition scheme, planner can join per-partition (parallel-friendly).

-- Both partitioned same way (range on created_at, same boundaries)
SELECT * FROM orders o
JOIN order_items oi ON o.id = oi.order_id AND o.created_at = oi.order_created_at
WHERE o.created_at BETWEEN '2026-04-01' AND '2026-07-01';

Enable:

SET enable_partitionwise_join = on;
SET enable_partitionwise_aggregate = on;

Default off (PG12+) because planner cost increases. Turn on if useful for your workload.

8.2 Partition-wise aggregate

GROUP BY can run per-partition then combine.

SET enable_partitionwise_aggregate = on;
SELECT date_trunc('day', created_at), count(*) FROM events GROUP BY 1;
-- Plan: aggregate within each partition, then UNION

Parallelizes well across CPU cores.


9. pg_partman — Automation

9.1 Why pg_partman

Manual partition management:

  • Create future partitions
  • Drop old partitions
  • Retention policies
  • Sub-partitions

→ Error-prone, repetitive. pg_partman automates.

9.2 Setup

CREATE EXTENSION pg_partman;

9.3 Time-based partitioning

-- Create parent
CREATE TABLE events (
    id bigint, created_at timestamptz NOT NULL, payload jsonb
) PARTITION BY RANGE (created_at);
 
-- Register with pg_partman
SELECT partman.create_parent(
    p_parent_table => 'public.events',
    p_control => 'created_at',
    p_type => 'native',
    p_interval => 'daily',
    p_premake => 7  -- always 7 days ahead
);
 
-- Configure retention
UPDATE partman.part_config
SET retention = '90 days',
    retention_keep_table = false  -- drop, don't just detach
WHERE parent_table = 'public.events';
 
-- Run maintenance (cron this every hour)
SELECT partman.run_maintenance();

Output: pg_partman creates events_p20260516, events_p20260517, etc. Drops anything < 90 days old.

9.4 Schedule maintenance

# Crontab
0 * * * * psql -c "SELECT partman.run_maintenance();" appdb

Or use pg_cron extension for in-DB scheduling.

9.5 Sub-partition automation

pg_partman handles 2-level (time + region) automatically.

SELECT partman.create_sub_parent(
    p_top_parent => 'public.events',
    p_control => 'region',
    p_type => 'native',
    p_interval => '...'
);

9.6 Other features

  • Constraint exclusion on detached
  • Automatic index on new partitions
  • Partition templates (apply settings)

10. Migration: Existing Huge Table → Partitioned

10.1 The challenge

Non-partitioned orders table 500GB. Want to partition.

⚠️ Cannot directly convert — must rebuild.

10.2 Strategy A: New table + dual-write + backfill

-- 1. Create new partitioned table (same schema)
CREATE TABLE orders_new (
    LIKE orders INCLUDING ALL,
    PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);
 
-- 2. Create partitions for history + future
CREATE TABLE orders_new_2024 PARTITION OF orders_new FOR VALUES FROM (...) TO (...);
-- ... per year/quarter
 
-- 3. App dual-write (INSERT to both)
-- 4. Backfill in chunks (Tuần 08)
-- 5. Switch reads to new
-- 6. Stop writing old
-- 7. DROP TABLE orders; ALTER TABLE orders_new RENAME TO orders;

Standard expand-contract (Tuần 08). Takes weeks for huge table.

10.3 Strategy B: pg_partman with retention from start

For new feature where you anticipate growth:

  • Design as partitioned from day 1
  • Avoid migration pain later

Rule of thumb 2024: If you expect > 100M rows over product lifetime, partition from day 1.

10.4 Strategy C: Use pg_partman partition_data_proc

pg_partman ships with partition_data_proc to migrate existing data into partitioned table in batches.

-- After setting up partitioned table
SELECT partman.partition_data_proc(
    p_parent_table => 'public.orders_new',
    p_batch_count => 100,
    p_interval => '1 month'
);

Manages copy + lock minimization.


11. Performance Considerations

11.1 Planning time scales with partition count

100s of partitions → planning can take 10-100ms (vs 1ms unpartitioned).

-- Check
EXPLAIN SELECT ... ;
-- Planning Time: 50ms (high!)

Mitigations:

  • Prepared statements (plan reused)
  • Fewer partitions (quarterly vs daily)
  • Use pruning aggressively
  • PG14+ has many planner optimizations for partitioned tables

11.2 Cross-partition operations slow

Operations spanning many partitions:

  • count(*) whole table → scan each partition
  • Cross-partition unique check
  • Joins not partition-wise

→ Always filter on partition key when possible.

11.3 Connection overhead

Each backend caches partition info → memory + setup time. Many connections × many partitions = memory pressure.

→ Pooler (Tuần 06) crucial with partitioned tables.

11.4 Lock escalation

Some DDL on parent locks ALL partitions (e.g., ALTER TABLE orders ADD COLUMN).

PG12+ improvements but still careful with DDL.


12. Anti-patterns

PatternWhy badFix
Too few partitions (1 per year)No pruning benefitMonthly/weekly
Too many partitions (1 per minute)Planning overheadMatch data volume
Default partition hugeNew partition attach slowAlways pre-create future
Partition key not in PKErrors at creationComposite PK
WHERE func(partition_col) = XNo pruningUse range on raw col
Forgot to ANALYZE new partitionBad plansAuto-analyze or manual
Cross-partition unique constraintNot enforceableAccept, use composite
Update partition keyPostgres physically moves rowAvoid; immutable design
Hash partition count guessedCan’t easily resizePick generous power of 2
No automationForget to create future partitionpg_partman or pg_cron
Heavy joins across partitionsSlowPartition-wise join enabled
DELETE FROM events WHERE created_at < ...Slow + bloatDROP TABLE old_partition

13. Lab — 7 days

13.1 Day 1: Basic range partitioning

CREATE TABLE events (
    id bigserial,
    created_at timestamptz NOT NULL,
    payload jsonb,
    PRIMARY KEY (created_at, id)
) PARTITION BY RANGE (created_at);
 
-- Create monthly partitions for 6 months
DO $$
DECLARE
    start_date date;
BEGIN
    FOR i IN 0..5 LOOP
        start_date := date_trunc('month', current_date) + (i || ' months')::interval;
        EXECUTE format(
            'CREATE TABLE events_%s PARTITION OF events FOR VALUES FROM (%L) TO (%L)',
            to_char(start_date, 'YYYY_MM'),
            start_date,
            start_date + interval '1 month'
        );
    END LOOP;
END $$;
 
-- Insert spreading across months
INSERT INTO events (created_at, payload)
SELECT
    current_date + (random() * 150 || ' days')::interval,
    jsonb_build_object('user', i)
FROM generate_series(1, 1000000) i;
 
-- EXPLAIN pruning
EXPLAIN ANALYZE SELECT count(*) FROM events
WHERE created_at >= current_date AND created_at < current_date + interval '7 days';
 
-- Compare with non-partitioned
CREATE TABLE events_flat (LIKE events INCLUDING ALL);
INSERT INTO events_flat SELECT * FROM events;
EXPLAIN ANALYZE SELECT count(*) FROM events_flat WHERE created_at >= ... ;

13.2 Day 2: List partitioning

Partition users by country. Test queries with/without country filter.

13.3 Day 3: Hash partitioning

Partition events by user_id hash (8 partitions). Verify even distribution. Test pruning.

13.4 Day 4: Sub-partitioning

Time + region two-level. Verify both pruning levels.

13.5 Day 5: pg_partman setup

Install, register events table, schedule maintenance, watch partitions auto-created.

13.6 Day 6: Attach/detach

Add partition for next quarter via ATTACH. Detach old partition. Drop.

13.7 Day 7: Migration

Take existing flat table 5M rows. Migrate to partitioned via dual-write + backfill (simplified version of production strategy).


14. Self-check

  1. Khi nào nên partition? 3 triệu chứng?
  2. 3 partition strategies + use case mỗi cái?
  3. Default partition trap — vì sao tránh time-based?
  4. PG12+ FK trên partitioned table — gì mới so PG10/11?
  5. Partition pruning — khi nào work, khi nào fail?
  6. Sub-partitioning — trade-off?
  7. ATTACH vs DETACH partition — workflow?
  8. pg_partman vai trò?
  9. Vì sao “drop old partition” rẻ hơn DELETE? (10x lighter cost)
  10. Migration flat → partitioned: chiến lược tổng quan?

15. Tiếp theo

Back to roadmap. Tuần này nên đọc kèm Tuần 08 (Migration) và Tuần 09 (Observability).


Bonus partitioning hoàn thành. Drop, not delete. Cập nhật: 2026-05-16