Tuần 08 — Zero-Downtime Migration
“Schema migration trên bảng 200M rows trong production lúc 3 giờ sáng. Bạn có 1 chance. Hỏng → revert phức tạp, có khi phải restore PITR (Tuần 07). Tuần này dạy bạn approach không bao giờ stress, không bao giờ down.”
Tags: database postgresql migration schema-change operations ddl Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization · Tuan-05-Transactions-Locking (locks) · Tuan-07-Backup-PITR-DR Liên quan: Tuan-Bonus-CDC-Debezium · Case-Design-Migrate-Monolith-DB
1. Context & Why
1.1 “Stop the world” không còn là option
10 năm trước: maintenance window 2 AM Sunday OK. 2026: global product, 24/7 traffic, 5-min downtime = $50K + customer churn. Migration phải online.
1.2 Cấp độ nguy hiểm của DDL
graph TB L1[Level 1: Metadata only<br/>RENAME, COMMENT<br/>microseconds] --> Safe[Safe anytime] L2[Level 2: Brief AccessExclusive<br/>ADD COLUMN default null<br/>milliseconds] --> Safe L3[Level 3: Full table rewrite<br/>ADD COLUMN with non-null default fn<br/>ALTER TYPE<br/>minutes-hours] --> Risky[Risky - needs strategy] L4[Level 4: Long lock<br/>CREATE INDEX without CONCURRENTLY<br/>VACUUM FULL<br/>hours] --> Dangerous[Avoid online] L5[Level 5: Multi-step rollout<br/>Add column + backfill + constraint<br/>days] --> Expand[Expand-Contract] style Safe fill:#c8e6c9 style Risky fill:#fff9c4 style Dangerous fill:#ffccbc style Expand fill:#90caf9
1.3 Mục tiêu tuần
- Master Expand-Contract pattern
- Postgres-specific safe DDL (PG12+, PG16+ updates)
- Tools: gh-ost, pt-osc cho MySQL; pg_repack cho Postgres
- Dual-write + backfill chunking strategy
- Rollback plan cho mọi migration
- Lock timeout patterns
- Schema version control: Flyway, Liquibase, sqlx, Alembic
- Real-world recipes: rename column, add NOT NULL, change column type, partition existing table
1.4 Tham chiếu
- Strong migrations — https://github.com/ankane/strong_migrations (Rails best practices)
- GitLab Database Migration Style Guide — https://docs.gitlab.com/development/migration_style_guide/
- github/gh-ost — https://github.com/github/gh-ost
- percona-toolkit pt-online-schema-change — https://docs.percona.com/percona-toolkit/pt-online-schema-change.html
- Braintree blog — PostgreSQL migrations — https://medium.com/braintree-product-technology
- Soumya Roy — Postgres safe migrations posts
2. The Big Picture — Expand-Contract Pattern
2.1 Concept
Schema migration phải support 2 app versions cùng chạy:
- Old app code: schema cũ
- New app code: schema mới
- Cả 2 phải work với schema during migration
sequenceDiagram participant App as App participant DB as Database Note over App,DB: Phase 1 - Expand Note over DB: Add new column nullable Note over App: Deploy v2: write both old + new<br/>read old still Note over DB: Backfill new column in chunks Note over App: Deploy v3: write both, read NEW Note over App: Verify in production Note over App,DB: Phase 2 - Contract Note over App: Deploy v4: only new column Note over DB: Add NOT NULL constraint Note over DB: Drop old column
2.2 6 steps universal
- Backward-compatible schema change (Expand)
- Deploy code writes both old + new
- Backfill old data into new schema
- Deploy code reads new, writes both
- Wait + monitor — let traffic settle
- Drop old (Contract) when safe
Each step reversible. Worst case at any step: rollback code, no DB change to revert.
2.3 Anti-pattern — “rip the bandaid”
-- BAD: deploy and migrate together
BEGIN;
ALTER TABLE users RENAME COLUMN email TO email_address;
COMMIT;
-- Deploy new code that uses email_addressVấn đề:
- Brief moment: old app pods still running with column
email→ app errors - Migration takes lock → block writes during alter
- Rollback complex (rename back + revert code + hope no writes lost)
3. Postgres DDL Cheat Sheet — Safe vs Not
3.1 Safe operations (PG12+)
| Operation | Lock | Duration | Notes |
|---|---|---|---|
ADD COLUMN no default | AccessExclusive brief | ms | Metadata only |
ADD COLUMN DEFAULT constant (PG11+) | AccessExclusive brief | ms | Metadata only, default stored in catalog |
ADD COLUMN DEFAULT function() (volatile) | AccessExclusive long | minutes-hours | Rewrites table! |
DROP COLUMN | AccessExclusive brief | ms | Marks as removed, actual cleanup async |
RENAME COLUMN | AccessExclusive brief | ms | App must handle both names during deploy |
RENAME TABLE | AccessExclusive brief | ms | Same caveat |
ADD CONSTRAINT NOT NULL | AccessExclusive | full scan to validate | Use NOT VALID trick |
ADD CONSTRAINT CHECK | AccessExclusive | full scan to validate | Use NOT VALID trick |
ALTER COLUMN TYPE (compatible) | AccessExclusive | depends | Sometimes no rewrite |
CREATE INDEX | ShareUpdateExclusive blocks writes | minutes-hours | Use CONCURRENTLY! |
CREATE INDEX CONCURRENTLY | minimal | longer | Online safe |
DROP INDEX | AccessExclusive brief | ms | CONCURRENTLY available |
3.2 NOT NULL trick — PG12+ super useful
-- BAD: rewrites table, AccessExclusive, full scan
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
-- GOOD: 2 steps, no rewrite
ALTER TABLE huge_table ADD CONSTRAINT col_not_null
CHECK (col IS NOT NULL) NOT VALID;
-- Quick: only adds constraint, doesn't check existing
-- Later (can run anytime, doesn't lock writes):
ALTER TABLE huge_table VALIDATE CONSTRAINT col_not_null;
-- Validates in background while reads/writes continue
-- PG12+ shortcut after validate:
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
-- Postgres notices CHECK constraint exists and skips full scan
ALTER TABLE huge_table DROP CONSTRAINT col_not_null; -- cleanup3.3 ADD COLUMN with default — PG11+
-- PG10 and before: rewrites table
ALTER TABLE huge ADD COLUMN status text DEFAULT 'active';
-- PG11+: metadata only if default is constant
-- Same syntax, just fast now!Sneaky: DEFAULT now() or DEFAULT gen_random_uuid() is volatile function → rewrite. Constants OK.
3.4 ALTER COLUMN TYPE — compatibility matrix
-- PG10+: int → bigint = no rewrite (binary compatible at storage layer)
ALTER TABLE big_table ALTER COLUMN id TYPE bigint;
-- ✓ fast
-- text → varchar(N) - no rewrite if data fits
ALTER TABLE t ALTER COLUMN col TYPE varchar(255);
-- ✓ if existing data <=255 chars
-- text → int - always rewrite + may fail
ALTER TABLE t ALTER COLUMN col TYPE int USING col::int;
-- ✗ rewrite, AccessExclusive longCheck before:
EXPLAIN ALTER TABLE t ALTER COLUMN col TYPE bigint;
-- Not directly possible. Test in staging with same data size.3.5 PG12-17 — DDL feature timeline
PG12 (2019) đã ship cái nhiều người tưởng là gần đây:
- Foreign keys trên partitioned tables
- Partition-wise joins
- Pluggable table storage (table AM)
PG13-14:
- Logical replication of partitioned tables
- REINDEX CONCURRENTLY on partitioned indexes (PG14)
PG15-16:
- MERGE statement (PG15)
- Parallel apply for logical replication (PG16)
pg_stat_io(PG16)
PG17 (Sept 2024) thực tế ship:
MERGE ... RETURNINGsupport- Incremental sort improvements
- COPY enhancements (
ON_ERROR ignore,LOG_VERBOSITY) EXPLAIN (MEMORY)shows planner memory- Improved SAOP (
IN (...)) handling in B-tree - Built-in C.UTF-8 locale
PG18 (Sept 2025, expected at writing):
- True B-tree skip scan
- Asynchronous I/O (io_uring) groundwork
- More logical replication improvements
3.6 CREATE INDEX CONCURRENTLY in depth
CREATE INDEX CONCURRENTLY idx_orders_user ON orders(user_id);Process:
- Build catalog entry (brief lock)
- Scan table — collect tuples (no lock blocking reads/writes)
- Sort + insert into index
- Second scan — catch up changes from step 2
- Mark valid
⚠️ Cannot run inside transaction. If fails → leaves invalid index → drop & retry.
-- Check progress
SELECT * FROM pg_stat_progress_create_index;
-- Find invalid indexes
SELECT i.indexrelid::regclass FROM pg_index i WHERE NOT i.indisvalid;
-- Cleanup
DROP INDEX CONCURRENTLY idx_failed;4. Common Migration Recipes
4.1 Add nullable column
-- Single step, safe
ALTER TABLE users ADD COLUMN phone text;App deploy after migration. No code change needed for old app (it ignores new column).
4.2 Add NOT NULL column with default
-- Step 1: Add nullable
ALTER TABLE users ADD COLUMN onboarded_at timestamptz;
-- Step 2: App writes onboarded_at on signup
-- Deploy app v2
-- Step 3: Backfill in chunks (see section 5)
-- ...
-- Step 4: Add CHECK NOT NULL VALID
ALTER TABLE users ADD CONSTRAINT users_onboarded_at_not_null
CHECK (onboarded_at IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_onboarded_at_not_null;
-- Step 5: Promote CHECK to NOT NULL
ALTER TABLE users ALTER COLUMN onboarded_at SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_onboarded_at_not_null;4.3 Rename column
-- Step 1: Add new column
ALTER TABLE users ADD COLUMN email_address text;
-- Step 2: Trigger to sync writes (during transition)
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$
BEGIN
IF NEW.email IS DISTINCT FROM OLD.email THEN
NEW.email_address := NEW.email;
END IF;
IF NEW.email_address IS DISTINCT FROM OLD.email_address THEN
NEW.email := NEW.email_address;
END IF;
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_users_sync_email BEFORE UPDATE OR INSERT ON users
FOR EACH ROW EXECUTE FUNCTION sync_email();
-- Step 3: Backfill old data
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- In chunks!
-- Step 4: Deploy app v2 to use email_address
-- Step 5: Verify, wait, drop old column + trigger
ALTER TABLE users DROP COLUMN email;
DROP FUNCTION sync_email();4.4 Change column type
E.g., id int → id bigint:
-- PG10+: this is fast in most cases (no rewrite for int→bigint)
ALTER TABLE huge_table ALTER COLUMN id TYPE bigint;For incompatible types (text → int with parsing):
-- Step 1: Add new column
ALTER TABLE orders ADD COLUMN total_cents_new bigint;
-- Step 2: Trigger sync writes
CREATE OR REPLACE FUNCTION sync_total() RETURNS trigger AS $$
BEGIN
NEW.total_cents_new := (NEW.total_str::numeric * 100)::bigint;
RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_orders_sync_total BEFORE UPDATE OR INSERT ON orders
FOR EACH ROW EXECUTE FUNCTION sync_total();
-- Step 3: Backfill in chunks
DO $$
DECLARE last_id bigint := 0;
BEGIN
LOOP
UPDATE orders SET total_cents_new = (total_str::numeric * 100)::bigint
WHERE id > last_id AND id <= last_id + 1000;
EXIT WHEN NOT FOUND;
last_id := last_id + 1000;
COMMIT;
PERFORM pg_sleep(0.1); -- breathe
END LOOP;
END $$;
-- Step 4: Deploy app v2 - reads/writes total_cents_new
-- Step 5: Verify, swap
BEGIN;
ALTER TABLE orders DROP COLUMN total_str;
ALTER TABLE orders RENAME COLUMN total_cents_new TO total_cents;
ALTER TABLE orders ALTER COLUMN total_cents SET NOT NULL;
COMMIT;
DROP FUNCTION sync_total;4.5 Add unique constraint
-- Step 1: Build unique index CONCURRENTLY
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
-- Step 2: Promote to constraint (instant, uses existing index)
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;4.6 Add foreign key
-- Step 1: Add FK with NOT VALID
ALTER TABLE orders
ADD CONSTRAINT fk_orders_user
FOREIGN KEY (user_id) REFERENCES users(id)
NOT VALID;
-- Quick: just registers, doesn't check existing data
-- Step 2: Validate (table scan, but no AccessExclusive)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user;4.7 Partition existing huge table
Hard problem. Pattern:
-- Step 1: Create new partitioned table
CREATE TABLE orders_new (LIKE orders INCLUDING ALL)
PARTITION BY RANGE (created_at);
-- Create partitions
CREATE TABLE orders_2025 PARTITION OF orders_new
FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE orders_2026 PARTITION OF orders_new
FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
-- Step 2: Set up dual-write via trigger or app
-- (Triggers way - complicated; use app dual-write recommended)
-- Step 3: Backfill in chunks (insert into orders_new from orders)
INSERT INTO orders_new SELECT * FROM orders WHERE created_at < '2026-01-01' AND id BETWEEN x AND y;
-- Step 4: Swap atomically
BEGIN;
ALTER TABLE orders RENAME TO orders_old;
ALTER TABLE orders_new RENAME TO orders;
COMMIT;
-- Step 5: Verify, drop orders_old
DROP TABLE orders_old;Or use pg_partman extension for partition management.
4.8 Drop column
-- Step 1: Deploy app v2 that doesn't read/write the column
-- Step 2: Wait until confident no traffic uses it
-- Step 3: Drop
ALTER TABLE users DROP COLUMN deprecated_field;
-- Fast: marks as dropped, actual cleanup async (next VACUUM or rewrite)⚠️ Cannot easily rollback after step 3. Test on staging with copy of production data.
5. Backfill — Đừng chạy 1 UPDATE
5.1 The trap
-- BAD: 100M row UPDATE
UPDATE users SET status = 'active';
-- Holds lock 30 min, generates 100M dead tuples, bloat explosion,
-- replica lag spike, autovacuum overwhelmed5.2 Chunking pattern
DO $$
DECLARE
batch_size int := 5000;
rows_updated int;
last_id bigint := 0;
BEGIN
LOOP
UPDATE users
SET status = 'active', updated_at = now()
WHERE id > last_id AND id <= last_id + batch_size
AND status IS NULL; -- idempotent
GET DIAGNOSTICS rows_updated = ROW_COUNT;
EXIT WHEN rows_updated = 0;
last_id := last_id + batch_size;
COMMIT; -- not inside DO; use procedure or external loop
PERFORM pg_sleep(0.05); -- breathe, let replica catch up
END LOOP;
END $$;(Note: DO block can’t COMMIT; use psql \watch or external script for transaction boundary.)
5.3 External backfill script
import psycopg, time
with psycopg.connect("...") as conn:
last_id = 0
batch_size = 5000
while True:
with conn.transaction():
result = conn.execute("""
UPDATE users
SET status = 'active'
WHERE id > %s AND id <= %s AND status IS NULL
RETURNING 1
""", [last_id, last_id + batch_size]).fetchall()
rows = len(result)
print(f"Batch {last_id}-{last_id+batch_size}: {rows} rows")
if rows == 0 and is_last_chunk(last_id):
break
last_id += batch_size
time.sleep(0.1) # breathe5.4 Monitor backfill
-- Replication lag spike
SELECT now() - pg_last_xact_replay_timestamp() FROM pg_stat_replication;
-- Bloat growth
SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'users';
-- Throughput
SELECT now() - query_start AS running_time, query FROM pg_stat_activity WHERE query LIKE '%backfill%';If replica lag > 30s → pause, let it catch up. If bloat ratio > 30% → trigger VACUUM on the table.
5.5 Tools 2024
- pg_repack — online table reorganization, handles backfill
- PgPusher — task queue migration
- rails_safe_migrations — Ruby framework
- GitLab batch_count_migrations — Ruby framework
- lib_pq with conn.exec — DIY in any language
6. Tools by Database
6.1 Postgres
pg_repack — repack table without locks:
pg_repack -d appdb -t orders --jobs 4Use cases:
- Reduce bloat without VACUUM FULL
- Reorder by index physically (cluster equivalent online)
- Move table to different tablespace
pg_squeeze — alternative, simpler.
pg_partman — partition management.
6.2 MySQL
gh-ost (GitHub):
gh-ost \
--host=primary.host --user=admin --password=... \
--database=appdb --table=users \
--alter="ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
--executeHow:
- Create ghost table with new schema
- Stream binlog → apply changes to ghost table
- Copy existing rows in chunks
- Cut over: rename atomically
pt-online-schema-change (Percona Toolkit):
pt-online-schema-change \
--alter "ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
D=appdb,t=users \
--executeSimilar approach. gh-ost newer, more popular 2024.
6.3 Use case match
| Need | Postgres tool | MySQL tool |
|---|---|---|
| Add column with default | Native ALTER (PG11+) | gh-ost |
| Build index | CONCURRENTLY | gh-ost (or native online) |
| Change column type | ALTER (in many cases) | gh-ost |
| Reduce bloat | pg_repack | OPTIMIZE TABLE / pt-osc |
| Partition existing | Manual / pg_partman | gh-ost |
| Rename column | ADD + sync + DROP | gh-ost |
7. Schema Version Control
7.1 Migration framework comparison
| Tool | Language | Down migration | DSL | Notes |
|---|---|---|---|---|
| Flyway | Java (any lang via CLI) | Optional | SQL + Java | Most mature, enterprise |
| Liquibase | Java | Yes (auto) | XML/YAML/JSON/SQL | DSL complexity |
| sqlx migrate | Rust | Yes | SQL | Simple, popular Rust |
| golang-migrate | Go | Yes | SQL | Most popular Go |
| Alembic | Python | Yes | Python DSL or SQL | SQLAlchemy companion |
| Prisma Migrate | TS/JS | No (forward-only) | Prisma schema → SQL | Modern, ORM-first |
| ActiveRecord | Ruby | Yes | Ruby DSL | Rails default |
| Atlas | Multi | Declarative + versioned | HCL | Modern (2024) |
| dbmate | CLI | Yes | SQL | Language-agnostic |
7.2 Atlas — modern entrant 2024
# schema.hcl - desired state
table "users" {
schema = schema.public
column "id" {
type = bigint
null = false
}
column "email" {
type = text
null = false
}
primary_key {
columns = [column.id]
}
}atlas migrate diff --to "file://schema.hcl" --dev-url "postgres://..."
atlas migrate apply --url "postgres://..."Declarative + auto-generated migrations. Good for greenfield.
7.3 Naming conventions
migrations/
20260516120000_add_users_table.sql
20260516131500_add_orders_table.sql
20260516143000_add_email_index.sql
Timestamp prefix → strict ordering. Avoid collisions.
7.4 Forward-only vs reversible
Forward-only (Prisma):
- Pro: simpler, real-world rarely actually downgrades
- Con: cannot rollback DB state
Reversible (most tools):
- Pro: dev workflow nicer
- Con: false sense of security — rolling back data migrations is hard
Pattern 2024: reversible for dev/staging, forward-only mindset for production. Always roll forward, not back.
7.5 Migration in CI/CD
# .github/workflows/deploy.yml
deploy:
steps:
- name: Run migrations
run: migrate -path ./migrations -database $DB_URL up
- name: Deploy app
run: kubectl rollout restart deployment/appPattern:
- Migration runs first
- App deploy after — uses new schema
For destructive migrations (DROP COLUMN), reverse order:
- App deploy (no longer uses old column)
- Migration runs (DROP COLUMN)
7.6 Lock timeout — critical for production
Every migration should set:
SET lock_timeout = '5s';
SET statement_timeout = '60s';
ALTER TABLE huge_table ADD COLUMN foo int;
-- If acquiring AccessExclusive blocks for >5s, fails fast
-- Better than blocking app for 30 minIn Rails:
class AddFooToTable < ActiveRecord::Migration[7.1]
def change
safety_assured do
lock_timeout = '5s'
add_column :table, :foo, :integer
end
end
end8. Rollback Plans
8.1 Every migration needs a rollback plan
Document before running:
## Migration: add status column to users
### Forward
ALTER TABLE users ADD COLUMN status text;
-- App v2 writes status
-- Backfill...
-- Add NOT NULL
### Rollback at each step
- After ALTER: ALTER TABLE users DROP COLUMN status;
- After app v2 deploy: rollback app to v1, ALTER ... DROP COLUMN
- After backfill: cannot easily rollback data; can stop using column
- After NOT NULL: cannot easily rollback; remove constraint, then drop column
### If shit hits the fan
- PITR to before migration started (Tuần 07)
- Time required: ~30 min RTO8.2 Snapshot before migration
For risky migrations:
# Take backup just before
pgbackrest --stanza=appdb backup --type=incr
# Verify
pgbackrest --stanza=appdb infoIf migration fails irrecoverably → PITR to just before.
9. Real-world Production Story
9.1 Case study: Rename users.email → users.email_address (1B rows)
Day 1: Plan
- Review all callsites of
email - Identify ~200 callsites across 15 services
- Estimate 2-week rollout
Day 2-3: Setup
ALTER TABLE users ADD COLUMN email_address text;
CREATE INDEX CONCURRENTLY idx_users_email_address ON users(email_address);
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$ ...
CREATE TRIGGER ... FOR EACH ROW ...Day 4-7: Backfill
# Run continuously, monitor metrics
# 1B rows / 5000 batch / 100ms sleep = ~28 hours total
# Watch replica lag, pause if >30sDay 8: Verify
SELECT count(*) FROM users WHERE email_address IS NULL;
-- 0 expectedDay 9-15: Service migration
- Service 1 deploys v2 (reads email_address, writes both)
- Service 2 deploys v2
- … 15 services × 1-day rollout window
Day 16: Wait
- Monitor for any callsite still using
email - Use
pg_stat_user_tablesor audit log
Day 17: Drop
ALTER TABLE users DROP COLUMN email;
DROP TRIGGER trg_users_sync_email ON users;
DROP FUNCTION sync_email();Total: 17 days, 0 downtime.
9.2 Lessons
- Communicate — Slack channel migration-users-email, daily updates
- Monitor — backfill metrics, lag, error rates
- Pause-able — backfill can stop/resume
- Verifiable — explicit verification each step
- Don’t rush — extra week of dual-write OK; reverting messy migration is weeks
10. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
ALTER TABLE huge ADD COLUMN x int NOT NULL DEFAULT 0 (pre-PG11) | Full rewrite, AccessExclusive | Add nullable + backfill + add constraint |
One huge UPDATE | Bloat, lag, lock | Chunk |
| Migration in deploy script without lock_timeout | Block app for hours | Always SET lock_timeout |
| No rollback plan | Stuck if fails | Document before |
| Run migration without monitoring | Don’t know when broken | Watch replication lag, locks, app errors |
| Rename column + deploy together | Brief window app errors | Multi-step expand-contract |
| Drop column before deploy stops using it | App errors | Order matters |
| Run on primary only, ignore replica lag | Replica falls behind, may need rebuild | Throttle to keep lag low |
| Use down migrations as “undo” | Often broken, never tested | Forward-only mindset |
DROP TABLE without backup | Cannot recover | Snapshot first |
| Skip staging | Surprises in prod | Always test on prod-size data first |
11. Lab
11.1 Day 1 — Safe vs unsafe DDL
Test on huge_table (10M rows):
ALTER ADD COLUMN x int— time itALTER ADD COLUMN x int DEFAULT 1— time it (PG11+ should be fast)ALTER ADD COLUMN x uuid DEFAULT gen_random_uuid()— slow! Volatile function- Compare each
11.2 Day 2 — Backfill chunk
Add column, backfill 10M rows in chunks:
- Compare 1 UPDATE vs chunked
- Watch replication lag during each
- Watch
n_dead_tupgrowth
11.3 Day 3 — gh-ost on MySQL
Setup MySQL + gh-ost:
gh-ost \
--alter="ADD COLUMN test_col INT NOT NULL DEFAULT 0" \
--database=test --table=users \
--executeObserve cut-over.
11.4 Day 4 — pg_repack
# Create bloat
docker exec postgres psql -c "ALTER TABLE users SET (autovacuum_enabled = false);"
# Run updates...
docker exec postgres psql -c "SELECT pg_size_pretty(pg_relation_size('users'));"
pg_repack -h localhost -d test_db -t users --jobs 4
# Check size dropped11.5 Day 5 — Rename column (expand-contract)
Implement full pattern in lab:
- ADD new column
- Trigger sync
- Backfill
- Deploy app v2 (simulate by changing queries)
- Verify
- Drop old
11.6 Day 6 — Atlas / migration tool
Pick a tool (sqlx-migrate, golang-migrate, Atlas), apply 5 migrations forward + 2 rollbacks. Get comfortable with CLI.
11.7 Day 7 — Real scenario
Take e-commerce schema from Tuần 02. Migrate it to add currency column to users, NOT NULL, default ‘USD’, applying expand-contract.
12. Self-check
- Expand-contract — 6 steps?
ALTER TABLE huge ADD COLUMN x int DEFAULT 0PG10 vs PG11+ — khác biệt?- Add NOT NULL trick PG12+?
CREATE INDEX CONCURRENTLY— process + caveats?- Backfill 100M rows — strategy + chunk size?
- lock_timeout — vì sao mandatory cho production migration?
- gh-ost cơ chế — copy + binlog?
- Forward-only vs reversible migration — pick mindset 2024?
- Rollback plan content — 3 things must include?
- Replication lag spike during migration — cause + fix?
13. Tiếp theo
Bài tiếp: Tuan-09-DB-Observability-Tuning — sau khi master change, master visibility.
Tuần 08 hoàn thành. Expand, contract, never down. Cập nhật: 2026-05-16