Tuần 08 — Zero-Downtime Migration

“Schema migration trên bảng 200M rows trong production lúc 3 giờ sáng. Bạn có 1 chance. Hỏng → revert phức tạp, có khi phải restore PITR (Tuần 07). Tuần này dạy bạn approach không bao giờ stress, không bao giờ down.”

Tags: database postgresql migration schema-change operations ddl Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization · Tuan-05-Transactions-Locking (locks) · Tuan-07-Backup-PITR-DR Liên quan: Tuan-Bonus-CDC-Debezium · Case-Design-Migrate-Monolith-DB

1. Context & Why

1.1 “Stop the world” không còn là option

10 năm trước: maintenance window 2 AM Sunday OK. 2026: global product, 24/7 traffic, 5-min downtime = $50K + customer churn. Migration phải online.

1.2 Cấp độ nguy hiểm của DDL

graph TB
    L1[Level 1: Metadata only<br/>RENAME, COMMENT<br/>microseconds] --> Safe[Safe anytime]

    L2[Level 2: Brief AccessExclusive<br/>ADD COLUMN default null<br/>milliseconds] --> Safe

    L3[Level 3: Full table rewrite<br/>ADD COLUMN with non-null default fn<br/>ALTER TYPE<br/>minutes-hours] --> Risky[Risky - needs strategy]

    L4[Level 4: Long lock<br/>CREATE INDEX without CONCURRENTLY<br/>VACUUM FULL<br/>hours] --> Dangerous[Avoid online]

    L5[Level 5: Multi-step rollout<br/>Add column + backfill + constraint<br/>days] --> Expand[Expand-Contract]

    style Safe fill:#c8e6c9
    style Risky fill:#fff9c4
    style Dangerous fill:#ffccbc
    style Expand fill:#90caf9

1.3 Mục tiêu tuần

Master Expand-Contract pattern
Postgres-specific safe DDL (PG12+, PG16+ updates)
Tools: gh-ost, pt-osc cho MySQL; pg_repack cho Postgres
Dual-write + backfill chunking strategy
Rollback plan cho mọi migration
Lock timeout patterns
Schema version control: Flyway, Liquibase, sqlx, Alembic
Real-world recipes: rename column, add NOT NULL, change column type, partition existing table

1.4 Tham chiếu

Strong migrations — https://github.com/ankane/strong_migrations (Rails best practices)
GitLab Database Migration Style Guide — https://docs.gitlab.com/development/migration_style_guide/
github/gh-ost — https://github.com/github/gh-ost
percona-toolkit pt-online-schema-change — https://docs.percona.com/percona-toolkit/pt-online-schema-change.html
Braintree blog — PostgreSQL migrations — https://medium.com/braintree-product-technology
Soumya Roy — Postgres safe migrations posts

2. The Big Picture — Expand-Contract Pattern

2.1 Concept

Schema migration phải support 2 app versions cùng chạy:

Old app code: schema cũ
New app code: schema mới
Cả 2 phải work với schema during migration

sequenceDiagram
    participant App as App
    participant DB as Database

    Note over App,DB: Phase 1 - Expand
    Note over DB: Add new column nullable
    Note over App: Deploy v2: write both old + new<br/>read old still
    Note over DB: Backfill new column in chunks
    Note over App: Deploy v3: write both, read NEW
    Note over App: Verify in production

    Note over App,DB: Phase 2 - Contract
    Note over App: Deploy v4: only new column
    Note over DB: Add NOT NULL constraint
    Note over DB: Drop old column

2.2 6 steps universal

Backward-compatible schema change (Expand)
Deploy code writes both old + new
Backfill old data into new schema
Deploy code reads new, writes both
Wait + monitor — let traffic settle
Drop old (Contract) when safe

Each step reversible. Worst case at any step: rollback code, no DB change to revert.

2.3 Anti-pattern — “rip the bandaid”

-- BAD: deploy and migrate together
BEGIN;
ALTER TABLE users RENAME COLUMN email TO email_address;
COMMIT;
-- Deploy new code that uses email_address

Vấn đề:

Brief moment: old app pods still running with column email → app errors
Migration takes lock → block writes during alter
Rollback complex (rename back + revert code + hope no writes lost)

3. Postgres DDL Cheat Sheet — Safe vs Not

3.1 Safe operations (PG12+)

Operation	Lock	Duration	Notes
`ADD COLUMN` no default	AccessExclusive brief	ms	Metadata only
`ADD COLUMN DEFAULT constant` (PG11+)	AccessExclusive brief	ms	Metadata only, default stored in catalog
`ADD COLUMN DEFAULT function()` (volatile)	AccessExclusive long	minutes-hours	Rewrites table!
`DROP COLUMN`	AccessExclusive brief	ms	Marks as removed, actual cleanup async
`RENAME COLUMN`	AccessExclusive brief	ms	App must handle both names during deploy
`RENAME TABLE`	AccessExclusive brief	ms	Same caveat
`ADD CONSTRAINT NOT NULL`	AccessExclusive	full scan to validate	Use `NOT VALID` trick
`ADD CONSTRAINT CHECK`	AccessExclusive	full scan to validate	Use `NOT VALID` trick
`ALTER COLUMN TYPE` (compatible)	AccessExclusive	depends	Sometimes no rewrite
`CREATE INDEX`	ShareUpdateExclusive blocks writes	minutes-hours	Use `CONCURRENTLY`!
`CREATE INDEX CONCURRENTLY`	minimal	longer	Online safe
`DROP INDEX`	AccessExclusive brief	ms	`CONCURRENTLY` available

3.2 NOT NULL trick — PG12+ super useful

-- BAD: rewrites table, AccessExclusive, full scan
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
 
-- GOOD: 2 steps, no rewrite
ALTER TABLE huge_table ADD CONSTRAINT col_not_null
    CHECK (col IS NOT NULL) NOT VALID;
-- Quick: only adds constraint, doesn't check existing
 
-- Later (can run anytime, doesn't lock writes):
ALTER TABLE huge_table VALIDATE CONSTRAINT col_not_null;
-- Validates in background while reads/writes continue
 
-- PG12+ shortcut after validate:
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
-- Postgres notices CHECK constraint exists and skips full scan
ALTER TABLE huge_table DROP CONSTRAINT col_not_null;  -- cleanup

3.3 ADD COLUMN with default — PG11+

-- PG10 and before: rewrites table
ALTER TABLE huge ADD COLUMN status text DEFAULT 'active';
 
-- PG11+: metadata only if default is constant
-- Same syntax, just fast now!

Sneaky: DEFAULT now() or DEFAULT gen_random_uuid() is volatile function → rewrite. Constants OK.

3.4 ALTER COLUMN TYPE — compatibility matrix

-- PG10+: int → bigint = no rewrite (binary compatible at storage layer)
ALTER TABLE big_table ALTER COLUMN id TYPE bigint;
-- ✓ fast
 
-- text → varchar(N) - no rewrite if data fits
ALTER TABLE t ALTER COLUMN col TYPE varchar(255);
-- ✓ if existing data <=255 chars
 
-- text → int - always rewrite + may fail
ALTER TABLE t ALTER COLUMN col TYPE int USING col::int;
-- ✗ rewrite, AccessExclusive long

Check before:

EXPLAIN ALTER TABLE t ALTER COLUMN col TYPE bigint;
-- Not directly possible. Test in staging with same data size.

3.5 PG12-17 — DDL feature timeline

PG12 (2019) đã ship cái nhiều người tưởng là gần đây:

Foreign keys trên partitioned tables
Partition-wise joins
Pluggable table storage (table AM)

PG13-14:

Logical replication of partitioned tables
REINDEX CONCURRENTLY on partitioned indexes (PG14)

PG15-16:

MERGE statement (PG15)
Parallel apply for logical replication (PG16)
pg_stat_io (PG16)

PG17 (Sept 2024) thực tế ship:

MERGE ... RETURNING support
Incremental sort improvements
COPY enhancements (ON_ERROR ignore, LOG_VERBOSITY)
EXPLAIN (MEMORY) shows planner memory
Improved SAOP (IN (...)) handling in B-tree
Built-in C.UTF-8 locale

PG18 (Sept 2025, expected at writing):

True B-tree skip scan
Asynchronous I/O (io_uring) groundwork
More logical replication improvements

3.6 CREATE INDEX CONCURRENTLY in depth

CREATE INDEX CONCURRENTLY idx_orders_user ON orders(user_id);

Process:

Build catalog entry (brief lock)
Scan table — collect tuples (no lock blocking reads/writes)
Sort + insert into index
Second scan — catch up changes from step 2
Mark valid

⚠️ Cannot run inside transaction. If fails → leaves invalid index → drop & retry.

-- Check progress
SELECT * FROM pg_stat_progress_create_index;
 
-- Find invalid indexes
SELECT i.indexrelid::regclass FROM pg_index i WHERE NOT i.indisvalid;
 
-- Cleanup
DROP INDEX CONCURRENTLY idx_failed;

4. Common Migration Recipes

4.1 Add nullable column

-- Single step, safe
ALTER TABLE users ADD COLUMN phone text;

App deploy after migration. No code change needed for old app (it ignores new column).

4.2 Add NOT NULL column with default

-- Step 1: Add nullable
ALTER TABLE users ADD COLUMN onboarded_at timestamptz;
 
-- Step 2: App writes onboarded_at on signup
-- Deploy app v2
 
-- Step 3: Backfill in chunks (see section 5)
-- ...
 
-- Step 4: Add CHECK NOT NULL VALID
ALTER TABLE users ADD CONSTRAINT users_onboarded_at_not_null
    CHECK (onboarded_at IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_onboarded_at_not_null;
 
-- Step 5: Promote CHECK to NOT NULL
ALTER TABLE users ALTER COLUMN onboarded_at SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_onboarded_at_not_null;

4.3 Rename column

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN email_address text;
 
-- Step 2: Trigger to sync writes (during transition)
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$
BEGIN
    IF NEW.email IS DISTINCT FROM OLD.email THEN
        NEW.email_address := NEW.email;
    END IF;
    IF NEW.email_address IS DISTINCT FROM OLD.email_address THEN
        NEW.email := NEW.email_address;
    END IF;
    RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_users_sync_email BEFORE UPDATE OR INSERT ON users
FOR EACH ROW EXECUTE FUNCTION sync_email();
 
-- Step 3: Backfill old data
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- In chunks!
 
-- Step 4: Deploy app v2 to use email_address
 
-- Step 5: Verify, wait, drop old column + trigger
ALTER TABLE users DROP COLUMN email;
DROP FUNCTION sync_email();

4.4 Change column type

E.g., id int → id bigint:

-- PG10+: this is fast in most cases (no rewrite for int→bigint)
ALTER TABLE huge_table ALTER COLUMN id TYPE bigint;

For incompatible types (text → int with parsing):

-- Step 1: Add new column
ALTER TABLE orders ADD COLUMN total_cents_new bigint;
 
-- Step 2: Trigger sync writes
CREATE OR REPLACE FUNCTION sync_total() RETURNS trigger AS $$
BEGIN
    NEW.total_cents_new := (NEW.total_str::numeric * 100)::bigint;
    RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_orders_sync_total BEFORE UPDATE OR INSERT ON orders
FOR EACH ROW EXECUTE FUNCTION sync_total();
 
-- Step 3: Backfill in chunks
DO $$
DECLARE last_id bigint := 0;
BEGIN
    LOOP
        UPDATE orders SET total_cents_new = (total_str::numeric * 100)::bigint
        WHERE id > last_id AND id <= last_id + 1000;
        EXIT WHEN NOT FOUND;
        last_id := last_id + 1000;
        COMMIT;
        PERFORM pg_sleep(0.1);  -- breathe
    END LOOP;
END $$;
 
-- Step 4: Deploy app v2 - reads/writes total_cents_new
 
-- Step 5: Verify, swap
BEGIN;
ALTER TABLE orders DROP COLUMN total_str;
ALTER TABLE orders RENAME COLUMN total_cents_new TO total_cents;
ALTER TABLE orders ALTER COLUMN total_cents SET NOT NULL;
COMMIT;
 
DROP FUNCTION sync_total;

4.5 Add unique constraint

-- Step 1: Build unique index CONCURRENTLY
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
 
-- Step 2: Promote to constraint (instant, uses existing index)
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

4.6 Add foreign key

-- Step 1: Add FK with NOT VALID
ALTER TABLE orders
    ADD CONSTRAINT fk_orders_user
    FOREIGN KEY (user_id) REFERENCES users(id)
    NOT VALID;
-- Quick: just registers, doesn't check existing data
 
-- Step 2: Validate (table scan, but no AccessExclusive)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user;

4.7 Partition existing huge table

Hard problem. Pattern:

-- Step 1: Create new partitioned table
CREATE TABLE orders_new (LIKE orders INCLUDING ALL)
PARTITION BY RANGE (created_at);
 
-- Create partitions
CREATE TABLE orders_2025 PARTITION OF orders_new
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE orders_2026 PARTITION OF orders_new
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
 
-- Step 2: Set up dual-write via trigger or app
-- (Triggers way - complicated; use app dual-write recommended)
 
-- Step 3: Backfill in chunks (insert into orders_new from orders)
INSERT INTO orders_new SELECT * FROM orders WHERE created_at < '2026-01-01' AND id BETWEEN x AND y;
 
-- Step 4: Swap atomically
BEGIN;
ALTER TABLE orders RENAME TO orders_old;
ALTER TABLE orders_new RENAME TO orders;
COMMIT;
 
-- Step 5: Verify, drop orders_old
DROP TABLE orders_old;

Or use pg_partman extension for partition management.

4.8 Drop column

-- Step 1: Deploy app v2 that doesn't read/write the column
-- Step 2: Wait until confident no traffic uses it
-- Step 3: Drop
ALTER TABLE users DROP COLUMN deprecated_field;
-- Fast: marks as dropped, actual cleanup async (next VACUUM or rewrite)

⚠️ Cannot easily rollback after step 3. Test on staging with copy of production data.

5. Backfill — Đừng chạy 1 UPDATE

5.1 The trap

-- BAD: 100M row UPDATE
UPDATE users SET status = 'active';
-- Holds lock 30 min, generates 100M dead tuples, bloat explosion,
-- replica lag spike, autovacuum overwhelmed

5.2 Chunking pattern

DO $$
DECLARE
    batch_size int := 5000;
    rows_updated int;
    last_id bigint := 0;
BEGIN
    LOOP
        UPDATE users
        SET status = 'active', updated_at = now()
        WHERE id > last_id AND id <= last_id + batch_size
          AND status IS NULL;  -- idempotent
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
 
        EXIT WHEN rows_updated = 0;
 
        last_id := last_id + batch_size;
        COMMIT;  -- not inside DO; use procedure or external loop
 
        PERFORM pg_sleep(0.05);  -- breathe, let replica catch up
    END LOOP;
END $$;

(Note: DO block can’t COMMIT; use psql \watch or external script for transaction boundary.)

5.3 External backfill script

import psycopg, time
 
with psycopg.connect("...") as conn:
    last_id = 0
    batch_size = 5000
 
    while True:
        with conn.transaction():
            result = conn.execute("""
                UPDATE users
                SET status = 'active'
                WHERE id > %s AND id <= %s AND status IS NULL
                RETURNING 1
            """, [last_id, last_id + batch_size]).fetchall()
 
        rows = len(result)
        print(f"Batch {last_id}-{last_id+batch_size}: {rows} rows")
 
        if rows == 0 and is_last_chunk(last_id):
            break
 
        last_id += batch_size
        time.sleep(0.1)  # breathe

5.4 Monitor backfill

-- Replication lag spike
SELECT now() - pg_last_xact_replay_timestamp() FROM pg_stat_replication;
 
-- Bloat growth
SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'users';
 
-- Throughput
SELECT now() - query_start AS running_time, query FROM pg_stat_activity WHERE query LIKE '%backfill%';

If replica lag > 30s → pause, let it catch up. If bloat ratio > 30% → trigger VACUUM on the table.

5.5 Tools 2024

pg_repack — online table reorganization, handles backfill
PgPusher — task queue migration
rails_safe_migrations — Ruby framework
GitLab batch_count_migrations — Ruby framework
lib_pq with conn.exec — DIY in any language

6. Tools by Database

6.1 Postgres

pg_repack — repack table without locks:

pg_repack -d appdb -t orders --jobs 4

Use cases:

Reduce bloat without VACUUM FULL
Reorder by index physically (cluster equivalent online)
Move table to different tablespace

pg_squeeze — alternative, simpler.

pg_partman — partition management.

6.2 MySQL

gh-ost (GitHub):

gh-ost \
    --host=primary.host --user=admin --password=... \
    --database=appdb --table=users \
    --alter="ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
    --execute

How:

Create ghost table with new schema
Stream binlog → apply changes to ghost table
Copy existing rows in chunks
Cut over: rename atomically

pt-online-schema-change (Percona Toolkit):

pt-online-schema-change \
    --alter "ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
    D=appdb,t=users \
    --execute

Similar approach. gh-ost newer, more popular 2024.

6.3 Use case match

Need	Postgres tool	MySQL tool
Add column with default	Native ALTER (PG11+)	gh-ost
Build index	CONCURRENTLY	gh-ost (or native online)
Change column type	ALTER (in many cases)	gh-ost
Reduce bloat	pg_repack	OPTIMIZE TABLE / pt-osc
Partition existing	Manual / pg_partman	gh-ost
Rename column	ADD + sync + DROP	gh-ost

7. Schema Version Control

7.1 Migration framework comparison

Tool	Language	Down migration	DSL	Notes
Flyway	Java (any lang via CLI)	Optional	SQL + Java	Most mature, enterprise
Liquibase	Java	Yes (auto)	XML/YAML/JSON/SQL	DSL complexity
sqlx migrate	Rust	Yes	SQL	Simple, popular Rust
golang-migrate	Go	Yes	SQL	Most popular Go
Alembic	Python	Yes	Python DSL or SQL	SQLAlchemy companion
Prisma Migrate	TS/JS	No (forward-only)	Prisma schema → SQL	Modern, ORM-first
ActiveRecord	Ruby	Yes	Ruby DSL	Rails default
Atlas	Multi	Declarative + versioned	HCL	Modern (2024)
dbmate	CLI	Yes	SQL	Language-agnostic

7.2 Atlas — modern entrant 2024

# schema.hcl - desired state
table "users" {
  schema = schema.public
  column "id" {
    type = bigint
    null = false
  }
  column "email" {
    type = text
    null = false
  }
  primary_key {
    columns = [column.id]
  }
}

atlas migrate diff --to "file://schema.hcl" --dev-url "postgres://..."
atlas migrate apply --url "postgres://..."

Declarative + auto-generated migrations. Good for greenfield.

7.3 Naming conventions

migrations/
  20260516120000_add_users_table.sql
  20260516131500_add_orders_table.sql
  20260516143000_add_email_index.sql

Timestamp prefix → strict ordering. Avoid collisions.

7.4 Forward-only vs reversible

Forward-only (Prisma):

Pro: simpler, real-world rarely actually downgrades
Con: cannot rollback DB state

Reversible (most tools):

Pro: dev workflow nicer
Con: false sense of security — rolling back data migrations is hard

Pattern 2024: reversible for dev/staging, forward-only mindset for production. Always roll forward, not back.

7.5 Migration in CI/CD

# .github/workflows/deploy.yml
deploy:
  steps:
    - name: Run migrations
      run: migrate -path ./migrations -database $DB_URL up
    - name: Deploy app
      run: kubectl rollout restart deployment/app

Pattern:

Migration runs first
App deploy after — uses new schema

For destructive migrations (DROP COLUMN), reverse order:

App deploy (no longer uses old column)
Migration runs (DROP COLUMN)

7.6 Lock timeout — critical for production

Every migration should set:

SET lock_timeout = '5s';
SET statement_timeout = '60s';
 
ALTER TABLE huge_table ADD COLUMN foo int;
-- If acquiring AccessExclusive blocks for >5s, fails fast
-- Better than blocking app for 30 min

In Rails:

class AddFooToTable < ActiveRecord::Migration[7.1]
  def change
    safety_assured do
      lock_timeout = '5s'
      add_column :table, :foo, :integer
    end
  end
end

8. Rollback Plans

8.1 Every migration needs a rollback plan

Document before running:

## Migration: add status column to users
 
### Forward
ALTER TABLE users ADD COLUMN status text;
-- App v2 writes status
-- Backfill...
-- Add NOT NULL
 
### Rollback at each step
- After ALTER: ALTER TABLE users DROP COLUMN status;
- After app v2 deploy: rollback app to v1, ALTER ... DROP COLUMN
- After backfill: cannot easily rollback data; can stop using column
- After NOT NULL: cannot easily rollback; remove constraint, then drop column
 
### If shit hits the fan
- PITR to before migration started (Tuần 07)
- Time required: ~30 min RTO

8.2 Snapshot before migration

For risky migrations:

# Take backup just before
pgbackrest --stanza=appdb backup --type=incr
 
# Verify
pgbackrest --stanza=appdb info

If migration fails irrecoverably → PITR to just before.

9. Real-world Production Story

9.1 Case study: Rename users.email → users.email_address (1B rows)

Day 1: Plan

Review all callsites of email
Identify ~200 callsites across 15 services
Estimate 2-week rollout

Day 2-3: Setup

ALTER TABLE users ADD COLUMN email_address text;
CREATE INDEX CONCURRENTLY idx_users_email_address ON users(email_address);
 
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$ ...
CREATE TRIGGER ... FOR EACH ROW ...

Day 4-7: Backfill

# Run continuously, monitor metrics
# 1B rows / 5000 batch / 100ms sleep = ~28 hours total
# Watch replica lag, pause if >30s

Day 8: Verify

SELECT count(*) FROM users WHERE email_address IS NULL;
-- 0 expected

Day 9-15: Service migration

Service 1 deploys v2 (reads email_address, writes both)
Service 2 deploys v2
… 15 services × 1-day rollout window

Day 16: Wait

Monitor for any callsite still using email
Use pg_stat_user_tables or audit log

Day 17: Drop

ALTER TABLE users DROP COLUMN email;
DROP TRIGGER trg_users_sync_email ON users;
DROP FUNCTION sync_email();

Total: 17 days, 0 downtime.

9.2 Lessons

Communicate — Slack channel migration-users-email, daily updates
Monitor — backfill metrics, lag, error rates
Pause-able — backfill can stop/resume
Verifiable — explicit verification each step
Don’t rush — extra week of dual-write OK; reverting messy migration is weeks

10. Anti-patterns

Pattern	Why bad	Fix
`ALTER TABLE huge ADD COLUMN x int NOT NULL DEFAULT 0` (pre-PG11)	Full rewrite, AccessExclusive	Add nullable + backfill + add constraint
One huge `UPDATE`	Bloat, lag, lock	Chunk
Migration in deploy script without lock_timeout	Block app for hours	Always SET lock_timeout
No rollback plan	Stuck if fails	Document before
Run migration without monitoring	Don’t know when broken	Watch replication lag, locks, app errors
Rename column + deploy together	Brief window app errors	Multi-step expand-contract
Drop column before deploy stops using it	App errors	Order matters
Run on primary only, ignore replica lag	Replica falls behind, may need rebuild	Throttle to keep lag low
Use down migrations as “undo”	Often broken, never tested	Forward-only mindset
`DROP TABLE` without backup	Cannot recover	Snapshot first
Skip staging	Surprises in prod	Always test on prod-size data first

11. Lab

11.1 Day 1 — Safe vs unsafe DDL

Test on huge_table (10M rows):

ALTER ADD COLUMN x int — time it
ALTER ADD COLUMN x int DEFAULT 1 — time it (PG11+ should be fast)
ALTER ADD COLUMN x uuid DEFAULT gen_random_uuid() — slow! Volatile function
Compare each

11.2 Day 2 — Backfill chunk

Add column, backfill 10M rows in chunks:

Compare 1 UPDATE vs chunked
Watch replication lag during each
Watch n_dead_tup growth

11.3 Day 3 — gh-ost on MySQL

Setup MySQL + gh-ost:

gh-ost \
    --alter="ADD COLUMN test_col INT NOT NULL DEFAULT 0" \
    --database=test --table=users \
    --execute

Observe cut-over.

11.4 Day 4 — pg_repack

# Create bloat
docker exec postgres psql -c "ALTER TABLE users SET (autovacuum_enabled = false);"
# Run updates...
docker exec postgres psql -c "SELECT pg_size_pretty(pg_relation_size('users'));"
 
pg_repack -h localhost -d test_db -t users --jobs 4
# Check size dropped

11.5 Day 5 — Rename column (expand-contract)

Implement full pattern in lab:

ADD new column
Trigger sync
Backfill
Deploy app v2 (simulate by changing queries)
Verify
Drop old

11.6 Day 6 — Atlas / migration tool

Pick a tool (sqlx-migrate, golang-migrate, Atlas), apply 5 migrations forward + 2 rollbacks. Get comfortable with CLI.

11.7 Day 7 — Real scenario

Take e-commerce schema from Tuần 02. Migrate it to add currency column to users, NOT NULL, default ‘USD’, applying expand-contract.

12. Self-check

Expand-contract — 6 steps?
ALTER TABLE huge ADD COLUMN x int DEFAULT 0 PG10 vs PG11+ — khác biệt?
Add NOT NULL trick PG12+?
CREATE INDEX CONCURRENTLY — process + caveats?
Backfill 100M rows — strategy + chunk size?
lock_timeout — vì sao mandatory cho production migration?
gh-ost cơ chế — copy + binlog?
Forward-only vs reversible migration — pick mindset 2024?
Rollback plan content — 3 things must include?
Replication lag spike during migration — cause + fix?

13. Tiếp theo

Bài tiếp: Tuan-09-DB-Observability-Tuning — sau khi master change, master visibility.

Tuần 08 hoàn thành. Expand, contract, never down. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-08-Zero-Downtime-Migration