Tuần 08 — Zero-Downtime Migration

“Schema migration trên bảng 200M rows trong production lúc 3 giờ sáng. Bạn có 1 chance. Hỏng → revert phức tạp, có khi phải restore PITR (Tuần 07). Tuần này dạy bạn approach không bao giờ stress, không bao giờ down.”

Tags: database postgresql migration schema-change operations ddl Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-02-Schema-Design-Normalization · Tuan-05-Transactions-Locking (locks) · Tuan-07-Backup-PITR-DR Liên quan: Tuan-Bonus-CDC-Debezium · Case-Design-Migrate-Monolith-DB


1. Context & Why

1.1 “Stop the world” không còn là option

10 năm trước: maintenance window 2 AM Sunday OK. 2026: global product, 24/7 traffic, 5-min downtime = $50K + customer churn. Migration phải online.

1.2 Cấp độ nguy hiểm của DDL

graph TB
    L1[Level 1: Metadata only<br/>RENAME, COMMENT<br/>microseconds] --> Safe[Safe anytime]

    L2[Level 2: Brief AccessExclusive<br/>ADD COLUMN default null<br/>milliseconds] --> Safe

    L3[Level 3: Full table rewrite<br/>ADD COLUMN with non-null default fn<br/>ALTER TYPE<br/>minutes-hours] --> Risky[Risky - needs strategy]

    L4[Level 4: Long lock<br/>CREATE INDEX without CONCURRENTLY<br/>VACUUM FULL<br/>hours] --> Dangerous[Avoid online]

    L5[Level 5: Multi-step rollout<br/>Add column + backfill + constraint<br/>days] --> Expand[Expand-Contract]

    style Safe fill:#c8e6c9
    style Risky fill:#fff9c4
    style Dangerous fill:#ffccbc
    style Expand fill:#90caf9

1.3 Mục tiêu tuần

  • Master Expand-Contract pattern
  • Postgres-specific safe DDL (PG12+, PG16+ updates)
  • Tools: gh-ost, pt-osc cho MySQL; pg_repack cho Postgres
  • Dual-write + backfill chunking strategy
  • Rollback plan cho mọi migration
  • Lock timeout patterns
  • Schema version control: Flyway, Liquibase, sqlx, Alembic
  • Real-world recipes: rename column, add NOT NULL, change column type, partition existing table

1.4 Tham chiếu


2. The Big Picture — Expand-Contract Pattern

2.1 Concept

Schema migration phải support 2 app versions cùng chạy:

  • Old app code: schema cũ
  • New app code: schema mới
  • Cả 2 phải work với schema during migration
sequenceDiagram
    participant App as App
    participant DB as Database

    Note over App,DB: Phase 1 - Expand
    Note over DB: Add new column nullable
    Note over App: Deploy v2: write both old + new<br/>read old still
    Note over DB: Backfill new column in chunks
    Note over App: Deploy v3: write both, read NEW
    Note over App: Verify in production

    Note over App,DB: Phase 2 - Contract
    Note over App: Deploy v4: only new column
    Note over DB: Add NOT NULL constraint
    Note over DB: Drop old column

2.2 6 steps universal

  1. Backward-compatible schema change (Expand)
  2. Deploy code writes both old + new
  3. Backfill old data into new schema
  4. Deploy code reads new, writes both
  5. Wait + monitor — let traffic settle
  6. Drop old (Contract) when safe

Each step reversible. Worst case at any step: rollback code, no DB change to revert.

2.3 Anti-pattern — “rip the bandaid”

-- BAD: deploy and migrate together
BEGIN;
ALTER TABLE users RENAME COLUMN email TO email_address;
COMMIT;
-- Deploy new code that uses email_address

Vấn đề:

  • Brief moment: old app pods still running with column email → app errors
  • Migration takes lock → block writes during alter
  • Rollback complex (rename back + revert code + hope no writes lost)

3. Postgres DDL Cheat Sheet — Safe vs Not

3.1 Safe operations (PG12+)

OperationLockDurationNotes
ADD COLUMN no defaultAccessExclusive briefmsMetadata only
ADD COLUMN DEFAULT constant (PG11+)AccessExclusive briefmsMetadata only, default stored in catalog
ADD COLUMN DEFAULT function() (volatile)AccessExclusive longminutes-hoursRewrites table!
DROP COLUMNAccessExclusive briefmsMarks as removed, actual cleanup async
RENAME COLUMNAccessExclusive briefmsApp must handle both names during deploy
RENAME TABLEAccessExclusive briefmsSame caveat
ADD CONSTRAINT NOT NULLAccessExclusivefull scan to validateUse NOT VALID trick
ADD CONSTRAINT CHECKAccessExclusivefull scan to validateUse NOT VALID trick
ALTER COLUMN TYPE (compatible)AccessExclusivedependsSometimes no rewrite
CREATE INDEXShareUpdateExclusive blocks writesminutes-hoursUse CONCURRENTLY!
CREATE INDEX CONCURRENTLYminimallongerOnline safe
DROP INDEXAccessExclusive briefmsCONCURRENTLY available

3.2 NOT NULL trick — PG12+ super useful

-- BAD: rewrites table, AccessExclusive, full scan
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
 
-- GOOD: 2 steps, no rewrite
ALTER TABLE huge_table ADD CONSTRAINT col_not_null
    CHECK (col IS NOT NULL) NOT VALID;
-- Quick: only adds constraint, doesn't check existing
 
-- Later (can run anytime, doesn't lock writes):
ALTER TABLE huge_table VALIDATE CONSTRAINT col_not_null;
-- Validates in background while reads/writes continue
 
-- PG12+ shortcut after validate:
ALTER TABLE huge_table ALTER COLUMN col SET NOT NULL;
-- Postgres notices CHECK constraint exists and skips full scan
ALTER TABLE huge_table DROP CONSTRAINT col_not_null;  -- cleanup

3.3 ADD COLUMN with default — PG11+

-- PG10 and before: rewrites table
ALTER TABLE huge ADD COLUMN status text DEFAULT 'active';
 
-- PG11+: metadata only if default is constant
-- Same syntax, just fast now!

Sneaky: DEFAULT now() or DEFAULT gen_random_uuid() is volatile function → rewrite. Constants OK.

3.4 ALTER COLUMN TYPE — compatibility matrix

-- PG10+: int → bigint = no rewrite (binary compatible at storage layer)
ALTER TABLE big_table ALTER COLUMN id TYPE bigint;
-- ✓ fast
 
-- text → varchar(N) - no rewrite if data fits
ALTER TABLE t ALTER COLUMN col TYPE varchar(255);
-- ✓ if existing data <=255 chars
 
-- text → int - always rewrite + may fail
ALTER TABLE t ALTER COLUMN col TYPE int USING col::int;
-- ✗ rewrite, AccessExclusive long

Check before:

EXPLAIN ALTER TABLE t ALTER COLUMN col TYPE bigint;
-- Not directly possible. Test in staging with same data size.

3.5 PG12-17 — DDL feature timeline

PG12 (2019) đã ship cái nhiều người tưởng là gần đây:

  • Foreign keys trên partitioned tables
  • Partition-wise joins
  • Pluggable table storage (table AM)

PG13-14:

  • Logical replication of partitioned tables
  • REINDEX CONCURRENTLY on partitioned indexes (PG14)

PG15-16:

  • MERGE statement (PG15)
  • Parallel apply for logical replication (PG16)
  • pg_stat_io (PG16)

PG17 (Sept 2024) thực tế ship:

  • MERGE ... RETURNING support
  • Incremental sort improvements
  • COPY enhancements (ON_ERROR ignore, LOG_VERBOSITY)
  • EXPLAIN (MEMORY) shows planner memory
  • Improved SAOP (IN (...)) handling in B-tree
  • Built-in C.UTF-8 locale

PG18 (Sept 2025, expected at writing):

  • True B-tree skip scan
  • Asynchronous I/O (io_uring) groundwork
  • More logical replication improvements

3.6 CREATE INDEX CONCURRENTLY in depth

CREATE INDEX CONCURRENTLY idx_orders_user ON orders(user_id);

Process:

  1. Build catalog entry (brief lock)
  2. Scan table — collect tuples (no lock blocking reads/writes)
  3. Sort + insert into index
  4. Second scan — catch up changes from step 2
  5. Mark valid

⚠️ Cannot run inside transaction. If fails → leaves invalid index → drop & retry.

-- Check progress
SELECT * FROM pg_stat_progress_create_index;
 
-- Find invalid indexes
SELECT i.indexrelid::regclass FROM pg_index i WHERE NOT i.indisvalid;
 
-- Cleanup
DROP INDEX CONCURRENTLY idx_failed;

4. Common Migration Recipes

4.1 Add nullable column

-- Single step, safe
ALTER TABLE users ADD COLUMN phone text;

App deploy after migration. No code change needed for old app (it ignores new column).

4.2 Add NOT NULL column with default

-- Step 1: Add nullable
ALTER TABLE users ADD COLUMN onboarded_at timestamptz;
 
-- Step 2: App writes onboarded_at on signup
-- Deploy app v2
 
-- Step 3: Backfill in chunks (see section 5)
-- ...
 
-- Step 4: Add CHECK NOT NULL VALID
ALTER TABLE users ADD CONSTRAINT users_onboarded_at_not_null
    CHECK (onboarded_at IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT users_onboarded_at_not_null;
 
-- Step 5: Promote CHECK to NOT NULL
ALTER TABLE users ALTER COLUMN onboarded_at SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT users_onboarded_at_not_null;

4.3 Rename column

-- Step 1: Add new column
ALTER TABLE users ADD COLUMN email_address text;
 
-- Step 2: Trigger to sync writes (during transition)
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$
BEGIN
    IF NEW.email IS DISTINCT FROM OLD.email THEN
        NEW.email_address := NEW.email;
    END IF;
    IF NEW.email_address IS DISTINCT FROM OLD.email_address THEN
        NEW.email := NEW.email_address;
    END IF;
    RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_users_sync_email BEFORE UPDATE OR INSERT ON users
FOR EACH ROW EXECUTE FUNCTION sync_email();
 
-- Step 3: Backfill old data
UPDATE users SET email_address = email WHERE email_address IS NULL;
-- In chunks!
 
-- Step 4: Deploy app v2 to use email_address
 
-- Step 5: Verify, wait, drop old column + trigger
ALTER TABLE users DROP COLUMN email;
DROP FUNCTION sync_email();

4.4 Change column type

E.g., id intid bigint:

-- PG10+: this is fast in most cases (no rewrite for int→bigint)
ALTER TABLE huge_table ALTER COLUMN id TYPE bigint;

For incompatible types (text → int with parsing):

-- Step 1: Add new column
ALTER TABLE orders ADD COLUMN total_cents_new bigint;
 
-- Step 2: Trigger sync writes
CREATE OR REPLACE FUNCTION sync_total() RETURNS trigger AS $$
BEGIN
    NEW.total_cents_new := (NEW.total_str::numeric * 100)::bigint;
    RETURN NEW;
END $$ LANGUAGE plpgsql;
CREATE TRIGGER trg_orders_sync_total BEFORE UPDATE OR INSERT ON orders
FOR EACH ROW EXECUTE FUNCTION sync_total();
 
-- Step 3: Backfill in chunks
DO $$
DECLARE last_id bigint := 0;
BEGIN
    LOOP
        UPDATE orders SET total_cents_new = (total_str::numeric * 100)::bigint
        WHERE id > last_id AND id <= last_id + 1000;
        EXIT WHEN NOT FOUND;
        last_id := last_id + 1000;
        COMMIT;
        PERFORM pg_sleep(0.1);  -- breathe
    END LOOP;
END $$;
 
-- Step 4: Deploy app v2 - reads/writes total_cents_new
 
-- Step 5: Verify, swap
BEGIN;
ALTER TABLE orders DROP COLUMN total_str;
ALTER TABLE orders RENAME COLUMN total_cents_new TO total_cents;
ALTER TABLE orders ALTER COLUMN total_cents SET NOT NULL;
COMMIT;
 
DROP FUNCTION sync_total;

4.5 Add unique constraint

-- Step 1: Build unique index CONCURRENTLY
CREATE UNIQUE INDEX CONCURRENTLY idx_users_email_unique ON users(email);
 
-- Step 2: Promote to constraint (instant, uses existing index)
ALTER TABLE users ADD CONSTRAINT users_email_unique UNIQUE USING INDEX idx_users_email_unique;

4.6 Add foreign key

-- Step 1: Add FK with NOT VALID
ALTER TABLE orders
    ADD CONSTRAINT fk_orders_user
    FOREIGN KEY (user_id) REFERENCES users(id)
    NOT VALID;
-- Quick: just registers, doesn't check existing data
 
-- Step 2: Validate (table scan, but no AccessExclusive)
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_user;

4.7 Partition existing huge table

Hard problem. Pattern:

-- Step 1: Create new partitioned table
CREATE TABLE orders_new (LIKE orders INCLUDING ALL)
PARTITION BY RANGE (created_at);
 
-- Create partitions
CREATE TABLE orders_2025 PARTITION OF orders_new
    FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');
CREATE TABLE orders_2026 PARTITION OF orders_new
    FOR VALUES FROM ('2026-01-01') TO ('2027-01-01');
 
-- Step 2: Set up dual-write via trigger or app
-- (Triggers way - complicated; use app dual-write recommended)
 
-- Step 3: Backfill in chunks (insert into orders_new from orders)
INSERT INTO orders_new SELECT * FROM orders WHERE created_at < '2026-01-01' AND id BETWEEN x AND y;
 
-- Step 4: Swap atomically
BEGIN;
ALTER TABLE orders RENAME TO orders_old;
ALTER TABLE orders_new RENAME TO orders;
COMMIT;
 
-- Step 5: Verify, drop orders_old
DROP TABLE orders_old;

Or use pg_partman extension for partition management.

4.8 Drop column

-- Step 1: Deploy app v2 that doesn't read/write the column
-- Step 2: Wait until confident no traffic uses it
-- Step 3: Drop
ALTER TABLE users DROP COLUMN deprecated_field;
-- Fast: marks as dropped, actual cleanup async (next VACUUM or rewrite)

⚠️ Cannot easily rollback after step 3. Test on staging with copy of production data.


5. Backfill — Đừng chạy 1 UPDATE

5.1 The trap

-- BAD: 100M row UPDATE
UPDATE users SET status = 'active';
-- Holds lock 30 min, generates 100M dead tuples, bloat explosion,
-- replica lag spike, autovacuum overwhelmed

5.2 Chunking pattern

DO $$
DECLARE
    batch_size int := 5000;
    rows_updated int;
    last_id bigint := 0;
BEGIN
    LOOP
        UPDATE users
        SET status = 'active', updated_at = now()
        WHERE id > last_id AND id <= last_id + batch_size
          AND status IS NULL;  -- idempotent
        GET DIAGNOSTICS rows_updated = ROW_COUNT;
 
        EXIT WHEN rows_updated = 0;
 
        last_id := last_id + batch_size;
        COMMIT;  -- not inside DO; use procedure or external loop
 
        PERFORM pg_sleep(0.05);  -- breathe, let replica catch up
    END LOOP;
END $$;

(Note: DO block can’t COMMIT; use psql \watch or external script for transaction boundary.)

5.3 External backfill script

import psycopg, time
 
with psycopg.connect("...") as conn:
    last_id = 0
    batch_size = 5000
 
    while True:
        with conn.transaction():
            result = conn.execute("""
                UPDATE users
                SET status = 'active'
                WHERE id > %s AND id <= %s AND status IS NULL
                RETURNING 1
            """, [last_id, last_id + batch_size]).fetchall()
 
        rows = len(result)
        print(f"Batch {last_id}-{last_id+batch_size}: {rows} rows")
 
        if rows == 0 and is_last_chunk(last_id):
            break
 
        last_id += batch_size
        time.sleep(0.1)  # breathe

5.4 Monitor backfill

-- Replication lag spike
SELECT now() - pg_last_xact_replay_timestamp() FROM pg_stat_replication;
 
-- Bloat growth
SELECT relname, n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname = 'users';
 
-- Throughput
SELECT now() - query_start AS running_time, query FROM pg_stat_activity WHERE query LIKE '%backfill%';

If replica lag > 30s → pause, let it catch up. If bloat ratio > 30% → trigger VACUUM on the table.

5.5 Tools 2024

  • pg_repack — online table reorganization, handles backfill
  • PgPusher — task queue migration
  • rails_safe_migrations — Ruby framework
  • GitLab batch_count_migrations — Ruby framework
  • lib_pq with conn.exec — DIY in any language

6. Tools by Database

6.1 Postgres

pg_repack — repack table without locks:

pg_repack -d appdb -t orders --jobs 4

Use cases:

  • Reduce bloat without VACUUM FULL
  • Reorder by index physically (cluster equivalent online)
  • Move table to different tablespace

pg_squeeze — alternative, simpler.

pg_partman — partition management.

6.2 MySQL

gh-ost (GitHub):

gh-ost \
    --host=primary.host --user=admin --password=... \
    --database=appdb --table=users \
    --alter="ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
    --execute

How:

  1. Create ghost table with new schema
  2. Stream binlog → apply changes to ghost table
  3. Copy existing rows in chunks
  4. Cut over: rename atomically

pt-online-schema-change (Percona Toolkit):

pt-online-schema-change \
    --alter "ADD COLUMN new_col INT NOT NULL DEFAULT 0" \
    D=appdb,t=users \
    --execute

Similar approach. gh-ost newer, more popular 2024.

6.3 Use case match

NeedPostgres toolMySQL tool
Add column with defaultNative ALTER (PG11+)gh-ost
Build indexCONCURRENTLYgh-ost (or native online)
Change column typeALTER (in many cases)gh-ost
Reduce bloatpg_repackOPTIMIZE TABLE / pt-osc
Partition existingManual / pg_partmangh-ost
Rename columnADD + sync + DROPgh-ost

7. Schema Version Control

7.1 Migration framework comparison

ToolLanguageDown migrationDSLNotes
FlywayJava (any lang via CLI)OptionalSQL + JavaMost mature, enterprise
LiquibaseJavaYes (auto)XML/YAML/JSON/SQLDSL complexity
sqlx migrateRustYesSQLSimple, popular Rust
golang-migrateGoYesSQLMost popular Go
AlembicPythonYesPython DSL or SQLSQLAlchemy companion
Prisma MigrateTS/JSNo (forward-only)Prisma schema → SQLModern, ORM-first
ActiveRecordRubyYesRuby DSLRails default
AtlasMultiDeclarative + versionedHCLModern (2024)
dbmateCLIYesSQLLanguage-agnostic

7.2 Atlas — modern entrant 2024

# schema.hcl - desired state
table "users" {
  schema = schema.public
  column "id" {
    type = bigint
    null = false
  }
  column "email" {
    type = text
    null = false
  }
  primary_key {
    columns = [column.id]
  }
}
atlas migrate diff --to "file://schema.hcl" --dev-url "postgres://..."
atlas migrate apply --url "postgres://..."

Declarative + auto-generated migrations. Good for greenfield.

7.3 Naming conventions

migrations/
  20260516120000_add_users_table.sql
  20260516131500_add_orders_table.sql
  20260516143000_add_email_index.sql

Timestamp prefix → strict ordering. Avoid collisions.

7.4 Forward-only vs reversible

Forward-only (Prisma):

  • Pro: simpler, real-world rarely actually downgrades
  • Con: cannot rollback DB state

Reversible (most tools):

  • Pro: dev workflow nicer
  • Con: false sense of security — rolling back data migrations is hard

Pattern 2024: reversible for dev/staging, forward-only mindset for production. Always roll forward, not back.

7.5 Migration in CI/CD

# .github/workflows/deploy.yml
deploy:
  steps:
    - name: Run migrations
      run: migrate -path ./migrations -database $DB_URL up
    - name: Deploy app
      run: kubectl rollout restart deployment/app

Pattern:

  1. Migration runs first
  2. App deploy after — uses new schema

For destructive migrations (DROP COLUMN), reverse order:

  1. App deploy (no longer uses old column)
  2. Migration runs (DROP COLUMN)

7.6 Lock timeout — critical for production

Every migration should set:

SET lock_timeout = '5s';
SET statement_timeout = '60s';
 
ALTER TABLE huge_table ADD COLUMN foo int;
-- If acquiring AccessExclusive blocks for >5s, fails fast
-- Better than blocking app for 30 min

In Rails:

class AddFooToTable < ActiveRecord::Migration[7.1]
  def change
    safety_assured do
      lock_timeout = '5s'
      add_column :table, :foo, :integer
    end
  end
end

8. Rollback Plans

8.1 Every migration needs a rollback plan

Document before running:

## Migration: add status column to users
 
### Forward
ALTER TABLE users ADD COLUMN status text;
-- App v2 writes status
-- Backfill...
-- Add NOT NULL
 
### Rollback at each step
- After ALTER: ALTER TABLE users DROP COLUMN status;
- After app v2 deploy: rollback app to v1, ALTER ... DROP COLUMN
- After backfill: cannot easily rollback data; can stop using column
- After NOT NULL: cannot easily rollback; remove constraint, then drop column
 
### If shit hits the fan
- PITR to before migration started (Tuần 07)
- Time required: ~30 min RTO

8.2 Snapshot before migration

For risky migrations:

# Take backup just before
pgbackrest --stanza=appdb backup --type=incr
 
# Verify
pgbackrest --stanza=appdb info

If migration fails irrecoverably → PITR to just before.


9. Real-world Production Story

9.1 Case study: Rename users.email → users.email_address (1B rows)

Day 1: Plan

  • Review all callsites of email
  • Identify ~200 callsites across 15 services
  • Estimate 2-week rollout

Day 2-3: Setup

ALTER TABLE users ADD COLUMN email_address text;
CREATE INDEX CONCURRENTLY idx_users_email_address ON users(email_address);
 
CREATE OR REPLACE FUNCTION sync_email() RETURNS trigger AS $$ ...
CREATE TRIGGER ... FOR EACH ROW ...

Day 4-7: Backfill

# Run continuously, monitor metrics
# 1B rows / 5000 batch / 100ms sleep = ~28 hours total
# Watch replica lag, pause if >30s

Day 8: Verify

SELECT count(*) FROM users WHERE email_address IS NULL;
-- 0 expected

Day 9-15: Service migration

  • Service 1 deploys v2 (reads email_address, writes both)
  • Service 2 deploys v2
  • … 15 services × 1-day rollout window

Day 16: Wait

  • Monitor for any callsite still using email
  • Use pg_stat_user_tables or audit log

Day 17: Drop

ALTER TABLE users DROP COLUMN email;
DROP TRIGGER trg_users_sync_email ON users;
DROP FUNCTION sync_email();

Total: 17 days, 0 downtime.

9.2 Lessons

  • Communicate — Slack channel migration-users-email, daily updates
  • Monitor — backfill metrics, lag, error rates
  • Pause-able — backfill can stop/resume
  • Verifiable — explicit verification each step
  • Don’t rush — extra week of dual-write OK; reverting messy migration is weeks

10. Anti-patterns

PatternWhy badFix
ALTER TABLE huge ADD COLUMN x int NOT NULL DEFAULT 0 (pre-PG11)Full rewrite, AccessExclusiveAdd nullable + backfill + add constraint
One huge UPDATEBloat, lag, lockChunk
Migration in deploy script without lock_timeoutBlock app for hoursAlways SET lock_timeout
No rollback planStuck if failsDocument before
Run migration without monitoringDon’t know when brokenWatch replication lag, locks, app errors
Rename column + deploy togetherBrief window app errorsMulti-step expand-contract
Drop column before deploy stops using itApp errorsOrder matters
Run on primary only, ignore replica lagReplica falls behind, may need rebuildThrottle to keep lag low
Use down migrations as “undo”Often broken, never testedForward-only mindset
DROP TABLE without backupCannot recoverSnapshot first
Skip stagingSurprises in prodAlways test on prod-size data first

11. Lab

11.1 Day 1 — Safe vs unsafe DDL

Test on huge_table (10M rows):

  • ALTER ADD COLUMN x int — time it
  • ALTER ADD COLUMN x int DEFAULT 1 — time it (PG11+ should be fast)
  • ALTER ADD COLUMN x uuid DEFAULT gen_random_uuid() — slow! Volatile function
  • Compare each

11.2 Day 2 — Backfill chunk

Add column, backfill 10M rows in chunks:

  • Compare 1 UPDATE vs chunked
  • Watch replication lag during each
  • Watch n_dead_tup growth

11.3 Day 3 — gh-ost on MySQL

Setup MySQL + gh-ost:

gh-ost \
    --alter="ADD COLUMN test_col INT NOT NULL DEFAULT 0" \
    --database=test --table=users \
    --execute

Observe cut-over.

11.4 Day 4 — pg_repack

# Create bloat
docker exec postgres psql -c "ALTER TABLE users SET (autovacuum_enabled = false);"
# Run updates...
docker exec postgres psql -c "SELECT pg_size_pretty(pg_relation_size('users'));"
 
pg_repack -h localhost -d test_db -t users --jobs 4
# Check size dropped

11.5 Day 5 — Rename column (expand-contract)

Implement full pattern in lab:

  • ADD new column
  • Trigger sync
  • Backfill
  • Deploy app v2 (simulate by changing queries)
  • Verify
  • Drop old

11.6 Day 6 — Atlas / migration tool

Pick a tool (sqlx-migrate, golang-migrate, Atlas), apply 5 migrations forward + 2 rollbacks. Get comfortable with CLI.

11.7 Day 7 — Real scenario

Take e-commerce schema from Tuần 02. Migrate it to add currency column to users, NOT NULL, default ‘USD’, applying expand-contract.


12. Self-check

  1. Expand-contract — 6 steps?
  2. ALTER TABLE huge ADD COLUMN x int DEFAULT 0 PG10 vs PG11+ — khác biệt?
  3. Add NOT NULL trick PG12+?
  4. CREATE INDEX CONCURRENTLY — process + caveats?
  5. Backfill 100M rows — strategy + chunk size?
  6. lock_timeout — vì sao mandatory cho production migration?
  7. gh-ost cơ chế — copy + binlog?
  8. Forward-only vs reversible migration — pick mindset 2024?
  9. Rollback plan content — 3 things must include?
  10. Replication lag spike during migration — cause + fix?

13. Tiếp theo

Bài tiếp: Tuan-09-DB-Observability-Tuning — sau khi master change, master visibility.


Tuần 08 hoàn thành. Expand, contract, never down. Cập nhật: 2026-05-16