Bonus 5 — Distributed SQL: CockroachDB, TiDB, YugabyteDB, Aurora DSQL, Spanner (Engineering)

“Distributed SQL = ACID + horizontal scale + SQL semantics. Spanner/Aurora DSQL/CockroachDB make it real. Bonus tuần này đào sâu engineering side để biết khi nào move + cách deploy production.”

Tags: database distributed-sql cockroachdb tidb yugabyte spanner dsql Thời lượng: 5-6 ngày Prerequisites: Tuan-05-Transactions-Locking · Tuan-Bonus-Multi-Region-Active-Active-DSQL (SD course) Liên quan: Case-Design-Migrate-Monolith-DB

1. Landscape 2024-2026

1.1 The contenders

Product	Vendor	OSS	Wire Protocol	Architecture
Spanner	Google	No (Cloud only)	Custom + JDBC	Paxos + TrueTime atomic clocks
CockroachDB	Cockroach Labs	Source-available BSL (Community), Commercial	Postgres	Raft per range + HLC
TiDB	PingCAP	Apache 2.0	MySQL	Raft + Percolator (TSO)
YugabyteDB	Yugabyte Inc	Apache 2.0 (Community)	Postgres + Cassandra (YCQL)	Raft per tablet + HLC
Aurora DSQL	AWS	No (Cloud only)	Postgres	Bounded-skew clock + journal-based
TiFlash	PingCAP	Apache 2.0	TiDB	Columnar replicas for HTAP
FoundationDB	Apple	Apache 2.0	Custom	Ordered KV + transactions, used by Snowflake metadata

1.2 Trade-off space

graph LR
    Single[Single Postgres<br/>1ms latency<br/>limited scale] --> Sharded[Vitess/Citus<br/>App-aware sharding<br/>SQL limitations]
    Sharded --> DistSQL[CockroachDB/TiDB<br/>ACID distributed<br/>2-5ms latency]
    DistSQL --> Global[Spanner/DSQL<br/>Multi-region strong<br/>5-50ms latency]

    style Single fill:#c8e6c9
    style DistSQL fill:#fff9c4
    style Global fill:#ffccbc

Each step: more scale, more latency cost. Pick based on real need.

2. CockroachDB Architecture

2.1 Layered design

graph TB
    SQL[SQL layer<br/>Postgres-compat parser, planner, optimizer]
    TX[Transaction layer<br/>distributed across ranges]
    DIST[Distribution layer<br/>routes to ranges]
    REP[Replication layer<br/>Raft per range]
    STORE[Storage layer<br/>Pebble - RocksDB fork in Go]

    Client --> SQL
    SQL --> TX
    TX --> DIST
    DIST --> REP
    REP --> STORE

2.2 Range-based partitioning

Table split into 64MB ranges. Each range:

Has a primary + 2 replicas (RF=3 default)
Independent Raft group
Can rebalance automatically
Has a leaseholder (typically Raft leader)

-- See ranges
SHOW RANGES FROM TABLE orders;
-- Lists: range_id, key, lease_holder, replicas, size
 
-- Manually split (rare)
ALTER TABLE orders SPLIT AT VALUES ('2026-01-01');

2.3 Raft per range

Each range = independent Raft group. Replicas elect leader, leader handles writes.

graph TB
    subgraph "Range 1 (rows 0-1M)"
        R1L[Leader N1]
        R1F1[Follower N2]
        R1F2[Follower N3]
    end

    subgraph "Range 2 (rows 1M-2M)"
        R2L[Leader N2]
        R2F1[Follower N1]
        R2F2[Follower N3]
    end

    R1L -.Raft.-> R1F1
    R1L -.Raft.-> R1F2
    R2L -.Raft.-> R2F1
    R2L -.Raft.-> R2F2

Important: each range’s Raft is independent → scales horizontally + parallel.

2.4 Distributed transactions

CRDB uses modified Percolator style 2PC:

sequenceDiagram
    Client->>Coord: BEGIN
    Coord->>Range1: write tentatively (intent)
    Coord->>Range2: write tentatively (intent)
    Note over Coord: Pick commit timestamp (HLC)
    Coord->>Range1: commit phase (resolve intent)
    Coord->>Range2: commit phase
    Coord-->>Client: COMMIT OK

    Note over Range1,Range2: Async cleanup of intents

Optimizations:

Parallel commits — pipeline phases
1PC fast path — single-range transaction skips 2PC
Async intent resolution — commit fast, clean up later

2.5 Hot range problem

Sequential key → all writes hit single range. CRDB pattern:

-- BAD - sequential PK
CREATE TABLE events (id SERIAL PRIMARY KEY, ...);
 
-- GOOD - random UUID
CREATE TABLE events (id UUID DEFAULT gen_random_uuid() PRIMARY KEY, ...);
 
-- ALSO GOOD - hash-prefixed sequential
CREATE TABLE events (
    bucket INT8 DEFAULT mod(unique_rowid(), 16),
    id SERIAL,
    PRIMARY KEY (bucket, id)
) PARTITION BY HASH (bucket);

unique_rowid() returns CRDB-style ID with shard prefix to avoid hot range automatically.

2.6 Locality (multi-region)

Tell each node its locality:

cockroach start --locality=region=us-east-1,zone=us-east-1a ...

Place rows based on locality (Regional By Row):

ALTER DATABASE appdb PRIMARY REGION 'us-east-1';
ALTER DATABASE appdb ADD REGION 'eu-west-1';
 
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;
-- Each row gets `crdb_region` column; data stored in row's region

Result: row read/write from local region. Cross-region only when row is from other region.

2.7 Schema change rollout

CRDB schema changes are online + non-blocking (descriptor versioning):

graph LR
    V1[Schema v1 active] --> Adding[Adding new column<br/>v1 + v2 coexist]
    Adding --> V2[Schema v2 active<br/>old v1 deprecated]
    V2 --> Drop[Drop v1 column<br/>actual data removal]

Multiple cluster versions can run. Online schema change in seconds for most DDL.

3. TiDB Architecture

3.1 Components

graph TB
    Client --> TiDB[TiDB nodes<br/>MySQL compat<br/>SQL parser + planner<br/>stateless]
    TiDB --> PD[PD - Placement Driver<br/>cluster metadata + TSO]
    TiDB --> TiKV[TiKV nodes<br/>distributed KV<br/>Raft groups]
    TiKV -.async.-> TiFlash[TiFlash<br/>columnar replicas<br/>HTAP analytics]
    TiDB --> TiFlash

    style TiFlash fill:#fff9c4

TiDB: SQL layer, stateless, scale horizontally
PD: cluster metadata, timestamp oracle (TSO)
TiKV: distributed KV, Raft consensus
TiFlash: columnar replicas for analytics

3.2 HTAP — Hybrid OLTP/OLAP

TiFlash replicates TiKV data in columnar format. Analytics queries route to TiFlash automatically.

ALTER TABLE orders SET TIFLASH REPLICA 2;

Query optimizer picks TiKV (row) or TiFlash (column) per query.

Use case: OLTP + OLAP on same data, no ETL. Real-time analytics.

3.3 MySQL compatibility

99%+ MySQL syntax compatible. Easier migration from MySQL than to CockroachDB (Postgres).

Differences:

Some MySQL features missing (stored procedures limited, no full triggers)
TiDB-specific extensions (placement rules, region split)
Different optimizer (cost-based, similar to MySQL 8 but custom)

3.4 Percolator-style transactions

Uses Google Percolator algorithm + TSO (TimeStamp Oracle from PD).

sequenceDiagram
    Client->>TiDB: BEGIN
    TiDB->>PD: get_start_ts
    Client->>TiDB: writes...
    Client->>TiDB: COMMIT
    TiDB->>PD: get_commit_ts
    TiDB->>TiKV: prewrite (lock + tentative value)
    TiDB->>TiKV: commit (resolve locks)
    TiDB-->>Client: OK

TSO from PD provides global ordering. PD HA via Raft.

3.5 Setup (tiup)

# Install tiup
curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
 
# Playground (single node)
tiup playground
 
# Production cluster
tiup cluster deploy mycluster v8.0.0 topology.yaml
tiup cluster start mycluster

3.6 Use case fit

High-volume OLTP MySQL-compat
Need HTAP (real-time analytics)
Self-hosted multi-region
Distributed sharding without app changes

Big users: PingCAP, ZTO Express, Lenovo, Pinterest, Square (partial).

4. YugabyteDB

4.1 Architecture

Postgres-compatible (PG11+ syntax reused). DocDB storage engine.

graph TB
    YQL[YSQL - Postgres compat]
    YCQL[YCQL - Cassandra compat]

    YQL --> Tablet[Tablet servers<br/>Raft + RocksDB]
    YCQL --> Tablet
    Master[Master servers<br/>metadata + load balancer]
    Tablet -.coordinated by.-> Master

Two APIs:

YSQL: Postgres-compatible (Postgres 11.x query layer reused)
YCQL: Cassandra-compatible CQL

Both share same storage layer.

4.2 Strengths

True Postgres compat (more features than CRDB)
Multi-cloud, self-host friendly
Apache 2.0 OSS Community Edition (more permissive than CRDB’s BSL)
Geo-partitioning

4.3 Trade-offs vs CockroachDB

	CockroachDB	YugabyteDB
License	BSL (eventually Apache after 4y)	Apache 2.0
Postgres compat	Built from scratch	Reuses PG code
Distributed transactions	Modified Percolator	Hybrid (PG + HLC)
Performance per node	Slightly higher	Slightly lower
Ecosystem	Smaller	Smaller
Community	Active	Active

5. Aurora DSQL (AWS)

5.1 Timeline

Preview: Dec 2024 (re:Invent)
GA: May 2025

5.2 Architecture

graph TB
    Client --> Endpoint[DSQL Endpoint<br/>per-region]
    Endpoint --> Compute[Compute nodes<br/>stateless, scale auto]
    Compute --> Journal[(Distributed Journal<br/>commit log)]
    Journal --> Storage[(Distributed Storage<br/>multi-AZ replicated)]

    Journal -.async cross-region.-> Journal2[Region 2 Journal]

Key innovations:

Decoupled compute + storage + journal — each scales independently
Multi-region active-active with strong consistency
Postgres wire protocol + most PG features
Serverless — no instance to manage

5.3 Consistency mechanism

DSQL does NOT use atomic clocks (like Spanner). Instead:

Bounded clock skew assumption (EC2 NTP + monitoring)
Adjustable clock service — bounded by HW
Journal-based commit ordering — totally ordered log

Different from Spanner TrueTime, different from CRDB HLC. AWS’s own design.

5.4 Strengths

Zero ops (managed)
Multi-region active-active
Postgres compatible
Auto-scaling
AWS integration (IAM, CloudWatch, etc)

5.5 Limitations (May 2025 GA)

AWS-only
Postgres 15-level compat (some features missing)
Some PG extensions not supported
Cost can scale rapidly
New product — battle-testing limited

5.6 Use case

Multi-region SaaS needing strong consistency on AWS. E.g., banking, inventory, multi-region B2B.

6. Spanner (Google)

6.1 The OG distributed SQL

Released as service 2017, paper 2012. Inspired all modern distributed SQL.

6.2 Architecture

graph TB
    Client --> Tablet[Tablet servers<br/>Paxos groups<br/>data stored as splits]
    Tablet --> Colossus[(Colossus<br/>distributed FS)]

    TrueTime[TrueTime API<br/>GPS + atomic clocks<br/>at each datacenter]
    TrueTime -.bounded uncertainty.-> Tablet

6.3 TrueTime — the secret sauce

TrueTime returns TT.now() = [earliest, latest] interval. Bounded by ~7ms typically.

Commit-wait: Spanner waits out the uncertainty to guarantee external consistency.

commit_timestamp = TT.now().latest
wait until TT.now().earliest > commit_timestamp
// Now safe to commit; all observers see consistent ordering

Hardware: GPS receivers + atomic clocks in every datacenter. Custom AWS/Google can’t easily replicate without similar HW.

6.4 External consistency

Spanner’s claim: globally external consistency. Means:

T1 commits before T2 starts → T2 sees T1
True even across continents
Stronger than serializability

This is unique to Spanner among major DBs.

6.5 Strengths

Battle-tested 10+ years inside Google
Backbone of YouTube, Gmail, Ads
True multi-region strong consistency
SQL (Standard SQL since 2017)
Petabyte scale

6.6 Limitations

GCP-only (no on-prem version, even via Anthos)
Custom SQL dialect (not Postgres/MySQL)
TrueTime requires GCP HW
Some features take long to add (no triggers until recently)

6.7 Interleaved tables

CREATE TABLE orders (...) PRIMARY KEY (user_id, order_id),
    INTERLEAVE IN PARENT users ON DELETE CASCADE;

Physically stores child rows next to parent → JOIN within parent locality fast.

Unique to Spanner; pattern others mimic.

6.8 Use case

Massive scale (>petabyte), multi-region strong consistency, willing to commit to GCP. Few apps qualify.

7. Real Performance Numbers

Distributed SQL has per-query latency overhead vs single Postgres:

Single-range transaction: 1-2ms (close to single PG)
Multi-range transaction: 2-5ms (network + 2PC)
Cross-region transaction: 50-200ms+ (depends on region distance)

Throughput on same hardware:

Single PG: ~50K TPS limit per instance
Distributed SQL: 100K-1M TPS aggregate across cluster

Rule of thumb:

Single PG handles up to ~10TB + 50K TPS
Distributed SQL: when exceed either

8. Migration Considerations

8.1 SQL compatibility gaps

Not 100% even for “Postgres-compat”:

Some Postgres extensions missing (CRDB has fewer extensions)
Cursor-heavy code may behave differently
Vendor-specific SQL (pg_catalog differences)
Stored procedures limited
Triggers different semantics
Sequences behave differently (gaps OK, can’t rely on order)

Test extensively in staging.

8.2 Performance characteristics

Per-query latency higher
High concurrency throughput much higher
Some queries slower (cross-range JOIN)
COUNT(*) on huge tables slow (must scan all ranges)

8.3 Operational

Different backup (continuous, distributed)
Monitoring different (per range, leaseholder, region)
Upgrade procedures (rolling, version compatibility)
Different alerts (Raft, leaseholder, ranges)

8.4 Cost

Distributed SQL cluster has higher base cost (multiple nodes minimum):

CRDB 3-node small cluster: $300-1000/month
Aurora DSQL: pay-per-query
TiDB cluster: similar to CRDB

vs Single Postgres: $50-200/month

→ Justify with scale or HA need.

9. Picking Distributed SQL

flowchart TD
    A[Need distributed SQL?] --> B{Data fits single Postgres?}
    B -->|Yes <10TB, 50K TPS| C[Stick with Postgres - simpler]

    B -->|No| D{Multi-region active-active critical?}
    D -->|Yes| E{Cloud?}
    E -->|AWS only| F[Aurora DSQL]
    E -->|GCP only, massive scale| G[Spanner]
    E -->|Multi-cloud / self-host| H[CockroachDB or Yugabyte]

    D -->|No - single region scale-out| I{Existing stack?}
    I -->|MySQL| J[TiDB - keeps MySQL compat]
    I -->|Postgres| K[CockroachDB or Yugabyte]

    style C fill:#c8e6c9
    style F fill:#fff9c4
    style H fill:#fff9c4
    style J fill:#fff9c4
    style K fill:#fff9c4

10. Pitfalls Specific to Distributed SQL

10.1 Sequential PK = hot range

→ Use UUID v4 or hash-prefixed.

10.2 Cross-range JOIN slow

Plan queries with locality in mind. Co-locate frequently-joined data.

10.3 Multi-row writes more expensive

-- 1000 INSERTs in 1 transaction across 1000 ranges = slow
-- Batch with IN clause better; same range best

10.4 Index everywhere costs

Each secondary index = its own ranges = its own Raft groups. More indexes = more replication overhead.

10.5 Long-running transactions slow more

Hold locks across multiple ranges. Network amplifies.

→ Keep transactions short.

10.6 COUNT(*) very slow

Must scan all ranges. Use approximate or maintained counters.

10.7 Vacuum-equivalent maintenance

CRDB has automatic GC. TiDB compaction. Different from Postgres VACUUM but similar concept.

11. Lab

Day 1: CockroachDB local cluster

# Single binary
cockroach demo
 
# Or 3-node cluster
cockroach start --insecure --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259
# (repeat for 2 more nodes with different ports)
cockroach init --insecure --host=localhost:26257

Test failover: kill one node, verify cluster still serves.

Day 2: TiDB playground

tiup playground

Auto-spins TiDB + TiKV + PD. Connect via MySQL client. Test.

Day 3: Workload benchmark

# CRDB workload
cockroach workload init kv
cockroach workload run kv --duration=1m
 
# Compare with single Postgres pgbench
pgbench -c 50 -j 4 -T 60 postgres

Day 4: Multi-region simulation

Use Docker network latency:

docker network create --driver=overlay --opt encrypted ...

Add latency between containers. Measure cross-region transaction time.

Day 5: Schema change online

While running workload, run schema changes. Verify zero downtime.

Day 6: Migration plan

Plan: monolithic Postgres → CockroachDB. List:

SQL compat checks
Performance test queries
Migration approach (logical replication?)
Rollback plan

Day 7: Cost analysis

Compare TCO:

3-node CRDB cluster
Single beefy Postgres + read replica
Aurora DSQL serverless

For your specific workload (TPS, data size).

12. Anti-patterns

Pattern	Why bad	Fix
Move to distributed SQL “for HA”	Cheaper: Postgres + Patroni	Identify real bottleneck
Sequential auto-increment PK	Hot range	UUID v4 or random prefix
Cross-region transaction in hot path	Latency 50-200ms	Keep transaction region-local
Many small transactions	2PC overhead	Batch into larger
Forgot SCAN COUNT(*) is slow	Slow dashboards	Approximate or counter
Migrate without staging	Surprise SQL incompat	Mirror prod in staging
Don’t tune Raft config	Default may suboptimal	Tune per workload
Ignore range hot spots	Bottleneck on 1 node	Distribute write
Use as drop-in single Postgres	Different perf model	Re-architect

13. Self-check

Range vs shard — khác biệt CRDB vs Vitess?
Raft per range — vì sao scale?
TrueTime — Spanner unique advantage là gì?
Aurora DSQL khác Spanner clock như thế nào?
TiDB HTAP cơ chế (TiFlash)?
Hot range — cách prevent với CRDB?
CRDB vs YugabyteDB — pick khi nào?
Per-query latency overhead — bao nhiêu vs single PG?
Sequential PK trap — vì sao?
Migration risk lớn nhất từ Postgres → CRDB?

14. Tiếp theo

Tuan-Bonus-MySQL-Internals

Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-Bonus-Distributed-SQL-Engineering