Bonus 5 — Distributed SQL: CockroachDB, TiDB, YugabyteDB, Aurora DSQL, Spanner (Engineering)

“Distributed SQL = ACID + horizontal scale + SQL semantics. Spanner/Aurora DSQL/CockroachDB make it real. Bonus tuần này đào sâu engineering side để biết khi nào move + cách deploy production.”

Tags: database distributed-sql cockroachdb tidb yugabyte spanner dsql Thời lượng: 5-6 ngày Prerequisites: Tuan-05-Transactions-Locking · Tuan-Bonus-Multi-Region-Active-Active-DSQL (SD course) Liên quan: Case-Design-Migrate-Monolith-DB


1. Landscape 2024-2026

1.1 The contenders

ProductVendorOSSWire ProtocolArchitecture
SpannerGoogleNo (Cloud only)Custom + JDBCPaxos + TrueTime atomic clocks
CockroachDBCockroach LabsSource-available BSL (Community), CommercialPostgresRaft per range + HLC
TiDBPingCAPApache 2.0MySQLRaft + Percolator (TSO)
YugabyteDBYugabyte IncApache 2.0 (Community)Postgres + Cassandra (YCQL)Raft per tablet + HLC
Aurora DSQLAWSNo (Cloud only)PostgresBounded-skew clock + journal-based
TiFlashPingCAPApache 2.0TiDBColumnar replicas for HTAP
FoundationDBAppleApache 2.0CustomOrdered KV + transactions, used by Snowflake metadata

1.2 Trade-off space

graph LR
    Single[Single Postgres<br/>1ms latency<br/>limited scale] --> Sharded[Vitess/Citus<br/>App-aware sharding<br/>SQL limitations]
    Sharded --> DistSQL[CockroachDB/TiDB<br/>ACID distributed<br/>2-5ms latency]
    DistSQL --> Global[Spanner/DSQL<br/>Multi-region strong<br/>5-50ms latency]

    style Single fill:#c8e6c9
    style DistSQL fill:#fff9c4
    style Global fill:#ffccbc

Each step: more scale, more latency cost. Pick based on real need.


2. CockroachDB Architecture

2.1 Layered design

graph TB
    SQL[SQL layer<br/>Postgres-compat parser, planner, optimizer]
    TX[Transaction layer<br/>distributed across ranges]
    DIST[Distribution layer<br/>routes to ranges]
    REP[Replication layer<br/>Raft per range]
    STORE[Storage layer<br/>Pebble - RocksDB fork in Go]

    Client --> SQL
    SQL --> TX
    TX --> DIST
    DIST --> REP
    REP --> STORE

2.2 Range-based partitioning

Table split into 64MB ranges. Each range:

  • Has a primary + 2 replicas (RF=3 default)
  • Independent Raft group
  • Can rebalance automatically
  • Has a leaseholder (typically Raft leader)
-- See ranges
SHOW RANGES FROM TABLE orders;
-- Lists: range_id, key, lease_holder, replicas, size
 
-- Manually split (rare)
ALTER TABLE orders SPLIT AT VALUES ('2026-01-01');

2.3 Raft per range

Each range = independent Raft group. Replicas elect leader, leader handles writes.

graph TB
    subgraph "Range 1 (rows 0-1M)"
        R1L[Leader N1]
        R1F1[Follower N2]
        R1F2[Follower N3]
    end

    subgraph "Range 2 (rows 1M-2M)"
        R2L[Leader N2]
        R2F1[Follower N1]
        R2F2[Follower N3]
    end

    R1L -.Raft.-> R1F1
    R1L -.Raft.-> R1F2
    R2L -.Raft.-> R2F1
    R2L -.Raft.-> R2F2

Important: each range’s Raft is independent → scales horizontally + parallel.

2.4 Distributed transactions

CRDB uses modified Percolator style 2PC:

sequenceDiagram
    Client->>Coord: BEGIN
    Coord->>Range1: write tentatively (intent)
    Coord->>Range2: write tentatively (intent)
    Note over Coord: Pick commit timestamp (HLC)
    Coord->>Range1: commit phase (resolve intent)
    Coord->>Range2: commit phase
    Coord-->>Client: COMMIT OK

    Note over Range1,Range2: Async cleanup of intents

Optimizations:

  • Parallel commits — pipeline phases
  • 1PC fast path — single-range transaction skips 2PC
  • Async intent resolution — commit fast, clean up later

2.5 Hot range problem

Sequential key → all writes hit single range. CRDB pattern:

-- BAD - sequential PK
CREATE TABLE events (id SERIAL PRIMARY KEY, ...);
 
-- GOOD - random UUID
CREATE TABLE events (id UUID DEFAULT gen_random_uuid() PRIMARY KEY, ...);
 
-- ALSO GOOD - hash-prefixed sequential
CREATE TABLE events (
    bucket INT8 DEFAULT mod(unique_rowid(), 16),
    id SERIAL,
    PRIMARY KEY (bucket, id)
) PARTITION BY HASH (bucket);

unique_rowid() returns CRDB-style ID with shard prefix to avoid hot range automatically.

2.6 Locality (multi-region)

Tell each node its locality:

cockroach start --locality=region=us-east-1,zone=us-east-1a ...

Place rows based on locality (Regional By Row):

ALTER DATABASE appdb PRIMARY REGION 'us-east-1';
ALTER DATABASE appdb ADD REGION 'eu-west-1';
 
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;
-- Each row gets `crdb_region` column; data stored in row's region

Result: row read/write from local region. Cross-region only when row is from other region.

2.7 Schema change rollout

CRDB schema changes are online + non-blocking (descriptor versioning):

graph LR
    V1[Schema v1 active] --> Adding[Adding new column<br/>v1 + v2 coexist]
    Adding --> V2[Schema v2 active<br/>old v1 deprecated]
    V2 --> Drop[Drop v1 column<br/>actual data removal]

Multiple cluster versions can run. Online schema change in seconds for most DDL.


3. TiDB Architecture

3.1 Components

graph TB
    Client --> TiDB[TiDB nodes<br/>MySQL compat<br/>SQL parser + planner<br/>stateless]
    TiDB --> PD[PD - Placement Driver<br/>cluster metadata + TSO]
    TiDB --> TiKV[TiKV nodes<br/>distributed KV<br/>Raft groups]
    TiKV -.async.-> TiFlash[TiFlash<br/>columnar replicas<br/>HTAP analytics]
    TiDB --> TiFlash

    style TiFlash fill:#fff9c4
  • TiDB: SQL layer, stateless, scale horizontally
  • PD: cluster metadata, timestamp oracle (TSO)
  • TiKV: distributed KV, Raft consensus
  • TiFlash: columnar replicas for analytics

3.2 HTAP — Hybrid OLTP/OLAP

TiFlash replicates TiKV data in columnar format. Analytics queries route to TiFlash automatically.

ALTER TABLE orders SET TIFLASH REPLICA 2;

Query optimizer picks TiKV (row) or TiFlash (column) per query.

Use case: OLTP + OLAP on same data, no ETL. Real-time analytics.

3.3 MySQL compatibility

99%+ MySQL syntax compatible. Easier migration from MySQL than to CockroachDB (Postgres).

Differences:

  • Some MySQL features missing (stored procedures limited, no full triggers)
  • TiDB-specific extensions (placement rules, region split)
  • Different optimizer (cost-based, similar to MySQL 8 but custom)

3.4 Percolator-style transactions

Uses Google Percolator algorithm + TSO (TimeStamp Oracle from PD).

sequenceDiagram
    Client->>TiDB: BEGIN
    TiDB->>PD: get_start_ts
    Client->>TiDB: writes...
    Client->>TiDB: COMMIT
    TiDB->>PD: get_commit_ts
    TiDB->>TiKV: prewrite (lock + tentative value)
    TiDB->>TiKV: commit (resolve locks)
    TiDB-->>Client: OK

TSO from PD provides global ordering. PD HA via Raft.

3.5 Setup (tiup)

# Install tiup
curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
 
# Playground (single node)
tiup playground
 
# Production cluster
tiup cluster deploy mycluster v8.0.0 topology.yaml
tiup cluster start mycluster

3.6 Use case fit

  • High-volume OLTP MySQL-compat
  • Need HTAP (real-time analytics)
  • Self-hosted multi-region
  • Distributed sharding without app changes

Big users: PingCAP, ZTO Express, Lenovo, Pinterest, Square (partial).


4. YugabyteDB

4.1 Architecture

Postgres-compatible (PG11+ syntax reused). DocDB storage engine.

graph TB
    YQL[YSQL - Postgres compat]
    YCQL[YCQL - Cassandra compat]

    YQL --> Tablet[Tablet servers<br/>Raft + RocksDB]
    YCQL --> Tablet
    Master[Master servers<br/>metadata + load balancer]
    Tablet -.coordinated by.-> Master

Two APIs:

  • YSQL: Postgres-compatible (Postgres 11.x query layer reused)
  • YCQL: Cassandra-compatible CQL

Both share same storage layer.

4.2 Strengths

  • True Postgres compat (more features than CRDB)
  • Multi-cloud, self-host friendly
  • Apache 2.0 OSS Community Edition (more permissive than CRDB’s BSL)
  • Geo-partitioning

4.3 Trade-offs vs CockroachDB

CockroachDBYugabyteDB
LicenseBSL (eventually Apache after 4y)Apache 2.0
Postgres compatBuilt from scratchReuses PG code
Distributed transactionsModified PercolatorHybrid (PG + HLC)
Performance per nodeSlightly higherSlightly lower
EcosystemSmallerSmaller
CommunityActiveActive

5. Aurora DSQL (AWS)

5.1 Timeline

  • Preview: Dec 2024 (re:Invent)
  • GA: May 2025

5.2 Architecture

graph TB
    Client --> Endpoint[DSQL Endpoint<br/>per-region]
    Endpoint --> Compute[Compute nodes<br/>stateless, scale auto]
    Compute --> Journal[(Distributed Journal<br/>commit log)]
    Journal --> Storage[(Distributed Storage<br/>multi-AZ replicated)]

    Journal -.async cross-region.-> Journal2[Region 2 Journal]

Key innovations:

  • Decoupled compute + storage + journal — each scales independently
  • Multi-region active-active with strong consistency
  • Postgres wire protocol + most PG features
  • Serverless — no instance to manage

5.3 Consistency mechanism

DSQL does NOT use atomic clocks (like Spanner). Instead:

  • Bounded clock skew assumption (EC2 NTP + monitoring)
  • Adjustable clock service — bounded by HW
  • Journal-based commit ordering — totally ordered log

Different from Spanner TrueTime, different from CRDB HLC. AWS’s own design.

5.4 Strengths

  • Zero ops (managed)
  • Multi-region active-active
  • Postgres compatible
  • Auto-scaling
  • AWS integration (IAM, CloudWatch, etc)

5.5 Limitations (May 2025 GA)

  • AWS-only
  • Postgres 15-level compat (some features missing)
  • Some PG extensions not supported
  • Cost can scale rapidly
  • New product — battle-testing limited

5.6 Use case

Multi-region SaaS needing strong consistency on AWS. E.g., banking, inventory, multi-region B2B.


6. Spanner (Google)

6.1 The OG distributed SQL

Released as service 2017, paper 2012. Inspired all modern distributed SQL.

6.2 Architecture

graph TB
    Client --> Tablet[Tablet servers<br/>Paxos groups<br/>data stored as splits]
    Tablet --> Colossus[(Colossus<br/>distributed FS)]

    TrueTime[TrueTime API<br/>GPS + atomic clocks<br/>at each datacenter]
    TrueTime -.bounded uncertainty.-> Tablet

6.3 TrueTime — the secret sauce

TrueTime returns TT.now() = [earliest, latest] interval. Bounded by ~7ms typically.

Commit-wait: Spanner waits out the uncertainty to guarantee external consistency.

commit_timestamp = TT.now().latest
wait until TT.now().earliest > commit_timestamp
// Now safe to commit; all observers see consistent ordering

Hardware: GPS receivers + atomic clocks in every datacenter. Custom AWS/Google can’t easily replicate without similar HW.

6.4 External consistency

Spanner’s claim: globally external consistency. Means:

  • T1 commits before T2 starts → T2 sees T1
  • True even across continents
  • Stronger than serializability

This is unique to Spanner among major DBs.

6.5 Strengths

  • Battle-tested 10+ years inside Google
  • Backbone of YouTube, Gmail, Ads
  • True multi-region strong consistency
  • SQL (Standard SQL since 2017)
  • Petabyte scale

6.6 Limitations

  • GCP-only (no on-prem version, even via Anthos)
  • Custom SQL dialect (not Postgres/MySQL)
  • TrueTime requires GCP HW
  • Some features take long to add (no triggers until recently)

6.7 Interleaved tables

CREATE TABLE orders (...) PRIMARY KEY (user_id, order_id),
    INTERLEAVE IN PARENT users ON DELETE CASCADE;

Physically stores child rows next to parent → JOIN within parent locality fast.

Unique to Spanner; pattern others mimic.

6.8 Use case

Massive scale (>petabyte), multi-region strong consistency, willing to commit to GCP. Few apps qualify.


7. Real Performance Numbers

Distributed SQL has per-query latency overhead vs single Postgres:

  • Single-range transaction: 1-2ms (close to single PG)
  • Multi-range transaction: 2-5ms (network + 2PC)
  • Cross-region transaction: 50-200ms+ (depends on region distance)

Throughput on same hardware:

  • Single PG: ~50K TPS limit per instance
  • Distributed SQL: 100K-1M TPS aggregate across cluster

Rule of thumb:

  • Single PG handles up to ~10TB + 50K TPS
  • Distributed SQL: when exceed either

8. Migration Considerations

8.1 SQL compatibility gaps

Not 100% even for “Postgres-compat”:

  • Some Postgres extensions missing (CRDB has fewer extensions)
  • Cursor-heavy code may behave differently
  • Vendor-specific SQL (pg_catalog differences)
  • Stored procedures limited
  • Triggers different semantics
  • Sequences behave differently (gaps OK, can’t rely on order)

Test extensively in staging.

8.2 Performance characteristics

  • Per-query latency higher
  • High concurrency throughput much higher
  • Some queries slower (cross-range JOIN)
  • COUNT(*) on huge tables slow (must scan all ranges)

8.3 Operational

  • Different backup (continuous, distributed)
  • Monitoring different (per range, leaseholder, region)
  • Upgrade procedures (rolling, version compatibility)
  • Different alerts (Raft, leaseholder, ranges)

8.4 Cost

Distributed SQL cluster has higher base cost (multiple nodes minimum):

  • CRDB 3-node small cluster: $300-1000/month
  • Aurora DSQL: pay-per-query
  • TiDB cluster: similar to CRDB

vs Single Postgres: $50-200/month

→ Justify with scale or HA need.


9. Picking Distributed SQL

flowchart TD
    A[Need distributed SQL?] --> B{Data fits single Postgres?}
    B -->|Yes <10TB, 50K TPS| C[Stick with Postgres - simpler]

    B -->|No| D{Multi-region active-active critical?}
    D -->|Yes| E{Cloud?}
    E -->|AWS only| F[Aurora DSQL]
    E -->|GCP only, massive scale| G[Spanner]
    E -->|Multi-cloud / self-host| H[CockroachDB or Yugabyte]

    D -->|No - single region scale-out| I{Existing stack?}
    I -->|MySQL| J[TiDB - keeps MySQL compat]
    I -->|Postgres| K[CockroachDB or Yugabyte]

    style C fill:#c8e6c9
    style F fill:#fff9c4
    style H fill:#fff9c4
    style J fill:#fff9c4
    style K fill:#fff9c4

10. Pitfalls Specific to Distributed SQL

10.1 Sequential PK = hot range

→ Use UUID v4 or hash-prefixed.

10.2 Cross-range JOIN slow

Plan queries with locality in mind. Co-locate frequently-joined data.

10.3 Multi-row writes more expensive

-- 1000 INSERTs in 1 transaction across 1000 ranges = slow
-- Batch with IN clause better; same range best

10.4 Index everywhere costs

Each secondary index = its own ranges = its own Raft groups. More indexes = more replication overhead.

10.5 Long-running transactions slow more

Hold locks across multiple ranges. Network amplifies.

→ Keep transactions short.

10.6 COUNT(*) very slow

Must scan all ranges. Use approximate or maintained counters.

10.7 Vacuum-equivalent maintenance

CRDB has automatic GC. TiDB compaction. Different from Postgres VACUUM but similar concept.


11. Lab

Day 1: CockroachDB local cluster

# Single binary
cockroach demo
 
# Or 3-node cluster
cockroach start --insecure --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259
# (repeat for 2 more nodes with different ports)
cockroach init --insecure --host=localhost:26257

Test failover: kill one node, verify cluster still serves.

Day 2: TiDB playground

tiup playground

Auto-spins TiDB + TiKV + PD. Connect via MySQL client. Test.

Day 3: Workload benchmark

# CRDB workload
cockroach workload init kv
cockroach workload run kv --duration=1m
 
# Compare with single Postgres pgbench
pgbench -c 50 -j 4 -T 60 postgres

Day 4: Multi-region simulation

Use Docker network latency:

docker network create --driver=overlay --opt encrypted ...

Add latency between containers. Measure cross-region transaction time.

Day 5: Schema change online

While running workload, run schema changes. Verify zero downtime.

Day 6: Migration plan

Plan: monolithic Postgres → CockroachDB. List:

  • SQL compat checks
  • Performance test queries
  • Migration approach (logical replication?)
  • Rollback plan

Day 7: Cost analysis

Compare TCO:

  • 3-node CRDB cluster
  • Single beefy Postgres + read replica
  • Aurora DSQL serverless

For your specific workload (TPS, data size).


12. Anti-patterns

PatternWhy badFix
Move to distributed SQL “for HA”Cheaper: Postgres + PatroniIdentify real bottleneck
Sequential auto-increment PKHot rangeUUID v4 or random prefix
Cross-region transaction in hot pathLatency 50-200msKeep transaction region-local
Many small transactions2PC overheadBatch into larger
Forgot SCAN COUNT(*) is slowSlow dashboardsApproximate or counter
Migrate without stagingSurprise SQL incompatMirror prod in staging
Don’t tune Raft configDefault may suboptimalTune per workload
Ignore range hot spotsBottleneck on 1 nodeDistribute write
Use as drop-in single PostgresDifferent perf modelRe-architect

13. Self-check

  1. Range vs shard — khác biệt CRDB vs Vitess?
  2. Raft per range — vì sao scale?
  3. TrueTime — Spanner unique advantage là gì?
  4. Aurora DSQL khác Spanner clock như thế nào?
  5. TiDB HTAP cơ chế (TiFlash)?
  6. Hot range — cách prevent với CRDB?
  7. CRDB vs YugabyteDB — pick khi nào?
  8. Per-query latency overhead — bao nhiêu vs single PG?
  9. Sequential PK trap — vì sao?
  10. Migration risk lớn nhất từ Postgres → CRDB?

14. Tiếp theo

Tuan-Bonus-MySQL-Internals

Cập nhật: 2026-05-16