Bonus 5 — Distributed SQL: CockroachDB, TiDB, YugabyteDB, Aurora DSQL, Spanner (Engineering)
“Distributed SQL = ACID + horizontal scale + SQL semantics. Spanner/Aurora DSQL/CockroachDB make it real. Bonus tuần này đào sâu engineering side để biết khi nào move + cách deploy production.”
Tags: database distributed-sql cockroachdb tidb yugabyte spanner dsql Thời lượng: 5-6 ngày Prerequisites: Tuan-05-Transactions-Locking · Tuan-Bonus-Multi-Region-Active-Active-DSQL (SD course) Liên quan: Case-Design-Migrate-Monolith-DB
1. Landscape 2024-2026
1.1 The contenders
| Product | Vendor | OSS | Wire Protocol | Architecture |
|---|---|---|---|---|
| Spanner | No (Cloud only) | Custom + JDBC | Paxos + TrueTime atomic clocks | |
| CockroachDB | Cockroach Labs | Source-available BSL (Community), Commercial | Postgres | Raft per range + HLC |
| TiDB | PingCAP | Apache 2.0 | MySQL | Raft + Percolator (TSO) |
| YugabyteDB | Yugabyte Inc | Apache 2.0 (Community) | Postgres + Cassandra (YCQL) | Raft per tablet + HLC |
| Aurora DSQL | AWS | No (Cloud only) | Postgres | Bounded-skew clock + journal-based |
| TiFlash | PingCAP | Apache 2.0 | TiDB | Columnar replicas for HTAP |
| FoundationDB | Apple | Apache 2.0 | Custom | Ordered KV + transactions, used by Snowflake metadata |
1.2 Trade-off space
graph LR Single[Single Postgres<br/>1ms latency<br/>limited scale] --> Sharded[Vitess/Citus<br/>App-aware sharding<br/>SQL limitations] Sharded --> DistSQL[CockroachDB/TiDB<br/>ACID distributed<br/>2-5ms latency] DistSQL --> Global[Spanner/DSQL<br/>Multi-region strong<br/>5-50ms latency] style Single fill:#c8e6c9 style DistSQL fill:#fff9c4 style Global fill:#ffccbc
Each step: more scale, more latency cost. Pick based on real need.
2. CockroachDB Architecture
2.1 Layered design
graph TB SQL[SQL layer<br/>Postgres-compat parser, planner, optimizer] TX[Transaction layer<br/>distributed across ranges] DIST[Distribution layer<br/>routes to ranges] REP[Replication layer<br/>Raft per range] STORE[Storage layer<br/>Pebble - RocksDB fork in Go] Client --> SQL SQL --> TX TX --> DIST DIST --> REP REP --> STORE
2.2 Range-based partitioning
Table split into 64MB ranges. Each range:
- Has a primary + 2 replicas (RF=3 default)
- Independent Raft group
- Can rebalance automatically
- Has a leaseholder (typically Raft leader)
-- See ranges
SHOW RANGES FROM TABLE orders;
-- Lists: range_id, key, lease_holder, replicas, size
-- Manually split (rare)
ALTER TABLE orders SPLIT AT VALUES ('2026-01-01');2.3 Raft per range
Each range = independent Raft group. Replicas elect leader, leader handles writes.
graph TB subgraph "Range 1 (rows 0-1M)" R1L[Leader N1] R1F1[Follower N2] R1F2[Follower N3] end subgraph "Range 2 (rows 1M-2M)" R2L[Leader N2] R2F1[Follower N1] R2F2[Follower N3] end R1L -.Raft.-> R1F1 R1L -.Raft.-> R1F2 R2L -.Raft.-> R2F1 R2L -.Raft.-> R2F2
Important: each range’s Raft is independent → scales horizontally + parallel.
2.4 Distributed transactions
CRDB uses modified Percolator style 2PC:
sequenceDiagram Client->>Coord: BEGIN Coord->>Range1: write tentatively (intent) Coord->>Range2: write tentatively (intent) Note over Coord: Pick commit timestamp (HLC) Coord->>Range1: commit phase (resolve intent) Coord->>Range2: commit phase Coord-->>Client: COMMIT OK Note over Range1,Range2: Async cleanup of intents
Optimizations:
- Parallel commits — pipeline phases
- 1PC fast path — single-range transaction skips 2PC
- Async intent resolution — commit fast, clean up later
2.5 Hot range problem
Sequential key → all writes hit single range. CRDB pattern:
-- BAD - sequential PK
CREATE TABLE events (id SERIAL PRIMARY KEY, ...);
-- GOOD - random UUID
CREATE TABLE events (id UUID DEFAULT gen_random_uuid() PRIMARY KEY, ...);
-- ALSO GOOD - hash-prefixed sequential
CREATE TABLE events (
bucket INT8 DEFAULT mod(unique_rowid(), 16),
id SERIAL,
PRIMARY KEY (bucket, id)
) PARTITION BY HASH (bucket);unique_rowid() returns CRDB-style ID with shard prefix to avoid hot range automatically.
2.6 Locality (multi-region)
Tell each node its locality:
cockroach start --locality=region=us-east-1,zone=us-east-1a ...Place rows based on locality (Regional By Row):
ALTER DATABASE appdb PRIMARY REGION 'us-east-1';
ALTER DATABASE appdb ADD REGION 'eu-west-1';
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;
-- Each row gets `crdb_region` column; data stored in row's regionResult: row read/write from local region. Cross-region only when row is from other region.
2.7 Schema change rollout
CRDB schema changes are online + non-blocking (descriptor versioning):
graph LR V1[Schema v1 active] --> Adding[Adding new column<br/>v1 + v2 coexist] Adding --> V2[Schema v2 active<br/>old v1 deprecated] V2 --> Drop[Drop v1 column<br/>actual data removal]
Multiple cluster versions can run. Online schema change in seconds for most DDL.
3. TiDB Architecture
3.1 Components
graph TB Client --> TiDB[TiDB nodes<br/>MySQL compat<br/>SQL parser + planner<br/>stateless] TiDB --> PD[PD - Placement Driver<br/>cluster metadata + TSO] TiDB --> TiKV[TiKV nodes<br/>distributed KV<br/>Raft groups] TiKV -.async.-> TiFlash[TiFlash<br/>columnar replicas<br/>HTAP analytics] TiDB --> TiFlash style TiFlash fill:#fff9c4
- TiDB: SQL layer, stateless, scale horizontally
- PD: cluster metadata, timestamp oracle (TSO)
- TiKV: distributed KV, Raft consensus
- TiFlash: columnar replicas for analytics
3.2 HTAP — Hybrid OLTP/OLAP
TiFlash replicates TiKV data in columnar format. Analytics queries route to TiFlash automatically.
ALTER TABLE orders SET TIFLASH REPLICA 2;Query optimizer picks TiKV (row) or TiFlash (column) per query.
Use case: OLTP + OLAP on same data, no ETL. Real-time analytics.
3.3 MySQL compatibility
99%+ MySQL syntax compatible. Easier migration from MySQL than to CockroachDB (Postgres).
Differences:
- Some MySQL features missing (stored procedures limited, no full triggers)
- TiDB-specific extensions (placement rules, region split)
- Different optimizer (cost-based, similar to MySQL 8 but custom)
3.4 Percolator-style transactions
Uses Google Percolator algorithm + TSO (TimeStamp Oracle from PD).
sequenceDiagram Client->>TiDB: BEGIN TiDB->>PD: get_start_ts Client->>TiDB: writes... Client->>TiDB: COMMIT TiDB->>PD: get_commit_ts TiDB->>TiKV: prewrite (lock + tentative value) TiDB->>TiKV: commit (resolve locks) TiDB-->>Client: OK
TSO from PD provides global ordering. PD HA via Raft.
3.5 Setup (tiup)
# Install tiup
curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh
# Playground (single node)
tiup playground
# Production cluster
tiup cluster deploy mycluster v8.0.0 topology.yaml
tiup cluster start mycluster3.6 Use case fit
- High-volume OLTP MySQL-compat
- Need HTAP (real-time analytics)
- Self-hosted multi-region
- Distributed sharding without app changes
Big users: PingCAP, ZTO Express, Lenovo, Pinterest, Square (partial).
4. YugabyteDB
4.1 Architecture
Postgres-compatible (PG11+ syntax reused). DocDB storage engine.
graph TB YQL[YSQL - Postgres compat] YCQL[YCQL - Cassandra compat] YQL --> Tablet[Tablet servers<br/>Raft + RocksDB] YCQL --> Tablet Master[Master servers<br/>metadata + load balancer] Tablet -.coordinated by.-> Master
Two APIs:
- YSQL: Postgres-compatible (Postgres 11.x query layer reused)
- YCQL: Cassandra-compatible CQL
Both share same storage layer.
4.2 Strengths
- True Postgres compat (more features than CRDB)
- Multi-cloud, self-host friendly
- Apache 2.0 OSS Community Edition (more permissive than CRDB’s BSL)
- Geo-partitioning
4.3 Trade-offs vs CockroachDB
| CockroachDB | YugabyteDB | |
|---|---|---|
| License | BSL (eventually Apache after 4y) | Apache 2.0 |
| Postgres compat | Built from scratch | Reuses PG code |
| Distributed transactions | Modified Percolator | Hybrid (PG + HLC) |
| Performance per node | Slightly higher | Slightly lower |
| Ecosystem | Smaller | Smaller |
| Community | Active | Active |
5. Aurora DSQL (AWS)
5.1 Timeline
- Preview: Dec 2024 (re:Invent)
- GA: May 2025
5.2 Architecture
graph TB Client --> Endpoint[DSQL Endpoint<br/>per-region] Endpoint --> Compute[Compute nodes<br/>stateless, scale auto] Compute --> Journal[(Distributed Journal<br/>commit log)] Journal --> Storage[(Distributed Storage<br/>multi-AZ replicated)] Journal -.async cross-region.-> Journal2[Region 2 Journal]
Key innovations:
- Decoupled compute + storage + journal — each scales independently
- Multi-region active-active with strong consistency
- Postgres wire protocol + most PG features
- Serverless — no instance to manage
5.3 Consistency mechanism
DSQL does NOT use atomic clocks (like Spanner). Instead:
- Bounded clock skew assumption (EC2 NTP + monitoring)
- Adjustable clock service — bounded by HW
- Journal-based commit ordering — totally ordered log
Different from Spanner TrueTime, different from CRDB HLC. AWS’s own design.
5.4 Strengths
- Zero ops (managed)
- Multi-region active-active
- Postgres compatible
- Auto-scaling
- AWS integration (IAM, CloudWatch, etc)
5.5 Limitations (May 2025 GA)
- AWS-only
- Postgres 15-level compat (some features missing)
- Some PG extensions not supported
- Cost can scale rapidly
- New product — battle-testing limited
5.6 Use case
Multi-region SaaS needing strong consistency on AWS. E.g., banking, inventory, multi-region B2B.
6. Spanner (Google)
6.1 The OG distributed SQL
Released as service 2017, paper 2012. Inspired all modern distributed SQL.
6.2 Architecture
graph TB Client --> Tablet[Tablet servers<br/>Paxos groups<br/>data stored as splits] Tablet --> Colossus[(Colossus<br/>distributed FS)] TrueTime[TrueTime API<br/>GPS + atomic clocks<br/>at each datacenter] TrueTime -.bounded uncertainty.-> Tablet
6.3 TrueTime — the secret sauce
TrueTime returns TT.now() = [earliest, latest] interval. Bounded by ~7ms typically.
Commit-wait: Spanner waits out the uncertainty to guarantee external consistency.
commit_timestamp = TT.now().latest
wait until TT.now().earliest > commit_timestamp
// Now safe to commit; all observers see consistent ordering
Hardware: GPS receivers + atomic clocks in every datacenter. Custom AWS/Google can’t easily replicate without similar HW.
6.4 External consistency
Spanner’s claim: globally external consistency. Means:
- T1 commits before T2 starts → T2 sees T1
- True even across continents
- Stronger than serializability
This is unique to Spanner among major DBs.
6.5 Strengths
- Battle-tested 10+ years inside Google
- Backbone of YouTube, Gmail, Ads
- True multi-region strong consistency
- SQL (Standard SQL since 2017)
- Petabyte scale
6.6 Limitations
- GCP-only (no on-prem version, even via Anthos)
- Custom SQL dialect (not Postgres/MySQL)
- TrueTime requires GCP HW
- Some features take long to add (no triggers until recently)
6.7 Interleaved tables
CREATE TABLE orders (...) PRIMARY KEY (user_id, order_id),
INTERLEAVE IN PARENT users ON DELETE CASCADE;Physically stores child rows next to parent → JOIN within parent locality fast.
Unique to Spanner; pattern others mimic.
6.8 Use case
Massive scale (>petabyte), multi-region strong consistency, willing to commit to GCP. Few apps qualify.
7. Real Performance Numbers
Distributed SQL has per-query latency overhead vs single Postgres:
- Single-range transaction: 1-2ms (close to single PG)
- Multi-range transaction: 2-5ms (network + 2PC)
- Cross-region transaction: 50-200ms+ (depends on region distance)
Throughput on same hardware:
- Single PG: ~50K TPS limit per instance
- Distributed SQL: 100K-1M TPS aggregate across cluster
Rule of thumb:
- Single PG handles up to ~10TB + 50K TPS
- Distributed SQL: when exceed either
8. Migration Considerations
8.1 SQL compatibility gaps
Not 100% even for “Postgres-compat”:
- Some Postgres extensions missing (CRDB has fewer extensions)
- Cursor-heavy code may behave differently
- Vendor-specific SQL (
pg_catalogdifferences) - Stored procedures limited
- Triggers different semantics
- Sequences behave differently (gaps OK, can’t rely on order)
Test extensively in staging.
8.2 Performance characteristics
- Per-query latency higher
- High concurrency throughput much higher
- Some queries slower (cross-range JOIN)
- COUNT(*) on huge tables slow (must scan all ranges)
8.3 Operational
- Different backup (continuous, distributed)
- Monitoring different (per range, leaseholder, region)
- Upgrade procedures (rolling, version compatibility)
- Different alerts (Raft, leaseholder, ranges)
8.4 Cost
Distributed SQL cluster has higher base cost (multiple nodes minimum):
- CRDB 3-node small cluster: $300-1000/month
- Aurora DSQL: pay-per-query
- TiDB cluster: similar to CRDB
vs Single Postgres: $50-200/month
→ Justify with scale or HA need.
9. Picking Distributed SQL
flowchart TD A[Need distributed SQL?] --> B{Data fits single Postgres?} B -->|Yes <10TB, 50K TPS| C[Stick with Postgres - simpler] B -->|No| D{Multi-region active-active critical?} D -->|Yes| E{Cloud?} E -->|AWS only| F[Aurora DSQL] E -->|GCP only, massive scale| G[Spanner] E -->|Multi-cloud / self-host| H[CockroachDB or Yugabyte] D -->|No - single region scale-out| I{Existing stack?} I -->|MySQL| J[TiDB - keeps MySQL compat] I -->|Postgres| K[CockroachDB or Yugabyte] style C fill:#c8e6c9 style F fill:#fff9c4 style H fill:#fff9c4 style J fill:#fff9c4 style K fill:#fff9c4
10. Pitfalls Specific to Distributed SQL
10.1 Sequential PK = hot range
→ Use UUID v4 or hash-prefixed.
10.2 Cross-range JOIN slow
Plan queries with locality in mind. Co-locate frequently-joined data.
10.3 Multi-row writes more expensive
-- 1000 INSERTs in 1 transaction across 1000 ranges = slow
-- Batch with IN clause better; same range best10.4 Index everywhere costs
Each secondary index = its own ranges = its own Raft groups. More indexes = more replication overhead.
10.5 Long-running transactions slow more
Hold locks across multiple ranges. Network amplifies.
→ Keep transactions short.
10.6 COUNT(*) very slow
Must scan all ranges. Use approximate or maintained counters.
10.7 Vacuum-equivalent maintenance
CRDB has automatic GC. TiDB compaction. Different from Postgres VACUUM but similar concept.
11. Lab
Day 1: CockroachDB local cluster
# Single binary
cockroach demo
# Or 3-node cluster
cockroach start --insecure --listen-addr=localhost:26257 --http-addr=localhost:8080 --join=localhost:26257,localhost:26258,localhost:26259
# (repeat for 2 more nodes with different ports)
cockroach init --insecure --host=localhost:26257Test failover: kill one node, verify cluster still serves.
Day 2: TiDB playground
tiup playgroundAuto-spins TiDB + TiKV + PD. Connect via MySQL client. Test.
Day 3: Workload benchmark
# CRDB workload
cockroach workload init kv
cockroach workload run kv --duration=1m
# Compare with single Postgres pgbench
pgbench -c 50 -j 4 -T 60 postgresDay 4: Multi-region simulation
Use Docker network latency:
docker network create --driver=overlay --opt encrypted ...Add latency between containers. Measure cross-region transaction time.
Day 5: Schema change online
While running workload, run schema changes. Verify zero downtime.
Day 6: Migration plan
Plan: monolithic Postgres → CockroachDB. List:
- SQL compat checks
- Performance test queries
- Migration approach (logical replication?)
- Rollback plan
Day 7: Cost analysis
Compare TCO:
- 3-node CRDB cluster
- Single beefy Postgres + read replica
- Aurora DSQL serverless
For your specific workload (TPS, data size).
12. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Move to distributed SQL “for HA” | Cheaper: Postgres + Patroni | Identify real bottleneck |
| Sequential auto-increment PK | Hot range | UUID v4 or random prefix |
| Cross-region transaction in hot path | Latency 50-200ms | Keep transaction region-local |
| Many small transactions | 2PC overhead | Batch into larger |
| Forgot SCAN COUNT(*) is slow | Slow dashboards | Approximate or counter |
| Migrate without staging | Surprise SQL incompat | Mirror prod in staging |
| Don’t tune Raft config | Default may suboptimal | Tune per workload |
| Ignore range hot spots | Bottleneck on 1 node | Distribute write |
| Use as drop-in single Postgres | Different perf model | Re-architect |
13. Self-check
- Range vs shard — khác biệt CRDB vs Vitess?
- Raft per range — vì sao scale?
- TrueTime — Spanner unique advantage là gì?
- Aurora DSQL khác Spanner clock như thế nào?
- TiDB HTAP cơ chế (TiFlash)?
- Hot range — cách prevent với CRDB?
- CRDB vs YugabyteDB — pick khi nào?
- Per-query latency overhead — bao nhiêu vs single PG?
- Sequential PK trap — vì sao?
- Migration risk lớn nhất từ Postgres → CRDB?
14. Tiếp theo
Cập nhật: 2026-05-16