Tuần Bonus: Multi-Region Active-Active & Globally Distributed SQL

“Năm 2012 Google launch Spanner — distributed SQL với external consistency, dùng atomic clock để đồng bộ thời gian giữa data centers. Năm 2024 AWS launch Aurora DSQL — đem nguyên concept đó cho mass market với 99.999% SLA. Cùng với CockroachDB, YugabyteDB, TiDB, một category mới đã hình thành: globally distributed SQL với strong consistency.”

Tags: system-design multi-region distributed-sql aurora-dsql spanner cockroachdb disaster-recovery bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-Bonus-Consensus-Raft-Paxos · Tuan-Bonus-Consistency-Models-Isolation Liên quan: Case-Design-Payment-System · Case-Design-Stock-Exchange · Tuan-Bonus-Multi-Tenancy-SaaS-Patterns


1. Context & Why

Analogy đời thường — Hệ thống ngân hàng đa quốc gia

Hieu, tưởng tượng một ngân hàng quốc tế có chi nhánh ở Hà Nội, Tokyo, San Francisco, London. Khách hàng:

  • VIP đi công tác → cần rút tiền ở chi nhánh nào cũng OK
  • Số dư phải chính xác toàn cầu — rút Tokyo phải trừ Hà Nội ngay
  • 1 chi nhánh cháy → vẫn phải hoạt động bình thường
  • Compliance: data của khách EU phải lưu ở EU (GDPR), khách US lưu ở US

Đây là Multi-Region Active-Active — mọi region đều active (nhận write), data consistent toàn cầu, tolerate mất 1 region.

Không phải:

  • Active-Passive: 1 region chính, others standby. Failover = downtime + data loss.
  • Read replicas multi-region: write đi 1 region → cross-region latency.
  • Sharding by region: customer EU/US tách biệt, không thể giao dịch chéo.

Tại sao Backend Dev cần hiểu?

Lý doHậu quả
Outage costsAWS US-EAST-1 down (2017, 2021, 2024) → mọi app single-region down
ComplianceGDPR data residency, China cybersecurity law, India DPDPA
Latency globalUser VN gọi API US: 300ms RTT → unacceptable cho real-time
Disaster recoveryEarthquake, fire, ransomware → cần off-region backup
2024-2026 distributed SQL maturityAurora DSQL (Dec 2024), Spanner GA, CockroachDB → no excuse not to use

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 (2020-2022) trước khi distributed SQL mature ở mass market. CockroachDB cloud GA 2020, Spanner pricing reform 2023, Aurora DSQL 2024-12. Đây là evolution 2-3 năm gần đây.

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 Disaster Recovery Strategies — Spectrum

RPO (Recovery Point Objective)    RTO (Recovery Time Objective)
"Bao nhiêu data có thể mất?"        "Bao lâu để recover?"

Backup/Restore       ────────  hours    ────────  hours-days     CHEAP
Pilot Light          ────────  minutes  ────────  hours          ↓
Warm Standby         ────────  seconds  ────────  minutes        ↓
Multi-Site Active-Active  ──── ~zero   ────────  ~zero          EXPENSIVE
StrategyRPORTOCostComplexity
Backup/RestoreHoursHours-days$Low
Pilot LightMinutesHours$$Medium
Warm StandbySecondsMinutes$$$Medium
Active-Active~0 (sync rep)~0 (auto failover)$$$$High

Khi nào chọn cái nào:

  • Internal tools, dev/staging: Backup/Restore
  • Customer-facing non-critical: Pilot Light hoặc Warm Standby
  • Revenue-critical (e-commerce, banking, payment): Active-Active
  • Mission-critical (healthcare, aviation): Active-Active + chaos engineering

2.2 The Hard Problem — Why Multi-Region Active-Active is Hard

Light is slow: speed of light = 300,000 km/s. From US East to Asia = 12,000km → 40ms one-way physical limit.

Round trip latencies (typical):
  Same DC:           0.5 ms
  Same region:       2-5 ms
  Cross-region (US): 50-80 ms
  Cross-continent:   100-180 ms

Vấn đề (cho strong consistency):

  • Synchronous replication cross-region: 100-200ms write latency → bad UX
  • Async replication: data loss risk if region fails before replicate

3 fundamental approaches:

  1. Async với conflict resolution (CRDT, LWW): Available, weak consistency
  2. Sync với consensus (Raft, Paxos): Consistent but slow
  3. TrueTime / atomic clocks: External consistency, fast (Spanner, DSQL)

2.3 Spanner — TrueTime External Consistency

Spanner (Google 2012) là first production globally distributed SQL với strong consistency.

Key innovation: TrueTime

  • Atomic clocks + GPS receivers in every data center
  • Returns TT.now() = [earliest, latest] interval (~7ms uncertainty)
  • Guarantees: TT.after(t) returns true only after t has passed
Spanner commit protocol:
1. Acquire write timestamp T_commit ≥ TT.now().latest
2. Wait until TT.after(T_commit) = true (avg 7ms)
3. Apply commit
4. Reply to client

Result: External consistency
  If T1 commits before T2 starts → T1.commit_ts < T2.commit_ts
  Even across regions

External consistency = strongest possible: linearizability + serializability + real-time order across regions.

Cost: 7ms commit wait + 1-2 RTT cross-region for Paxos = ~30ms write latency multi-region.

2.4 Aurora DSQL — Spanner for AWS (2024)

Launched at AWS re:Invent 2024, GA Apr 2025.

Key features:

  • Active-Active multi-region by default
  • Strong consistency across regions (like Spanner)
  • Amazon Time Sync Service (atomic clocks free trên EC2 từ 2023!) — no manual setup
  • PostgreSQL-compatible — drop-in replacement for many apps
  • 99.999% SLA (5 minutes downtime/year)
  • Serverless: scale to zero, no cluster management
  • OCC (Optimistic Concurrency Control) — không có pessimistic locks
  • Disaggregated storage: storage layer riêng biệt, scale độc lập

Architecture (high level):

┌────────────────────────────────────────────┐
│         Aurora DSQL (Region A)              │
│  Compute: query routers (stateless)         │
│  Storage: distributed log + KV store        │
│  Consensus: across regions                  │
└──────────────────────┬─────────────────────┘
                       │ sync replication
                       │ (atomic clock-coordinated)
┌──────────────────────┴─────────────────────┐
│         Aurora DSQL (Region B)              │
│  Same arch, write/read locally              │
└────────────────────────────────────────────┘

Cost (2026 pricing):

  • $4.00 / Distributed Processing Unit (DPU) hour
  • $0.33 / GB-month storage
  • Cheaper than Spanner for most workloads

Limitations:

  • Currently no foreign keys, no triggers (some PostgreSQL features)
  • Max DB size 100 TB
  • Limited to AWS regions

2.5 CockroachDB / YugabyteDB / TiDB

2.5.1 CockroachDB

  • Origin: ex-Google engineers (2014) — modeled after Spanner
  • HLC instead of atomic clocks: Hybrid Logical Clock with bounded skew
  • Multi-active: every node accepts reads/writes
  • Survival goals: configure để survive zone/region/multi-region failures
  • Production: DoorDash, Comcast, Netflix, eBay
  • Open source (BSL license)
-- CockroachDB multi-region table
CREATE DATABASE myapp;
USE myapp;
 
-- Set survival goal
ALTER DATABASE myapp SURVIVE REGION FAILURE;
 
-- Region-aware table
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    region crdb_internal_region NOT NULL DEFAULT default_to_database_primary_region(gateway_region()),
    name TEXT,
    PRIMARY KEY (region, id)
)
LOCALITY REGIONAL BY ROW;
-- Each row pinned to user's home region for low-latency local reads

2.5.2 YugabyteDB

  • Origin: ex-Facebook engineers (2017)
  • YSQL (PostgreSQL-compatible) + YCQL (Cassandra-compatible)
  • Raft per shard (tablet)
  • Production: General Motors, Wells Fargo, Justuno

2.5.3 TiDB

  • Origin: PingCAP (2015), China-focused initially
  • MySQL-compatible
  • HTAP (OLTP + OLAP via TiFlash columnar)
  • Production: ByteDance, Pinterest, Square

2.6 Comparison Matrix

FeatureAurora DSQLSpannerCockroachDBYugabyteDBTiDB
VendorAWSGoogle CloudOSS + CloudOSS + CloudOSS + Cloud
Time sourceAmazon Time Sync (atomic)TrueTime (atomic+GPS)HLCHLCHLC
External consistency⚠️ (linearizable, not strict serializable)⚠️
Multi-region active✅ Built-in✅ Built-in✅ Configurable✅ Configurable✅ Configurable
SQL dialectPostgreSQLCustom (Spanner SQL)PostgreSQLPostgreSQLMySQL
HTAPNoYes (limited)NoNoYes
Self-hostedNo (AWS managed)No (GCP managed)YesYesYes
Best forAWS shopGoogle shop, global appsOSS preferencePolyglot (SQL + NoSQL)MySQL migration

2.7 Routing Strategies

Vấn đề: User từ Việt Nam — gọi region nào?

2.7.1 DNS-based Routing (Route 53)

api.myapp.com → DNS query →
  If user in Asia → returns IP of ap-southeast-1
  If user in Europe → returns IP of eu-west-1
  If user in Americas → returns IP of us-east-1

Latency-based routing: Route 53 returns endpoint với lowest latency. Geolocation routing: Route by country.

Pros: Simple, free Cons: DNS TTL caching → slow failover (5-10 min)

2.7.2 AWS Global Accelerator (Anycast)

api.myapp.com → Single static IP
              → Anycast (BGP) routes to nearest edge
              → Edge connects to healthy regional endpoint

Pros:

  • Sub-30s failover (no DNS cache)
  • Better performance than DNS
  • Single IP simpler for clients

Cons:

  • 0.018/hour per accelerator
  • Vendor-specific (AWS only)

2.7.3 CDN-level routing (Cloudflare, Fastly)

Request → Cloudflare edge (250+ locations)
       → Worker decides routing logic
       → Forwards to optimal regional backend

Pros:

  • Edge logic (Workers, Compute@Edge)
  • Best UX (closest to user)
  • Built-in DDoS, WAF

Cons:

  • Vendor lock-in
  • Worker cold start consideration

2.8 Conflict Resolution

Active-Active requires strategy when concurrent writes happen.

2.8.1 Strong Consistency (Spanner-style)

  • Sync consensus via atomic clocks
  • No conflicts possible (all writes serialized globally)
  • Cost: 30-100ms write latency

2.8.2 LWW (Last-Write-Wins)

  • Each write tagged with HLC timestamp
  • Higher timestamp wins
  • Risk: Lose updates, clock skew bugs
  • Use for: non-critical metadata (user preferences, timestamps)

2.8.3 CRDT-based (Riak, Redis CRDB)

2.8.4 Application-level (Custom)

  • App detects conflict, resolves with business logic
  • Example: “merge two carts” = union of items
  • Most flexible, most work

2.8.5 Comparison

StrategyConsistencyLatencyUse case
Sync consensus (Spanner)StrongHigh (~50ms)Banking, critical
LWWEventualLowMetadata
CRDTEventual (convergent)LowCounters, sets
CustomVariesMediumDomain-specific

2.9 Split-Brain Prevention

Risk: Network partition → 2 regions both think they’re primary → diverge writes → data corruption.

Mitigations:

2.9.1 Quorum-based (most common)

3 regions: A, B, C
Quorum = majority = 2

If A isolated from B+C:
  A: minority (1) → cannot accept writes
  B+C: majority (2) → accepts writes

When network heals:
  A reconciles from B+C (replays missed transactions)

2.9.2 STONITH (Shoot The Other Node In The Head)

  • Hardware-level fencing: powered off via management API
  • Used in HA clusters (Pacemaker)
  • Less common in cloud (use cloud APIs instead)

2.9.3 Lease-based

  • Primary holds time-bounded lease
  • Must renew or lose primacy
  • If can’t communicate → step down automatically

2.10 Cost Considerations

Multi-region active-active là expensive. Tính cho 100 GB DB, 1M tx/day, 3 regions:

Cost componentSingle-regionActive-Active 3-region
Compute (DB instances)$500/month$1,500/month (3x)
Storage$50/month$150/month (3 copies)
Cross-region data transfer$0$200-500/month (replication)
Routing (Global Accelerator)$0$50-200/month
Monitoring/observability$50/month$150/month
Total$600/month$2,050-2,500/month

→ ~3-4x cost. ROI question: Worth $1,500/month for higher availability? Depends on revenue impact of downtime.

Rule of thumb: Multi-region cost-effective khi 1 hour downtime > 1 month additional cost. For most SaaS với $100K+/month revenue → worth it.


3. Estimation — Multi-Region Capacity

3.1 Replication bandwidth

Scenario: 1000 transactions/sec, average 5KB write per transaction, 3 regions full mesh.

Outbound from each region = 1000 × 5KB = 5 MB/s per replica peer
With 3 regions full mesh: 5 MB/s × 2 peers = 10 MB/s outbound per region

Cross-region bandwidth at AWS pricing ($0.02/GB):

10 MB/s × 86,400 s × 30 days = 25 TB/month per region
3 regions × 25 TB = 75 TB/month total
75 TB × $0.02/GB = $1,500/month for replication alone

3.2 Read latency budget

P95 latency target: 100ms cho user actions.

Latency breakdown (US user → US region):
  Network DNS/TLS: 20ms
  App server processing: 30ms
  DB query: 30ms
  Network return: 20ms
  Total: 100ms ✓

Cross-region penalty (US user → EU region):

DNS+TLS: 20ms (same)
App: 30ms
Cross-region DB read: 100ms (RTT)
Return: 20ms
Total: 170ms ✗ (over budget)

Read-local pattern bắt buộc.

3.3 RTO/RPO targets

TierRTORPOStrategy
Tier 1 (payment)< 1 min0 (zero loss)Sync replication multi-region
Tier 2 (e-commerce)< 5 min< 1 minAsync replication + auto failover
Tier 3 (analytics)< 1 hour< 1 hourDaily snapshots cross-region
Tier 4 (logs)< 1 day< 1 dayBackup to cold storage

3.4 Failover testing budget

Game day pattern: Simulate region failure quarterly.

Cost per game day:
  Engineer time: 5 engineers × 4 hours × $100/h = $2,000
  Potential customer impact (production): $0 if done right
  Tools (chaos engineering): included
  Total: $2,000/quarter = $8,000/year

ROI: 1 prevented production outage = saves $50K-500K depending on scale.


4. Security First

4.1 Data residency & sovereignty

Compliance requirements:

  • GDPR (EU): EU citizen data must reside in EU
  • China Cybersecurity Law: Chinese data must reside in China
  • India DPDPA: Critical personal data must reside in India
  • HIPAA (US): PHI must follow specific guidelines

Implementation patterns:

-- CockroachDB multi-region with row-level locality
CREATE TABLE users (
    id UUID PRIMARY KEY,
    home_region crdb_internal_region NOT NULL,
    pii_data JSONB,
    PRIMARY KEY (home_region, id)
)
LOCALITY REGIONAL BY ROW AS home_region;
-- EU users → eu-west-1, US users → us-east-1
-- Single SQL, but data physically separated

4.2 Cross-region encryption

Mandatory:

  • TLS 1.3 for inter-region replication
  • KMS keys per region (no single global key)
  • Customer-managed keys (CMK) for compliance
# Terraform: per-region KMS
resource "aws_kms_key" "us_east" {
  provider = aws.us-east-1
  description = "Aurora DSQL encryption key US East"
}
 
resource "aws_kms_key" "eu_west" {
  provider = aws.eu-west-1
  description = "Aurora DSQL encryption key EU West"
}

4.3 IAM cross-account / cross-region

Principle of least privilege: Each region has separate IAM roles.

Region A app → IAM role A → Aurora DSQL A only
Region B app → IAM role B → Aurora DSQL B only

No app has cross-region admin access.
Replication uses dedicated service role with minimal scope.

4.4 Audit logging

Every cross-region transaction must be logged:

  • Source region, destination region
  • Transaction ID, timestamp
  • User/role identity
  • Data classification

Forward to centralized SIEM (Splunk, Datadog, Wazuh) for compliance audits.

4.5 Disaster recovery testing security

Game day must include:

  • Verify failover doesn’t expose unauthorized data
  • Confirm encryption keys valid in DR region
  • Test access controls survive failover
  • Validate audit log integrity

5. DevOps — Vận hành Multi-Region

5.1 Aurora DSQL setup (Terraform)

provider "aws" {
  alias  = "primary"
  region = "us-east-1"
}
 
provider "aws" {
  alias  = "secondary"
  region = "us-west-2"
}
 
# Primary cluster
resource "aws_dsql_cluster" "primary" {
  provider = aws.primary
 
  multi_region_properties {
    witness_region = "us-west-2"
  }
 
  tags = {
    Name = "primary-us-east-1"
  }
}
 
# Secondary cluster
resource "aws_dsql_cluster" "secondary" {
  provider = aws.secondary
 
  multi_region_properties {
    witness_region = "us-east-1"
  }
 
  tags = {
    Name = "secondary-us-west-2"
  }
}
 
# Link clusters for active-active
resource "aws_dsql_cluster_peering" "main" {
  provider          = aws.primary
  cluster_id        = aws_dsql_cluster.primary.id
  peer_cluster_arns = [aws_dsql_cluster.secondary.arn]
}

5.2 Application connection pattern

"""
Multi-region aware DB client with automatic failover.
"""
 
import os
import psycopg
from contextlib import contextmanager
 
 
class MultiRegionDB:
    def __init__(self):
        # Primary endpoint based on user region
        self.endpoints = {
            "us-east-1": "primary.dsql-cluster.amazonaws.com",
            "us-west-2": "secondary.dsql-cluster.amazonaws.com",
            "eu-west-1": "tertiary.dsql-cluster.amazonaws.com",
        }
        self.current_region = os.getenv("AWS_REGION", "us-east-1")
 
    @contextmanager
    def connection(self):
        """Try local region first, fall back to others."""
        order = [self.current_region] + [
            r for r in self.endpoints if r != self.current_region
        ]
 
        for region in order:
            try:
                conn = psycopg.connect(
                    host=self.endpoints[region],
                    user="app",
                    password=self._get_iam_token(region),
                    dbname="postgres",
                    sslmode="require",
                    connect_timeout=2,
                )
                yield conn
                conn.close()
                return
            except (psycopg.OperationalError, TimeoutError) as e:
                print(f"Failed connect {region}: {e}")
                continue
 
        raise RuntimeError("All regions unreachable")
 
    def _get_iam_token(self, region):
        # Aurora DSQL uses IAM auth tokens
        import boto3
        client = boto3.client("dsql", region_name=region)
        return client.generate_db_connect_auth_token(
            cluster_endpoint=self.endpoints[region]
        )
 
 
db = MultiRegionDB()
 
with db.connection() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT * FROM orders WHERE id = %s", (order_id,))

5.3 Health check & failover

# Route 53 health check + failover
resource "aws_route53_health_check" "primary" {
  fqdn              = "api-us-east.myapp.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "30"
 
  tags = {
    Name = "primary-health"
  }
}
 
resource "aws_route53_record" "api_primary" {
  zone_id = var.zone_id
  name    = "api.myapp.com"
  type    = "A"
  set_identifier = "primary"
 
  failover_routing_policy {
    type = "PRIMARY"
  }
 
  health_check_id = aws_route53_health_check.primary.id
  alias {
    name                   = aws_lb.primary.dns_name
    zone_id                = aws_lb.primary.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "api_secondary" {
  zone_id = var.zone_id
  name    = "api.myapp.com"
  type    = "A"
  set_identifier = "secondary"
 
  failover_routing_policy {
    type = "SECONDARY"
  }
 
  alias {
    name                   = aws_lb.secondary.dns_name
    zone_id                = aws_lb.secondary.zone_id
    evaluate_target_health = true
  }
}

5.4 Monitoring metrics

groups:
  - name: multi_region_alerts
    rules:
      - alert: CrossRegionReplicationLag
        expr: dsql_replication_lag_seconds > 5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Replication lag {{ $value }}s between regions"
 
      - alert: RegionUnhealthy
        expr: up{job="api", region=~".+"} == 0
        for: 2m
        labels: { severity: critical }
        annotations:
          summary: "Region {{ $labels.region }} unreachable"
 
      - alert: SplitBrainSuspected
        expr: |
          count(dsql_is_primary == 1) by (cluster) > 1
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Multiple primaries detected — split brain!"
 
      - alert: HighFailoverFrequency
        expr: changes(dsql_primary_region[1h]) > 3
        labels: { severity: warning }
        annotations:
          summary: "Failover happened {{ $value }} times in 1h"

5.5 Game day procedure

#!/bin/bash
# game-day-region-failure.sh
# Simulate us-east-1 failure quarterly
 
echo "Game Day: Simulating US-EAST-1 failure"
echo "Expected: us-west-2 takes over, RTO < 5min"
 
# 1. Block traffic to us-east-1
aws elbv2 modify-target-group-attributes \
  --target-group-arn $US_EAST_TG \
  --attributes Key=deregistration_delay.timeout_seconds,Value=0
 
aws ec2 authorize-security-group-ingress \
  --group-id $US_EAST_SG \
  --protocol -1 --port -1 --source-group $BLOCKED_SG
 
# 2. Watch failover
echo "Waiting for failover..."
START=$(date +%s)
while true; do
  if curl -sf https://api.myapp.com/health | grep -q '"region":"us-west-2"'; then
    END=$(date +%s)
    echo "Failover complete: $(($END - $START))s"
    break
  fi
  sleep 5
done
 
# 3. Verify data consistency
psql -h secondary.dsql-cluster.amazonaws.com -c "
  SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL '5 min';
"
 
# 4. Restore us-east-1
aws ec2 revoke-security-group-ingress ...
 
# 5. Run reconciliation report
echo "Game Day complete. RTO: $(($END - $START))s. Generate report."

6. Code Implementation

6.1 CockroachDB region-aware app

"""
CockroachDB region-aware Python application.
Uses gateway region for low-latency local reads.
"""
 
import os
import psycopg
from psycopg.rows import dict_row
 
 
class RegionAwareDB:
    def __init__(self):
        self.region = os.getenv("CRDB_REGION", "us-east-1")
        self.dsn = os.getenv("CRDB_DSN")
 
    def connect(self):
        return psycopg.connect(
            self.dsn,
            row_factory=dict_row,
            options=f"--cluster_name=mycluster --search_path=public",
        )
 
    def get_user(self, user_id: str) -> dict:
        """Read user from local region (low latency)."""
        with self.connect() as conn:
            with conn.cursor() as cur:
                cur.execute("""
                    SELECT * FROM users
                    WHERE id = %s
                    AND home_region = %s
                """, (user_id, self.region))
                return cur.fetchone()
 
    def transfer_money(self, from_user: str, to_user: str, amount: int):
        """Cross-region transfer (requires consensus)."""
        with self.connect() as conn:
            with conn.transaction():
                with conn.cursor() as cur:
                    # CockroachDB uses bounded staleness reads by default
                    # but transactions are strongly consistent
                    cur.execute("""
                        UPDATE accounts
                        SET balance = balance - %s
                        WHERE user_id = %s AND balance >= %s
                    """, (amount, from_user, amount))
 
                    if cur.rowcount == 0:
                        raise ValueError("Insufficient funds")
 
                    cur.execute("""
                        UPDATE accounts
                        SET balance = balance + %s
                        WHERE user_id = %s
                    """, (amount, to_user))
 
 
db = RegionAwareDB()
user = db.get_user("user-123")  # Low latency, local read
db.transfer_money("user-123", "user-456", 100)  # Strong consistency, may cross region

6.2 Failover-aware HTTP middleware

from fastapi import FastAPI, Request, Response
from fastapi.middleware.base import BaseHTTPMiddleware
import time
 
app = FastAPI()
 
 
class FailoverAwareMiddleware(BaseHTTPMiddleware):
    """Add region info to responses, monitor cross-region calls."""
 
    async def dispatch(self, request: Request, call_next):
        start = time.time()
        region = os.getenv("AWS_REGION", "unknown")
 
        response = await call_next(request)
 
        elapsed = time.time() - start
        response.headers["X-Region"] = region
        response.headers["X-Response-Time"] = f"{elapsed:.3f}s"
 
        # Alert if response time > 200ms (suggests cross-region call)
        if elapsed > 0.2:
            await self._log_slow_request(request, region, elapsed)
 
        return response
 
    async def _log_slow_request(self, request, region, elapsed):
        # Track slow requests for analysis
        print(f"[SLOW] {region} {request.url.path} took {elapsed:.3f}s")
 
 
app.add_middleware(FailoverAwareMiddleware)

6.3 Custom conflict resolution (LWW)

"""
Application-level Last-Write-Wins for cross-region conflicts.
"""
 
from datetime import datetime
import uuid
 
 
class LWWConflictResolver:
    def __init__(self, db):
        self.db = db
 
    def update_user_profile(self, user_id: str, data: dict):
        """Update with LWW timestamp for cross-region safety."""
        timestamp = datetime.utcnow().isoformat() + "Z"
        update_id = str(uuid.uuid4())
 
        with self.db.connect() as conn:
            with conn.cursor() as cur:
                # Check current timestamp; only update if newer
                cur.execute("""
                    UPDATE users
                    SET profile_data = %s,
                        last_modified = %s,
                        last_modified_by = %s
                    WHERE id = %s
                      AND (last_modified IS NULL OR last_modified < %s)
                    RETURNING id, last_modified
                """, (
                    json.dumps(data),
                    timestamp,
                    update_id,
                    user_id,
                    timestamp,
                ))
 
                result = cur.fetchone()
                if result is None:
                    print(f"Update rejected: newer write exists for {user_id}")
                    return False
 
                return True

7. System Design Diagrams

7.1 Active-Active Architecture

flowchart TB
    subgraph Global["Global Layer"]
        DNS[Route 53<br/>Latency-based routing]
        CDN[CloudFront / Cloudflare]
    end

    subgraph US["US-EAST-1"]
        USL[Load Balancer]
        USA[App Tier]
        USDB[(Aurora DSQL<br/>US Primary)]
    end

    subgraph EU["EU-WEST-1"]
        EUL[Load Balancer]
        EUA[App Tier]
        EUDB[(Aurora DSQL<br/>EU Primary)]
    end

    subgraph ASIA["AP-SOUTHEAST-1"]
        ASL[Load Balancer]
        ASA[App Tier]
        ASDB[(Aurora DSQL<br/>Asia Primary)]
    end

    UserUS[US Users] --> CDN --> DNS
    UserEU[EU Users] --> CDN
    UserASIA[Asia Users] --> CDN

    DNS -->|nearest| USL
    DNS -->|nearest| EUL
    DNS -->|nearest| ASL

    USL --> USA --> USDB
    EUL --> EUA --> EUDB
    ASL --> ASA --> ASDB

    USDB <-.sync replication.-> EUDB
    EUDB <-.sync replication.-> ASDB
    USDB <-.sync replication.-> ASDB

    style USDB fill:#4caf50,color:#fff
    style EUDB fill:#4caf50,color:#fff
    style ASDB fill:#4caf50,color:#fff

7.2 Failover Sequence

sequenceDiagram
    participant U as User
    participant DNS as Route 53
    participant US as US Region
    participant EU as EU Region
    participant HC as Health Checks

    Note over US,EU: Normal operation

    U->>DNS: Resolve api.myapp.com
    DNS-->>U: us-east-1 IP (lowest latency)
    U->>US: Request
    US-->>U: Response

    Note over US: ⚡ Region failure ⚡

    HC->>US: Health probe
    Note over HC: 3 consecutive failures<br/>(90 seconds)

    HC->>DNS: Mark us-east-1 unhealthy
    DNS->>DNS: Remove from rotation

    Note over U: Next request

    U->>DNS: Resolve api.myapp.com
    DNS-->>U: eu-west-1 IP (next best)
    U->>EU: Request
    EU-->>U: Response

    Note over US,EU: Total RTO: 90-120 seconds

7.3 Spanner-style Commit Wait

sequenceDiagram
    participant Client
    participant Coord as Coordinator (Region A)
    participant TT as TrueTime API
    participant RegB as Replica (Region B)
    participant RegC as Replica (Region C)

    Client->>Coord: BEGIN; UPDATE x = 5; COMMIT;

    Coord->>TT: now()
    TT-->>Coord: [t_earliest, t_latest]

    Coord->>Coord: T_commit = t_latest

    par Replicate to majority
        Coord->>RegB: Prepare T_commit
        Coord->>RegC: Prepare T_commit
    end
    RegB-->>Coord: ack
    RegC-->>Coord: ack

    Note over Coord: Commit Wait<br/>until TT.after(T_commit)

    Coord->>TT: after(T_commit)?
    TT-->>Coord: true

    Coord->>Coord: Apply commit
    Coord-->>Client: 200 OK

    Note over Client,RegC: Total: ~30-50ms (RTT + ~7ms wait)

7.4 Split-Brain Prevention via Quorum

flowchart TB
    subgraph Before["Before Partition: 3 regions, full mesh"]
        A1[Region A] <--> B1[Region B]
        B1 <--> C1[Region C]
        A1 <--> C1
    end

    subgraph Partition["⚡ Partition: A isolated"]
        A2[Region A<br/>Minority - 1 node]
        B2[Region B<br/>Majority - 2 nodes]
        C2[Region C<br/>Majority - 2 nodes]

        B2 <--> C2

        A2 -.X.- B2
        A2 -.X.- C2

        AStatus[A: cannot accept writes<br/>read-only mode]
        BCStatus[B+C: continue as primary<br/>accept writes]
    end

    subgraph After["After Heal: A reconciles"]
        A3[Region A<br/>Replays missed transactions<br/>from B/C]
        B3[Region B]
        C3[Region C]
        A3 <--> B3
        B3 <--> C3
        A3 <--> C3
    end

    Before --> Partition --> After

    style A2 fill:#ffcdd2
    style B2 fill:#c8e6c9
    style C2 fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: Aurora DSQL = Spanner cho mass market. Trước 2024, chỉ Spanner có atomic clock external consistency. Aurora DSQL democratize công nghệ này — drop-in PostgreSQL với 99.999% SLA cross-region.

#2: Atomic clocks free trên AWS. Amazon Time Sync Service (2023) cung cấp microsecond-accurate time miễn phí trên EC2. Đây là enabling technology cho DSQL.

#3: Active-Active không phải binary. Có spectrum: full active-active (every region writes), regional active (each region owns subset), read-anywhere-write-primary. Chọn đúng level cho use case.

#4: Light speed là physical limit. Cross-region sync replication không thể < 30ms. Kiến trúc phải accept latency cost hoặc relax consistency.

#5: Read-local là bắt buộc cho UX. User APAC không thể đợi 200ms cho mỗi read. Pattern: local read replica + sync write to primary, hoặc CockroachDB locality-aware.

#6: Split-brain rare nhưng catastrophic. 1 lần data corruption = mất trust mãi mãi. Quorum-based + lease-based fencing là 2 chính defense.

#7: DNS failover slow (5-10 min). Cho RTO < 1 min, dùng Anycast (Global Accelerator) hoặc CDN-level routing.

#8: Cost gấp 3-4x single-region. Justify bằng business impact, không phải “best practice”. SaaS $100K+/month → worth it. Internal tools → có thể không.

Pitfalls

Pitfall 1: Thinking active-passive is enough

Sai: “Có warm standby là đủ” → outage failover takes 1 hour, lose data. Đúng: Cho revenue-critical, active-active duy nhất accept zero downtime.

Pitfall 2: Same KMS key across regions

Sai: 1 KMS key dùng cho cả 3 regions → key compromise = total loss. Đúng: Per-region KMS, customer-managed keys.

Pitfall 3: Async replication for critical writes

Sai: Payment ledger với async replication → 1 region fail = lose recent transactions. Đúng: Sync replication với Spanner/DSQL hoặc app-level 2PC.

Pitfall 4: Ignore data residency

Sai: EU user data tự động replicate sang US → GDPR violation, fines. Đúng: Row-level locality (CockroachDB), tagged tables, region-pinning.

Pitfall 5: No game day testing

Sai: “Failover should work” — chưa test bao giờ → real outage discover bug. Đúng: Quarterly game day, simulate region failure, measure RTO.

Pitfall 6: Cross-region in tight loops

Sai: App makes 100 sequential cross-region calls → 10 seconds latency. Đúng: Batch, prefetch, use local cache. Cross-region call expensive.

Pitfall 7: Trust DNS TTL

Sai: Set TTL=300s, expect failover in 5 min → some clients cache 1 hour. Đúng: Use Anycast / Global Accelerator for sub-30s failover.

Pitfall 8: Forget reconciliation after partition heal

Sai: Partition heals, app continues normally → diverged data persists. Đúng: Auto-reconciliation (DSQL/CRDB) or manual procedure.

Pitfall 9: No backup beyond replication

Sai: “Replication is backup” — ransomware encrypts → all replicas encrypted. Đúng: Point-in-time backups + cross-region + immutable storage.

Pitfall 10: Underestimate cost

Sai: “Multi-region just 2x cost” → bill comes 4x because of data transfer. Đúng: Calculate cross-region transfer carefully. Use private connectivity (Direct Connect, ExpressRoute).


TopicLiên hệ
Tuan-07-Database-Sharding-ReplicationFoundation; multi-region là extreme case
Tuan-Bonus-Consensus-Raft-PaxosUnderlying consensus protocol cho DSQL
Tuan-Bonus-Consistency-Models-IsolationExternal consistency, linearizability
Tuan-Bonus-CRDTs-Conflict-Free-Data-TypesAlternative cho async replication conflict resolution
Tuan-Bonus-Multi-Tenancy-SaaS-PatternsTenant-region affinity
Case-Design-Payment-SystemPayment cross-border requires multi-region
Case-Design-Stock-ExchangeGeo-distributed exchanges
Tuan-13-Monitoring-ObservabilityCross-region monitoring, replication lag

Tham khảo

Papers:

Engineering blogs:

Talks:

  • AWS re:Invent 2024 DAT427 — Aurora DSQL deep dive
  • Spanner talks at SIGMOD, OSDI

Tools:


Tiếp theo: Tuan-Bonus-Multi-Tenancy-SaaS-Patterns — Tenant isolation patterns cho SaaS, complement với multi-region.