Case Study: Design Nearby Friends (Real-Time Location Sharing)

“Hieu, tuong tuong em dang di choi voi nhom ban o Sai Gon. Ai cung tren Zalo, nhung khong ai biet ai dang o dau. Em mo app — va trong vong 1 giay, ban do hien ra 5 nguoi ban dang o gan em trong ban kinh 5km. Phia sau su ‘don gian’ do la mot he thong real-time phuc tap, noi WebSocket gap Pub/Sub gap Redis o quy mo 100 trieu nguoi dung.”

Tags: system-design nearby-friends real-time websocket pubsub redis alex-xu case-study vol2 Student: Hieu Prerequisite: Tuan-02-Back-of-the-envelope · Tuan-06-Cache-Strategy · Tuan-10-Consistent-Hashing · Tuan-17-Design-Chat-System Lien quan: Case-Design-Proximity-Service · Tuan-05-Load-Balancer · Tuan-08-Message-Queue · Tuan-13-Monitoring-Observability · Tuan-14-AuthN-AuthZ-Security Reference: Alex Xu, System Design Interview — An Insider’s Guide, Volume 2 — Chapter 2: Nearby Friends


1. Context & Why — Tai sao Nearby Friends quan trong?

1.1 Analogy — Nhom ban di choi trong thanh pho

Hieu, hay tuong tuong em va 10 nguoi ban cung di choi o TP.HCM vao toi thu 7. Moi nguoi co ke hoach rieng — co nguoi o quan 1 an uong, co nguoi o Bitexco mua sam, co nguoi o Thu Duc. Em muon biet: ai dang o gan minh de reu nhau?

Cach truyen thong: Em nhan tin hoi tung nguoi — “Ey, may dang o dau?” — roi doi tung nguoi tra loi. 10 nguoi, 10 tin nhan, 10 lan doi. Co nguoi khong tra loi, co nguoi tra loi tre 30 phut. Luc do ho da di cho khac roi.

Cach Nearby Friends: Em mo app, bat tinh nang “Nearby Friends”. App tu dong hien thi tren ban do: “Minh dang o Nguyen Hue, Tuan dang o Dong Khoi cach 500m, Lan dang o Bui Vien cach 800m.” Cap nhat moi 30 giay. Em khong can hoi ai ca — he thong lam moi thu tu dong.

Day chinh la tinh nang Nearby Friends — tuong tu Snap Map (Snapchat), Zalo Nearby, Facebook Nearby Friends (da discontinued 2022), hoac WhatsApp Live Location.

1.2 Van de ky thuat — Tai sao bai toan nay kho?

Van deGiai thichQuy mo
Real-time location updatesVi tri thay doi lien tuc, moi 30 giay phai cap nhat. Khong phai query 1 lan nhu Proximity Service10M concurrent users x update moi 30s = ~333K updates/s
Bidirectional communicationServer can push location cua ban be den client, khong chi client pullHTTP polling lang phi, can WebSocket
Fan-out problemMoi user co trung binh 400 ban. Moi update location phai notify toi 400 nguoi333K updates/s x 400 friends = 133M messages/s
Stateful connectionsWebSocket la stateful — kho scale hon stateless HTTPCan connection management, reconnection, server affinity
Privacy sensitiveLocation la personal data nhay cam nhat — can opt-in, opt-out, precision controlGDPR, CCPA, luat bao ve du lieu ca nhan
Selective sharingChi hien thi ban be da opt-in va dang online. Khong phai tat ca 400 ban beFiltering logic phuc tap

1.3 So sanh voi Proximity Service

Khia canhProximity Service (Chapter 1)Nearby Friends (Chapter 2)
Doi tuongBusinesses (tinh)Friends (dong — di chuyen lien tuc)
Update frequencyBusinesses hiem khi doi vi triUsers di chuyen moi giay
CommunicationRequest-Response (HTTP)Bidirectional (WebSocket)
Data freshnessChap nhan stale vai gioCan real-time (< 30s)
IndexingGeohash/Quadtree cho millions of POIsKhong can geospatial index — chi check khoang cach giua friends
Scale challengeRead-heavy (60K QPS)Write-heavy + fan-out (133M messages/s)
ReferenceCase-Design-Proximity-ServiceBai nay

Insight quan trong: Nearby Friends KHONG phai la Proximity Service cho nguoi dung. Proximity Service tim “ai o gan minh?” trong tat ca nguoi la. Nearby Friends chi hien thi ban be — em da biet danh sach ban be, chi can biet vi tri hien tai cua ho.

1.4 Real-World Applications

AppTinh nangDac diem
Snap Map (Snapchat)Hien thi vi tri ban be tren ban doReal-time, Bitmoji avatar, Ghost Mode de an
Zalo NearbyTim nguoi dung Zalo gan vi tri hien taiKhac — tim nguoi la, khong chi ban be
WhatsApp Live LocationChia se vi tri real-time voi contact/groupCo thoi han (15min, 1h, 8h)
Find My Friends (Apple)Xem vi tri gia dinh/ban beTich hop iOS, battery-efficient
Telegram Live LocationChia se vi tri trong chatTuong tu WhatsApp

2. Step 1 — Understand the Problem & Establish Design Scope

2.1 Clarifying Questions

Cau hoiTra loiGhi chu
Tinh nang chinh la gi?Hien thi danh sach ban be dang o gan minh, tren ban do, cap nhat real-timeCore feature duy nhat
”Gan” nghia la bao xa?Trong ban kinh co the cau hinh, mac dinh 5 miles (~8 km)Configurable radius
Tan suat cap nhat?Moi 30 giayKhong phai real-time tung giay — 30s la du
Quy mo bao lon?100M DAU (Daily Active Users)Scale cua Facebook/Zalo
Bao nhieu nguoi online dong thoi?~10% cua DAU o peak = 10M concurrentPeak hours: 7-10 PM
Trung binh moi user co bao nhieu ban?400 friendsTrung binh Facebook ~338, lam tron 400
User can opt-in khong?Co — chi user bat tinh nang moi chia se vi triPrivacy la so 1
Co can luu location history?Khong — chi can vi tri hien taiGiam storage, tang privacy
Hien thi khoang cach hay vi tri chinh xac?Ca hai — khoang cach + vi tri tren ban doClient render

2.2 Functional Requirements

  • FR1: User co the bat/tat tinh nang Nearby Friends (opt-in/opt-out)
  • FR2: Khi bat, app hien thi danh sach ban be dang o trong ban kinh configurable (mac dinh 5 miles)
  • FR3: Danh sach cap nhat moi 30 giay ma khong can user refresh
  • FR4: Moi friend entry hien thi: ten, khoang cach, thoi gian cap nhat cuoi cung
  • FR5: Chi hien thi ban be cung bat tinh nang (mutual opt-in)
  • FR6: Khi user tat tinh nang hoac offline, bien mat khoi danh sach cua ban be

2.3 Non-Functional Requirements

Yeu cauMuc tieuGiai thich
Availability99.9%Tinh nang social, khong phai safety-critical nhu navigation
LatencyLocation update propagation < 1 giayTu luc friend update location den luc em nhan duoc
Scalability10M concurrent users, 333K location updates/sPeak traffic
ConsistencyEventual consistency okNhan vi tri tre 1-2 giay la chap nhan duoc
Battery efficiencyKhong drain battery qua nhieuGPS polling moi 30s, khong phai lien tuc
PrivacyLocation chi chia se voi friends da opt-inGDPR compliant

2.4 Estimation — Back-of-the-Envelope

Concurrent Users:

Location Update QPS:

Moi giay, he thong nhan ~333K location updates tu clients. Day la write-heavy.

WebSocket Connections:

Neu moi WebSocket server handle 50K connections (con so thuc te cho production):

Pub/Sub Fan-out Volume:

Nhung khong phai tat ca 400 friends deu online va opt-in. Gia su 10% friends dang online va opt-in:

Day la con so quan trong: 13.3 trieu messages moi giay can duoc deliver qua Pub/Sub system. Day la thach thuc lon nhat cua bai toan.

Redis Memory cho Location Cache:

Chi 1 GB Redis memory cho location cache cua 10M users. Mot Redis instance (64 GB RAM) du suc.

Bandwidth cho Location Updates:

Outbound bandwidth gap 40x inbound — dac trung cua fan-out system.

Tom tat Estimation:

MetricValue
Concurrent users (peak)10M
WebSocket servers (50K conn/server)200
Location update QPS~333K/s
Pub/Sub fan-out messages~13.3M/s
Redis memory (location cache)~1 GB
Inbound bandwidth~33 MB/s
Outbound bandwidth~1.33 GB/s

3. Step 2 — High-Level Design

3.1 Lua chon Communication Protocol

Truoc khi thiet ke architecture, phai chon giao thuc giao tiep. Day la quyet dinh quan trong nhat cua bai toan.

OptionMo taUu diemNhuoc diemPhu hop?
HTTP PollingClient gui GET request moi 30sDon gian, statelessLang phi bandwidth (moi request co HTTP headers ~500 bytes), 10M requests moi 30s = 333K QPS chi de pollKhong
HTTP Long PollingClient gui request, server giu cho den khi co data moiIt lang phi hon pollingVan la 1 connection per pending request, khong that su bidirectionalKhong
Server-Sent Events (SSE)Server push events qua HTTPDon gian, built-in browser supportChi server → client, khong co client → server. Ma ta can ca hai chieuKhong du
WebSocketFull-duplex, bidirectional communication tren 1 TCP connectionClient gui location, server push friend updates — tren cung 1 connection. Lightweight (2-6 bytes overhead/message)Stateful, kho scale hon HTTPCo — day la lua chon

Tai sao WebSocket? Vi ta can bidirectional communication: client gui location update (client → server) va server push friend location (server → client) tren cung 1 connection. WebSocket la lua chon tu nhien nhat. Tham chieu Tuan-17-Design-Chat-System — chat system cung dung WebSocket vi ly do tuong tu.

3.2 High-Level Architecture Overview

He thong gom 3 thanh phan chinh:

ComponentChuc nangTechnology
WebSocket ServersDuy tri persistent connections voi clients, nhan location updates, push friend locationsWebSocket (ws/wss)
Location CacheLuu vi tri hien tai cua moi user (latest location)Redis
Pub/SubPropagate location updates den tat ca friends dang onlineRedis Pub/Sub

Data Flow tom tat:

  1. User A bat tinh nang → client mo WebSocket connection den server
  2. Client gui location update moi 30 giay qua WebSocket
  3. Server luu location vao Redis (Location Cache)
  4. Server publish location update len Pub/Sub channel cua User A
  5. Tat ca friends cua User A dang subscribe channel nay → nhan duoc update
  6. Friends’ WebSocket servers tinh khoang cach → push den friend’s client neu trong radius

3.3 High-Level Architecture Diagram

flowchart TB
    subgraph Clients
        A["User A<br/>(Mobile App)"]
        B["User B<br/>(Mobile App)"]
        C["User C<br/>(Mobile App)"]
    end

    subgraph "API Gateway / Load Balancer"
        LB["Load Balancer<br/>→ [[Tuan-05-Load-Balancer]]"]
    end

    subgraph "WebSocket Server Fleet"
        WS1["WebSocket Server 1<br/>50K connections"]
        WS2["WebSocket Server 2<br/>50K connections"]
        WSN["WebSocket Server N<br/>50K connections"]
    end

    subgraph "Data Layer"
        Redis_Cache["Redis — Location Cache<br/>Key: user_id<br/>Value: {lat, lng, timestamp}"]
        Redis_PubSub["Redis Pub/Sub<br/>Channel per user<br/>Friends subscribe"]
    end

    subgraph "Supporting Services"
        UserSvc["User Service<br/>(Friends list, profile)"]
        DB[("Database<br/>User data, friend relationships")]
    end

    A -->|"WebSocket"| LB
    B -->|"WebSocket"| LB
    C -->|"WebSocket"| LB

    LB --> WS1
    LB --> WS2
    LB --> WSN

    WS1 & WS2 & WSN -->|"SET location"| Redis_Cache
    WS1 & WS2 & WSN -->|"PUBLISH update"| Redis_PubSub
    WS1 & WS2 & WSN -->|"SUBSCRIBE friend channels"| Redis_PubSub

    WS1 & WS2 & WSN -->|"Get friends list"| UserSvc
    UserSvc --> DB

    style LB fill:#42a5f5,color:#fff
    style Redis_Cache fill:#ef5350,color:#fff
    style Redis_PubSub fill:#ff7043,color:#fff
    style DB fill:#66bb6a,color:#fff

3.4 API Design (WebSocket Messages)

Vi dung WebSocket, khong co REST endpoints truyen thong. Thay vao do, la cac message types:

Client → Server Messages:

Message TypePayloadMuc dich
location_update{lat, lng, timestamp}Gui vi tri hien tai moi 30s
enable_nearby{}Bat tinh nang Nearby Friends
disable_nearby{}Tat tinh nang Nearby Friends
update_radius{radius_miles}Thay doi ban kinh hien thi

Server → Client Messages:

Message TypePayloadMuc dich
friend_location{friend_id, lat, lng, timestamp, distance}Vi tri moi cua mot friend
friend_offline{friend_id}Friend vua tat tinh nang hoac offline
nearby_friends_list[{friend_id, lat, lng, distance}, ...]Danh sach day du khi moi connect

4. Step 3 — Deep Dive

4.1 Location Update Flow — Chi tiet tung buoc

Day la flow quan trong nhat cua he thong. Khi User A gui location update, chuyen gi xay ra?

sequenceDiagram
    participant A as User A (Client)
    participant WS_A as WebSocket Server<br/>(serving User A)
    participant Cache as Redis<br/>Location Cache
    participant PubSub as Redis<br/>Pub/Sub
    participant WS_B as WebSocket Server<br/>(serving User B)
    participant B as User B (Client)

    Note over A,B: User A va User B la ban be. Ca hai deu bat Nearby Friends.

    A->>WS_A: location_update {lat: 10.77, lng: 106.70, ts: ...}

    par Parallel Operations
        WS_A->>Cache: SET user:A:location {lat, lng, ts}<br/>TTL = 120s
    and
        WS_A->>PubSub: PUBLISH channel:user_A {lat, lng, ts}
    end

    Note over PubSub: User B da SUBSCRIBE channel:user_A<br/>(vi B la ban cua A va dang online)

    PubSub->>WS_B: Message on channel:user_A {lat, lng, ts}

    Note over WS_B: WS_B tinh khoang cach giua A va B.<br/>Neu <= radius cua B → push den B.

    WS_B->>B: friend_location {friend_id: A, lat, lng, distance: 1.2km}

Chi tiet tung buoc:

BuocActionComponentLatencyGhi chu
1Client gui location qua WebSocketClient → WS Server~5ms (LAN)Binary message, minimal overhead
2aLuu location vao Redis cacheWS Server → Redis~1msSET voi TTL 120s (2x update interval)
2bPublish len user’s Pub/Sub channelWS Server → Redis Pub/Sub~1msSong song voi buoc 2a
3Redis fan-out den subscribersRedis Pub/Sub → WS Servers~1msMoi subscriber nhan duoc message
4Tinh khoang cachWS Server (receiver)~0.01msHaversine formula, CPU-bound
5Push den client neu trong radiusWS Server → Client~5msQua WebSocket connection
Total~10-15msCuc nhanh

Aha Moment: Toan bo flow tu A update location den B nhan duoc chi mat ~10-15ms. User cam nhan la “real-time” mac du thuc te location chi cap nhat moi 30 giay. Bottleneck khong phai latency — ma la fan-out volume.

4.2 Redis Pub/Sub — Trai tim cua he thong

4.2.1 Thiet ke Channel

Core design decision: Moi user = 1 Pub/Sub channel.

Thiet keMo taUu diemNhuoc diem
1 channel per user (chosen)Channel user:A, friends subscribeDon gian, granular controlNhieu channels (10M)
1 channel per geohash cellChannel geo:w3gvk, users trong cell subscribeIt channelsNhan updates tu nguoi la (khong chi friends), privacy issue
1 global channelMoi update broadcast den tat caDon gian nhat333K updates/s x 10M subscribers = khong kha thi

Tai sao 1 channel per user? Vi Nearby Friends chi care ve ban be, khong phai tat ca nguoi trong khu vuc. Moi user chi can subscribe channels cua ban be — chinh xac nhung nguoi ho quan tam.

4.2.2 Subscribe/Unsubscribe Flow

Khi User B len online va bat Nearby Friends:

BuocActionChi tiet
1B connect WebSocketMo persistent connection
2Server lay friends list cua BQuery User Service → DB. Ket qua: [A, C, D, E, …] (400 friends)
3Server check ai dang online va opt-inQuery Redis: EXISTS user:A:location, user:C:location, …
4Server subscribe B vao channels cua online friendsSUBSCRIBE channel:user_A, channel:user_C, …
5Server lay latest location cua online friendsMGET user:A:location, user:C:location, … tu Redis cache
6Server tinh khoang cach va gui initial listPush nearby_friends_list den B qua WebSocket

Khi User B offline hoac tat Nearby Friends:

BuocActionChi tiet
1Server UNSUBSCRIBE B khoi tat ca friend channelsUNSUBSCRIBE channel:user_A, channel:user_C, …
2Server DELETE B’s location tu cacheDEL user:B:location
3Server PUBLISH “offline” event len B’s channelPUBLISH channel:user_B {status: “offline”}
4Friends cua B nhan “offline” eventPush friend_offline {friend_id: B} den friends’ clients

4.2.3 Pub/Sub Fan-out Analysis

Day la phan phuc tap nhat. Khi User A update location:

Nhung khong phai moi user co dung 400 friends. Phan phoi thuc te la long-tail:

User typeSo friends% usersSubscribers per channel
Light users< 10040%~10
Average users100-50045%~40
Heavy users500-200014%~100-200
Power users / celebrities2000-50001%~500+

Pitfall: Fan-out explosion voi popular users. Mot user co 5000 friends va 500 nguoi dang online → moi 30 giay, 1 update cua ho fan-out thanh 500 messages. Nhan voi 333K updates/s → co the co nhung burst cuc lon.

Giai phap cho popular users: Rate limit fan-out. Neu user co > 500 subscribers, giam tan suat update tu 30s xuong 60s hoac 120s. User binh thuong khong bi anh huong.

4.2.4 Memory cua Redis Pub/Sub Channels

7 GB cho Pub/Sub metadata — chap nhan duoc cho mot Redis cluster.

4.3 WebSocket Connection Management

4.3.1 Connection Lifecycle

stateDiagram-v2
    [*] --> Connecting: User mo app,<br/>bat Nearby Friends
    Connecting --> Connected: WebSocket handshake OK
    Connected --> Subscribing: Server subscribe<br/>friend channels
    Subscribing --> Active: Subscriptions done,<br/>initial list sent
    Active --> Active: Location updates<br/>moi 30s
    Active --> Reconnecting: Network drop,<br/>server crash
    Reconnecting --> Connecting: Retry voi<br/>exponential backoff
    Active --> Disconnecting: User tat tinh nang<br/>hoac close app
    Disconnecting --> [*]: Cleanup subscriptions,<br/>delete location cache
    Reconnecting --> [*]: Max retries exceeded

4.3.2 Connection Parameters

ParameterValueLy do
Heartbeat interval30 giayTrung voi location update interval — gui heartbeat kem location
Connection timeout10 giayNeu khong connect duoc trong 10s → retry
Max reconnect attempts10Sau 10 lan → thong bao user “Khong the ket noi”
Reconnect backoffExponential: 1s, 2s, 4s, 8s, …, max 60sTranh thundering herd khi server restart
Idle timeout5 phut khong nhan location updateCoi nhu user da tat GPS hoac app bi kill → disconnect, cleanup
Max message size1 KBLocation update chi can ~100 bytes. 1 KB la du margin

4.3.3 Reconnection Strategy

Khi WebSocket connection bi mat (network issue, server crash, app bi background), client can reconnect:

BuocActionChi tiet
1Client detect disconnectonclose hoac onerror event
2Wait theo backoffExponential: 1s, 2s, 4s, 8s, 16s, 32s, 60s (cap)
3Reconnect den Load BalancerCo the connect den server khac (stateful concern!)
4Re-authenticateGui token qua WebSocket
5Server re-subscribe friend channelsGiong flow ban dau
6Server gui initial nearby friends listClient refresh UI

Quan trong: Khi reconnect, client co the duoc route den server khac (vi Load Balancer). Server moi phai re-subscribe tat ca friend channels. Day la chi phi cua reconnection — nhung vi subscribe la O(number_of_friends) va friends list duoc cache, chi phi nay chap nhan duoc.

4.4 Scaling WebSocket Servers

4.4.1 Thach thuc — Stateful Connections

WebSocket connections la stateful: moi connection gan voi 1 server cu the. Khong the “load balance tung request” nhu HTTP. Sau khi connection duoc thiet lap, moi message phai di qua dung server do.

Van deGiai thichGiai phap
Server failureServer chet → 50K users mat connectionClient auto-reconnect, LB route den server khac
Uneven distributionServer A co 60K connections, server B co 20KLB track connection count, route new connections den server it tai
DeploymentRolling update → server restart → 50K users reconnect dong thoiGraceful shutdown: server thong bao clients truoc 30s, clients reconnect dần
Memory pressureMoi connection ~ 10-50 KB memory. 50K connections = 0.5-2.5 GB/serverMonitor memory, set max connections per server

4.4.2 Server Assignment Strategy

StrategyMo taUu diemNhuoc diem
Random (via LB)LB random chon server cho new connectionDon gianUneven distribution
Least connectionsLB chon server co it connection nhatEven distributionCan LB track state
Consistent hashingHash user_id → serverReconnect lai cung server (cache warm)Kho handle server add/remove
Least connections (recommended)

Dung Least Connections strategy cho WebSocket LB. Tham chieu Tuan-05-Load-Balancer. Consistent hashing (Tuan-10-Consistent-Hashing) co the dung nhung complexity khong dang cho use case nay — vi reconnection chi mat vai giay de re-subscribe.

4.4.3 Graceful Shutdown Flow

Khi can restart WebSocket server (deploy, maintenance):

BuocActionDuration
1Mark server as “draining”LB ngung gui new connections den server nay
2Server gui “reconnect” signal den tat ca clientsClients bat dau reconnect den server khac
3Doi cho connections giam ve 0 (hoac timeout)Max 60 giay
4Server unsubscribe tat ca Pub/Sub channelsCleanup
5Shutdown serverSafe

4.5 Scaling Redis Pub/Sub

4.5.1 Van de — Single Redis Pub/Sub Bottleneck

Mot Redis instance co the handle ~500K messages/s (Pub/Sub throughput). Nhung ta can deliver 13.3M messages/s. Can nhieu Redis instances.

Lam tron len: 30 Redis Pub/Sub instances (voi buffer).

4.5.2 Sharding Strategy cho Pub/Sub Channels

StrategyMo taUu diemNhuoc diem
Hash-based shardingshard = hash(user_id) % NEven distribution, deterministicResharding khi add/remove nodes
Consistent hashingDung hash ringSmooth reshardingPhuc tap hon
Range-baseduser_id 1-1M → shard 1, 1M-2M → shard 2Don gianUneven neu user distribution khong deu

Chon: Hash-based sharding (don gian, du tot).

Moi WebSocket server can biet: channel cua user X nam tren Redis shard nao. Vi hash function la deterministic → moi server tinh duoc ma khong can lookup.

flowchart LR
    subgraph "WebSocket Servers"
        WS1["WS Server 1"]
        WS2["WS Server 2"]
        WS3["WS Server 3"]
    end

    subgraph "Redis Pub/Sub Cluster"
        R1["Redis Shard 1<br/>Channels: user_1, user_31, ..."]
        R2["Redis Shard 2<br/>Channels: user_2, user_32, ..."]
        R3["Redis Shard 3<br/>Channels: user_3, user_33, ..."]
        RN["Redis Shard N<br/>..."]
    end

    WS1 -->|"PUBLISH channel:user_1"| R1
    WS1 -->|"SUBSCRIBE channel:user_2"| R2
    WS2 -->|"PUBLISH channel:user_32"| R2
    WS2 -->|"SUBSCRIBE channel:user_3"| R3
    WS3 -->|"SUBSCRIBE channel:user_1"| R1

    R1 -->|"Fan-out"| WS2 & WS3
    R2 -->|"Fan-out"| WS1 & WS3

    style R1 fill:#ef5350,color:#fff
    style R2 fill:#ef5350,color:#fff
    style R3 fill:#ef5350,color:#fff
    style RN fill:#ef5350,color:#fff

Luu y: Moi WebSocket server co the connect den nhieu Redis shards — vi friends cua 1 user co the nam tren nhieu shards khac nhau. WS Server 1 serving User B phai subscribe channel:user_A tren shard 1, channel:user_C tren shard 3, v.v.

4.5.3 Redis Pub/Sub Connections Budget

Moi WebSocket server can connect den moi Redis shard (de subscribe/publish). Voi 200 WS servers va 30 Redis shards:

Moi Redis instance co the handle ~10K concurrent connections → du thoai mai.

4.5.4 Alternative — Dung Message Queue thay Redis Pub/Sub?

Khia canhRedis Pub/SubMessage Queue (Kafka, RabbitMQ)
Delivery guaranteeAt-most-once (fire-and-forget)At-least-once (Kafka), At-most-once (RabbitMQ)
PersistenceKhong — message mat neu subscriber offlineCo — Kafka luu messages tren disk
LatencyCuc thap (~1ms)Cao hon (~5-50ms tuy cau hinh)
Fan-outNative — 1 publish, N subscribers nhanCan consumer groups hoac topic-per-user
MemoryChi luu trong memoryLuu tren disk (Kafka)
OrderingGuaranteed per channelGuaranteed per partition (Kafka)

Tai sao chon Redis Pub/Sub?

  1. At-most-once la du: Mat 1 location update khong sao — 30 giay sau co update moi. Khong can durability.
  2. Latency cuc thap: Redis Pub/Sub ~1ms, Kafka ~5-50ms. Cho real-time feature, 1ms matters.
  3. Khong can persistence: Ta khong care location 5 phut truoc. Chi can latest.
  4. Don gian: Redis Pub/Sub la built-in, khong can deploy/manage them Kafka cluster.

Trade-off: Neu can guaranteed delivery (vi du: notification system), dung Kafka. Cho Nearby Friends, Redis Pub/Sub la perfect fit vi ta uu tien low latency va simplicity hon durability.

4.6 Nearby Friend Calculation — Tinh khoang cach

4.6.1 Server-side vs Client-side Calculation

ApproachMo taUu diemNhuoc diem
Server-side (chosen)WS server tinh khoang cach truoc khi pushGiam bandwidth — chi push friends trong radiusServer can biet vi tri cua subscriber
Client-sideServer push tat ca friends’ locations, client tu filterServer don gian honLang phi bandwidth — push ca friends o xa

Chon server-side calculation vi:

  • Voi 40 online friends, server chi push 5-10 friends trong radius thay vi 40 → tiet kiem 75-87% bandwidth
  • Client (mobile) khong phai xu ly nhieu → tiet kiem battery
  • Server da co vi tri cua ca hai (sender va receiver) trong Redis cache

4.6.2 Haversine Formula

Khoang cach giua 2 diem tren be mat Trai Dat:

Trong do (ban kinh Trai Dat), la latitude, la longitude.

Performance: Haversine la O(1) — chi la vai phep tinh luong giac. Tinh 40 khoang cach mat < 0.01ms. Khong phai bottleneck.

4.6.3 Flow chi tiet khi nhan friend update

Khi WS Server (serving User B) nhan update tu channel:user_A:

BuocActionChi tiet
1Nhan message tu Pub/Sub{user_id: A, lat: 10.77, lng: 106.70, ts: ...}
2Lay vi tri hien tai cua User BTu local memory (cached khi B gui location update)
3Tinh Haversine distanced = haversine(B.lat, B.lng, A.lat, A.lng)
4So sanh voi radius cua Bif d <= B.radius
5aNeu trong radius → Push den Bfriend_location {friend_id: A, distance: d}
5bNeu ngoai radius → bo quaKhong push, tiet kiem bandwidth

Optimization: WS Server serving User B luu vi tri cua B trong local memory (khong phai query Redis moi lan). Vi tri duoc cap nhat moi 30 giay khi B gui location update. Day la “cache tai cho” — zero latency.

4.7 Adding/Removing Friends — Dynamic Subscription

Khi friend relationship thay doi (them ban moi, huy ket ban), subscriptions phai cap nhat:

4.7.1 Them ban moi

BuocActionTrigger
1User A va User B tro thanh banFriend request accepted
2Notification gui den WS Server cua A va BQua internal message queue
3WS Server cua A subscribe channel:user_BNeu B dang online va opt-in
4WS Server cua B subscribe channel:user_ANeu A dang online va opt-in
5Tinh khoang cach va push neu canCa A va B nhan vi tri cua nhau

4.7.2 Huy ket ban

BuocActionTrigger
1User A unfriend User BUI action
2Notification gui den WS Server cua A va BQua internal message queue
3WS Server cua A unsubscribe channel:user_B
4WS Server cua B unsubscribe channel:user_A
5Push friend_offline den ca A va BXoa khoi nearby list

Luu y: Subscribe/unsubscribe tren Redis Pub/Sub la O(1) — cuc nhanh. Khong anh huong performance.

4.8 Handling Inactive Users — TTL va Cleanup

4.8.1 Van de

User co the inactive vi nhieu ly do:

Tinh huongHe thong nhin thayCan xu ly
User tat appWebSocket disconnect eventCleanup ngay
App bi kill boi OSWebSocket disconnect event (co the tre)Cleanup ngay
User di vao vung khong co mangKhong nhan location update, heartbeat missDetect va cleanup
User de dien thoai yen 1 choVan nhan location update (vi tri khong doi)Khong can xu ly — van active
User tat GPSApp khong co GPS data → ngung gui locationDetect va thong bao user

4.8.2 TTL Strategy

DataTTLLy do
Location cache (Redis)120 giay (2x update interval)Neu khong nhan update trong 2 chu ky → user da offline
WebSocket heartbeat60 giayServer gui ping, client tra loi pong. Miss 2 ping lien tiep → disconnect
Pub/Sub subscriptionKhong co TTL — cleanup khi disconnectSubscribe/unsubscribe la explicit

Flow khi user inactive:

sequenceDiagram
    participant A as User A (Client)
    participant WS as WebSocket Server
    participant Cache as Redis Cache
    participant PubSub as Redis Pub/Sub
    participant Friends as Friends' WS Servers

    Note over A,WS: User A mat mang / tat app

    WS->>WS: Heartbeat timeout (60s, 2 missed pings)
    WS->>WS: Close WebSocket connection

    par Cleanup
        WS->>Cache: DEL user:A:location
    and
        WS->>PubSub: UNSUBSCRIBE all friend channels
    and
        WS->>PubSub: PUBLISH channel:user_A {status: "offline"}
    end

    PubSub->>Friends: User A offline notification
    Friends->>Friends: Remove A from nearby list,<br/>push friend_offline to clients

4.8.3 Redis TTL as Safety Net

Ngay ca khi server khong kip cleanup (vi du server crash), Redis TTL tren location key se tu dong xoa data:

Friends se nhan ra user “da cap nhat lan cuoi 2 phut truoc” → client co the dim hoac an user nay khoi danh sach.

4.9 Geohash Optimization — Giam Computation

4.9.1 Van de

Voi moi location update tu friend, WS server phai tinh Haversine distance. Neu user co 40 online friends va moi friend update moi 30s, moi user nhan:

40 calculations moi 30 giay la it — chua phai bottleneck. Nhung voi 50K users tren 1 WS server:

Haversine la nhe (~0.01ms), nen 67K calculations/s chi mat ~0.67ms CPU time. Khong phai bottleneck.

Ket luan: Cho quy mo 10M concurrent users, geohash optimization KHONG can thiet. Brute-force Haversine calculation du nhanh. Tuy nhien, neu quy mo tang len 100M+ concurrent users, hoac friends count tang len 5000+, thi geohash optimization co the can.

4.9.2 Geohash Optimization (neu can)

Neu can optimize (quy mo cuc lon):

Ky thuatMo taTiet kiem
Pre-filter by geohashMoi user co geohash (tinh tu lat/lng). Chi tinh Haversine cho friends co geohash ganLoai 80-90% friends o xa
Lazy calculationChi tinh khi friend’s geohash thay doi (khong tinh lai neu van o cung cell)Giam 90%+ calculations cho friends ngoi yen
Batch calculationGom nhieu updates va tinh 1 lan moi 5-10 giay thay vi moi updateGiam CPU spikes

4.10 Multi-Region Architecture

4.10.1 Tai sao can Multi-Region?

Ly doChi tiet
LatencyUser o Viet Nam connect den server o US → ~200ms latency. WebSocket updates se cham
Availability1 region down → toan bo he thong down. Multi-region → failover
Data sovereigntyGDPR yeu cau data EU users luu o EU
User distribution100M DAU phan bo toan cau — khong the serve tu 1 region

4.10.2 Regional Architecture

flowchart TB
    subgraph "Region: US-East"
        US_LB["Load Balancer"]
        US_WS["WebSocket Servers<br/>60 servers"]
        US_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"]
    end

    subgraph "Region: EU-West"
        EU_LB["Load Balancer"]
        EU_WS["WebSocket Servers<br/>50 servers"]
        EU_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"]
    end

    subgraph "Region: AP-Southeast"
        AP_LB["Load Balancer"]
        AP_WS["WebSocket Servers<br/>90 servers"]
        AP_Redis["Redis Cluster<br/>(Cache + Pub/Sub)"]
    end

    subgraph "Cross-Region"
        Bridge["Cross-Region<br/>Message Bridge<br/>(for cross-region friend pairs)"]
    end

    US_Redis <-->|"Cross-region<br/>friend updates"| Bridge
    EU_Redis <-->|"Cross-region<br/>friend updates"| Bridge
    AP_Redis <-->|"Cross-region<br/>friend updates"| Bridge

    style Bridge fill:#ff9800,color:#000
    style US_Redis fill:#ef5350,color:#fff
    style EU_Redis fill:#ef5350,color:#fff
    style AP_Redis fill:#ef5350,color:#fff

4.10.3 Cross-Region Friend Pairs

Van de: User A o Viet Nam (AP-Southeast), User B o My (US-East). Ca hai la ban be va bat Nearby Friends. Lam sao A nhan location cua B?

ApproachMo taUu diemNhuoc diem
Cross-region Pub/Sub bridge (chosen)Khi A update location, publish o AP region. Message bridge forward den US region cho B’s subscribersTach biet regions, bridge chi cho cross-region pairsThem latency (~100-200ms cross-region)
Global Pub/Sub1 Pub/Sub cluster serve toan cauDon gianSingle point of failure, high latency cho remote users
Ignore cross-regionChi hien thi friends cung regionDon gian nhatBad UX — khong thay ban be o nuoc khac

Trade-off: Cross-region friends se nhan location update tre hon ~100-200ms (vi phai di qua internet giua regions). Nhung vi nearby friends thuong o cung thanh pho (cung region), da so updates la intra-region va cuc nhanh.

Thuc te: Hau het friend pairs cung bat Nearby Friends se o cung khu vuc (ai bat Nearby Friends de xem ban o cach 10,000 km?). Cross-region pairs la edge case — co the chap nhan latency cao hon.

4.10.4 Region Assignment

User duoc assign vao region gan nhat dua tren IP hoac GPS location:

User locationRegionGhi chu
Viet Nam, Thai Lan, IndonesiaAP-Southeast (Singapore)RTT ~10-30ms
My, CanadaUS-East hoac US-WestRTT ~10-50ms
Chau AuEU-West (Ireland/Frankfurt)RTT ~10-30ms
Nhat Ban, Han QuocAP-Northeast (Tokyo)RTT ~10-20ms

DNS-based routing (vi du: AWS Route 53 latency-based routing) tu dong route user den region co latency thap nhat.


5. Estimation — Deep Dive Numbers

5.1 WebSocket Server Capacity Planning

Them 20% buffer cho failures va maintenance:

Memory per server:

Them OS, application, Redis connections: ~4 GB total. Server 8 GB RAM la du.

5.2 Redis Cluster Sizing

Location Cache (separate cluster):

1 Redis instance (voi replication) la du.

Pub/Sub Cluster:

5.3 Network Bandwidth

Inbound (client → server):

Per server: — khong dang ke.

Outbound (server → client):

Per server: — chap nhan duoc (server thuong co 1-10 Gbps NIC).

Internal (server ↔ Redis):

Cong voi cache reads/writes:

5.4 Tom tat Capacity

ResourceQuantitySpec
WebSocket servers2408 GB RAM, 4 vCPU
Redis (Location Cache)3 (1 primary + 2 replicas)4 GB RAM
Redis (Pub/Sub shards)302 GB RAM each
Total Redis memory~67 GBCache (1 GB) + Pub/Sub (60 GB)
Network bandwidth (internal)~1.4 GB/s10 Gbps network
Database (User/Friends)3 (1 primary + 2 replicas)Standard PostgreSQL

6. Security — Bao ve Location Privacy

6.1 Opt-in / Opt-out — Nguyen tac so 1

Nguyen tacImplementationChi tiet
Default OFFNearby Friends tat mac dinhUser phai chu dong bat. Khong bao gio tu dong bat
Granular controlCho phep chon chia se voi ai”Chia se voi tat ca ban be” vs “Chi chia se voi Close Friends list”
Easy OFF1 tap de tatKhong phai vao Settings → Privacy → Location → Nearby Friends → Off
Auto OFFTu dong tat sau thoi gianOption: tat sau 1h, 4h, 8h. Giong WhatsApp Live Location
Visual indicatorHien thi ro rang khi dang chia seIcon tren status bar, periodic reminder “Ban dang chia se vi tri”

6.2 Location Precision Control — Fuzzing

LevelPrecisionUse caseImplementation
Exact~10m (GPS accuracy)Close friends, gia dinhGui raw GPS coordinates
Approximate~500mBan be binh thuongRound lat/lng den 3 decimal places
City-level~10kmAcquaintancesChi gui city name, khong gui coordinates
HiddenN/AKhong muon chia seKhong gui location, unsubscribe

Implementation: Server-side fuzzing. Client gui raw GPS, server ap dung precision level truoc khi publish len Pub/Sub. Nhu vay client khong can biet logic — va khong the bypass.

6.3 Stalking Prevention

Moi deGiai phapChi tiet
Theo doi lien tucRate limit visibilityKhong cho phep user X xem vi tri cua Y nhieu hon 1 lan/phut (client-side throttle)
Location history inferenceKhong luu historyServer chi luu latest location. Khong co API de lay history. Client chi hien thi real-time
Fake accountsVerificationYeu cau phone verification de bat Nearby Friends
HarassmentBlock + ReportUser block → tu dong unsubscribe ca 2 chieu. Report → review boi trust & safety team
Ghost ModeAn vi tri nhung van thay ban beUser A bat Ghost Mode → A van subscribe friends’ channels (thay ban be). Nhung A khong publish location → ban be khong thay A
Invisible to specific peoplePer-friend setting”An vi tri voi Tuan” → unsubscribe Tuan khoi channel cua minh

6.4 Rate Limiting Location Updates

TierLimitLy do
Per user1 location update / 10 giay (min)Ngan client gui qua nhieu (battery drain, bandwidth)
Per userMax 1 update / 30 giay (expected)Normal operation
Per server100K updates/sBao ve Redis khoi overload
Burst3 updates trong 5 giay (khi moi bat)Cho phep initial burst de lay vi tri nhanh

Tham chieu Tuan-09-Rate-Limiter cho Token Bucket implementation.

6.5 Data Retention — Khong luu Location History

Du lieuRetentionLy do
Current location (Redis)TTL 120 giayChi can latest, tu dong xoa
Location updates (logs)Khong luuPrivacy — khong co ly do luu
Aggregated analytics90 ngay”Bao nhieu users bat Nearby Friends?” — khong chua location
Pub/Sub messages0 — fire and forgetRedis Pub/Sub khong persist

GDPR Article 5(1)(e): Du lieu ca nhan chi duoc luu trong thoi gian can thiet cho muc dich xu ly. Location data cho Nearby Friends chi can hien tai — khong can luu tru.

6.6 GDPR va Data Protection Compliance

Yeu cauImplementation
Lawful basisConsent (opt-in). Khong phai legitimate interest — location la “special category”
Right to be forgottenUser yeu cau xoa → xoa location cache, unsubscribe tat ca channels, xoa account data
Data portabilityExport: danh sach friends da chia se location (khong co location data vi khong luu)
Purpose limitationLocation chi dung cho Nearby Friends feature. Khong dung cho advertising, analytics, hoac share voi third party
Data minimizationChi thu thap lat, lng, timestamp. Khong thu thap altitude, speed, heading
Data Protection Impact Assessment (DPIA)Bat buoc vi xu ly location data o quy mo lon
Data Processing AgreementVoi cloud provider (AWS/GCP)

7. DevOps & Monitoring

7.1 Key Metrics — WebSocket Health

MetricAlert ThresholdDashboardY nghia
ws_connection_count per server> 55K (110% capacity)Gauge per serverServer gan day, can scale out
ws_connection_count total< 8M (80% expected)Single numberCo the co van de — users khong connect duoc
ws_connection_churn_rate> 5K/min per serverTime seriesNhieu reconnections → network issue hoac server issue
ws_handshake_latency_p99> 500msHistogramConnection cham → check LB, check TLS
ws_message_send_errors> 0.1%PercentageMessages khong gui duoc den client

7.2 Key Metrics — Redis Pub/Sub

MetricAlert ThresholdDashboardY nghia
pubsub_channels_count> 12M (120% expected)GaugeNhieu channels → nhieu users online (tot) hoac leak (xau)
pubsub_messages_per_second> 15M/s (>113% expected)Time seriesGan capacity → can add shards
pubsub_subscribers_per_channel max> 1000HistogramPopular user → co the can rate limit
redis_memory_used per shard> 80% max memoryGaugeCan increase memory hoac add shards
redis_connected_clients per shard> 8KGaugeNhieu connections tu WS servers

7.3 Key Metrics — Location Propagation

MetricAlert ThresholdDashboardY nghia
location_propagation_latency_p50> 50msHistogramTrung binh — phai < 50ms
location_propagation_latency_p99> 500msHistogramWorst case — phai < 500ms
location_propagation_latency_p999> 2sHistogramExtreme — investigate neu > 2s
location_update_rate< 200K/s (< 60% expected)Time seriesUsers khong gui updates → client bug?
nearby_friends_count avg per userN/A (informational)HistogramTrung binh moi user thay bao nhieu friends nearby

7.4 Key Metrics — Business Health

MetricAlert ThresholdDashboardY nghia
feature_opt_in_rateN/A (informational)PercentageBao nhieu % DAU bat Nearby Friends
avg_session_durationN/A (informational)Time seriesUsers dung feature bao lau
error_rate_by_type> 1% cho bat ky error type naoStacked barChi tiet errors: auth, timeout, redis, etc.

7.5 Alerting Rules

AlertConditionSeverityAction
WebSocket server overloadedconnections > 55K for 5 minP2Auto-scale, add servers
Redis Pub/Sub throughput high> 90% capacity for 10 minP1Add shards, page on-call
Location propagation slowp99 > 1s for 5 minP1Check Redis, check network
Connection churn spikechurn > 10K/min for 3 minP2Check network, check DNS, check LB
Redis shard downshard unreachable for 1 minP1Auto-failover, page on-call
Cross-region bridge lag> 5s for 3 minP2Check inter-region network

Tham chieu Tuan-13-Monitoring-Observability cho alerting best practices va runbook structure.

7.6 Deployment Strategy

ComponentStrategyLy do
WebSocket serversRolling update voi graceful drainStateful connections — can drain truoc khi shutdown
Redis Pub/SubAdd shards without downtime (resharding)Khong duoc restart — mat subscriptions
Redis CacheBlue-green voi replicationFailover toi replica neu primary down
Configuration changes (radius, TTL)Feature flags (LaunchDarkly/Unleash)Thay doi khong can deploy
New featuresCanary 5% → 25% → 100%Phat hien van de som

7.7 Load Testing Considerations

ScenarioTest methodTarget
50K WebSocket connections per serverLocust/k6 WebSocket load testVerify server handles 50K connections
333K location updates/sDistributed load generatorsVerify Redis Pub/Sub throughput
Fan-out explosion (user voi 5000 friends)Synthetic test voi high-fan-out usersVerify no cascading failures
Server crash during peakKill 1 WS server, observe reconnectionVerify reconnection < 30s
Redis shard failureKill 1 Redis shard, observe failoverVerify auto-failover < 10s

8. Diagrams

8.1 Complete Location Update Flow

flowchart TB
    subgraph "1. Client sends location"
        Client_A["User A Mobile App"]
        GPS["GPS Module<br/>lat: 10.7769<br/>lng: 106.7009"]
        GPS --> Client_A
    end

    subgraph "2. WebSocket Server receives"
        WS_A["WS Server 1<br/>(serving User A)"]
    end

    subgraph "3. Parallel writes"
        Redis_Cache["Redis Cache<br/>SET user:A:location<br/>{lat, lng, ts}<br/>TTL=120s"]
        Redis_PubSub["Redis Pub/Sub<br/>PUBLISH channel:user_A<br/>{lat, lng, ts}"]
    end

    subgraph "4. Fan-out to subscribers"
        WS_B["WS Server 2<br/>(serving User B)<br/>subscribed to channel:user_A"]
        WS_C["WS Server 3<br/>(serving User C)<br/>subscribed to channel:user_A"]
        WS_D["WS Server 1<br/>(serving User D)<br/>subscribed to channel:user_A"]
    end

    subgraph "5. Distance check + push"
        Check_B["B at 10.78, 106.71<br/>d = 1.2km < 5mi ✓"]
        Check_C["C at 10.90, 106.85<br/>d = 20km > 5mi ✗"]
        Check_D["D at 10.77, 106.70<br/>d = 0.1km < 5mi ✓"]
    end

    subgraph "6. Client receives"
        Client_B["User B sees:<br/>A is 1.2km away"]
        Client_D["User D sees:<br/>A is 0.1km away"]
    end

    Client_A -->|"WebSocket"| WS_A
    WS_A --> Redis_Cache
    WS_A --> Redis_PubSub

    Redis_PubSub --> WS_B
    Redis_PubSub --> WS_C
    Redis_PubSub --> WS_D

    WS_B --> Check_B
    WS_C --> Check_C
    WS_D --> Check_D

    Check_B -->|"Push"| Client_B
    Check_C -->|"Skip — out of radius"| X["(no push)"]
    Check_D -->|"Push"| Client_D

    style Redis_Cache fill:#ef5350,color:#fff
    style Redis_PubSub fill:#ff7043,color:#fff
    style Check_C fill:#e0e0e0,color:#999
    style X fill:#e0e0e0,color:#999

8.2 Redis Pub/Sub Fan-out Detail

flowchart LR
    subgraph "User A updates location"
        A_Update["User A<br/>location_update<br/>{lat, lng}"]
    end

    subgraph "Redis Pub/Sub Shard 3<br/>(channel:user_A hashed to shard 3)"
        Channel_A["channel:user_A<br/>Subscribers: B, C, D, E, F"]
    end

    subgraph "Fan-out (5 messages)"
        M1["→ WS Server 2 (for B)"]
        M2["→ WS Server 3 (for C)"]
        M3["→ WS Server 1 (for D)"]
        M4["→ WS Server 4 (for E)"]
        M5["→ WS Server 2 (for F)"]
    end

    A_Update -->|"PUBLISH"| Channel_A
    Channel_A --> M1
    Channel_A --> M2
    Channel_A --> M3
    Channel_A --> M4
    Channel_A --> M5

    style Channel_A fill:#ef5350,color:#fff
    style A_Update fill:#42a5f5,color:#fff

8.3 WebSocket Server Scaling

flowchart TB
    subgraph "Load Balancer Layer"
        LB["L7 Load Balancer<br/>Least Connections strategy<br/>WebSocket upgrade support"]
    end

    subgraph "WebSocket Server Fleet (240 servers)"
        subgraph "AZ-1 (80 servers)"
            WS1_1["WS 1<br/>48K conn"]
            WS1_2["WS 2<br/>50K conn"]
            WS1_N["WS ...<br/>49K conn"]
        end
        subgraph "AZ-2 (80 servers)"
            WS2_1["WS 81<br/>50K conn"]
            WS2_2["WS 82<br/>47K conn"]
            WS2_N["WS ...<br/>50K conn"]
        end
        subgraph "AZ-3 (80 servers)"
            WS3_1["WS 161<br/>49K conn"]
            WS3_2["WS 162<br/>50K conn"]
            WS3_N["WS ...<br/>48K conn"]
        end
    end

    subgraph "Auto Scaling"
        ASG["Auto Scaling Group<br/>Min: 200 | Max: 400<br/>Target: 80% connection capacity"]
    end

    LB --> WS1_1 & WS1_2 & WS1_N
    LB --> WS2_1 & WS2_2 & WS2_N
    LB --> WS3_1 & WS3_2 & WS3_N

    ASG -.->|"Scale out/in"| WS1_N & WS2_N & WS3_N

    style LB fill:#42a5f5,color:#fff
    style ASG fill:#ff9800,color:#000

Tai sao 3 Availability Zones? Neu 1 AZ down, 2 AZ con lai van handle 67% traffic. Voi 20% buffer (240 vs 200 needed), he thong survive 1 AZ failure.

8.4 Full System Architecture — Production Grade

flowchart TB
    subgraph "Client Layer"
        iOS["iOS App"]
        Android["Android App"]
    end

    subgraph "Edge Layer"
        CDN["CDN<br/>(static assets)"]
        DNS["DNS<br/>Latency-based routing"]
    end

    subgraph "Gateway Layer"
        ALB["Application Load Balancer<br/>WebSocket support<br/>TLS termination"]
        AuthZ["Auth Service<br/>JWT validation"]
    end

    subgraph "Application Layer"
        WS_Fleet["WebSocket Server Fleet<br/>240 servers across 3 AZs<br/>50K connections/server"]
    end

    subgraph "Data Layer"
        Redis_Cache2["Redis Cluster<br/>Location Cache<br/>3 nodes (1P+2R)"]
        Redis_PS["Redis Pub/Sub Cluster<br/>30 shards"]
    end

    subgraph "Supporting Services"
        User_Svc["User Service<br/>(Friends list)"]
        Notif_Svc["Notification Service<br/>(push notifications)"]
        Config_Svc["Config Service<br/>(feature flags)"]
    end

    subgraph "Storage"
        PG[("PostgreSQL<br/>Users, Friends<br/>1P + 2R")]
    end

    subgraph "Observability"
        Metrics["Prometheus + Grafana"]
        Logs["ELK Stack"]
        Traces["Jaeger/Zipkin"]
    end

    iOS & Android --> DNS
    DNS --> ALB
    ALB -->|"WS Upgrade"| AuthZ
    AuthZ -->|"Valid token"| WS_Fleet

    WS_Fleet --> Redis_Cache2
    WS_Fleet --> Redis_PS
    WS_Fleet --> User_Svc
    User_Svc --> PG

    WS_Fleet -.->|"User offline > 5min"| Notif_Svc

    WS_Fleet -.->|"Metrics"| Metrics
    WS_Fleet -.->|"Logs"| Logs
    WS_Fleet -.->|"Traces"| Traces

    Config_Svc -.->|"Feature flags"| WS_Fleet

    style ALB fill:#42a5f5,color:#fff
    style Redis_Cache2 fill:#ef5350,color:#fff
    style Redis_PS fill:#ff7043,color:#fff
    style PG fill:#66bb6a,color:#fff
    style Metrics fill:#ab47bc,color:#fff

9. Aha Moments & Pitfalls

9.1 Aha Moments — Nhung insight quan trong

Aha 1: Pub/Sub don gian hon Polling rat nhieu

Van de: Lam sao de User B biet User A vua update location?

Cach ngay tho: Polling — moi 30 giay, server cua B query Redis de lay location cua tat ca 40 online friends.

Cach Pub/Sub: B subscribe channels cua friends. Khi A update, B tu dong nhan.

Pub/Sub giam Redis read load tu 13.3M xuong 333K — giam 40x. Day la suc manh cua event-driven architecture.

Aha 2: WebSocket la stateful — va do la OK

Nhieu backend devs so stateful services vi kho scale. Nhung cho real-time features, stateful la bat buoc — khong co cach nao khac de maintain persistent bidirectional connection.

Bai hoc: Khong phai moi thu deu phai stateless. Stateful services co cho rieng cua chung — key la biet cach scale chung (connection draining, least-connections LB, graceful shutdown).

Aha 3: Redis Pub/Sub khong persist — va do la uu diem

Neu miss 1 location update vi Pub/Sub la fire-and-forget, khong sao — 30 giay sau co update moi. Tinh chat “khong persist” cua Redis Pub/Sub bien tu nhuoc diem thanh uu diem: khong ton disk, khong can cleanup, khong can retention policy.

So sanh: Kafka persist moi message, can manage offsets, can disk space, can compaction. Cho Nearby Friends, tat ca nhung thu do la overhead khong can thiet.

Aha 4: Fan-out la bottleneck, khong phai latency

Moi location update mat ~10-15ms de propagate — cuc nhanh. Bottleneck la so luong messages can fan-out: 13.3M/s. Day la bai toan throughput, khong phai latency.

Y nghia cho interview: Khi interviewer hoi “lam sao optimize?”, dung tra loi “giam latency”. Tra loi “giam fan-out” hoac “tang throughput cua Pub/Sub layer”.

Aha 5: Server-side filtering tiet kiem bandwidth dang ke

Chi push friends trong radius thay vi tat ca friends. Voi avg 40 online friends nhung chi 5-10 trong radius:

75% outbound bandwidth duoc tiet kiem nho server-side distance calculation. Cho mobile users (4G/5G co data cap), day la rất quan trong.

Aha 6: Khong can Geospatial Index

Khac voi Proximity Service (can geohash/quadtree de tim trong 100M businesses), Nearby Friends chi can check khoang cach giua user va 40 friends. 40 Haversine calculations la trivial — khong can spatial index.

Diem khac biet chinh: Proximity Service la “tim nguoi la trong ban kinh” (hay: “search in a large dataset”). Nearby Friends la “check vi tri cua nguoi da biet” (hay: “lookup known entities”). Search can index. Lookup khong can.

9.2 Pitfalls — Nhung sai lam pho bien

Pitfall 1: Dung HTTP Polling thay vi WebSocket

HTTP Polling cho Nearby Friends la sai lam lon nhat. Moi 30 giay, 10M clients gui HTTP request → 333K QPS. Moi request co ~500 bytes HTTP headers. Va server khong the push — client phai pull.

Fix: WebSocket — persistent connection, bidirectional, minimal overhead (2-6 bytes/message).

Pitfall 2: Fan-out explosion khong duoc xu ly

Mot user co 5000 friends, 500 online → moi 30 giay, 500 messages fan-out. Neu 1000 users nhu vay update cung luc:

Co the lam Redis Pub/Sub spike va anh huong toan he thong.

Fix: Rate limit fan-out cho users co nhieu friends. Hoac dung batching: gom updates cua popular users va publish 1 lan moi 5-10 giay thay vi moi 30 giay.

Pitfall 3: Khong handle WebSocket reconnection dung cach

Server restart, network blip, app bi background → connection mat. Neu client khong reconnect hoac reconnect qua nhanh (thundering herd), he thong co the overload.

Fix: Exponential backoff voi jitter. Client retry: 1s + random(0-500ms), 2s + random, 4s + random, … Max 60s.

Pitfall 4: Quen cleanup khi user offline

User tat app nhung server khong cleanup: location cache van con (TTL chua het), Pub/Sub channel van active, friends van thay user “online”.

Fix: 3 lop cleanup:

  1. WebSocket disconnect event → immediate cleanup
  2. Heartbeat miss → detect va cleanup
  3. Redis TTL → safety net, tu dong xoa data cu

Pitfall 5: Luu location history “de sau nay dung”

“De sau nay lam feature nay no” — va roi GDPR audit phat hien em luu vi tri cua 100M users moi 30 giay ma khong co consent.

Fix: Khong luu nhung gi khong can thiet. Nearby Friends chi can current location. Neu sau nay can location history → thiet ke feature rieng voi consent rieng.

Pitfall 6: Location precision vs privacy trade-off khong duoc can nhac

Gui raw GPS coordinates (10 decimal places, chinh xac ~0.1mm) cho tat ca friends. Khong ai can biet ban o dau chinh xac den 0.1mm.

Fix: Round coordinates den 3-4 decimal places (~10-100m precision). Du chinh xac de hien thi tren ban do. Khong du chinh xac de xac dinh em dang o phong nao trong toa nha.

Pitfall 7: Khong test fan-out o production scale

Dev test voi 100 users, moi user 10 friends → ok. Production 10M users, moi user 400 friends → Redis Pub/Sub overload.

Fix: Load test voi realistic numbers truoc khi launch. Simulate 10M connections, 333K updates/s, 13.3M fan-out messages/s.


10. Summary — Decision Framework

10.1 Khi nao dung Pub/Sub pattern?

Tinh huongRecommendation
Real-time updates cho known recipientsPub/Sub — subscribers da biet truoc
Fan-out nhieu nhung message khong can durableRedis Pub/Sub — fire-and-forget
Can guaranteed deliveryKafka/RabbitMQ thay vi Redis Pub/Sub
1-to-1 messagingWebSocket truc tiep, khong can Pub/Sub

10.2 Khi nao dung WebSocket?

Tinh huongRecommendation
Server can push data den clientWebSocket
Chi client → serverHTTP la du
Chi server → clientSSE (Server-Sent Events) co the du
Ca hai chieu + low latencyWebSocket — day la Nearby Friends
Low frequency updates (moi vai phut)Long polling co the du

10.3 Tong ket Architecture Decisions

DecisionChosenAlternativeTai sao
Communication protocolWebSocketHTTP Polling, SSEBidirectional, low overhead
Location propagationRedis Pub/SubKafka, RabbitMQLow latency, no persistence needed
Location storageRedis (in-memory cache)DatabaseChi can latest, TTL tu dong xoa
Distance calculationServer-side HaversineClient-side, Geohash pre-filterGiam bandwidth, 40 calculations la trivial
Channel design1 channel per userPer geohash, per friend-pairGranular, privacy-friendly
Scaling WebSocketLeast-connections LBConsistent hashingDon gian, reconnection cost thap
Scaling Pub/SubHash-based shardingConsistent hashingDon gian, deterministic
Multi-regionRegional clusters + cross-region bridgeGlobal clusterLow latency cho local, bridge cho remote

TopicLinkLien quan
Proximity Service (businesses)Case-Design-Proximity-ServiceSo sanh static vs dynamic location, geohash indexing
Chat System (WebSocket)Tuan-17-Design-Chat-SystemWebSocket management, connection lifecycle, message delivery
Consistent HashingTuan-10-Consistent-HashingShard assignment cho Redis Pub/Sub, server routing
Load BalancerTuan-05-Load-BalancerWebSocket-aware LB, least-connections strategy
Message QueueTuan-08-Message-QueuePub/Sub vs Message Queue trade-offs
Cache StrategyTuan-06-Cache-StrategyRedis caching patterns, TTL strategy
Rate LimiterTuan-09-Rate-LimiterRate limiting location updates, fan-out throttling
MonitoringTuan-13-Monitoring-ObservabilityWebSocket metrics, Pub/Sub monitoring, alerting
SecurityTuan-14-AuthN-AuthZ-SecurityJWT cho WebSocket auth, privacy controls
Database ReplicationTuan-07-Database-Sharding-ReplicationRedis replication, PostgreSQL for user data

12. Interview Tips — Goi y cho phong van

12.1 Cau truc tra loi

BuocNoi dungThoi gian
1. Clarify requirementsHoi ve scale, features, constraints3-5 phut
2. High-level designWebSocket + Redis Cache + Redis Pub/Sub5-7 phut
3. Deep diveChon 2-3 topics de di sau (Pub/Sub fan-out, scaling WebSocket, privacy)15-20 phut
4. Wrap upTrade-offs, alternatives, monitoring3-5 phut

12.2 Cau hoi interviewer thuong hoi

Cau hoiHuong tra loi
”Tai sao khong dung HTTP polling?”Bandwidth waste, server can push, bidirectional need
”Lam sao scale WebSocket?”Least-connections LB, graceful drain, auto-scaling
”Neu Redis Pub/Sub down?”Graceful degradation — nearby list freeze, client show “last updated X min ago"
"Lam sao xu ly popular users?”Rate limit fan-out, batch updates, reduce frequency
”Privacy concerns?”Opt-in, fuzzing, no history, GDPR, Ghost Mode
”Tai sao khong dung geohash?”Chi 40 friends — brute force Haversine du nhanh, khong can spatial index
”Alternative cho Redis Pub/Sub?”Kafka (neu can durability), custom in-memory Pub/Sub (neu can ultra-low latency)

12.3 Diem thuong (Bonus Points)

TopicChi tietAn tuong
Battery optimizationLocation updates moi 30s thay vi lien tuc. Dung significant location change API cua iOS/AndroidShow mobile awareness
Graceful degradationRedis down → serve stale data, show “approximate”Show resilience thinking
A/B testingTest 30s vs 60s update interval, test radius defaultsShow product thinking
Cost estimation240 WS servers x 17K/monthShow business awareness

“Nearby Friends khong phai la bai toan kho ve algorithm — khong co geohash, khong co quadtree, khong co Dijkstra. No kho o real-time communication o quy mo lon: 10 trieu WebSocket connections, 333 nghin location updates moi giay, 13 trieu Pub/Sub messages moi giay. Day la bai toan ve infrastructure va engineering trade-offs, khong phai ve toan hoc.”