Case Study: Design S3-like Object Storage

“S3 giống như một kho hàng khổng lồ của Amazon: mỗi món hàng có mã vạch duy nhất, tìm được ngay trong vài millisecond, không bao giờ mất, và kho có thể mở rộng vô hạn. Nhưng đằng sau sự đơn giản của API PUT/GET/DELETE là một trong những hệ thống phức tạp nhất từng được xây dựng — quản lý hàng trăm petabyte data trên hàng chục nghìn ổ cứng.”

Tags: system-design case-study object-storage s3 distributed-storage erasure-coding alex-xu-vol2 Student: Hieu Prerequisite: Tuan-01-Scale-From-Zero-To-Millions · Tuan-02-Back-of-the-envelope · Tuan-07-Database-Sharding-Replication Lien quan: Tuan-10-Consistent-Hashing · Tuan-15-Data-Security-Encryption · Case-Design-Google-Drive Reference: Alex Xu, System Design Interview Volume 2 — Chapter 9: S3-like Object Storage

Context & Why — Tại sao bài này quan trọng?

Analogy: Kho hàng khổng lồ Amazon

Hieu, hãy tưởng tượng em đang quản lý kho hàng khổng lồ nhất thế giới — giống như fulfillment center của Amazon. Trong kho này:

Mỗi món hàng có một mã vạch duy nhất (unique key) — quét mã vạch là tìm được ngay, dù kho có hàng tỷ món hàng
Kho được chia thành các khu vực (buckets) — khu vực “Điện tử”, khu vực “Sách”, khu vực “Quần áo”
Mỗi khu vực có sổ danh mục (metadata) — ghi rõ mã vạch, vị trí kệ, ngày nhập, kích thước, trọng lượng
Khi nhập hàng (PUT), người quản lý tìm kệ trống, đặt hàng vào, ghi vào sổ danh mục
Khi xuất hàng (GET), quét mã vạch, tra sổ → biết ngay hàng ở kệ nào, lấy ra ngay lập tức
Khi hủy hàng (DELETE), đánh dấu “đã hủy” trong sổ, cuối tuần có đội dọn dẹp thu gom (garbage collection)
Kho không bao giờ mất hàng — mỗi món đều có 3 bản sao ở 3 khu vực khác nhau, hoặc dùng kỹ thuật mã hóa đặc biệt (erasure coding) để phục hồi ngay cả khi cả khu vực bị cháy
Kho mở rộng vô hạn — cứ thêm nhà kho mới, kết nối vào hệ thống, không cần di chuyển hàng cũ

Đó chính xác là Amazon S3, Google Cloud Storage, Azure Blob Storage — một distributed object storage system.

Tại sao đây là bài toán kinh điển trong System Design Interview?

Khía cạnh	Tại sao khó?
Scale	100+ PB storage, hàng tỷ objects — mọi quyết định thiết kế đều bị amplify ở scale này
Durability	11 nines (99.999999999%) — mất 1 object trong 10 tỷ objects sau 10,000 năm
Metadata at scale	100 tỷ objects = 100 tỷ rows metadata — DB nào chịu được? Shard thế nào?
Erasure coding	Kỹ thuật toán học phức tạp nhưng tiết kiệm 50% storage so với replication
Garbage collection	Delete object ≠ xóa bytes ngay lập tức — cần GC phức tạp cho append-only storage
Listing at scale	Liệt kê objects trong bucket có hàng tỷ objects — không đơn giản như SELECT *
Consistency model	Strong consistency cho reads sau writes — trong distributed system, đây là thử thách lớn

Key Insight: Object storage nhìn đơn giản từ bên ngoài — chỉ là PUT/GET/DELETE. Nhưng bên trong, nó là sự kết hợp tinh vi của distributed file system, database sharding, erasure coding, garbage collection, và placement algorithms. Đây là bài toán tổng hợp nhiều kiến thức nhất trong system design.

Object Storage trong thực tế

Object storage là backbone của modern cloud:

Service	Công ty	Scale
Amazon S3	AWS	100+ exabytes, hàng nghìn tỷ objects
Google Cloud Storage	Google	Exabyte-scale
Azure Blob Storage	Microsoft	Exabyte-scale
MinIO	Open source	S3-compatible, dùng cho private cloud
Ceph RADOS	Open source	Unified storage (block + object + file)

Mỗi ngày:

S3 xử lý hàng chục triệu requests/second
Netflix, Airbnb, Spotify đều lưu data chính trên S3
95%+ các startup dùng S3 hoặc tương đương cho static assets, backups, data lakes

Step 1 — Understand the Problem & Establish Design Scope

1.1 Clarifying Questions (Câu hỏi làm rõ)

Trong interview, luôn hỏi interviewer trước khi thiết kế. Dưới đây là các câu hỏi quan trọng:

Câu hỏi	Trả lời / Giả định	Tại sao quan trọng
Những feature chính cần hỗ trợ?	Bucket CRUD, Object PUT/GET/DELETE, versioning, listing	Xác định scope — không cần build full IAM system
Storage scale?	100+ PB, hàng tỷ objects	Ảnh hưởng mọi quyết định architecture
Object size range?	Từ vài KB đến vài GB	Cần multipart upload cho large objects
Durability requirement?	11 nines (99.999999999%)	Quyết định replication vs erasure coding strategy
Availability requirement?	99.99% (4 nines)	High availability across data centers
Consistency model?	Strong read-after-write consistency	Sau khi PUT thành công, GET phải trả về data mới nhất
API compatibility?	S3-compatible REST API	Industry standard, dễ migrate
Versioning?	Có, optional per bucket	Ảnh hưởng metadata design
Storage classes?	Multiple tiers (hot/warm/cold/archive)	Cost optimization strategy
Multi-region?	Single region cho scope interview	Giảm complexity, focus vào core design

1.2 Functional Requirements (Yêu cầu chức năng)

FR1: Bucket operations — tạo, xóa, liệt kê buckets
FR2: Object upload (PUT) — upload object lên bucket, support objects từ vài bytes đến vài GB
FR3: Object download (GET) — download object bằng bucket name + object key
FR4: Object delete (DELETE) — xóa object (soft delete với versioning, hard delete không versioning)
FR5: Object listing — liệt kê objects trong bucket, hỗ trợ prefix filter và pagination
FR6: Object versioning — giữ lại multiple versions của cùng object, enable/disable per bucket
FR7: Multipart upload — upload large objects bằng cách chia thành parts, upload song song
FR8: Storage classes — hỗ trợ multiple storage tiers với lifecycle transitions

1.3 Non-Functional Requirements (Yêu cầu phi chức năng)

Yêu cầu	Mục tiêu	Giải thích
Durability (Độ bền)	99.999999999% (11 nines)	Mất 1 object trong 100 tỷ objects sau 10,000 năm
Availability (Sẵn sàng)	99.99% uptime	< 52 phút downtime/năm
Scalability (Mở rộng)	100+ PB, billions of objects	Horizontal scaling cho mọi component
Performance	TTFB < 100ms cho small objects, high throughput cho large objects	Ảnh hưởng trực tiếp tới user experience
Consistency	Strong read-after-write	GET sau PUT thành công phải trả về data mới nhất
Cost efficiency	Minimize storage cost per GB	Erasure coding, tiered storage, compression

1.4 Back-of-the-Envelope Estimation

Assumptions:

Thông số	Giá trị	Giải thích
Total storage	100 PB raw data	Large-scale object storage
Number of objects	100 billion ( $1 0^{11}$ )	Trung bình 1 KB metadata per object
Average object size	1 MB	Nhiều small objects, một số large objects
Daily new objects	1 billion	~1% growth per day
Read:Write ratio	4:1	Reads nhiều hơn writes
Object size distribution	60% < 1MB, 30% 1MB-100MB, 10% > 100MB	Long-tail distribution

Storage Estimation:

R a w s t or a g e = 100 PB = 100 \times 1 0^{15} b y t es

Wi t h er a s u re co d in g (1.5 x o v er h e a d) = 100 PB \times 1.5 = 150 PB t o t a l d i s k s p a ce

Wi t h 3 x re pl i c a t i o n (t ha y t h \overset{ˊ}{\overset{e}{^}}) = 100 PB \times 3 = 300 PB t o t a l d i s k s p a ce

Aha Moment: Erasure coding tiết kiệm 50% storage so với 3x replication (150 PB vs 300 PB) mà vẫn đảm bảo cùng mức durability. Ở scale 100 PB, sự khác biệt là 150 PB disk space — tương đương hàng chục triệu USD chi phí phần cứng.

S a v in g s = 300 PB - 150 PB = 150 PB \approx $4.5 M / ye a r (a t $30/ TB / ye a r)

Metadata DB Size Estimation:

M e t a d a t a p er o bj ec t \approx 1 K B (o bj ec t_k ey, b u c k e t_i d, v ers i o n_i d, s i ze, c h ec k s u m, t im es t am p s, s t or a g e_c l a ss)

T o t a l m e t a d a t a = 1 0^{11} \times 1 K B = 1 0^{11} \times 1 0^{3} b y t es = 1 0^{14} b y t es = 100 TB

Wi t h in d e x es (2 x) = 100 TB \times 2 = 200 TB m e t a d a t a s t or a g e

Quan trọng: 200 TB metadata — quá lớn cho single database. Cần sharding strategy. Tham chiếu Tuan-07-Database-Sharding-Replication.

QPS Estimation:

W r i t e QPS = \frac{1 0 ^{9} n e w o bj ec t s / d a y}{86 , 400 seco n d s / d a y} \approx 11, 574 w r i t es / s

R e a d QPS = 11, 574 \times 4 = 46, 296 re a d s / s

P e ak w r i t e QPS = 11, 574 \times 3 \approx 34, 722 w r i t es / s

P e ak re a d QPS = 46, 296 \times 3 \approx 138, 889 re a d s / s

Bandwidth Estimation:

W r i t e ban d w i d t h_{a vg} = 11, 574 o bj ec t s / s \times 1 MB = 11, 574 MB / s \approx 11.6 GB / s \approx 92.6 G b p s

R e a d ban d w i d t h_{a vg} = 46, 296 o bj ec t s / s \times 1 MB \approx 46.3 GB / s \approx 370 G b p s

Note: Bandwidth yêu cầu rất lớn — cần distribute across nhiều data nodes. Một server thường có NIC 10–25 Gbps, nên cần tối thiểu 15–37 data nodes chỉ riêng cho bandwidth (chưa tính storage capacity).

Disk Capacity per Node:

D i s k p er n o d e = 16 TB (t y p i c a l HDD)

N u mb er o f d i s k s n ee d e d = \frac{150 PB}{16 TB} = \frac{150 , 000 TB}{16 TB} = 9, 375 d i s k s

Wi t h 12 d i s k s / n o d e \Rightarrow \frac{9 , 375}{12} \approx 782 s t or a g e n o d es

Step 2 — High-Level Design

2.1 Các Component chính

Hieu, hệ thống S3-like object storage có thể chia thành 4 service lớn:

Component	Vai trò	Analogy kho hàng
API Service	Xử lý REST API requests, authentication, authorization, routing	Quầy tiếp nhận — nhận đơn hàng, kiểm tra giấy tờ, chuyển cho bộ phận xử lý
Metadata Service	Quản lý bucket info + object metadata (key, version, size, location)	Sổ danh mục kho — ghi nhận mọi món hàng: mã, vị trí kệ, ngày nhập
Data Service	Lưu trữ actual bytes trên disk, quản lý data files	Kệ hàng thực tế — nơi đặt món hàng, tổ chức theo hàng/cột
Placement Service	Quyết định data ghi vào node nào, quản lý cluster topology	Quản lý phân bổ — quyết định hàng mới đặt ở khu vực nào, kệ nào

2.2 Tại sao tách thành 4 services?

Lý do	Giải thích
Independent scaling	Metadata service cần nhiều CPU/RAM (query-heavy), Data service cần nhiều disk (storage-heavy)
Different failure modes	Metadata DB down ≠ data unavailable. Có thể serve cached reads khi metadata partially down
Technology fit	Metadata cần relational DB (structured queries), Data cần custom storage engine (sequential I/O)
Separation of concerns	Placement logic thay đổi independent — thêm nodes, rebalance, không ảnh hưởng API/Data

2.3 High-Level Architecture Diagram

flowchart TB
    subgraph "Client Layer"
        SDK["S3 SDK<br/>(AWS CLI, boto3, etc.)"]
        APP["Applications<br/>(Web, Mobile, Data Pipeline)"]
    end

    subgraph "API Layer"
        LB["Load Balancer<br/>(ALB / Nginx)"]
        API1["API Service 1<br/>(Stateless)"]
        API2["API Service 2<br/>(Stateless)"]
        API3["API Service N<br/>(Stateless)"]
    end

    subgraph "Metadata Layer"
        META["Metadata Service"]
        METADB[("Metadata DB<br/>(Sharded PostgreSQL)<br/>• Bucket table<br/>• Object table")]
        METACACHE[("Redis Cluster<br/>• Hot metadata cache")]
    end

    subgraph "Placement Layer"
        PLACE["Placement Service<br/>(Leader-based)<br/>• Cluster map<br/>• Consistent hashing"]
    end

    subgraph "Data Layer"
        DN1["Data Node 1<br/>• Data files<br/>• Checksum DB"]
        DN2["Data Node 2<br/>• Data files<br/>• Checksum DB"]
        DN3["Data Node 3<br/>• Data files<br/>• Checksum DB"]
        DNN["Data Node N<br/>• Data files<br/>• Checksum DB"]
    end

    SDK & APP --> LB
    LB --> API1 & API2 & API3
    API1 & API2 & API3 --> META
    API1 & API2 & API3 --> PLACE
    API1 & API2 & API3 --> DN1 & DN2 & DN3 & DNN
    META --> METADB
    META --> METACACHE
    PLACE -.->|"heartbeat"| DN1 & DN2 & DN3 & DNN

2.4 API Design (S3-compatible REST API)

Operation	HTTP Method	Endpoint	Mô tả
Create bucket	PUT	`/{bucket}`	Tạo bucket mới
Delete bucket	DELETE	`/{bucket}`	Xóa bucket (phải rỗng)
List buckets	GET	`/`	Liệt kê tất cả buckets của user
Upload object	PUT	`/{bucket}/{key}`	Upload object vào bucket
Download object	GET	`/{bucket}/{key}`	Download object
Delete object	DELETE	`/{bucket}/{key}`	Xóa object
List objects	GET	`/{bucket}?prefix=&delimiter=&max-keys=&continuation-token=`	Liệt kê objects với filtering
Initiate multipart	POST	`/{bucket}/{key}?uploads`	Bắt đầu multipart upload
Upload part	PUT	`/{bucket}/{key}?partNumber=&uploadId=`	Upload một part
Complete multipart	POST	`/{bucket}/{key}?uploadId=`	Hoàn thành multipart upload

2.5 Luồng hoạt động tổng quan

Upload flow (PUT object):

Client gửi PUT request → Load Balancer → API Service
API Service xác thực request (auth, bucket exists, permissions)
API Service hỏi Placement Service: “Object này nên ghi vào data nodes nào?”
Placement Service trả về danh sách data nodes (primary + replicas / erasure coding group)
API Service gửi data → Data Node(s), data được ghi vào data file (append-only)
Data Node trả về confirmation + storage location (file_id, offset, length)
API Service ghi metadata vào Metadata Service (object_key → storage_location mapping)
Trả về 200 OK cho client

Download flow (GET object):

Client gửi GET request → Load Balancer → API Service
API Service query Metadata Service: “Object này lưu ở đâu?” → nhận được (node_id, file_id, offset, length)
API Service gửi read request đến Data Node tương ứng
Data Node đọc data từ disk, verify checksum, trả về bytes
API Service stream data về client

Step 3 — Design Deep Dive

3.1 Object Storage vs Block Storage vs File Storage

Trước khi deep dive, Hieu cần hiểu rõ tại sao chúng ta chọn object storage chứ không phải block storage hay file storage.

So sánh ba loại storage

Đặc điểm	Block Storage	File Storage	Object Storage
Đơn vị lưu trữ	Block (512B – 4KB)	File trong directory tree	Object (any size) với unique key
Access method	Block device (mount as disk)	File system (NFS, SMB)	REST API (PUT/GET/DELETE)
Metadata	Minimal (block address)	File attributes (name, permissions, dates)	Rich, custom metadata (unlimited key-value)
Hierarchy	Flat (block addresses)	Directory tree (hierarchical)	Flat namespace (bucket + key)
Modify in-place	Có — overwrite bất kỳ block nào	Có — edit file in-place	Không — immutable, phải write lại toàn bộ object
Performance	Lowest latency (< 1ms)	Low latency (1-10ms)	Higher latency (10-100ms)
Scalability	Limited by single server/SAN	Limited by file system (billions of files = slow)	Virtually unlimited — horizontal scale
Use case	Database disks, VM storage, boot volumes	Shared files, home directories, legacy apps	Static assets, backups, data lakes, media
Ví dụ	AWS EBS, SAN	AWS EFS, NFS servers	AWS S3, GCS, Azure Blob

Tại sao object storage scales tốt hơn?

Lý do	Giải thích
Flat namespace	Không có directory tree → không bị bottleneck khi listing lớn, không cần lock directory
Immutable objects	Write-once, read-many → không cần distributed locking, không conflict, dễ cache
Separation of metadata & data	Metadata service scale independently từ data service
No POSIX semantics	Không cần support open/close/seek/lock → đơn giản hóa distributed protocol
Commodity hardware	Chạy trên HDD thường, không cần expensive SAN/NAS

Aha Moment: Object storage hy sinh latency và in-place modification để đạt được unlimited scalability và extreme durability. Đó là trade-off cốt lõi. Khi data lên đến petabytes, đây là trade-off đúng đắn.

3.2 Data Storage Design — Trái tim của hệ thống

3.2.1 Tại sao không lưu mỗi object thành một file trên disk?

Câu hỏi tự nhiên: tại sao không lưu mỗi object thành một file riêng trên file system? Đơn giản, dễ hiểu mà?

Vấn đề	Giải thích
Inode exhaustion	Mỗi file tiêu tốn 1 inode. ext4 mặc định ~256M inodes per filesystem. Với hàng tỷ objects → hết inodes trước khi hết disk space
Small file inefficiency	File 1 KB vẫn chiếm 1 block (4 KB) trên disk → waste 75% cho small objects
Directory listing	Directory chứa hàng triệu files → `ls` chậm, file system metadata fragmented
Filesystem overhead	Mỗi file cần metadata (permissions, timestamps, xattr) → overhead per file ~256–512 bytes
Random I/O	Nhiều small files = random I/O pattern → HDD performance rất tệ (100-200 IOPS)

3.2.2 Giải pháp: Append-only Data Files (WAL-like approach)

Thay vì lưu mỗi object thành một file, chúng ta pack nhiều objects vào chung một data file lớn — giống cách Write-Ahead Log (WAL) hoạt động.

Cách hoạt động:

Mỗi Data Node có một write-active data file (ví dụ: data_001.dat, size limit ~8 GB)
Khi nhận object mới, append object bytes vào cuối data file hiện tại
Ghi nhận vị trí: (file_id, offset, length) — đây chính là storage location của object
Khi data file đầy (đạt 8 GB), seal nó (read-only), tạo data file mới
Object bị delete? Không xóa bytes — chỉ đánh dấu trong metadata. Garbage collector sẽ xử lý sau

Tại sao append-only?

Lợi ích	Giải thích
Sequential I/O	Chỉ write vào cuối file → sequential write → HDD đạt throughput 100-200 MB/s (vs 1-2 MB/s random)
No fragmentation	Không overwrite, không delete → disk không bị fragment
Simple concurrency	Chỉ 1 writer append → không cần lock toàn bộ file
Easy replication	Replicate nguyên data file → đơn giản, reliable
Fast recovery	Scan data file from beginning → rebuild state

Mapping object_id → physical location:

Field	Mô tả	Ví dụ
`object_id`	UUID unique của object	`a1b2c3d4-e5f6-...`
`data_node_id`	Node nào lưu object	`node-42`
`file_id`	Data file nào trên node đó	`data_001.dat`
`offset`	Vị trí byte bắt đầu trong file	`1048576` (1 MB)
`length`	Kích thước object (bytes)	`524288` (512 KB)

Khi GET object, hệ thống: query metadata → lấy (data_node_id, file_id, offset, length) → gửi request đến data node → data node seek(offset) rồi read(length) → trả về bytes.

3.2.3 Data File Compaction

Vì objects bị delete chỉ được đánh dấu chứ không xóa bytes thật, data files sẽ có “holes” (dead space). Cần compaction để thu hồi space:

Background process scan data files, tính % dead space
Khi dead space > threshold (ví dụ 30%), trigger compaction
Compaction: đọc tất cả live objects từ file cũ → ghi vào data file mới → cập nhật metadata locations → xóa file cũ
Compaction chạy offline, không ảnh hưởng read/write path

Analogy: Giống như defragmentation trên Windows ngày xưa — sắp xếp lại data liên tục, loại bỏ khoảng trống.

3.3 Metadata Service — Bộ não của hệ thống

Metadata service là component quan trọng nhất — nếu data nodes mất data, erasure coding có thể recover. Nhưng nếu metadata mất → không biết object nào ở đâu → toàn bộ data trở nên vô nghĩa.

3.3.1 Schema Design

Bucket Table:

Column	Type	Mô tả
`bucket_id`	BIGINT (PK)	Auto-increment hoặc Snowflake ID
`bucket_name`	VARCHAR(63)	Globally unique, DNS-compatible
`owner_id`	BIGINT (FK)	User sở hữu bucket
`acl`	JSON	Access Control List
`versioning_enabled`	BOOLEAN	Bật/tắt versioning
`default_storage_class`	ENUM	STANDARD, IA, GLACIER, ARCHIVE
`created_at`	TIMESTAMP	Thời điểm tạo
`region`	VARCHAR	Region bucket thuộc về

Note: Bucket table nhỏ (triệu rows) → không cần shard. Single DB + read replicas là đủ.

Object Table (sharded):

Column	Type	Mô tả
`bucket_id`	BIGINT	Bucket chứa object
`object_name`	VARCHAR(1024)	Key (path) của object trong bucket
`object_id`	UUID	Internal unique ID, dùng để map tới physical location
`version_id`	BIGINT	Version number (auto-increment per object)
`is_latest`	BOOLEAN	Đánh dấu version mới nhất
`is_delete_marker`	BOOLEAN	True nếu đây là delete marker (versioning)
`size`	BIGINT	Object size in bytes
`etag`	VARCHAR(32)	MD5 hash của content (integrity check)
`content_type`	VARCHAR(256)	MIME type
`storage_class`	ENUM	STANDARD, IA, GLACIER, ARCHIVE
`data_node_id`	INT	Node lưu data
`file_id`	INT	Data file ID trên node đó
`offset`	BIGINT	Byte offset trong data file
`length`	BIGINT	Object size trên disk (có thể khác `size` nếu compressed)
`checksum`	VARCHAR(64)	CRC32 hoặc SHA-256 của stored data
`created_at`	TIMESTAMP	Thời điểm upload
`custom_metadata`	JSON	User-defined metadata (x-amz-meta-*)

Composite Primary Key: (bucket_id, object_name, version_id)

Indexes:

(bucket_id, object_name, is_latest) — cho GET latest version
(bucket_id, object_name_prefix) — cho LIST operations (prefix scan)

3.3.2 Sharding Strategy

Với 100 tỷ objects, cần shard metadata DB. Shard key là bucket_id:

Sharding option	Ưu điểm	Nhược điểm
Shard by bucket_id (chọn)	Objects trong cùng bucket nằm cùng shard → LIST efficient, range query tốt	Hot bucket (bucket có hàng tỷ objects) → hot shard
Shard by hash(bucket_id + object_name)	Distribute đều hơn	LIST operations cần scatter-gather across shards → chậm
Shard by object_id	Distribute đều nhất	LIST by prefix hoàn toàn bất khả thi → cần secondary index

Trade-off: Chọn shard by bucket_id vì LIST operations rất phổ biến (80%+ API calls cho S3 là LIST hoặc GET). Hot bucket problem giải quyết bằng cách split hot bucket across multiple shards (sub-sharding by prefix).

3.3.3 Metadata Caching

Redis cluster cache hot metadata:

Cache pattern	Khi nào dùng
Bucket metadata	Cache toàn bộ (small dataset) — bucket ACL, versioning config
Object metadata	Cache recent writes (write-through) — location mapping cho GET
LIST results	Cache paginated results (TTL short, 5-10s) — invalidate on write

3.4 Placement Service — Bộ phận phân bổ

Placement service quyết định data ghi vào node nào — đây là core intelligence của hệ thống.

3.4.1 Vai trò

Chức năng	Mô tả
Cluster map	Biết tất cả data nodes: IP, capacity, health status, rack, failure domain
Placement decision	Khi API service hỏi “ghi object này ở đâu?”, trả về danh sách target nodes
Failure domain awareness	Đảm bảo replicas/erasure coding chunks nằm trên different racks/AZs
Rebalancing	Khi add/remove nodes, quyết định data cần di chuyển
Heartbeat monitoring	Nhận heartbeat từ data nodes, detect failures

3.4.2 Virtual Cluster Map & Consistent Hashing

Placement service dùng consistent hashing (tham chiếu Tuan-10-Consistent-Hashing) để phân bổ data:

Mỗi data node được map lên virtual nodes trên hash ring
Object’s object_id được hash → rơi vào vị trí trên ring → N virtual nodes tiếp theo (theo chiều kim đồng hồ) là target nodes
Virtual nodes cho phép distribute data đều hơn giữa physical nodes
Khi add/remove node → chỉ cần move data từ/tới adjacent nodes trên ring

Failure domain rules:

Rule	Mô tả
Rack awareness	Replicas phải nằm trên ít nhất 2 racks khác nhau
AZ awareness	Nếu có multiple AZs, replicas phải span ít nhất 2 AZs
Disk awareness	Không đặt 2 replicas trên cùng physical disk

3.4.3 Leader-based Architecture

Placement service chạy dưới dạng leader-based cluster (3 hoặc 5 nodes, sử dụng Raft/Paxos):

Lý do dùng leader	Giải thích
Single source of truth	Cluster map phải consistent — không thể có 2 versions khác nhau
Coordination	Rebalancing decisions cần central coordinator để tránh conflicting moves
Heartbeat aggregation	Leader collect heartbeats, single point of failure detection

Data nodes gửi heartbeat mỗi 5-10 giây tới placement service. Nếu miss 3 heartbeats liên tiếp → node coi như dead → trigger re-replication.

3.5 Data Durability — Bảo vệ data bằng mọi giá

Durability 11 nines nghĩa là gì?

P (d a t a l oss) = 1 - 0.99999999999 = 1 0^{- 11}

N \overset{ˊ}{\overset{e}{ˆ}} u l ư u 100 tỷ objects \to m \overset{ˊ}{\overset{a}{ˆ}} t 100 \times 1 0^{9} \times 1 0^{- 11} = 1 o bj ec t t ro n g 10, 000 n \overset{a}{˘} m

Để đạt được mức durability này, có 2 strategies chính:

3.5.1 Strategy 1: Replication (3x)

Cách hoạt động: Lưu 3 bản sao giống hệt nhau trên 3 data nodes khác nhau.

Đặc điểm	Giá trị
Storage overhead	3x (200% overhead)
Write amplification	3x (ghi 1 object = ghi 3 lần)
Read performance	Tốt — có thể read từ bất kỳ replica nào
Recovery speed	Nhanh — chỉ cần copy nguyên bản từ healthy replica
Fault tolerance	Tolerate 2 node failures
Complexity	Thấp — dễ implement, dễ hiểu

Khi nào dùng: Small-scale systems, hot data cần low latency reads, khi simplicity quan trọng hơn cost.

3.5.2 Strategy 2: Erasure Coding (Reed-Solomon)

Cách hoạt động: Thay vì copy nguyên bản, dùng toán học để encode data thành k data chunks + m parity chunks.

Reed-Solomon Encoding:

Input: k data chunks \to Output: k + m total chunks (k data + m parity)

C \overset{o}{ˊ} thể m \overset{ˊ}{\overset{a}{ˆ}} t b \overset{ˊ}{\overset{a}{ˆ}} t k \overset{y}{ˋ} m chunks \to v \tilde{\overset{a}{ˆ}} n recover đư ợc to \overset{a}{ˋ} n bộ data từ k chunks c \overset{o}{ˋ} n lại

Ví dụ phổ biến: (8, 4) Erasure Coding — 8 data chunks + 4 parity chunks = 12 total chunks:

Đặc điểm	Giá trị
Data chunks (k)	8
Parity chunks (m)	4
Total chunks	12
Storage overhead	$\frac{k + m}{k} = \frac{12}{8} = 1.5 x$ (50% overhead)
Fault tolerance	Tolerate any 4 chunk losses
Compare to replication	3x replication = 200% overhead, tolerates 2 failures. EC(8,4) = 50% overhead, tolerates 4 failures

Aha Moment: Erasure coding (8,4) dùng 50% overhead nhưng chịu được 4 failures — trong khi 3x replication dùng 200% overhead chỉ chịu được 2 failures. Erasure coding vượt trội hoàn toàn ở scale lớn.

So sánh chi tiết Replication vs Erasure Coding:

Tiêu chí	3x Replication	EC (8,4)	Winner
Storage overhead	200%	50%	EC
Fault tolerance	2 failures	4 failures	EC
Write latency	Thấp (parallel write 3 copies)	Cao hơn (encode + distribute 12 chunks)	Replication
Read latency	Thấp (read 1 copy)	Cao hơn (đọc k=8 chunks + decode)	Replication
Recovery speed	Nhanh (copy 1 full replica)	Chậm hơn (đọc k chunks + decode + write)	Replication
Network usage (recovery)	Transfer 1 full object	Transfer k chunks	Depends on object size
Complexity	Đơn giản	Phức tạp (math + distributed coordination)	Replication
Cost at scale (100PB)	$9M/year (300 PB disk)	$4.5M/year (150 PB disk)	EC

Annual savings of EC vs Replication = (300 - 150) PB \times $30/ TB / ye a r = 150, 000 TB \times $30 = $4, 500, 000/ ye a r

Thực tế trong industry: S3 dùng erasure coding cho hầu hết storage classes. Chỉ dùng replication cho data cần ultra-low latency. Google Colossus dùng Reed-Solomon. Facebook f4 warm storage dùng EC (10,4). Azure LRC (Local Reconstruction Codes) là variant tối ưu hơn.

3.5.3 Durability Calculation

Với erasure coding (8,4), 12 chunks trên 12 nodes khác nhau:

P (single node failure/year) = 0.01 (1% — t y p i c a l f or ser v er)

P (data loss) = P (more than 4 nodes fail simultaneously)

P (data loss) = i = 5 \sum 12 (i 12) \times 0.0 1^{i} \times 0.9 9^{12 - i}

P (data loss) \approx (5 12) \times 0.0 1^{5} \approx 792 \times 1 0^{- 10} \approx 7.92 \times 1 0^{- 8}

Với thêm background scrubbing + fast recovery (giảm window of vulnerability), durability thực tế đạt $1 0^{- 11}$ trở lên → 11 nines.

3.6 Data Integrity — Checksum & Self-Healing

3.6.1 Checksum per Object

Mỗi object (hoặc chunk trong erasure coding) đều có checksum được tính khi write:

Checksum algorithm	Speed	Collision resistance	Dùng khi
CRC32	Rất nhanh	Thấp (32-bit)	Integrity check nội bộ, per-block verification
MD5	Nhanh	Trung bình (128-bit)	ETag trong S3, backward compatible
SHA-256	Chậm hơn	Rất cao (256-bit)	Content verification, security-sensitive

Write path: tính checksum → lưu checksum cùng data → trả ETag cho client Read path: đọc data → tính lại checksum → so sánh với stored checksum → nếu mismatch → corrupt!

3.6.2 Periodic Background Scrubbing

Dù data không bị đọc, nó vẫn có thể bị bit rot (silent data corruption do phần cứng):

Loại corruption	Nguyên nhân	Tần suất
Bit rot	Cosmic rays, magnetic decay trên HDD	~1 bit lỗi per 10^14-10^15 bits read
Firmware bugs	Disk firmware write sai vị trí	Rare nhưng catastrophic
Bad sectors	HDD surface degradation	Tăng theo tuổi disk
Silent corruption	Data sai nhưng không báo error	Nguy hiểm nhất vì không biết

Scrubbing process:

Background worker scan tất cả data files trên mỗi node (cycle time: 2-4 tuần)
Đọc mỗi object/chunk → tính checksum → so sánh với stored checksum
Nếu mismatch → object bị corrupt → trigger self-healing

3.6.3 Self-Healing

Khi phát hiện corruption:

Đánh dấu corrupt chunk là invalid
Đọc data từ healthy copies: với replication, đọc từ replica khác; với erasure coding, decode từ k healthy chunks
Re-write chunk mới lên node khác (hoặc cùng node nếu disk vẫn healthy)
Cập nhật placement metadata
Log event cho monitoring

Key Insight: Self-healing chạy tự động, liên tục, không cần human intervention. Hệ thống luôn trong trạng thái “đang tự sửa chữa” — giống cơ thể con người tự chữa lành vết thương nhỏ mà ta không nhận ra.

3.7 Garbage Collection — Dọn dẹp data đã xóa

3.7.1 Tại sao cần Garbage Collection?

Vì data files là append-only, khi user DELETE object:

Metadata được đánh dấu “deleted” (hoặc thêm delete marker nếu versioning)
Bytes trên disk KHÔNG bị xóa — vẫn nằm trong data file
Nếu không dọn dẹp → disk space leak → eventually hết disk

3.7.2 Mark-and-Sweep GC

Phase	Mô tả
Mark phase	Scan metadata DB → tìm tất cả deleted objects → lập danh sách `(file_id, offset, length)` cần thu hồi
Sweep phase	Với mỗi data file có dead objects: đọc live objects → ghi vào data file mới → cập nhật metadata → xóa file cũ

3.7.3 Compaction Details

Compaction (sweep phase) chỉ trigger khi dead space vượt threshold:

De a d r a t i o = \frac{Tổng size deleted objects trong file}{Tổng size data file}

Trigger compaction khi De a d r a t i o > 30%

Compaction flow:

Chọn data file có dead ratio > threshold
Lock file (chuyển sang read-only nếu đang write-active — nhưng data files sealed thì đã read-only)
Đọc tuần tự, skip dead objects, ghi live objects vào new data file
Atomic swap: cập nhật metadata pointers từ old file → new file
Xóa old data file
Report freed space cho placement service

3.7.4 Orphan Cleanup

Orphans là data mà không có metadata nào trỏ tới — xảy ra khi:

Write succeed trên data node nhưng metadata write fail (crash giữa chừng)
Multipart upload bị abort nhưng parts chưa được cleanup

Detection: Periodic scan so sánh data files vs metadata → data không có metadata = orphan → xóa sau grace period (24-48 giờ, phòng trường hợp write đang in-flight).

3.8 Versioning — Quản lý nhiều phiên bản

3.8.1 Cách Versioning hoạt động

Khi bucket có versioning enabled:

Hành động	Không versioning	Có versioning
PUT object (key đã tồn tại)	Overwrite — version cũ mất	Tạo version mới, version cũ vẫn còn
DELETE object	Xóa vĩnh viễn	Thêm delete marker — object “ẩn” nhưng versions cũ vẫn truy cập được
GET object	Trả về object hiện tại (hoặc 404)	Trả về latest version (hoặc 404 nếu latest là delete marker)
GET object?versionId=X	Không hỗ trợ	Trả về version cụ thể
DELETE object?versionId=X	Không hỗ trợ	Xóa vĩnh viễn version cụ thể

3.8.2 Version Chain trong Metadata

Mỗi version là một row riêng trong Object table:

bucket_id	object_name	version_id	is_latest	is_delete_marker	data_location
1	`photos/cat.jpg`	3	true	false	`(node-5, file-12, 8192, 524288)`
1	`photos/cat.jpg`	2	false	false	`(node-3, file-7, 4096, 262144)`
1	`photos/cat.jpg`	1	false	false	`(node-8, file-2, 0, 131072)`

Khi DELETE photos/cat.jpg (versioning enabled):

bucket_id	object_name	version_id	is_latest	is_delete_marker	data_location
1	`photos/cat.jpg`	4	true	true	NULL
1	`photos/cat.jpg`	3	false	false	`(node-5, file-12, 8192, 524288)`
1	`photos/cat.jpg`	2	false	false	`(node-3, file-7, 4096, 262144)`
1	`photos/cat.jpg`	1	false	false	`(node-8, file-2, 0, 131072)`

GET photos/cat.jpg → thấy latest version là delete marker → return 404. Nhưng GET photos/cat.jpg?versionId=3 → return version 3 bình thường.

3.8.3 Lifecycle Policies

Lifecycle policies tự động quản lý versions:

Policy	Mô tả	Ví dụ
Expiration	Xóa versions cũ hơn N ngày	Giữ tối đa 90 ngày
Max versions	Giữ tối đa N versions per object	Giữ 5 versions mới nhất
Transition	Chuyển versions cũ sang storage class rẻ hơn	Sau 30 ngày → chuyển sang GLACIER
Delete marker cleanup	Xóa delete markers khi không còn version nào khác	Tránh accumulate orphan delete markers

3.9 Multipart Upload — Upload file lớn

3.9.1 Tại sao cần Multipart Upload?

Vấn đề với single PUT	Giải pháp bằng Multipart
Object 5 GB upload fail ở 4.9 GB → upload lại từ đầu	Chia thành 1000 parts x 5 MB, fail part nào → retry part đó
Bandwidth bottleneck: 1 TCP connection	Upload nhiều parts song song qua multiple connections
Timeout risk cho large objects	Mỗi part upload nhanh, timeout risk thấp
Memory pressure: buffer toàn bộ object	Stream từng part, memory usage thấp

3.9.2 Multipart Upload Flow

Phase 1: Initiate

Client gửi POST /{bucket}/{key}?uploads
Server tạo upload_id, lưu vào metadata DB
Trả về upload_id cho client

Phase 2: Upload Parts

Client chia object thành parts (5 MB - 5 GB per part, tối đa 10,000 parts)
Upload mỗi part: PUT /{bucket}/{key}?partNumber=N&uploadId=XXX
Server lưu mỗi part vào data node, ghi metadata (upload_id, part_number, etag, size, location)
Client upload song song nhiều parts → tận dụng bandwidth tối đa

Phase 3: Complete (hoặc Abort)

Complete: Client gửi danh sách (part_number, etag) → Server verify, concatenate parts logically, tạo object metadata entry mới
Abort: Client gửi abort → Server cleanup tất cả uploaded parts (orphan cleanup)

Note: “Concatenate parts logically” không có nghĩa copy bytes. Server chỉ cần lưu danh sách part locations trong metadata → GET object sẽ đọc parts theo thứ tự. Tránh data movement không cần thiết.

3.9.3 Multipart Upload Timeout

Multipart uploads chưa complete/abort sau 7 ngày → coi như abandoned → auto-abort + cleanup parts. Lifecycle policy có thể configure timeout này.

3.10 Listing Objects — Khó hơn tưởng tượng

3.10.1 Tại sao Listing khó?

Object storage có flat namespace — không có thật sự directories. Nhưng users kỳ vọng thấy folder-like structure:

photos/
  2024/
    january/
      cat.jpg
      dog.jpg
    february/
      bird.jpg
  2025/
    march/
      fish.jpg

Thực tế trong storage, đây chỉ là 4 objects với keys:

photos/2024/january/cat.jpg
photos/2024/january/dog.jpg
photos/2024/february/bird.jpg
photos/2025/march/fish.jpg

3.10.2 Prefix + Delimiter Listing

S3 API dùng prefix + delimiter để simulate directories:

Request	prefix	delimiter	Kết quả
List “root” of photos/	`photos/`	`/`	CommonPrefixes: `[photos/2024/, photos/2025/]`
List 2024 folder	`photos/2024/`	`/`	CommonPrefixes: `[photos/2024/january/, photos/2024/february/]`
List january folder	`photos/2024/january/`	`/`	Contents: `[cat.jpg, dog.jpg]`
List all in photos/	`photos/`	(none)	Contents: `[all 4 objects]`

3.10.3 Pagination với Continuation Token

Bucket có hàng tỷ objects — không thể trả tất cả trong 1 response. S3 dùng pagination:

Parameter	Mô tả
`max-keys`	Tối đa bao nhiêu results per page (default 1000)
`continuation-token`	Opaque token trỏ tới vị trí tiếp theo (thường là last key encoded)
`is-truncated`	True nếu còn results → client gửi request tiếp với continuation-token

Tại sao dùng continuation token thay vì offset/limit?

offset/limit	continuation-token
Skip N rows → chậm khi N lớn (O(N) scan)	Seek directly tới vị trí (O(log N) B-tree lookup)
Kết quả thay đổi nếu data thay đổi giữa pages	Stable — luôn tiếp tục từ vị trí cũ
Stateless nhưng performance tệ	Stateless + performance tốt

Pitfall: Listing objects at scale (bucket có hàng tỷ objects) là một trong những operations tốn kém nhất. S3 charge riêng cho LIST requests (đắt hơn GET). Khi design, cần index prefix-based trong metadata DB để listing efficient.

3.11 Storage Classes — Tối ưu chi phí

3.11.1 Phân loại Storage Classes

Storage Class	Access Pattern	Latency	Cost (relative)	Durability	Ví dụ
Standard (Hot)	Frequent access	Milliseconds	$$$$ (1x)	11 nines	Active app data, website assets
Infrequent Access (Warm)	1-2 lần/tháng	Milliseconds	$$$ (0.5x)	11 nines	Backups < 6 tháng, logs
Glacier / Cold	1-2 lần/năm	Minutes → Hours	$$ (0.2x)	11 nines	Compliance archives, old backups
Deep Archive	Gần như không access	12-48 hours	$ (0.05x)	11 nines	Legal hold, 10-year retention

3.11.2 Cách implement khác nhau giữa các tiers

Aspect	Hot / Warm	Cold	Archive
Storage media	HDD (hoặc SSD cho ultra-hot)	High-density HDD, drives spin down	Tape library hoặc ultra-cold HDD
Erasure coding	EC(8,4) — balance durability + performance	EC(10,6) — higher parity, tolerate more failures	EC(12,4) — optimize cho space
Data placement	Spread across active nodes	Packed dense, cold nodes	Offline media, spin-up on demand
Retrieval	Instant	Instant (nhưng retrieval fee cao)	Async — submit job, wait hours
Minimum duration	None	30 ngày (charge nếu xóa sớm)	90-180 ngày

3.11.3 Lifecycle Transitions

Objects tự động transition giữa classes theo policy:

Rule	Ví dụ
Standard → IA sau 30 ngày	Logs mới: access thường xuyên → sau 30 ngày ít access hơn
IA → Glacier sau 90 ngày	Logs cũ: gần như không access nhưng cần giữ theo compliance
Glacier → Delete sau 365 ngày	Retention policy: giữ 1 năm rồi xóa
Standard → IA cho versions cũ	Current version: hot. Previous versions: warm

Cost optimization impact: Transition 100 PB từ Standard ( $23/ TB / m o n t h) \to Gl a c i er ($ 4/TB/month) tiết kiệm:

S a v in g s = 100, 000 TB \times ($23 - $4) / m o n t h = 100, 000 \times $19 = $1, 900, 000/ m o n t h = $22, 800, 000/ ye a r

Estimation — Capacity Planning tổng hợp

Storage Budget

Raw data = 100 PB

Erasure coded (1.5x) = 150 PB

Compaction overhead (20% temporary) = 150 \times 1.2 = 180 PB p e ak

Metadata (with indexes) = 200 TB

Total storage needed \approx 180 PB + 0.2 PB = 180.2 PB

Metadata DB Sizing

Objects = 1 0^{11}

Row size \approx 1 K B

Index size \approx 1 K B p er ro w (m u lt i pl e in d e x es)

Total metadata = 1 0^{11} \times 2 K B = 200 TB

Shard size target = 500 GB (mana g e ab l e p er D B in s t an ce)

Number of shards = \frac{200 TB}{500 GB} = \frac{200 , 000 GB}{500 GB} = 400 s ha r d s

With 3x replication per shard = 400 \times 3 = 1, 200 D B in s t an ces

Network Bandwidth

Write bandwidth (avg) = 11, 574 o bj / s \times 1 MB = 11.6 GB / s

Write bandwidth with EC overhead = 11.6 \times 1.5 = 17.4 GB / s = 139 G b p s

Read bandwidth (avg) = 46, 296 o bj / s \times 1 MB = 46.3 GB / s = 370 G b p s

Total bandwidth = 139 + 370 = 509 G b p s

Hardware Estimation

Storage nodes = \frac{180 PB}{16 TB / d i s k \times 12 d i s k s / n o d e} = \frac{180 , 000 TB}{192 TB} \approx 938 n o d es

Metadata DB instances = 1, 200

API service instances = \frac{180 , 000 QPS}{5 , 000 QPS / in s t an ce} = 36 in s t an ces

Placement service instances = 5 (R a f t c l u s t er)

Security — Bảo mật Object Storage

Pre-signed URLs

Pre-signed URLs cho phép tạm thời chia sẻ access tới private objects mà không cần expose credentials:

Đặc điểm	Mô tả
Cách tạo	Server sign URL bằng secret key + expiration time
Format	`https://bucket.s3.amazonaws.com/key?X-Amz-Signature=xxx&X-Amz-Expires=3600`
Timeout	Configurable (1 giây → 7 ngày)
Use case	Share private file, allow upload without AWS credentials
Security	URL hết hạn → access bị revoke tự động

Bucket Policies & ACL

Mechanism	Granularity	Mô tả
Bucket Policy	Bucket-level	JSON policy gắn vào bucket — ai được làm gì (allow/deny)
Object ACL	Object-level	Per-object permissions (owner, read, write)
IAM Policy	User/Role-level	Gắn vào user/role — user này được access buckets nào

Evaluation order: Deny overrides Allow. Explicit Deny > Explicit Allow > Implicit Deny (default).

Encryption at Rest

Mode	Key Management	Ai giữ key?	Use case
SSE-S3	S3 managed keys	S3 tự quản lý	Default, đơn giản nhất
SSE-KMS	KMS managed keys	Customer manage key in KMS	Audit trail, key rotation, fine-grained control
SSE-C	Customer-provided keys	Customer giữ key, gửi kèm mỗi request	Full control, S3 không lưu key

Encryption flow (SSE-KMS):

Client PUT object → API service
API service gọi KMS: “Encrypt data key with master key”
KMS trả về encrypted data key + plaintext data key
API service dùng plaintext data key để encrypt object (AES-256)
Lưu encrypted object + encrypted data key vào storage
Xóa plaintext data key khỏi memory
Khi GET: lấy encrypted data key → gọi KMS decrypt → dùng plaintext key decrypt object → trả về client

Envelope Encryption: KMS không encrypt toàn bộ object (quá lớn). Thay vào đó, encrypt một data key nhỏ, dùng data key đó để encrypt object. Gọi là envelope encryption vì data key được “bọc” (wrapped) bởi master key.

Encryption in Transit

Protocol	Mô tả
TLS 1.2/1.3	Tất cả API calls phải qua HTTPS
Mutual TLS	Giữa internal services (API → Data Node, API → Metadata Service)
VPC Endpoints	Access S3 qua private network, traffic không đi qua internet

Access Logging

Mọi API call được log để audit:

Field	Ví dụ
Timestamp	`2026-03-18T10:30:00Z`
Requester	`arn:aws:iam::123456:user/hieu`
Operation	`REST.PUT.OBJECT`
Bucket	`my-app-bucket`
Key	`uploads/report.pdf`
Status	`200`
Bytes transferred	`1048576`
Source IP	`203.0.113.42`
User Agent	`aws-cli/2.0`

Access logs được lưu vào separate bucket — tạo thành audit trail bất biến. Tham chiếu Tuan-15-Data-Security-Encryption.

DevOps — Monitoring & Operations

Durability Monitoring

Metric	Mô tả	Alert threshold
Under-replicated chunks	Chunks có fewer copies than target	> 0.01% → alert
Unrecoverable chunks	Chunks đã mất quá nhiều copies, không thể recover	> 0 → P0 critical
Scrubbing progress	% data đã được verify trong current cycle	< 50% sau 2 tuần → investigate
Corrupt chunks detected	Số chunks corrupt phát hiện bởi scrubbing	Trend tăng → disk health issue
Recovery throughput	Speed re-replicate/re-encode after node failure	Quá chậm → tăng bandwidth allocation

Disk & Node Health

Metric	Mô tả	Alert threshold
Disk failure rate	% disks failed trong 30 ngày	> 2% → review hardware batch
SMART indicators	Predictive disk failure signals	Reallocated sectors > 100 → schedule replacement
Node heartbeat	Heartbeat miss count	3 consecutive misses → node declared dead
Disk I/O latency	P99 read/write latency per disk	P99 > 100ms → disk degrading
Disk utilization	% capacity used per disk	> 85% → add capacity

Placement & Rebalancing

Metric	Mô tả	Alert threshold
Placement skew	Max node usage / Average node usage	> 1.5 → rebalance needed
Rebalance progress	% data moved so far / total to move	Stalled > 1 hour → investigate
Cross-rack distribution	% objects có chunks trên cùng rack	> 0 → placement bug
Node capacity variance	Standard deviation of disk usage across nodes	> 20% → check placement algorithm

Garbage Collection

Metric	Mô tả	Alert threshold
GC backlog	Total dead space not yet reclaimed	> 20% of total storage → increase GC throughput
Compaction rate	GB reclaimed per hour	Trending down → investigate
Orphan count	Data without metadata references	Trending up → check write path
GC latency	Time from delete to space reclaimed	> 7 days → increase GC frequency

Storage Utilization per Class

Metric	Mô tả
Capacity per class	TB stored in Standard / IA / Glacier / Archive
Transition volume	GB transitioned between classes per day
Retrieval volume	GB retrieved from cold/archive per day
Cost per class	Monthly cost breakdown by storage class

Operational Wisdom: Ở scale 100+ PB, disk failures xảy ra hàng ngày. Hệ thống phải tự healing liên tục. Monitoring không phải để “phát hiện sự cố” mà để “confirm hệ thống đang tự healing đúng cách”.

Mermaid Diagrams

Overall Architecture (Chi tiết)

flowchart TB
    subgraph "Client Layer"
        CLI["AWS CLI<br/>s3 cp, s3 sync"]
        SDK["SDK<br/>(boto3, aws-sdk-js)"]
        CONSOLE["Web Console<br/>(S3 Dashboard)"]
    end

    subgraph "Edge & Gateway"
        DNS["DNS<br/>(Route 53)"]
        LB["Load Balancer<br/>(L7 ALB)"]
        AUTH["Auth Service<br/>(IAM / STS)"]
    end

    subgraph "API Layer (Stateless)"
        API1["API Service"]
        API2["API Service"]
        APIN["API Service"]
    end

    subgraph "Metadata Layer"
        METASVC["Metadata Service<br/>(Query Router)"]
        SHARD1[("Shard 1<br/>bucket_id 0-99")]
        SHARD2[("Shard 2<br/>bucket_id 100-199")]
        SHARDN[("Shard N<br/>bucket_id ...")]
        REDIS[("Redis Cluster<br/>Metadata Cache")]
    end

    subgraph "Placement Layer"
        PL["Placement Leader"]
        PF1["Placement Follower"]
        PF2["Placement Follower"]
    end

    subgraph "Data Layer (Storage Nodes)"
        DN1["Data Node 1<br/>Rack A / AZ-1<br/>12 x 16TB HDD"]
        DN2["Data Node 2<br/>Rack B / AZ-1<br/>12 x 16TB HDD"]
        DN3["Data Node 3<br/>Rack A / AZ-2<br/>12 x 16TB HDD"]
        DNN["Data Node N<br/>Rack B / AZ-2<br/>12 x 16TB HDD"]
    end

    subgraph "Background Services"
        GC["Garbage Collector<br/>Mark-and-Sweep"]
        SCRUB["Scrubbing Service<br/>Integrity Check"]
        LIFECYCLE["Lifecycle Manager<br/>Transitions + Expiry"]
        REBAL["Rebalancer<br/>Data Migration"]
    end

    CLI & SDK & CONSOLE --> DNS
    DNS --> LB
    LB --> AUTH
    AUTH --> API1 & API2 & APIN

    API1 & API2 & APIN --> METASVC
    API1 & API2 & APIN --> PL
    API1 & API2 & APIN --> DN1 & DN2 & DN3 & DNN

    METASVC --> SHARD1 & SHARD2 & SHARDN
    METASVC --> REDIS

    PL --- PF1 & PF2
    PL -.->|"heartbeat"| DN1 & DN2 & DN3 & DNN

    GC --> METASVC
    GC --> DN1 & DN2 & DN3 & DNN
    SCRUB --> DN1 & DN2 & DN3 & DNN
    LIFECYCLE --> METASVC
    REBAL --> PL
    REBAL --> DN1 & DN2 & DN3 & DNN

Write Flow (PUT Object)

sequenceDiagram
    participant C as Client
    participant LB as Load Balancer
    participant API as API Service
    participant AUTH as Auth Service
    participant META as Metadata Service
    participant PLACE as Placement Service
    participant DN1 as Data Node 1<br/>(Primary)
    participant DN2 as Data Node 2<br/>(Replica/EC)
    participant DN3 as Data Node 3<br/>(Replica/EC)

    C->>LB: PUT /my-bucket/photos/cat.jpg
    LB->>API: Forward request

    API->>AUTH: Verify signature + permissions
    AUTH-->>API: Authorized

    API->>META: Check bucket exists + versioning config
    META-->>API: Bucket OK, versioning=enabled

    API->>PLACE: Where to store? (object_id, size)
    PLACE-->>API: Target nodes: [DN1, DN2, DN3]<br/>Strategy: 3x replication

    Note over API: Generate object_id (UUID)<br/>Compute checksum (MD5)

    par Write to all replicas
        API->>DN1: Write data (primary)
        API->>DN2: Write data (replica 2)
        API->>DN3: Write data (replica 3)
    end

    DN1-->>API: OK (file_id=7, offset=1048576, length=524288)
    DN2-->>API: OK (file_id=3, offset=2097152, length=524288)
    DN3-->>API: OK (file_id=11, offset=0, length=524288)

    Note over API: Wait for W replicas<br/>(quorum write: W=2 of 3)

    API->>META: Save object metadata<br/>(key, version, locations, checksum)
    META-->>API: Metadata saved

    API-->>C: 200 OK<br/>ETag: "d41d8cd98f00b204e9800998ecf8427e"<br/>VersionId: "v3"

Erasure Coding Visualization

flowchart LR
    subgraph "Original Object (8 MB)"
        OBJ["Object Data<br/>8 MB"]
    end

    subgraph "Split into k=8 Data Chunks"
        D1["Data 1<br/>1 MB"]
        D2["Data 2<br/>1 MB"]
        D3["Data 3<br/>1 MB"]
        D4["Data 4<br/>1 MB"]
        D5["Data 5<br/>1 MB"]
        D6["Data 6<br/>1 MB"]
        D7["Data 7<br/>1 MB"]
        D8["Data 8<br/>1 MB"]
    end

    subgraph "Reed-Solomon Encoder"
        ENC["RS Encoder<br/>k=8, m=4<br/>Generate 4 parity"]
    end

    subgraph "m=4 Parity Chunks"
        P1["Parity 1<br/>1 MB"]
        P2["Parity 2<br/>1 MB"]
        P3["Parity 3<br/>1 MB"]
        P4["Parity 4<br/>1 MB"]
    end

    subgraph "Distributed to 12 Nodes"
        N1["Node 1<br/>Data 1 ✅"]
        N2["Node 2<br/>Data 2 ✅"]
        N3["Node 3<br/>Data 3 ❌"]
        N4["Node 4<br/>Data 4 ✅"]
        N5["Node 5<br/>Data 5 ✅"]
        N6["Node 6<br/>Data 6 ❌"]
        N7["Node 7<br/>Data 7 ✅"]
        N8["Node 8<br/>Data 8 ✅"]
        N9["Node 9<br/>Parity 1 ❌"]
        N10["Node 10<br/>Parity 2 ✅"]
        N11["Node 11<br/>Parity 3 ❌"]
        N12["Node 12<br/>Parity 4 ✅"]
    end

    OBJ --> D1 & D2 & D3 & D4 & D5 & D6 & D7 & D8
    D1 & D2 & D3 & D4 & D5 & D6 & D7 & D8 --> ENC
    ENC --> P1 & P2 & P3 & P4

    D1 --> N1
    D2 --> N2
    D3 --> N3
    D4 --> N4
    D5 --> N5
    D6 --> N6
    D7 --> N7
    D8 --> N8
    P1 --> N9
    P2 --> N10
    P3 --> N11
    P4 --> N12

Trong diagram trên: 4 nodes bị fail (Node 3, 6, 9, 11 — marked ❌). Vẫn còn 8 healthy chunks (>= k=8) → recover được 100% data. Đây là sức mạnh của erasure coding.

Garbage Collection Flow

flowchart TB
    subgraph "Phase 1: Mark"
        A["GC Scheduler<br/>(periodic trigger)"] --> B["Scan Metadata DB<br/>Find deleted objects"]
        B --> C["Build deletion list<br/>(file_id, offset, length)"]
        C --> D["Group by data file<br/>Calculate dead ratio per file"]
    end

    subgraph "Phase 2: Filter"
        D --> E{"Dead ratio<br/>> 30%?"}
        E -->|"No"| F["Skip file<br/>(not worth compacting)"]
        E -->|"Yes"| G["Add to compaction queue"]
    end

    subgraph "Phase 3: Sweep (Compaction)"
        G --> H["Read data file<br/>sequentially"]
        H --> I["For each object:<br/>Check metadata - alive?"]
        I -->|"Alive"| J["Copy to new data file"]
        I -->|"Dead"| K["Skip (don't copy)"]
        J --> L["All objects processed?"]
        K --> L
        L -->|"No"| I
        L -->|"Yes"| M["Atomic metadata update:<br/>old locations → new locations"]
    end

    subgraph "Phase 4: Cleanup"
        M --> N["Delete old data file"]
        N --> O["Report freed space<br/>to Placement Service"]
        O --> P["Update storage metrics"]
    end

Metadata Service Architecture

flowchart TB
    subgraph "API Services"
        API1["API 1"]
        API2["API 2"]
        APIN["API N"]
    end

    subgraph "Metadata Service Layer"
        ROUTER["Query Router<br/>(Shard Locator)"]
        CACHE["Redis Cluster<br/>L1 Cache<br/>(Hot objects, Bucket info)"]
    end

    subgraph "Shard Group 1 (bucket_id 0-999)"
        S1P[("Primary<br/>PostgreSQL")]
        S1R1[("Replica 1<br/>(Sync)")]
        S1R2[("Replica 2<br/>(Async)")]
    end

    subgraph "Shard Group 2 (bucket_id 1000-1999)"
        S2P[("Primary<br/>PostgreSQL")]
        S2R1[("Replica 1<br/>(Sync)")]
        S2R2[("Replica 2<br/>(Async)")]
    end

    subgraph "Shard Group N"
        SNP[("Primary<br/>PostgreSQL")]
        SNR1[("Replica 1<br/>(Sync)")]
        SNR2[("Replica 2<br/>(Async)")]
    end

    subgraph "Shard Config"
        ZK["ZooKeeper / etcd<br/>(Shard mapping config)"]
    end

    API1 & API2 & APIN --> ROUTER
    ROUTER --> CACHE
    ROUTER --> S1P & S2P & SNP
    S1P --> S1R1 & S1R2
    S2P --> S2R1 & S2R2
    SNP --> SNR1 & SNR2
    ROUTER --> ZK

    style CACHE fill:#f9f,stroke:#333

Read path: Query Router → check Redis cache → nếu miss → tính shard từ bucket_id → query đúng shard primary (cho strong consistency) hoặc replica (cho eventually consistent reads)

Write path: Query Router → tính shard → write to shard primary → primary replicate to sync replica → return success → async replicate to async replica

Aha Moments & Pitfalls

Aha Moments — Những insight quan trọng

#	Insight	Chi tiết
1	Erasure coding > Replication at scale	Ở scale 100+ PB, erasure coding tiết kiệm hàng triệu USD/năm so với 3x replication, với durability tốt hơn. Đây là kỹ thuật then chốt mà mọi large-scale storage system đều dùng.
2	Metadata là bottleneck, không phải data	Data nodes scale linearly bằng cách thêm disk. Nhưng metadata DB chịu toàn bộ query load — mỗi GET, PUT, DELETE, LIST đều cần metadata lookup. Shard metadata đúng cách là key to scalability.
3	Append-only đơn giản hóa mọi thứ	Khi data chỉ append, không overwrite, không delete → không cần locks, không fragmentation, sequential I/O, easy replication. Trade-off: cần GC — nhưng GC chạy background, offline, và predictable.
4	Listing objects khó đến bất ngờ	Flat namespace + hàng tỷ objects + prefix-based “directory” simulation + pagination = bài toán indexing phức tạp. S3 LIST là operation đắt nhất (về compute) và được charge cao nhất.
5	Immutability là superpower	Object storage objects are immutable — write once, read many. Điều này cho phép: aggressive caching (content never changes), easy replication (no conflicts), simple consistency (no concurrent writes to same data).
6	Separation of concerns chính là architecture	4 services (API, Metadata, Data, Placement) có thể scale, fail, upgrade independently. Đây là essence của distributed system design — decompose by concern, not by function.
7	Checksum everywhere	Ở scale lớn, bit rot và silent corruption là chắc chắn sẽ xảy ra, không phải “có thể”. Background scrubbing + self-healing là must-have, không phải nice-to-have.
8	Cost optimization > Performance optimization	Object storage bài toán chính là cost, không phải latency. Tiered storage, erasure coding, compaction — tất cả đều nhằm giảm $/GB. Performance “good enough” là đủ.

Pitfalls — Những bẫy thường gặp trong interview

#	Pitfall	Giải thích	Cách tránh
1	Lưu mỗi object thành một file	Inode exhaustion, small file inefficiency, random I/O → system sụp ở scale lớn	Append-only data files, pack objects liên tiếp
2	Chỉ dùng 3x replication	200% overhead ở 100 PB = lãng phí $4.5M/năm. Interviewer kỳ vọng bạn biết erasure coding	Đề xuất erasure coding cho durability, replication cho hot path
3	Bỏ qua metadata scalability	”Dùng single MySQL” cho 100 tỷ objects → impossible. Metadata là bottleneck #1	Sharding by bucket_id, caching, read replicas
4	Quên garbage collection	Append-only mà không GC → disk đầy. Đây là phần dễ bị bỏ qua nhất	Mention mark-and-sweep, compaction, orphan cleanup
5	Không tách Placement Service	Hardcode data node selection trong API service → không rebalance được, không failure domain aware	Separate placement service với cluster map + consistent hashing
6	Listing dùng offset/limit	Offset 1 triệu trong SQL = scan 1 triệu rows → extremely slow	Continuation token (cursor-based pagination)
7	Quên data integrity	Không checksum, không scrubbing → silent corruption tích lũy → data loss	Checksum per object, periodic scrubbing, self-healing
8	Đánh đồng delete = xóa bytes	Trong append-only system, delete chỉ là đánh dấu metadata. Bytes trên disk vẫn còn	Giải thích rõ soft delete + GC reclaims space later
9	Thiết kế cho performance thay vì cost	Object storage không cần sub-millisecond latency. Tối ưu sai metric → over-engineering	Focus on $/GB, durability, scalability trước; performance “good enough”
10	Quên multipart upload	Object 5 GB upload trong single PUT → timeout, no retry, memory pressure	Multipart upload cho objects > 100 MB

Wrap Up — Step 4 trong Interview

Những điểm nên mention khi kết thúc

Topic	Nói gì
Single point of failure	Metadata service là critical — cần sharding + replication + caching. Placement service dùng Raft cluster. API service stateless, dễ scale.
Data integrity	Checksum at every layer: client compute ETag, data node verify on write, scrubbing verify periodically, self-healing repair automatically
Global scale	Multi-region replication (cross-region replication), DNS-based routing tới nearest region, eventual consistency cho cross-region reads
Cost optimization	Storage classes + lifecycle policies + erasure coding → minimize $/GB. Ở scale 100+ PB, 1% savings = millions USD/year
Consistency model	Strong read-after-write cho single region. Eventual consistency cho cross-region. Tham chiếu Tuan-07-Database-Sharding-Replication
Evolution path	Start với replication → migrate to erasure coding khi scale grows. Start với single-region → add cross-region replication. Incremental complexity.

So sánh với hệ thống thực tế

Feature	Amazon S3	Google Cloud Storage	Azure Blob	MinIO
Durability	11 nines	11 nines	16 nines (LRS-RA-GRS)	Depends on config
Consistency	Strong (since 2020)	Strong	Strong	Strong
Erasure coding	Yes (custom)	Yes (custom)	Yes (LRC)	Yes (Reed-Solomon)
Storage classes	6 tiers	4 tiers	4 tiers	Tiering via ILM
Max object size	5 TB	5 TB	4.75 TB	5 TB
Multipart	Yes (10,000 parts)	Yes (composite objects)	Yes (50,000 blocks)	Yes
Versioning	Yes	Yes	Yes (snapshots)	Yes

Internal Links — Liên kết nội bộ

Link	Liên quan đến phần nào
Tuan-07-Database-Sharding-Replication	Metadata DB sharding by bucket_id, replication strategy cho durability, consistency models
Tuan-10-Consistent-Hashing	Placement service dùng consistent hashing để distribute objects across data nodes
Tuan-15-Data-Security-Encryption	Encryption at rest (SSE-S3/KMS/C), envelope encryption, key management
Case-Design-Google-Drive	Google Drive dùng object storage (S3) làm backend — understanding S3 internals giúp hiểu sâu hơn Drive architecture
Tuan-01-Scale-From-Zero-To-Millions	Horizontal scaling strategy, stateless services, load balancing
Tuan-02-Back-of-the-envelope	Capacity estimation methodology dùng trong bài này
Tuan-06-Cache-Strategy	Redis cache cho metadata service, cache invalidation strategy
Tuan-08-Message-Queue	Async processing cho GC, lifecycle transitions, rebalancing events

“S3 dạy em rằng simplicity bên ngoài đòi hỏi complexity khổng lồ bên trong. PUT/GET/DELETE — chỉ 3 operations. Nhưng đằng sau là erasure coding, append-only storage, distributed metadata, placement algorithms, garbage collection, background scrubbing, lifecycle management. Khi em hiểu cách S3 hoạt động, em hiểu được essence của distributed storage — và đó là foundation cho mọi hệ thống lưu trữ data ở scale lớn.”

lthieu's notes

Explorer

Case-Design-S3-Object-Storage