Tuần 11: Microservices Patterns

“Microservices không phải là đích đến. Nó là một chiến lược tổ chức hệ thống — và nếu dùng sai thời điểm, em sẽ có một distributed monolith: phức tạp của microservices nhưng không có lợi ích nào.”

Tags: system-design microservices architecture devops security alex-xu Student: Hieu Prerequisite: Tuan-04-API-Design-REST-gRPC · Tuan-08-Message-Queue · Tuan-05-Load-Balancer Liên quan: Tuan-12-CICD-Pipeline · Tuan-13-Monitoring-Observability · Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption

1. Context & Why

Analogy đời thường

Hieu, tưởng tượng em đang quản lý một nhà hàng:

Monolith = Nhà hàng một bếp duy nhất

Một bếp nấu tất cả: phở, cơm tấm, pizza, sushi
Nếu bếp hỏng → cả nhà hàng đóng cửa
Muốn thêm món mới → phải sửa cả layout bếp
Mọi đầu bếp phải biết nấu mọi thứ
Khi khách đông, không thể mở rộng riêng phần bếp phở

Microservices = Food court nhiều quầy chuyên biệt

Quầy phở, quầy cơm tấm, quầy pizza — mỗi quầy tự vận hành
Quầy phở hỏng → các quầy khác vẫn bán bình thường (fault isolation)
Muốn thêm quầy bún bò → dựng thêm một quầy mới, không ảnh hưởng quầy khác (independent deployment)
Mỗi quầy có đầu bếp chuyên môn riêng (team autonomy)
Quầy phở đông khách → mở thêm 2 quầy phở nữa, không cần mở rộng quầy pizza (independent scaling)

Nhưng food court cũng có chi phí:

Cần một quầy thông tin ở cửa vào để hướng dẫn khách → API Gateway
Cần hệ thống gọi món trung tâm để phối hợp order liên quầy → Message Queue / Saga
Cần camera giám sát toàn bộ food court → Distributed Tracing / Monitoring
Cần bảo vệ kiểm tra thẻ giữa các quầy → mTLS / Zero Trust

Tại sao đây là tuần 11?

Vì trước khi hiểu microservices, Hieu cần nắm vững:

API Design (Tuần 4) — cách services giao tiếp
Load Balancer (Tuần 5) — cách phân tải giữa instances
Message Queue (Tuần 8) — nền tảng async communication
Consistent Hashing (Tuần 10) — cách phân phối data/traffic

Microservices không phải là bước đầu tiên. Theo Alex Xu: “Start with monolith, evolve to microservices when complexity demands it.”

2. Deep Dive — Kiến trúc Microservices toàn diện

2.1 Monolith vs Microservices — Trade-offs chi tiết

Tiêu chí	Monolith	Microservices
Deployment	Deploy cả hệ thống mỗi lần	Deploy từng service độc lập
Scaling	Scale toàn bộ app (vertical/horizontal)	Scale từng service riêng biệt
Codebase	Một repo duy nhất (hoặc monorepo)	Multi-repo hoặc monorepo + boundaries
Technology	Một tech stack duy nhất	Mỗi service có thể dùng tech stack khác
Data	Một database chung	Database per service (lý tưởng)
Team	Tất cả dev làm chung codebase	Mỗi team sở hữu 1-2 services
Testing	Integration test đơn giản (in-process)	Cần contract testing, E2E phức tạp
Debugging	Stack trace rõ ràng	Distributed tracing (Jaeger, Zipkin)
Latency	Function call = nanoseconds	Network call = milliseconds
Consistency	ACID transactions dễ dàng	Eventual consistency, Saga pattern
Operational cost	Thấp (1 app, 1 DB, 1 deploy pipeline)	Cao (N apps, N DBs, N pipelines)

Khi nào NÊN dùng Microservices?

Team > 10-15 developers, bắt đầu “đạp chân nhau” trên codebase
Cần deploy các phần khác nhau với tần suất khác nhau
Cần scale riêng biệt (ví dụ: search service cần 20 instances, user service chỉ cần 3)
Cần fault isolation thực sự (một phần chết không kéo cả hệ thống)
Codebase đã quá lớn (> 500K LOC), build time > 30 phút

Khi nào KHÔNG nên?

Team nhỏ (< 5 devs) — overhead vận hành sẽ “ăn” hết productivity
Product chưa rõ domain boundaries — tách sai rất đau
Chưa có DevOps maturity (CI/CD, monitoring, containerization)
Đang ở giai đoạn MVP — tốc độ phát triển quan trọng hơn scalability

2.2 Service Decomposition Strategies — Cách tách service

Strategy 1: Decompose by Business Domain (theo nghiệp vụ)

Đây là cách phổ biến nhất, dựa trên Bounded Context của DDD (Domain-Driven Design):

E-commerce Platform
├── User Service          → Quản lý tài khoản, profile
├── Product Service       → Catalog, inventory
├── Order Service         → Đặt hàng, trạng thái đơn
├── Payment Service       → Thanh toán, hoàn tiền
├── Shipping Service      → Vận chuyển, tracking
├── Notification Service  → Email, SMS, push notification
├── Search Service        → Full-text search, recommendations
└── Review Service        → Đánh giá, bình luận

Nguyên tắc: Mỗi service map 1:1 với một business capability (khả năng nghiệp vụ). Nếu em tách theo technical layers (một service cho “database layer”, một cho “business logic layer”) → đó là sai, vì mỗi thay đổi nghiệp vụ sẽ cần sửa nhiều services.

Strategy 2: Decompose by Subdomain (DDD approach)

DDD phân chia domain thành 3 loại subdomain:

Loại	Đặc điểm	Ví dụ (E-commerce)	Chiến lược
Core Domain	Lợi thế cạnh tranh, phức tạp nhất	Recommendation Engine, Pricing Engine	Build in-house, đầu tư nhiều nhất
Supporting Domain	Hỗ trợ core, không phải USP	Order Management, Inventory	Build hoặc customize off-the-shelf
Generic Domain	Chung cho mọi business	Authentication, Payment Gateway, Notification	Buy/use SaaS (Auth0, Stripe, Twilio)

Context Mapping — cách các Bounded Context giao tiếp:

┌──────────────────┐     Shared Kernel     ┌──────────────────┐
│  Order Context    │◄────────────────────►│  Payment Context  │
│                   │                       │                   │
│  - Order          │   Anti-Corruption     │  - Transaction    │
│  - OrderItem      │       Layer           │  - Refund         │
│  - OrderStatus    │◄─────────────────────│  - PaymentMethod  │
└──────────────────┘                       └──────────────────┘
        │                                           │
        │  Customer/Supplier                        │  Conformist
        ▼                                           ▼
┌──────────────────┐                       ┌──────────────────┐
│ Shipping Context  │                       │ External Payment  │
│                   │                       │ Provider (Stripe) │
│  - Shipment       │                       │                   │
│  - Carrier        │                       │  (Conform to their│
│  - TrackingInfo   │                       │   API contract)   │
└──────────────────┘                       └──────────────────┘

Anti-Corruption Layer (ACL): Khi service A giao tiếp với service B có model khác biệt nhiều, ACL đóng vai trò “phiên dịch” để model bên trong service A không bị “nhiễm” model của B. Rất quan trọng khi integrate với legacy system hoặc third-party.

2.3 Inter-Service Communication — Giao tiếp giữa các services

Synchronous Communication (đồng bộ)

REST over HTTP

Order Service --HTTP POST--> Payment Service
              <--200 OK-----

Ưu điểm	Nhược điểm
Đơn giản, dễ hiểu	Tight coupling (caller chờ response)
Tooling phong phú (Postman, Swagger)	Cascading failure (A → B → C, C chết → tất cả chết)
Dễ debug	Latency cộng dồn qua mỗi hop
Stateless	Retry logic phức tạp (idempotency)

gRPC (Google Remote Procedure Call)

Order Service --gRPC (HTTP/2, Protobuf)--> Payment Service
              <--Binary response-----------

Ưu điểm	Nhược điểm
Nhanh hơn REST 2-10x (binary serialization)	Learning curve cao hơn
Contract-first (`.proto` files)	Không đọc được bằng mắt (binary)
Bi-directional streaming	Browser support hạn chế (cần grpc-web)
Code generation tự động	Debugging khó hơn REST
HTTP/2 multiplexing	Load balancing phức tạp hơn (cần L7 LB)

Khi nào dùng gRPC vs REST?

gRPC: Service-to-service internal communication, high-throughput, streaming data
REST: Public APIs, browser clients, khi cần human-readable format

Asynchronous Communication (bất đồng bộ)

Message Queue (Event-Driven)

Order Service --publish "OrderCreated"--> Message Broker (Kafka/RabbitMQ)
                                              │
                                              ├──> Payment Service (subscribe)
                                              ├──> Inventory Service (subscribe)
                                              └──> Notification Service (subscribe)

Ưu điểm	Nhược điểm
Loose coupling (producer không biết consumer)	Eventual consistency (không tức thì)
Fault tolerance (queue buffer khi consumer chết)	Debugging phức tạp (message tracing)
Scale independently	Message ordering challenges
Không cascading failure	Infrastructure thêm (broker cluster)
Natural load leveling	Duplicate message handling (idempotency)

Chi tiết về Message Queue: Tuan-08-Message-Queue

Hybrid approach (thực tế phổ biến nhất):

Sync cho queries cần response ngay (GET user profile, GET product details)
Async cho commands/events (place order, send notification, update inventory)

2.4 API Gateway Pattern

API Gateway là single entry point cho tất cả client requests, đóng vai trò “quầy lễ tân” của food court.

Responsibilities:

                         ┌─────────────────────────────────────┐
                         │           API Gateway               │
   Mobile App ──────────►│                                     │
   Web App ─────────────►│  1. Request Routing                 │
   3rd Party ───────────►│  2. Authentication/Authorization    │
                         │  3. Rate Limiting                   │
                         │  4. SSL Termination                 │
                         │  5. Load Balancing                  │
                         │  6. Request/Response Transformation │
                         │  7. Response Caching                │
                         │  8. Circuit Breaking                │
                         │  9. Logging & Monitoring            │
                         │ 10. API Versioning                  │
                         └──────────┬──────────────────────────┘
                                    │
                    ┌───────────────┼───────────────┐
                    ▼               ▼               ▼
              User Service    Order Service   Product Service

Công nghệ phổ biến: Kong, AWS API Gateway, Nginx (with Lua), Envoy, Traefik, APISIX

Lưu ý: API Gateway có thể trở thành single point of failure và performance bottleneck. Cần:

Deploy nhiều instances + Load Balancer phía trước
Keep gateway logic thin — chỉ routing, auth, rate limit
Không đặt business logic vào gateway

2.5 Backend for Frontend (BFF) Pattern

Khi có nhiều loại client (mobile, web, IoT) với nhu cầu data khác nhau:

                  ┌─────────────┐
   Mobile App ───►│  Mobile BFF  │───┐
                  └─────────────┘   │
                  ┌─────────────┐   │    ┌──────────────┐
   Web App ──────►│   Web BFF    │───┼───►│ User Service  │
                  └─────────────┘   │    ├──────────────┤
                  ┌─────────────┐   │    │ Order Service │
   IoT Device ──►│   IoT BFF    │───┘    ├──────────────┤
                  └─────────────┘        │Product Service│
                                         └──────────────┘

Tại sao cần BFF?

Mobile cần payload nhỏ (bandwidth hạn chế) → BFF mobile chỉ trả fields cần thiết
Web cần data phong phú hơn → BFF web aggregates nhiều services
IoT cần protocol khác (MQTT) → BFF IoT chuyển đổi protocol

BFF vs API Gateway: API Gateway là shared infrastructure (routing, auth). BFF là application-level (data aggregation, transformation cho từng loại client). Thường deploy: Client → API Gateway → BFF → Microservices.

2.6 Saga Pattern — Distributed Transactions

Trong monolith, một order flow đơn giản:

BEGIN TRANSACTION;
  INSERT INTO orders (...);
  UPDATE inventory SET quantity = quantity - 1;
  INSERT INTO payments (...);
COMMIT;

Trong microservices, mỗi service có database riêng → không thể dùng ACID transaction xuyên services. Saga pattern giải quyết bằng cách chia transaction thành chuỗi local transactions với compensating transactions (hoàn tác) cho mỗi bước.

Choreography-based Saga (phi tập trung)

Mỗi service tự lắng nghe events và quyết định bước tiếp theo:

Order Service                  Payment Service              Inventory Service
     │                              │                              │
     │──"OrderCreated"──────────►   │                              │
     │                              │──"PaymentProcessed"────────► │
     │                              │                              │──"InventoryReserved"──►
     │                              │                              │
     │   (Nếu Payment fail)         │                              │
     │◄──"PaymentFailed"───────────│                              │
     │──"OrderCancelled"──────────►│                              │
     │                              │──"PaymentRefunded"─────────►│
     │                              │                              │──"InventoryReleased"──►

Ưu điểm	Nhược điểm
Đơn giản khi ít bước (2-3 services)	Khó theo dõi flow khi nhiều services
Loose coupling	Cyclic dependencies có thể xảy ra
Không có single point of failure	Testing phức tạp
Mỗi service tự chủ	Không có global view của transaction

Orchestration-based Saga (tập trung)

Một Saga Orchestrator điều phối toàn bộ flow:

                    Saga Orchestrator
                          │
           ┌──────────────┼──────────────┐
           ▼              ▼              ▼
      Order Service  Payment Service  Inventory Service

  Step 1: CreateOrder()
  Step 2: ProcessPayment()
  Step 3: ReserveInventory()

  If Step 3 fails:
    Compensate 2: RefundPayment()
    Compensate 1: CancelOrder()

Ưu điểm	Nhược điểm
Dễ theo dõi flow (centralized logic)	Orchestrator là single point of failure
Dễ thêm/sửa steps	Risk: orchestrator chứa quá nhiều logic
Tránh cyclic dependencies	Coupling vào orchestrator
Dễ test (test orchestrator = test flow)	Cần manage state machine

Khi nào dùng gì?

Choreography: 2-4 services, flow đơn giản, team muốn loose coupling tối đa
Orchestration: > 4 services, flow phức tạp, cần visibility rõ ràng, cần compensating logic phức tạp

2.7 Circuit Breaker Pattern

Khi service B chết, service A vẫn gọi liên tục → waste resources + cascade failure. Circuit Breaker ngăn chặn điều này, giống cầu dao điện trong nhà:

State Machine:

        ┌─────────────────────────────────┐
        │                                 │
        ▼                                 │
   ┌─────────┐    failure threshold   ┌───┴─────┐
   │  CLOSED  │──────────────────────►│  OPEN    │
   │(cho phép)│   (vượt ngưỡng lỗi)  │(chặn hết)│
   └─────────┘                        └────┬────┘
        ▲                                  │
        │     success                      │  timeout
        │                                  ▼
        │                           ┌───────────┐
        └───────────────────────────│ HALF-OPEN  │
              (thử lại thành công)  │(cho 1 vài) │
                                    └───────────┘
                                         │
                                         │ failure
                                         ▼
                                    ┌─────────┐
                                    │  OPEN    │
                                    └─────────┘

3 trạng thái:

CLOSED (bình thường): Request đi qua bình thường. Đếm số lỗi.
OPEN (cắt mạch): Khi lỗi vượt ngưỡng (ví dụ: 5 lỗi trong 10 giây), ngắt mạch. Mọi request trả về fallback response ngay lập tức — không gọi service B.
HALF-OPEN (thăm dò): Sau timeout (ví dụ: 30 giây), cho phép 1-2 request thử. Nếu thành công → CLOSED. Nếu thất bại → OPEN lại.

Fallback strategies:

Trả về cached data (stale but available)
Trả về default response
Trả về error message thân thiện
Redirect sang service backup

2.8 Bulkhead Pattern

Tên gọi lấy từ vách ngăn kín nước trên tàu biển — nếu một khoang bị thủng, nước chỉ ngập khoang đó, không lan sang khoang khác.

Trong microservices, Bulkhead cô lập resources để failure ở một phần không ảnh hưởng phần khác:

┌──────────────────────────────────────────────┐
│              Order Service                    │
│                                              │
│  ┌──────────────┐   ┌──────────────┐         │
│  │ Thread Pool A │   │ Thread Pool B │         │
│  │ (10 threads)  │   │ (10 threads)  │         │
│  │               │   │               │         │
│  │ → Payment API │   │ → Inventory   │         │
│  │               │   │   API         │         │
│  └──────────────┘   └──────────────┘         │
│                                              │
│  Nếu Payment API chết, chỉ Pool A bị block  │
│  Pool B (Inventory) vẫn hoạt động bình thường│
└──────────────────────────────────────────────┘

Cách implement Bulkhead:

Thread pool isolation: Mỗi downstream dependency có thread pool riêng
Connection pool isolation: Giới hạn connection tới mỗi dependency
Semaphore isolation: Giới hạn concurrent calls (nhẹ hơn thread pool)
Pod/Container isolation: Mỗi function group chạy trên pod riêng

2.9 Service Discovery — Tìm kiếm services

Trong microservices, services scale dynamically (thêm/bớt instances). Không thể hardcode IP. Service Discovery giải quyết vấn đề: “Service A muốn gọi Service B. Service B ở đâu?”

Client-Side Discovery

                      ┌──────────────┐
                      │   Service    │
  Service A ─────────►   │   Registry   │  (Consul, Eureka, etcd)
       │              │              │
       │  1. Query     │ user-svc:    │
       │  "user-svc"    │  - 10.0.1.5  │
       │              │  - 10.0.1.6  │
       │              │  - 10.0.1.7  │
       │              └──────────────┘
       │
       │  2. Client tự chọn instance (round-robin, random, etc.)
       │
       └──────────────────────────────►  10.0.1.6 (User Service)

Ưu điểm: Ít hops hơn, client có thể chọn strategy tùy ý
Nhược điểm: Client phải implement discovery logic, mỗi ngôn ngữ phải implement riêng

Server-Side Discovery

                      ┌──────────────┐
  Service A ─────────►│ Load Balancer│─────────► 10.0.1.6 (User Service)
                      │  / Router    │
                      └──────┬───────┘
                             │ query
                      ┌──────▼───────┐
                      │   Service    │
                      │   Registry   │
                      └──────────────┘

Ưu điểm: Client đơn giản (chỉ cần biết LB address), language-agnostic
Nhược điểm: Thêm một hop (latency), LB có thể là bottleneck

Công cụ phổ biến:

Tool	Type	Health Check	Key-Value Store	Ghi chú
Consul (HashiCorp)	Both	HTTP, TCP, gRPC, Script	Yes	Multi-datacenter native
Eureka (Netflix)	Client-side	Heartbeat	No	Java ecosystem, Spring Cloud
etcd	Server-side (via Kubernetes)	—	Yes	Kubernetes native
ZooKeeper	Server-side	Session-based	Yes	Legacy, đang bị thay thế

2.10 Service Mesh — Istio / Linkerd

Service Mesh là infrastructure layer xử lý service-to-service communication, giúp developers không phải implement cross-cutting concerns (retry, circuit breaker, mTLS, observability) trong application code.

Sidecar Proxy Pattern

┌─────────────────────────┐     ┌─────────────────────────┐
│         Pod A           │     │         Pod B           │
│ ┌─────────┐ ┌─────────┐│     │┌─────────┐ ┌─────────┐ │
│ │ Service │ │ Sidecar ││     ││ Sidecar │ │ Service │ │
│ │    A    │→│ Proxy A ││────►││ Proxy B │→│    B    │ │
│ │         │ │(Envoy)  ││     ││(Envoy)  │ │         │ │
│ └─────────┘ └─────────┘│     │└─────────┘ └─────────┘ │
└─────────────────────────┘     └─────────────────────────┘

              ▲                          ▲
              │      Control Plane       │
              │    (Istiod / Linkerd)    │
              └──────────┬───────────────┘
                         │
                   ┌─────▼──────┐
                   │  Config:   │
                   │  - Routing │
                   │  - mTLS    │
                   │  - Retry   │
                   │  - CB      │
                   └────────────┘

Mỗi pod có một sidecar proxy (thường là Envoy) chạy bên cạnh application container. Mọi traffic in/out đều đi qua sidecar → sidecar xử lý:

Feature	Không có Service Mesh	Với Service Mesh
mTLS	Tự implement trong code	Tự động, transparent
Retry / Timeout	Library trong mỗi service	Config ở control plane
Circuit Breaker	Code-level (Hystrix, opossum)	Infrastructure-level
Observability	Instrument manually	Tự động metrics, traces
Traffic splitting	Manual (canary flags)	Declarative (90/10 split)
Rate Limiting	Code hoặc gateway	Policy ở mesh level

Istio vs Linkerd:

Tiêu chí	Istio	Linkerd
Proxy	Envoy (C++)	linkerd2-proxy (Rust)
Complexity	Cao, nhiều features	Đơn giản, dễ setup
Resource overhead	Nhiều hơn (~50-100MB/sidecar)	Ít hơn (~20-30MB/sidecar)
Learning curve	Cao	Thấp
Features	Rất đầy đủ	Core features, đủ dùng
Best for	Enterprise, cần mọi feature	Startup/mid-size, cần đơn giản

2.11 Distributed Tracing — Theo dõi request xuyên services

Khi một request đi qua 5-10 services, làm sao biết bottleneck ở đâu?

Request ID: abc-123

API Gateway [2ms]
  └─► Order Service [15ms]
        ├─► User Service [3ms]        ← cache hit, nhanh
        ├─► Product Service [45ms]     ← ⚠️ slow query!
        │     └─► Inventory DB [40ms]  ← ⚠️ root cause
        └─► Payment Service [8ms]
              └─► Stripe API [120ms]   ← external, expected

Total: 193ms

3 pillars of Observability:

Logs: “Chuyện gì đã xảy ra?” — structured logs (JSON) cho mỗi service
Metrics: “Hệ thống đang như thế nào?” — QPS, latency, error rate (RED metrics)
Traces: “Request đi qua đâu, mất bao lâu?” — span-based tracing

OpenTelemetry là standard hiện nay, tích hợp cả 3 pillars. Backend phổ biến:

Jaeger (Uber, open-source) — distributed tracing
Zipkin (Twitter, open-source) — distributed tracing
Grafana Tempo — traces, tích hợp tốt với Grafana stack

Chi tiết: Tuan-13-Monitoring-Observability

2.12 Data Management — Database per Service

Nguyên tắc: Mỗi service sở hữu database riêng

┌──────────┐    ┌──────────┐    ┌──────────┐
│  Order   │    │ Payment  │    │ Product  │
│ Service  │    │ Service  │    │ Service  │
└────┬─────┘    └────┬─────┘    └────┬─────┘
     │               │               │
┌────▼─────┐    ┌────▼─────┐    ┌────▼─────┐
│ Order DB │    │Payment DB│    │Product DB│
│(Postgres)│    │(Postgres)│    │ (Mongo)  │
└──────────┘    └──────────┘    └──────────┘

Tại sao?

Loose coupling: Thay đổi schema DB này không ảnh hưởng service khác
Independent scaling: Order DB cần sharding, Product DB không cần
Tech freedom: Payment dùng PostgreSQL (ACID), Product dùng MongoDB (flexible schema), Search dùng Elasticsearch

Shared Database Anti-Pattern:

┌──────────┐    ┌──────────┐    ┌──────────┐
│  Order   │    │ Payment  │    │ Product  │
│ Service  │    │ Service  │    │ Service  │
└────┬─────┘    └────┬─────┘    └────┬─────┘
     │               │               │
     └───────────────┼───────────────┘
                     │
              ┌──────▼──────┐
              │  Shared DB   │  ← ANTI-PATTERN!
              │  (1 schema)  │
              └─────────────┘

Tại sao shared DB là anti-pattern?

Tight coupling: Service A thay đổi table → Service B bể
Không scale riêng: Bottleneck ở 1 service → cả DB bị ảnh hưởng
Schema conflicts: 2 teams sửa cùng table → merge conflicts
Không polyglot persistence: Mọi service buộc dùng cùng DB technology

Challenge khi database per service: Cần data từ nhiều services → không thể JOIN.

API Composition: Service A gọi API Service B, C để lấy data rồi join ở application level
CQRS (Command Query Responsibility Segregation): Tách read/write models. Write vào service DB. Read từ denormalized read model (materialized view).
Event Sourcing + Change Data Capture: Services publish events khi data thay đổi → các services khác build local copy

2.13 Strangler Fig Pattern — Migration từ Monolith

Không ai nên rewrite monolith từ đầu (“big bang rewrite”). Strangler Fig — đặt tên theo cây đa siết (strangler fig) bao quanh cây chủ rồi dần thay thế:

Phase 1: Monolith handles everything
┌─────────────────────────────────┐
│          Monolith               │
│  [Users] [Orders] [Products]    │
│  [Payments] [Notifications]     │
└─────────────────────────────────┘

Phase 2: Extract one service, route via proxy
┌──────────────────┐
│   API Proxy      │──── /api/notifications ───► Notification Service (new)
│   (Strangler     │
│    Facade)       │──── /api/* ──────────────► Monolith (existing)
└──────────────────┘

Phase 3: Continue extracting
┌──────────────────┐
│   API Proxy      │──── /api/notifications ───► Notification Service
│                  │──── /api/payments ─────────► Payment Service
│                  │──── /api/products ─────────► Product Service
│                  │──── /api/* ──────────────► Monolith (smaller)
└──────────────────┘

Phase 4: Monolith fully replaced
┌──────────────────┐
│   API Gateway    │──── /api/notifications ───► Notification Service
│                  │──── /api/payments ─────────► Payment Service
│                  │──── /api/products ─────────► Product Service
│                  │──── /api/orders ───────────► Order Service
│                  │──── /api/users ────────────► User Service
└──────────────────┘

Nguyên tắc migration:

Bắt đầu từ service ít coupling nhất (thường là Notification, Logging)
Đặt proxy/facade phía trước monolith
Tách từng service một, route traffic qua proxy
Giữ monolith chạy song song → rollback dễ dàng
Khi service mới ổn định → tắt phần tương ứng trong monolith
Lặp lại cho đến khi monolith rỗng

2.14 Modern Service Mesh — Ambient Mesh, Cilium, eBPF

Cập nhật 2024-2026: Sidecar pattern cũ (Istio classic, Linkerd 2.x) đang bị thay thế bởi sidecarless approach. Đây là xu hướng quan trọng nhất cho microservice infrastructure trong 3 năm tới.

2.14.1 Vấn đề của Sidecar truyền thống

Section 2.10 ở trên mô tả sidecar pattern — mỗi pod có 1 Envoy proxy. Trong production, điều này có cost lớn:

Vấn đề	Chi tiết
Resource overhead	Cluster 1000 pods → 1000 Envoy sidecars × 50-100MB = 50-100GB RAM riêng cho mesh
Latency tax	Mỗi request đi qua 2 sidecar (egress + ingress) → +1-2ms per hop
Operational cost	1000 sidecar phải được rolling-upgrade khi update Istio
Lifecycle coupling	Sidecar restart → application traffic disrupt
Sidecar injection complexity	Webhook configurations, init containers, namespace labels

Real-world impact: Đây là lý do nhiều công ty (HBO Max, Salesforce) đã bắt đầu rút khỏi Istio classic sau 2-3 năm production và evaluate alternatives.

2.14.2 Istio Ambient Mesh (2023+, GA 2024)

Istio Ambient loại bỏ sidecar bằng cách chia mesh thành 2 layers:

┌─────────────────────────────────────────────────────┐
│   Application Pods (no sidecar!)                     │
│   ┌───────────┐  ┌───────────┐  ┌───────────┐       │
│   │  Service A│  │  Service B│  │  Service C│       │
│   └─────┬─────┘  └─────┬─────┘  └─────┬─────┘       │
└─────────┼──────────────┼──────────────┼─────────────┘
          │              │              │
          ▼              ▼              ▼
   ┌─────────────────────────────────────────┐
   │   Layer 4: ztunnel (per-node DaemonSet) │
   │   - mTLS, identity, basic routing       │
   │   - Always on, zero-overhead            │
   └─────────────────────────────────────────┘
                        │
                        ▼ (only when L7 needed)
   ┌─────────────────────────────────────────┐
   │   Layer 7: waypoint proxy (optional)    │
   │   - HTTP routing, retry, auth policy    │
   │   - Per-namespace, opt-in               │
   └─────────────────────────────────────────┘

Hai layer:

L4 (ztunnel) — DaemonSet, 1 pod/node. Handle mTLS + L4 routing. Always on.
L7 (waypoint proxy) — Optional, only deploy khi cần L7 features (HTTP routing, retry policies). Per-namespace.

Ưu điểm:

Tiêu chí	Sidecar Mesh	Ambient Mesh
Memory per pod	+50-100MB	0 (no sidecar)
Memory per node	0	~200MB (1 ztunnel)
Latency tax	2 hops sidecar	1 hop ztunnel (L4 only)
Pod restart on mesh upgrade	Required	Not required
L7 features	Always on (cost)	Opt-in (pay-per-use)
Suitable for	Apps cần mọi L7 feature	Mixed workload, chi phí matter

Khi nào dùng Ambient:

Cluster lớn (1000+ pods) cần giảm overhead
Workload phần lớn không cần L7 routing (chỉ cần mTLS)
Cần upgrade mesh không disrupt app

Khi nào vẫn dùng Sidecar (classic):

Cần L7 features cho mọi request (rate limit, auth, transform)
Đã đầu tư heavy vào Envoy filter custom
Mature stack, không muốn migration risk

Tham chiếu: https://istio.io/latest/docs/ambient/overview/

2.14.3 Cilium — Mesh trên eBPF (sidecarless from day 1)

Cilium dùng eBPF (extended Berkeley Packet Filter) để implement networking + service mesh trực tiếp trong kernel, không cần sidecar proxy.

┌──────────────────────────────────────────┐
│         Application Pod                    │
│         (no sidecar)                       │
└────────────────┬─────────────────────────┘
                 │
                 ▼ (kernel)
┌──────────────────────────────────────────┐
│   Linux Kernel + eBPF programs            │
│   - Service routing (L4/L7)              │
│   - Network policy enforcement           │
│   - mTLS (with WireGuard or IPsec)       │
│   - Observability (Hubble)               │
│   - Load balancing (replace kube-proxy)  │
└──────────────────────────────────────────┘

eBPF basics:

Cho phép run sandboxed code trong kernel mà không cần kernel module
Verified at load time → safe
Hot-reloadable, no kernel reboot
Visibility ở every packet/syscall

So sánh Cilium vs Istio Ambient:

Tiêu chí	Cilium	Istio Ambient
Data plane	eBPF (kernel)	ztunnel (userspace)
Performance	Cao nhất (in-kernel)	Tốt (user-space proxy)
L7 features	Cilium Service Mesh (Envoy as needed)	Waypoint (Envoy)
CNI replacement	Có (replace kube-proxy)	Không (need separate CNI)
Maturity	GA, production-proven (Google GKE Dataplane V2, Datadog)	GA mid-2024
Vendor lock	CNCF, vendor-neutral	Istio community

Real-world adoptions:

Google GKE Dataplane V2 — built on Cilium
Datadog — Cilium for production observability
Adobe — migrated to Cilium for performance

Tham chiếu:

Cilium docs: https://cilium.io/
Liz Rice, Learning eBPF (O’Reilly 2023): https://learning.oreilly.com/library/view/learning-ebpf/9781098135119/
Brendan Gregg’s eBPF resources: https://www.brendangregg.com/ebpf.html

2.14.4 Decision matrix — chọn mesh nào?

Need full L7 features for every service?
├─ YES → Istio classic (sidecar) — proven path
└─ NO  → Need maximum performance + observability?
         ├─ YES → Cilium (eBPF)
         └─ NO  → Istio Ambient (mid-ground)

Anti-pattern: Migrate giữa các mesh thường xuyên. Mỗi mesh là 6-12 tháng investment. Chọn 1 và stick với nó.

2.15 Cell-based Architecture & Shuffle Sharding

Đây là pattern AWS dùng cho hầu hết hệ thống critical (DynamoDB, Lambda, Route 53). Chuẩn industry cho blast radius reduction.

2.15.1 Vấn đề: Blast radius khi 1 failure

Traditional architecture:

                    [Load Balancer]
                          │
       ┌──────────────────┼──────────────────┐
       │                  │                  │
   [Service A1]      [Service A2]      [Service A3]
       │                  │                  │
       └──────────────────┼──────────────────┘
                          │
                  [Single Database]
                          │
                  ⚡ Failure mode ⚡

Khi 1 component fail (DB corruption, bad deployment, hot key, DDoS) → toàn bộ user bị ảnh hưởng.

Real-world incident examples:

2017 AWS S3 outage — typo in command → 4 hours global S3 down → $150M+ damage
2021 Facebook 6-hour outage — BGP misconfiguration → all services down
2024 CrowdStrike — bad update → 8.5M Windows machines crash globally

2.15.2 Cell-based Architecture

Nguyên lý: Chia hệ thống thành N cells độc lập, mỗi cell phục vụ subset users. Failure trong 1 cell → chỉ N% users bị ảnh hưởng.

            [Cellular Router]
                    │
      ┌─────────────┼─────────────┐
      ▼             ▼             ▼
 ┌──────────┐  ┌──────────┐  ┌──────────┐
 │  Cell 1  │  │  Cell 2  │  │  Cell 3  │
 │          │  │          │  │          │
 │ App tier │  │ App tier │  │ App tier │
 │   DB     │  │   DB     │  │   DB     │
 │   Cache  │  │   Cache  │  │   Cache  │
 │          │  │          │  │          │
 │ Users    │  │ Users    │  │ Users    │
 │ A-G      │  │ H-O      │  │ P-Z      │
 └──────────┘  └──────────┘  └──────────┘

Tính chất quan trọng:

Independent: Cell không chia sẻ state với cell khác
Fixed size: Mỗi cell có capacity giới hạn (e.g., 100K users), không scale up beyond
Identical stack: Cell 1 = Cell 2 architecturally, chỉ khác data
Routed deterministically: Mỗi user thuộc 1 cell cố định (theo user_id hash)

Failure isolation:

Cell 1 DB corruption → chỉ users A-G bị ảnh hưởng (33%)
Bad deployment trong Cell 1 → blue-green test, rollback chỉ affects 33%
Hot key trong Cell 1 → không lan sang Cell 2/3

2.15.3 Cell sizing trade-off

Cell size	Blast radius	Operational overhead
1M users/cell	100% impact nếu fail	Minimal — như monolith
100K users/cell	10% impact	Manageable
10K users/cell	1% impact	Cao (10x infrastructure)
1K users/cell	0.1% impact	Rất cao

AWS standard: 100K-1M units per cell. Optimize cho:

Big enough để efficient (DB economy of scale)
Small enough để failure không catastrophic

2.15.4 Shuffle Sharding — Tăng isolation

Vấn đề: Với cell-based, nếu 1 user “bad actor” làm Cell A overload → mọi user trong Cell A bị ảnh hưởng (10% blast).

Shuffle Sharding (AWS pattern): Mỗi user được assign subset of resources thay vì toàn bộ 1 cell.

Ví dụ: 8 worker nodes, mỗi user được assign 2 random nodes:

8 workers: [W1, W2, W3, W4, W5, W6, W7, W8]

User Alice  → [W1, W3]
User Bob    → [W2, W7]
User Carol  → [W3, W6]
User Dave   → [W1, W5]
User Evan   → [W4, W8]

Math: Số combination với (8 choose 2) = 28 distinct shuffles. Nếu Alice bị “noisy neighbor” với attacker (Bob), only 1/28 = 3.5% of users overlap với attacker.

Generalized: Với N nodes, shard size K:

Combinations = (K N) = \frac{N !}{K ! ( N - K )!}

AWS Route 53: 2048 servers, mỗi customer dùng 4 servers → C(2048, 4) ≈ 7 × 10^11 distinct shuffles. Probability 2 customers share all 4 servers ~ 0%.

Tham chiếu: AWS Builders Library — Workload isolation using shuffle-sharding — https://aws.amazon.com/builders-library/workload-isolation-using-shuffle-sharding/

2.15.5 Cellular Router

Router maps requests → cell. Có 3 cách:

Pattern	Mô tả	Trade-off
DNS-based	`cell-1.example.com`, `cell-2.example.com`. Client biết cell mình thuộc	Simple, nhưng client phải biết
Smart router	App tier router: `route(user_id) → cell_id`	Hidden từ client, nhưng router là SPOF/bottleneck
Stateless edge router	Edge function (Cloudflare Workers) compute `cell_id = hash(user_id)`	Rất scalable, nhưng cần consistent hashing

Best practice: Stateless edge router với consistent hashing. Cell mapping table được cache ở edge.

2.15.6 Production examples

System	Cell size	Routing
AWS DynamoDB	Partitions (cells)	Internal routing layer
AWS Lambda	Cell = isolated execution environment	Smart router
AWS Route 53	Shuffle-sharded across 2048 servers	Anycast + shuffle
Slack	Workspace cells	DNS + smart router
Stripe	API cells per region	Edge routing

2.15.7 Khi nào KHÔNG dùng Cell-based?

System nhỏ (< 100K users) — overhead lớn hơn benefit
Workload có strong cross-user query (e.g., social network feed) — cell boundary kill performance
Team size nhỏ — operational complexity của N cells overwhelm

Tham chiếu thêm:

AWS Builders Library — Reducing the Scope of Impact with Cell-based Architecture — https://aws.amazon.com/builders-library/avoiding-fallback-in-distributed-systems/
Cellular architecture (Werner Vogels) — https://www.allthingsdistributed.com/2024/08/cellular-architecture.html
Slack’s cell-based migration — https://slack.engineering/slacks-migration-to-a-cellular-architecture/

3. Estimation — Network Overhead & Service Mesh Latency

3.1 Network Overhead: Monolith vs Microservices

Scenario: E-commerce checkout flow. Trong monolith, checkout là 1 function call chain. Trong microservices, cần gọi qua mạng.

Monolith (in-process):

L a t e n c y_{m o n o l i t h} = T_{v a l i d a t e_u ser} + T_{c h ec k_in v e n t ory} + T_{p rocess_p a y m e n t} + T_{cre a t e_or d er}

= 2 m s + 5 m s + 50 m s + 3 m s = 60 m s

Mỗi function call: ~100ns overhead (negligible).

Microservices (network calls):

Mỗi service call thêm:

Serialization/Deserialization: ~0.5ms (JSON) hoặc ~0.1ms (Protobuf)
Network round trip (same DC): ~0.5ms
TLS handshake (nếu mTLS, first connection): ~2ms (subsequent: ~0ms with connection pooling)
Load balancer hop: ~0.2ms

O v er h e a d_{p er_c a ll} = T_{ser ia l i ze} + T_{n e tw or k} + T_{L B} + T_{d eser ia l i ze}

REST (JSON):

O v er h e a d_{REST} = 0.5 m s + 0.5 m s + 0.2 m s + 0.5 m s = 1.7 m s p er c a ll

gRPC (Protobuf):

O v er h e a d_{g RPC} = 0.1 m s + 0.5 m s + 0.2 m s + 0.1 m s = 0.9 m s p er c a ll

Checkout flow cần 4 service calls (sequential):

L a t e n c y_{mi croser v i ces_REST} = 60 m s + (4 \times 1.7 m s) = 66.8 m s

L a t e n c y_{mi croser v i ces_g RPC} = 60 m s + (4 \times 0.9 m s) = 63.6 m s

Overhead percentage:

O v er h e a d_{REST} = \frac{6.8 m s}{60 m s} \times 100% \approx 11.3%

O v er h e a d_{g RPC} = \frac{3.6 m s}{60 m s} \times 100% \approx 6.0%

Aha: Network overhead khoảng 6-11% cho sequential calls. Nhưng nếu Hieu parallelize calls (User + Inventory song song, rồi Payment), overhead giảm đáng kể.

Parallelized flow (User check + Inventory check chạy song song):

L a t e n c y_{p a r a ll e l} = ma x (T_{u ser}, T_{in v e n t ory}) + T_{p a y m e n t} + T_{or d er} + 3 \times O v er h e a d_{g RPC}

= ma x (2, 5) + 50 + 3 + (3 \times 0.9) = 60.7 m s

O v er h e a d_{p a r a ll e l} = \frac{0.7 m s}{60 m s} \times 100% \approx 1.2%

3.2 Service Mesh Sidecar Latency

Khi thêm service mesh (Istio/Linkerd), mỗi request đi qua 2 sidecar proxies (source + destination):

O v er h e a d_{s i d ec a r} = T_{so u rce_p ro x y} + T_{d es t_p ro x y}

Istio (Envoy proxy):

O v er h e a d_{I s t i o} \approx 0.5 m s + 0.5 m s = 1.0 m s p er h o p

Linkerd (linkerd2-proxy):

O v er h e a d_{L ink er d} \approx 0.3 m s + 0.3 m s = 0.6 m s p er h o p

Cho checkout flow (4 hops, gRPC + Istio):

L a t e n c y_{m es h} = 60 m s + (4 \times 0.9 m s) + (4 \times 1.0 m s) = 67.6 m s

T o t a l O v er h e a d_{m es h} = \frac{7.6 m s}{60 m s} \times 100% \approx 12.7%

3.3 Bandwidth overhead: Microservices

Mỗi service call gửi HTTP headers, TLS overhead, serialization envelope:

O v er h e a d_{ban d w i d t h_p er_c a ll} \approx 500 b y t es (h e a d ers) + 200 b y t es (T L S) + 100 b y t es (e n v e l o p e)

= 800 b y t es / c a ll

Nếu hệ thống có 10,000 requests/s, mỗi request trung bình 5 internal service calls:

I n t er na l c a ll s = 10, 000 \times 5 = 50, 000 in t er na l c a ll s / s

B an d w i d t h_{o v er h e a d} = 50, 000 \times 800 b y t es = 40 MB / s = 320 M b p s

Nhận xét: 320 Mbps bandwidth overhead chỉ cho internal communication. Đây là lý do service mesh + gRPC (binary, smaller headers) quan trọng ở scale lớn.

3.4 Tổng hợp

Metric	Monolith	Microservices (REST)	Microservices (gRPC)	gRPC + Service Mesh
Checkout latency	60ms	66.8ms	63.6ms	67.6ms
Overhead	0%	11.3%	6.0%	12.7%
Bandwidth/call	0	~800 bytes	~400 bytes	~500 bytes

4. Security — Bảo mật Microservices

4.1 Service-to-Service Authentication (mTLS)

Trong monolith, các function gọi nhau trong cùng process → trusted. Trong microservices, mọi network call đều có thể bị intercept, spoof, hoặc tamper.

mTLS (mutual TLS): Cả client và server đều verify certificate của nhau.

Service A (client)                           Service B (server)
     │                                              │
     │──── ClientHello ────────────────────────────►│
     │◄─── ServerHello + Server Certificate ────────│
     │◄─── CertificateRequest ─────────────────────│
     │──── Client Certificate ─────────────────────►│
     │                                              │
     │  (Both sides verify each other's cert)       │
     │                                              │
     │◄════ Encrypted channel established ════════►│

Không có mTLS: Bất kỳ ai có network access đều có thể giả dạng Service A gọi Service B. Có mTLS: Chỉ services có certificate hợp lệ (signed by trusted CA) mới giao tiếp được.

Trong Service Mesh: mTLS được enable tự động, transparent — application code không cần thay đổi.

# Istio PeerAuthentication — enforce mTLS cho toàn namespace
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
  namespace: ecommerce
spec:
  mtls:
    mode: STRICT   # Chỉ chấp nhận mTLS, reject plaintext

4.2 Zero Trust Networking

Mô hình truyền thống: “Castle and moat” — tin tưởng mọi thứ bên trong network perimeter. Zero Trust: “Never trust, always verify” — mọi request đều phải authenticate + authorize, bất kể từ đâu.

Nguyên tắc Zero Trust cho Microservices:

Verify identity: Mỗi service có identity riêng (certificate, service account)
Least privilege: Service A chỉ được gọi đúng endpoints cần thiết của Service B
Encrypt everywhere: Mọi communication đều encrypted (mTLS)
Verify explicitly: Token-based auth cho mỗi request, không cache trust decisions lâu
Assume breach: Design để khi 1 service bị compromise, blast radius nhỏ nhất

# Istio AuthorizationPolicy — chỉ cho Order Service gọi Payment Service
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: payment-service-policy
  namespace: ecommerce
spec:
  selector:
    matchLabels:
      app: payment-service
  action: ALLOW
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/ecommerce/sa/order-service"]
      to:
        - operation:
            methods: ["POST"]
            paths: ["/api/v1/payments", "/api/v1/refunds"]

4.3 Secrets Management (HashiCorp Vault)

Microservices cần nhiều secrets: database passwords, API keys, certificates, encryption keys. Không bao giờ hardcode secrets trong code hoặc environment variables.

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Order Service │     │Payment Service│     │ User Service  │
│              │     │              │     │              │
│  vault-agent │     │  vault-agent │     │  vault-agent │
│  (sidecar)   │     │  (sidecar)   │     │  (sidecar)   │
└──────┬───────┘     └──────┬───────┘     └──────┬───────┘
       │                    │                    │
       │    Authenticate    │                    │
       │    (K8s SA Token)  │                    │
       └────────────────────┼────────────────────┘
                            │
                     ┌──────▼──────┐
                     │ HashiCorp   │
                     │   Vault     │
                     │             │
                     │ Secrets:    │
                     │ - DB creds  │
                     │ - API keys  │
                     │ - TLS certs │
                     └─────────────┘

Vault Features cho Microservices:

Dynamic secrets: Vault tạo DB credentials tạm thời, tự động rotate → nếu bị leak, secret tự hết hạn
PKI (Certificate Authority): Vault cấp phát certificates cho mTLS, tự động renew
Transit engine: Encryption as a service — services gửi data cho Vault encrypt, không giữ encryption keys
Audit logging: Mọi lần truy cập secret đều được ghi log

# Vault Agent Injector annotation cho Kubernetes Pod
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    metadata:
      annotations:
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: "order-service"
        vault.hashicorp.com/agent-inject-secret-db-creds: "database/creds/order-db"
        vault.hashicorp.com/agent-inject-template-db-creds: |
          {{- with secret "database/creds/order-db" -}}
          DB_HOST=order-db.ecommerce.svc
          DB_USER={{ .Data.username }}
          DB_PASS={{ .Data.password }}
          {{- end -}}
    spec:
      serviceAccountName: order-service
      containers:
        - name: order-service
          image: myrepo/order-service:v1.2.3

4.4 Network Policies (Kubernetes)

Default Kubernetes: Mọi pod có thể giao tiếp với mọi pod khác → quá rộng. Network Policies giới hạn traffic.

# Chỉ cho phép Order Service gọi Payment Service trên port 8080
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: payment-service-ingress
  namespace: ecommerce
spec:
  podSelector:
    matchLabels:
      app: payment-service
  policyTypes:
    - Ingress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: order-service
        - podSelector:
            matchLabels:
              app: api-gateway
      ports:
        - protocol: TCP
          port: 8080

# Deny all by default — bắt buộc khai báo explicit allow
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: ecommerce
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

4.5 API Gateway Security

API Gateway là tuyến phòng thủ đầu tiên:

Security Feature	Mô tả
Authentication	Verify JWT token trước khi forward request tới services
Rate Limiting	Chặn abuse, DDoS ở edge → Tuan-09-Rate-Limiter
IP Whitelisting	Chỉ cho phép known IPs cho admin endpoints
WAF (Web Application Firewall)	Chặn SQL injection, XSS, common attacks
Request Validation	Validate schema trước khi forward (reject malformed requests)
CORS	Kiểm soát cross-origin requests
Bot Detection	Fingerprinting, CAPTCHA challenges
Response Sanitization	Loại bỏ internal headers, stack traces trước khi trả về client

# Kong API Gateway — security plugins
plugins:
  - name: jwt
    config:
      secret_is_base64: false
      claims_to_verify:
        - exp
 
  - name: rate-limiting
    config:
      minute: 100
      hour: 10000
      policy: redis
 
  - name: ip-restriction
    config:
      allow:
        - 10.0.0.0/8       # Internal
        - 203.0.113.0/24    # Office IP
 
  - name: bot-detection
    config:
      deny:
        - "Scrapy"
        - "curl"

5. DevOps — Vận hành Microservices

5.1 Kubernetes Basics cho Microservices

Kubernetes (K8s) là nền tảng orchestration chuẩn cho microservices:

┌──────────────────────────── Kubernetes Cluster ─────────────────────────────┐
│                                                                             │
│  ┌─────────── Namespace: ecommerce ───────────────────────────────────┐    │
│  │                                                                     │    │
│  │  ┌─ Deployment: order-service ─┐  ┌─ Deployment: payment-service ─┐│    │
│  │  │  ┌─Pod─┐ ┌─Pod─┐ ┌─Pod─┐   │  │  ┌─Pod─┐ ┌─Pod─┐            ││    │
│  │  │  │ App │ │ App │ │ App │   │  │  │ App │ │ App │            ││    │
│  │  │  │Envoy│ │Envoy│ │Envoy│   │  │  │Envoy│ │Envoy│            ││    │
│  │  │  └─────┘ └─────┘ └─────┘   │  │  └─────┘ └─────┘            ││    │
│  │  └────────────────────────────┘  └────────────────────────────────┘│    │
│  │                                                                     │    │
│  │  ┌─ Service: order-svc (ClusterIP) ─┐                              │    │
│  │  │  Selector: app=order-service      │                              │    │
│  │  │  Port: 8080                       │                              │    │
│  │  └──────────────────────────────────┘                              │    │
│  │                                                                     │    │
│  │  ┌─ Ingress Controller (nginx/traefik) ─┐                          │    │
│  │  │  /api/orders → order-svc:8080         │                          │    │
│  │  │  /api/payments → payment-svc:8080     │                          │    │
│  │  └──────────────────────────────────────┘                          │    │
│  └─────────────────────────────────────────────────────────────────────┘    │
│                                                                             │
│  ┌─ Namespace: monitoring ────────────┐                                    │
│  │  Prometheus, Grafana, Jaeger        │                                    │
│  └────────────────────────────────────┘                                    │
└─────────────────────────────────────────────────────────────────────────────┘

Core Kubernetes objects cho microservices:

Object	Vai trò	Ví dụ
Pod	Đơn vị nhỏ nhất, chứa 1+ containers	App container + sidecar
Deployment	Quản lý replicas, rolling updates	3 replicas order-service
Service	Stable networking endpoint	ClusterIP, LoadBalancer
Ingress	HTTP routing từ bên ngoài vào	Path-based routing
ConfigMap	Config không nhạy cảm	Feature flags, URLs
Secret	Config nhạy cảm (mã hoá base64)	DB passwords (nên dùng Vault)
HPA	Auto-scaling dựa trên metrics	Scale khi CPU > 70%
PDB	Đảm bảo availability khi maintenance	Min 2 pods available
NetworkPolicy	Firewall giữa pods	Restrict inter-service traffic
Namespace	Logical isolation	ecommerce, monitoring, staging

5.2 Helm Charts — Package Manager cho Kubernetes

Helm giúp template hoá và quản lý Kubernetes manifests:

microservices-chart/
├── Chart.yaml
├── values.yaml                    # Default values
├── values-staging.yaml            # Override cho staging
├── values-production.yaml         # Override cho production
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── hpa.yaml
    ├── ingress.yaml
    ├── networkpolicy.yaml
    └── _helpers.tpl

# values.yaml — shared config cho mọi microservice
replicaCount: 2
 
image:
  repository: myrepo/{{ .Chart.Name }}
  tag: "latest"
  pullPolicy: IfNotPresent
 
service:
  type: ClusterIP
  port: 8080
 
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi
 
autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
 
vault:
  enabled: true
  role: ""  # Override per service

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ include "chart.fullname" . }}
  labels:
    {{- include "chart.labels" . | nindent 4 }}
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      {{- include "chart.selectorLabels" . | nindent 6 }}
  template:
    metadata:
      labels:
        {{- include "chart.selectorLabels" . | nindent 8 }}
      annotations:
        {{- if .Values.vault.enabled }}
        vault.hashicorp.com/agent-inject: "true"
        vault.hashicorp.com/role: {{ .Values.vault.role | quote }}
        {{- end }}
    spec:
      containers:
        - name: {{ .Chart.Name }}
          image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
          ports:
            - containerPort: {{ .Values.service.port }}
          resources:
            {{- toYaml .Values.resources | nindent 12 }}
          livenessProbe:
            httpGet:
              path: /healthz
              port: {{ .Values.service.port }}
            initialDelaySeconds: 15
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /readyz
              port: {{ .Values.service.port }}
            initialDelaySeconds: 5
            periodSeconds: 5

# Deploy từng service
helm install order-service ./microservices-chart \
  --set image.repository=myrepo/order-service \
  --set image.tag=v1.2.3 \
  --set vault.role=order-service \
  -f values-production.yaml
 
helm install payment-service ./microservices-chart \
  --set image.repository=myrepo/payment-service \
  --set image.tag=v2.0.1 \
  --set vault.role=payment-service \
  -f values-production.yaml

5.3 Service Mesh Setup (Istio)

# Cài đặt Istio
istioctl install --set profile=demo -y
 
# Enable injection cho namespace
kubectl label namespace ecommerce istio-injection=enabled
 
# Verify sidecar injection
kubectl get pods -n ecommerce
# Mỗi pod sẽ có 2/2 containers (app + envoy sidecar)

# Istio VirtualService — Traffic management
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: order-service
  namespace: ecommerce
spec:
  hosts:
    - order-service
  http:
    - route:
        - destination:
            host: order-service
            subset: v1
          weight: 90
        - destination:
            host: order-service
            subset: v2
          weight: 10       # Canary: 10% traffic tới v2
      retries:
        attempts: 3
        perTryTimeout: 2s
        retryOn: gateway-error,connect-failure,refused-stream
      timeout: 10s
---
# DestinationRule — Circuit Breaker config
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: payment-service
  namespace: ecommerce
spec:
  host: payment-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: DEFAULT
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
    outlierDetection:                    # Circuit Breaker
      consecutive5xxErrors: 5            # 5 lỗi liên tiếp
      interval: 10s                      # Trong 10 giây
      baseEjectionTime: 30s             # Eject 30 giây
      maxEjectionPercent: 50            # Tối đa eject 50% hosts
  subsets:
    - name: v1
      labels:
        version: v1
    - name: v2
      labels:
        version: v2

5.4 Distributed Tracing với Jaeger

# Jaeger deployment cho Kubernetes
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:1.53
          ports:
            - containerPort: 16686   # Jaeger UI
            - containerPort: 14268   # Collector HTTP
            - containerPort: 4317    # OTLP gRPC
            - containerPort: 4318    # OTLP HTTP
          env:
            - name: COLLECTOR_OTLP_ENABLED
              value: "true"

# Istio telemetry — gửi traces tới Jaeger
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  tracing:
    - providers:
        - name: jaeger
      randomSamplingPercentage: 10    # Sample 10% traffic (production)

5.5 Centralized Logging (EFK Stack)

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Order Service │    │Payment Service│    │ User Service  │
│  (stdout/err) │    │  (stdout/err) │    │  (stdout/err) │
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                    ┌───────▼────────┐
                    │   Fluent Bit    │  (DaemonSet, chạy trên mỗi node)
                    │  - Parse JSON   │
                    │  - Add metadata │
                    │    (pod, ns)    │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │ Elasticsearch   │  (hoặc OpenSearch, Loki)
                    │  - Index logs   │
                    │  - Full-text    │
                    │    search       │
                    └───────┬────────┘
                            │
                    ┌───────▼────────┐
                    │    Kibana       │  (hoặc Grafana)
                    │  - Dashboards   │
                    │  - Alerts       │
                    │  - Search       │
                    └────────────────┘

Structured logging là bắt buộc trong microservices:

{
  "timestamp": "2026-03-18T10:30:45.123Z",
  "level": "INFO",
  "service": "order-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "userId": "user-42",
  "orderId": "order-1001",
  "message": "Order created successfully",
  "duration_ms": 45,
  "metadata": {
    "items_count": 3,
    "total_amount": 150.00
  }
}

Quan trọng: traceId phải được propagate xuyên suốt mọi services để correlate logs với traces trong Jaeger.

6. Code Examples

6.1 Docker Compose — Multi-Service Setup với API Gateway

# docker-compose.yml
# E-commerce microservices with API Gateway
version: "3.8"
 
services:
  # ─── API Gateway (Kong) ───
  kong-database:
    image: postgres:15-alpine
    environment:
      POSTGRES_USER: kong
      POSTGRES_DB: kong
      POSTGRES_PASSWORD: kong_password
    volumes:
      - kong-db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U kong"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  kong-migration:
    image: kong:3.5
    command: kong migrations bootstrap
    depends_on:
      kong-database:
        condition: service_healthy
    environment:
      KONG_DATABASE: postgres
      KONG_PG_HOST: kong-database
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: kong_password
 
  api-gateway:
    image: kong:3.5
    depends_on:
      kong-migration:
        condition: service_completed_successfully
    environment:
      KONG_DATABASE: postgres
      KONG_PG_HOST: kong-database
      KONG_PG_USER: kong
      KONG_PG_PASSWORD: kong_password
      KONG_PROXY_LISTEN: 0.0.0.0:8000
      KONG_ADMIN_LISTEN: 0.0.0.0:8001
      KONG_LOG_LEVEL: info
    ports:
      - "8000:8000"   # Proxy
      - "8001:8001"   # Admin API
    healthcheck:
      test: ["CMD", "kong", "health"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  # ─── User Service ───
  user-service:
    build: ./services/user-service
    environment:
      PORT: 3001
      DB_HOST: user-db
      DB_NAME: users
      NODE_ENV: production
    depends_on:
      user-db:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3001/healthz"]
      interval: 10s
      timeout: 3s
      retries: 3
 
  user-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: users
      POSTGRES_USER: user_svc
      POSTGRES_PASSWORD: user_password
    volumes:
      - user-db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user_svc"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  # ─── Order Service ───
  order-service:
    build: ./services/order-service
    environment:
      PORT: 3002
      DB_HOST: order-db
      DB_NAME: orders
      PAYMENT_SERVICE_URL: http://payment-service:3003
      USER_SERVICE_URL: http://user-service:3001
      RABBITMQ_URL: amqp://rabbitmq:5672
      NODE_ENV: production
    depends_on:
      order-db:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3002/healthz"]
      interval: 10s
      timeout: 3s
      retries: 3
 
  order-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: orders
      POSTGRES_USER: order_svc
      POSTGRES_PASSWORD: order_password
    volumes:
      - order-db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U order_svc"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  # ─── Payment Service ───
  payment-service:
    build: ./services/payment-service
    environment:
      PORT: 3003
      DB_HOST: payment-db
      DB_NAME: payments
      RABBITMQ_URL: amqp://rabbitmq:5672
      NODE_ENV: production
    depends_on:
      payment-db:
        condition: service_healthy
      rabbitmq:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3003/healthz"]
      interval: 10s
      timeout: 3s
      retries: 3
 
  payment-db:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: payments
      POSTGRES_USER: payment_svc
      POSTGRES_PASSWORD: payment_password
    volumes:
      - payment-db-data:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U payment_svc"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  # ─── Message Broker ───
  rabbitmq:
    image: rabbitmq:3.13-management-alpine
    ports:
      - "15672:15672"   # Management UI
    environment:
      RABBITMQ_DEFAULT_USER: rabbit
      RABBITMQ_DEFAULT_PASS: rabbit_password
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "check_running"]
      interval: 10s
      timeout: 5s
      retries: 5
 
  # ─── Observability ───
  jaeger:
    image: jaegertracing/all-in-one:1.53
    ports:
      - "16686:16686"   # Jaeger UI
      - "4318:4318"     # OTLP HTTP
    environment:
      COLLECTOR_OTLP_ENABLED: "true"
 
volumes:
  kong-db-data:
  user-db-data:
  order-db-data:
  payment-db-data:

6.2 Circuit Breaker in Node.js (Opossum)

// services/order-service/lib/circuit-breaker.js
const CircuitBreaker = require('opossum');
const axios = require('axios');
 
/**
 * Factory function tạo circuit breaker cho mỗi downstream service.
 * Mỗi service có breaker riêng → Bulkhead pattern.
 */
function createServiceBreaker(serviceName, baseURL, options = {}) {
  const defaultOptions = {
    timeout: 3000,             // 3s timeout cho mỗi request
    errorThresholdPercentage: 50, // Mở circuit khi 50% requests fail
    resetTimeout: 30000,       // Thử lại sau 30s (HALF-OPEN)
    rollingCountTimeout: 10000,// Window 10s để tính error %
    rollingCountBuckets: 10,   // 10 buckets x 1s
    volumeThreshold: 5,        // Cần ít nhất 5 requests trước khi tính %
    ...options,
  };
 
  // Function được wrap bởi circuit breaker
  async function makeRequest({ method = 'GET', path = '/', data = null, headers = {} }) {
    const response = await axios({
      method,
      url: `${baseURL}${path}`,
      data,
      headers: {
        'Content-Type': 'application/json',
        ...headers,
      },
      timeout: defaultOptions.timeout,
    });
    return response.data;
  }
 
  const breaker = new CircuitBreaker(makeRequest, defaultOptions);
 
  // ─── Event handlers for observability ───
  breaker.on('success', (result) => {
    console.log(JSON.stringify({
      level: 'DEBUG',
      service: serviceName,
      event: 'circuit_breaker_success',
      state: breaker.status.stats,
    }));
  });
 
  breaker.on('timeout', () => {
    console.log(JSON.stringify({
      level: 'WARN',
      service: serviceName,
      event: 'circuit_breaker_timeout',
      message: `Request to ${serviceName} timed out after ${defaultOptions.timeout}ms`,
    }));
  });
 
  breaker.on('reject', () => {
    console.log(JSON.stringify({
      level: 'WARN',
      service: serviceName,
      event: 'circuit_breaker_rejected',
      message: `Circuit is OPEN for ${serviceName}. Request rejected.`,
    }));
  });
 
  breaker.on('open', () => {
    console.log(JSON.stringify({
      level: 'ERROR',
      service: serviceName,
      event: 'circuit_breaker_opened',
      message: `Circuit OPENED for ${serviceName}! Too many failures.`,
    }));
  });
 
  breaker.on('halfOpen', () => {
    console.log(JSON.stringify({
      level: 'INFO',
      service: serviceName,
      event: 'circuit_breaker_half_open',
      message: `Circuit HALF-OPEN for ${serviceName}. Testing with probe request...`,
    }));
  });
 
  breaker.on('close', () => {
    console.log(JSON.stringify({
      level: 'INFO',
      service: serviceName,
      event: 'circuit_breaker_closed',
      message: `Circuit CLOSED for ${serviceName}. Service recovered.`,
    }));
  });
 
  // ─── Fallback: trả về response mặc định khi circuit OPEN ───
  breaker.fallback((request, error) => {
    console.log(JSON.stringify({
      level: 'WARN',
      service: serviceName,
      event: 'circuit_breaker_fallback',
      error: error.message,
      message: `Fallback activated for ${serviceName}`,
    }));
 
    return {
      fallback: true,
      service: serviceName,
      message: `${serviceName} is temporarily unavailable. Using fallback response.`,
      timestamp: new Date().toISOString(),
    };
  });
 
  return breaker;
}
 
// ─── Tạo breakers cho từng downstream service ───
const paymentBreaker = createServiceBreaker(
  'payment-service',
  process.env.PAYMENT_SERVICE_URL || 'http://payment-service:3003',
  { timeout: 5000, resetTimeout: 60000 } // Payment cần timeout dài hơn
);
 
const userBreaker = createServiceBreaker(
  'user-service',
  process.env.USER_SERVICE_URL || 'http://user-service:3001',
  { timeout: 2000 } // User service nên nhanh
);
 
// ─── Sử dụng ───
async function getUser(userId, traceId) {
  return userBreaker.fire({
    method: 'GET',
    path: `/api/users/${userId}`,
    headers: { 'X-Trace-Id': traceId },
  });
}
 
async function processPayment(paymentData, traceId) {
  return paymentBreaker.fire({
    method: 'POST',
    path: '/api/payments',
    data: paymentData,
    headers: { 'X-Trace-Id': traceId },
  });
}
 
// ─── Health check endpoint exposes circuit states ───
function getCircuitHealth() {
  return {
    payment: {
      state: paymentBreaker.opened ? 'OPEN' : paymentBreaker.halfOpen ? 'HALF-OPEN' : 'CLOSED',
      stats: paymentBreaker.status.stats,
    },
    user: {
      state: userBreaker.opened ? 'OPEN' : userBreaker.halfOpen ? 'HALF-OPEN' : 'CLOSED',
      stats: userBreaker.status.stats,
    },
  };
}
 
module.exports = { getUser, processPayment, getCircuitHealth };

6.3 Saga Pattern — Orchestration Example (Node.js)

// services/order-service/lib/saga-orchestrator.js
const amqp = require('amqplib');
const { processPayment } = require('./circuit-breaker');
 
/**
 * Saga Orchestrator cho Order Creation flow.
 *
 * Steps:
 *   1. Create Order (local DB)
 *   2. Reserve Inventory (async via queue)
 *   3. Process Payment (sync via circuit breaker)
 *   4. Confirm Order (local DB)
 *
 * Compensating Transactions:
 *   3 fails → Release Inventory → Cancel Order
 *   2 fails → Cancel Order
 */
 
// ─── Saga State Machine ───
const SAGA_STATES = {
  STARTED: 'STARTED',
  INVENTORY_RESERVED: 'INVENTORY_RESERVED',
  PAYMENT_PROCESSED: 'PAYMENT_PROCESSED',
  COMPLETED: 'COMPLETED',
  COMPENSATING: 'COMPENSATING',
  FAILED: 'FAILED',
};
 
class OrderSaga {
  constructor(orderData, { orderRepo, channel, logger }) {
    this.orderData = orderData;
    this.orderRepo = orderRepo;
    this.channel = channel;       // RabbitMQ channel
    this.logger = logger;
    this.sagaLog = [];             // Audit trail
    this.state = SAGA_STATES.STARTED;
    this.orderId = null;
    this.compensations = [];       // Stack of compensation functions
  }
 
  log(step, status, details = {}) {
    const entry = {
      timestamp: new Date().toISOString(),
      sagaId: this.orderId,
      step,
      status,
      state: this.state,
      ...details,
    };
    this.sagaLog.push(entry);
    this.logger.info(JSON.stringify(entry));
  }
 
  async execute() {
    try {
      // Step 1: Create Order (PENDING status)
      await this.step_createOrder();
 
      // Step 2: Reserve Inventory
      await this.step_reserveInventory();
 
      // Step 3: Process Payment
      await this.step_processPayment();
 
      // Step 4: Confirm Order
      await this.step_confirmOrder();
 
      this.state = SAGA_STATES.COMPLETED;
      this.log('saga', 'COMPLETED', { orderId: this.orderId });
 
      return { success: true, orderId: this.orderId, sagaLog: this.sagaLog };
 
    } catch (error) {
      this.log('saga', 'FAILED', { error: error.message });
      this.state = SAGA_STATES.COMPENSATING;
 
      // Execute compensating transactions in reverse order
      await this.compensate();
 
      return { success: false, orderId: this.orderId, error: error.message, sagaLog: this.sagaLog };
    }
  }
 
  // ─── Step 1: Create Order ───
  async step_createOrder() {
    this.log('create_order', 'EXECUTING');
 
    const order = await this.orderRepo.create({
      ...this.orderData,
      status: 'PENDING',
      createdAt: new Date(),
    });
    this.orderId = order.id;
 
    // Push compensation: cancel order
    this.compensations.push(async () => {
      this.log('create_order', 'COMPENSATING');
      await this.orderRepo.updateStatus(this.orderId, 'CANCELLED');
      this.log('create_order', 'COMPENSATED');
    });
 
    this.log('create_order', 'SUCCESS', { orderId: this.orderId });
  }
 
  // ─── Step 2: Reserve Inventory ───
  async step_reserveInventory() {
    this.log('reserve_inventory', 'EXECUTING');
 
    // Publish to inventory queue and wait for response
    const result = await this.publishAndWait(
      'inventory.reserve',
      {
        orderId: this.orderId,
        items: this.orderData.items,
      },
      'inventory.reserve.result',
      10000  // 10s timeout
    );
 
    if (!result.success) {
      throw new Error(`Inventory reservation failed: ${result.reason}`);
    }
 
    this.state = SAGA_STATES.INVENTORY_RESERVED;
 
    // Push compensation: release inventory
    this.compensations.push(async () => {
      this.log('reserve_inventory', 'COMPENSATING');
      await this.publishMessage('inventory.release', {
        orderId: this.orderId,
        items: this.orderData.items,
      });
      this.log('reserve_inventory', 'COMPENSATED');
    });
 
    this.log('reserve_inventory', 'SUCCESS', { reservationId: result.reservationId });
  }
 
  // ─── Step 3: Process Payment ───
  async step_processPayment() {
    this.log('process_payment', 'EXECUTING');
 
    // Sync call via circuit breaker
    const paymentResult = await processPayment(
      {
        orderId: this.orderId,
        amount: this.orderData.totalAmount,
        currency: this.orderData.currency || 'VND',
        userId: this.orderData.userId,
      },
      this.orderId  // traceId
    );
 
    if (paymentResult.fallback) {
      throw new Error('Payment service unavailable (circuit breaker fallback)');
    }
 
    if (!paymentResult.success) {
      throw new Error(`Payment failed: ${paymentResult.reason}`);
    }
 
    this.state = SAGA_STATES.PAYMENT_PROCESSED;
 
    // Push compensation: refund payment
    this.compensations.push(async () => {
      this.log('process_payment', 'COMPENSATING');
      try {
        await processPayment(
          {
            orderId: this.orderId,
            amount: this.orderData.totalAmount,
            type: 'REFUND',
            originalTransactionId: paymentResult.transactionId,
          },
          this.orderId
        );
      } catch (refundErr) {
        // Payment refund failure is critical — alert and manual intervention needed
        this.log('process_payment', 'COMPENSATION_FAILED', { error: refundErr.message });
        // TODO: Publish to dead-letter queue for manual processing
      }
      this.log('process_payment', 'COMPENSATED');
    });
 
    this.log('process_payment', 'SUCCESS', { transactionId: paymentResult.transactionId });
  }
 
  // ─── Step 4: Confirm Order ───
  async step_confirmOrder() {
    this.log('confirm_order', 'EXECUTING');
    await this.orderRepo.updateStatus(this.orderId, 'CONFIRMED');
 
    // Publish event for downstream services (notification, analytics)
    await this.publishMessage('order.confirmed', {
      orderId: this.orderId,
      userId: this.orderData.userId,
      totalAmount: this.orderData.totalAmount,
    });
 
    this.log('confirm_order', 'SUCCESS');
  }
 
  // ─── Compensate: reverse all completed steps ───
  async compensate() {
    this.log('compensation', 'STARTED', { stepsToCompensate: this.compensations.length });
 
    // Execute compensations in reverse order (LIFO)
    while (this.compensations.length > 0) {
      const compensationFn = this.compensations.pop();
      try {
        await compensationFn();
      } catch (compError) {
        this.log('compensation', 'STEP_FAILED', { error: compError.message });
        // Continue compensating remaining steps
      }
    }
 
    this.state = SAGA_STATES.FAILED;
    this.log('compensation', 'COMPLETED');
  }
 
  // ─── Helper: Publish message to RabbitMQ ───
  async publishMessage(queue, message) {
    await this.channel.assertQueue(queue, { durable: true });
    this.channel.sendToQueue(queue, Buffer.from(JSON.stringify(message)), {
      persistent: true,
      messageId: `${this.orderId}-${Date.now()}`,
    });
  }
 
  // ─── Helper: Publish and wait for response (RPC pattern) ───
  async publishAndWait(requestQueue, message, responseQueue, timeoutMs) {
    return new Promise(async (resolve, reject) => {
      const correlationId = `${this.orderId}-${Date.now()}`;
      const timeout = setTimeout(() => {
        reject(new Error(`Timeout waiting for response from ${requestQueue}`));
      }, timeoutMs);
 
      // Setup response listener
      await this.channel.assertQueue(responseQueue, { durable: true });
      const { consumerTag } = await this.channel.consume(responseQueue, (msg) => {
        if (msg.properties.correlationId === correlationId) {
          clearTimeout(timeout);
          this.channel.cancel(consumerTag);
          this.channel.ack(msg);
          resolve(JSON.parse(msg.content.toString()));
        }
      });
 
      // Send request
      await this.channel.assertQueue(requestQueue, { durable: true });
      this.channel.sendToQueue(requestQueue, Buffer.from(JSON.stringify(message)), {
        persistent: true,
        correlationId,
        replyTo: responseQueue,
      });
    });
  }
}
 
// ─── Usage in Express route ───
async function createOrder(req, res) {
  const channel = req.app.get('rabbitChannel');
  const orderRepo = req.app.get('orderRepo');
  const logger = req.app.get('logger');
 
  const saga = new OrderSaga(req.body, { orderRepo, channel, logger });
  const result = await saga.execute();
 
  if (result.success) {
    res.status(201).json({
      orderId: result.orderId,
      status: 'CONFIRMED',
      message: 'Order created successfully',
    });
  } else {
    res.status(500).json({
      orderId: result.orderId,
      status: 'FAILED',
      error: result.error,
      message: 'Order creation failed. All changes have been rolled back.',
    });
  }
}
 
module.exports = { OrderSaga, createOrder };

7. Mermaid Diagrams

7.1 Microservices Architecture Overview

flowchart TD
    subgraph Clients
        WEB[Web App]
        MOB[Mobile App]
        IOT[IoT Device]
    end

    subgraph "Edge Layer"
        CDN[CDN / WAF]
        LB[Load Balancer]
    end

    subgraph "API Layer"
        GW[API Gateway<br/>Kong / Envoy]
        BFF_WEB[BFF Web]
        BFF_MOB[BFF Mobile]
    end

    subgraph "Service Mesh (Istio)"
        subgraph "Core Services"
            US[User Service<br/>+ Envoy Sidecar]
            OS[Order Service<br/>+ Envoy Sidecar]
            PS[Payment Service<br/>+ Envoy Sidecar]
            PRS[Product Service<br/>+ Envoy Sidecar]
            IS[Inventory Service<br/>+ Envoy Sidecar]
        end

        subgraph "Supporting Services"
            NS[Notification Service<br/>+ Envoy Sidecar]
            SS[Search Service<br/>+ Envoy Sidecar]
        end

        CP[Istiod<br/>Control Plane]
    end

    subgraph "Async Communication"
        MQ[Message Broker<br/>RabbitMQ / Kafka]
    end

    subgraph "Data Layer"
        UDB[(User DB<br/>PostgreSQL)]
        ODB[(Order DB<br/>PostgreSQL)]
        PDB[(Payment DB<br/>PostgreSQL)]
        PRDB[(Product DB<br/>MongoDB)]
        IDB[(Inventory DB<br/>Redis)]
        ES[(Elasticsearch)]
    end

    subgraph "Observability"
        PROM[Prometheus]
        GRAF[Grafana]
        JAEG[Jaeger]
        EFK[EFK Stack]
    end

    subgraph "Security"
        VAULT[HashiCorp Vault]
    end

    WEB --> CDN
    MOB --> CDN
    IOT --> CDN
    CDN --> LB
    LB --> GW
    GW --> BFF_WEB
    GW --> BFF_MOB
    BFF_WEB --> US & OS & PRS
    BFF_MOB --> US & OS & PRS
    OS -->|gRPC| PS
    OS -->|gRPC| IS
    OS -->|event| MQ
    PS -->|event| MQ
    MQ --> NS
    MQ --> SS
    US --- UDB
    OS --- ODB
    PS --- PDB
    PRS --- PRDB
    IS --- IDB
    SS --- ES
    CP -.->|config| US & OS & PS & PRS & IS & NS & SS
    VAULT -.->|secrets| US & OS & PS & PRS & IS
    US & OS & PS -.->|metrics| PROM
    PROM --> GRAF
    US & OS & PS -.->|traces| JAEG
    US & OS & PS -.->|logs| EFK

    style GW fill:#e65100,color:#fff,stroke:#333
    style CP fill:#1565c0,color:#fff,stroke:#333
    style VAULT fill:#7b1fa2,color:#fff,stroke:#333
    style MQ fill:#2e7d32,color:#fff,stroke:#333

7.2 Saga Flow — Order Creation (Orchestration)

sequenceDiagram
    participant C as Client
    participant GW as API Gateway
    participant O as Order Service<br/>(Saga Orchestrator)
    participant I as Inventory Service
    participant P as Payment Service
    participant N as Notification Service
    participant MQ as Message Queue

    C->>GW: POST /api/orders
    GW->>O: Forward request

    Note over O: Step 1: Create Order (PENDING)
    O->>O: INSERT order (status=PENDING)

    Note over O: Step 2: Reserve Inventory
    O->>MQ: Publish "inventory.reserve"
    MQ->>I: Consume
    I->>I: Reserve items
    I->>MQ: Publish "inventory.reserved" ✓
    MQ->>O: Consume result

    Note over O: Step 3: Process Payment
    O->>P: POST /api/payments (via Circuit Breaker)
    P->>P: Charge customer
    P->>O: 200 OK {transactionId}

    Note over O: Step 4: Confirm Order
    O->>O: UPDATE order (status=CONFIRMED)
    O->>MQ: Publish "order.confirmed"
    MQ->>N: Consume
    N->>N: Send email/SMS

    O->>GW: 201 Created
    GW->>C: 201 {orderId, status: CONFIRMED}

    Note over O,P: ═══ FAILURE SCENARIO ═══

    Note over O: If Payment fails at Step 3:
    P-->>O: 402 Payment Failed

    Note over O: Compensate Step 2: Release Inventory
    O->>MQ: Publish "inventory.release"
    MQ->>I: Consume
    I->>I: Release reserved items

    Note over O: Compensate Step 1: Cancel Order
    O->>O: UPDATE order (status=CANCELLED)

    O->>GW: 500 {error, status: FAILED}
    GW->>C: 500 Order creation failed

7.3 Circuit Breaker State Machine

stateDiagram-v2
    [*] --> CLOSED

    CLOSED --> OPEN : Failure count >= threshold<br/>(e.g., 5 failures in 10s)
    CLOSED --> CLOSED : Success / Failure < threshold

    OPEN --> HALF_OPEN : Reset timeout expires<br/>(e.g., after 30s)
    OPEN --> OPEN : All requests rejected<br/>(return fallback immediately)

    HALF_OPEN --> CLOSED : Probe request succeeds<br/>(service recovered!)
    HALF_OPEN --> OPEN : Probe request fails<br/>(still broken, wait more)

    note right of CLOSED
        Normal operation.
        Requests pass through.
        Count failures in rolling window.
    end note

    note right of OPEN
        Circuit tripped!
        All requests fail fast.
        Return fallback response.
        No load on failing service.
    end note

    note right of HALF_OPEN
        Testing recovery.
        Allow limited probe requests.
        If OK → close circuit.
        If fail → open again.
    end note

8. Aha Moments & Pitfalls

Aha Moments

#1 — Distributed Monolith là nightmare thực sự: Nếu deploy microservices nhưng các services share database, deploy phải đồng bộ, và thay đổi một service bắt buộc thay đổi service khác → em có một distributed monolith: tất cả nhược điểm của monolith + tất cả nhược điểm của distributed system, không có ưu điểm nào. Kiểm tra bằng câu hỏi: “Có thể deploy service A mà không cần deploy service B không?” Nếu không → distributed monolith.

#2 — “Monolith first” không phải yếu: Shopify chạy monolith Rails khổng lồ phục vụ hàng tỷ dollar GMV. Basecamp/37signals vẫn dùng monolith. Microservices là trade-off, không phải upgrade. Amazon, Netflix chuyển sang microservices khi team đã > 1000 engineers và codebase không thể manage được nữa.

#3 — Mỗi network call là một failure point mới: Trong monolith, function call gần như không bao giờ fail (trừ OOM). Trong microservices, mỗi network call có thể fail vì: network partition, DNS resolution failure, timeout, service crash, resource exhaustion. Với 10 services chained, nếu mỗi service có 99.9% availability:

A v ai l abi l i t y_{c hain} = 0.99 9^{10} = 0.99 = 99%

99% = 87.6 giờ downtime/năm. Đó là lý do Circuit Breaker, Retry, Timeout, Fallback tồn tại.

#4 — Saga không replace ACID: Saga chỉ đảm bảo eventual consistency. Giữa step 2 và step 3 của saga, hệ thống ở trạng thái inconsistent (inventory reserved nhưng chưa payment). Cần thiết kế UI/UX cho trạng thái “processing” và handle edge cases (user refresh trang giữa saga, duplicate submit).

#5 — Observability là non-negotiable: Trong monolith, grep log file là đủ. Trong microservices, nếu không có distributed tracing + centralized logging + metrics, debugging production issue sẽ giống như tìm kim trong 10 đống rơm cùng lúc. Phải setup observability trước khi deploy microservices, không phải sau.

Pitfalls — Sai lầm thường gặp

Pitfall 1: Too Many Microservices Too Early (Nano-services)

Sai: Mới bắt đầu project, 3 developers, tách thành 15 services. Đúng: Bắt đầu với monolith modular (monolith với clear module boundaries). Khi 1 module cần scale/deploy riêng → tách ra thành service. “If you can’t build a well-structured monolith, what makes you think microservices is the answer?” — Simon Brown.

Pitfall 2: Shared Database Coupling

Sai: 3 services cùng query bảng users trong shared PostgreSQL. Đúng: User Service sở hữu users table. Các service khác gọi API User Service hoặc subscribe events để build local copy (CQRS). Đau lúc đầu nhưng tránh coupling nightmare về lâu dài.

Pitfall 3: Synchronous Chains (Death by a Thousand Cuts)

Sai: A → B → C → D → E (5 sync calls chained). D chết → A timeout. Đúng: Dùng async events khi có thể. Nếu phải sync, dùng Circuit Breaker + Timeout + Fallback. Tối đa 2-3 sync hops.

Pitfall 4: Không có Contract Testing

Sai: Service A depend on response format của Service B. B thay đổi response → A crash ở production. Đúng: Dùng Consumer-Driven Contract Testing (Pact). Consumer (A) viết contract mô tả response mong đợi. Provider (B) verify contract trước khi deploy. CI/CD chặn deploy nếu break contract.

Pitfall 5: Ignoring Data Consistency

Sai: Tách services nhưng vẫn expect strong consistency giữa chúng (coi distributed system như single database). Đúng: Accept eventual consistency. Design cho idempotency. Dùng Saga cho distributed transactions. Implement outbox pattern để đảm bảo “write to DB + publish event” là atomic.

Pitfall 6: No API Versioning Strategy

Sai: Thay đổi API contract → tất cả consumers cần deploy cùng lúc. Đúng: Version APIs từ đầu (/api/v1/, /api/v2/). Support backward compatibility. Dùng header-based versioning hoặc URI-based versioning. Deprecate cũ sau khi consumers đã migrate.

Pitfall 7: Logging Without Correlation

Sai: Mỗi service log riêng, không có traceId. Khi bug xảy ra, không biết request đi qua đâu. Đúng: Propagate traceId qua tất cả services (HTTP header X-Trace-Id hoặc OpenTelemetry context). Mọi log entry phải có traceId. Dùng Jaeger/Zipkin để visualize trace.

Pitfall 8: Sidecar mesh ở scale lớn

Sai: Cluster 2000 pods × Istio sidecar 80MB = 160GB RAM cho mesh. Mesh upgrade = 2000 pod restart. Đúng: Evaluate Istio Ambient hoặc Cilium cho cluster lớn. L4 mTLS thông qua DaemonSet (ztunnel) hoặc eBPF, opt-in L7 features. Tham chiếu section 2.14.

Pitfall 9: Single cell — blast radius lớn

Sai: Microservices nhưng 1 shared DB cho tất cả users. 1 bad query → toàn bộ user impact. Đúng: Cell-based architecture chia user thành N cells độc lập. AWS DynamoDB, Lambda đều dùng pattern này. Bad deployment trong 1 cell → chỉ 1/N user affected. Tham chiếu section 2.15.

Pitfall 10: Dual-write to DB + MQ

Sai: db.commit(); kafka.send(...) — nếu Kafka fail sau commit → data inconsistent. Đúng: Dùng Outbox Pattern. Đọc Tuan-Bonus-Outbox-Pattern cho full deep dive.

Pitfall 11: Saga compensation không idempotent

Sai: Compensation gọi 2 lần (vì retry) → refund 2 lần. Đúng: Mọi compensation action phải idempotent. Dùng UPDATE ... WHERE status='charged' thay vì blindly refund.

9. Internal Links — Bản đồ liên kết

Prerequisite (cần đọc trước)

Tuan-01-Scale-From-Zero-To-Millions — Hiểu khi nào cần scale, từ single server đến distributed system
Tuan-04-API-Design-REST-gRPC — Nền tảng giao tiếp giữa services (REST vs gRPC)
Tuan-05-Load-Balancer — Phân tải request giữa multiple instances
Tuan-08-Message-Queue — Async communication, event-driven architecture
Tuan-10-Consistent-Hashing — Phân phối data/traffic đều giữa services

Bonus chapters (Architect-level, advanced)

Tuan-Bonus-Consensus-Raft-Paxos — Service mesh control plane dùng etcd/Raft
Tuan-Bonus-Consistency-Models-Isolation — Isolation cho cross-service transaction
Tuan-Bonus-Outbox-Pattern — Outbox + Saga deep dive với choreography vs orchestration so sánh latency cụ thể

Cùng module (Architecture-DevOps-Security)

Tuan-12-CICD-Pipeline — Deploy microservices (mỗi service = 1 pipeline)
Tuan-13-Monitoring-Observability — Distributed tracing, centralized logging, metrics
Tuan-14-AuthN-AuthZ-Security — Service-to-service auth, JWT, OAuth2
Tuan-15-Data-Security-Encryption — mTLS, encryption at rest, secrets management

Áp dụng trong Case Studies

Tuan-16-Design-URL-Shortener — Khi nào URL Shortener cần microservices? (hint: hầu như không cần)
Tuan-17-Design-Chat-System — Chat system scale lớn: message service, presence service, notification service
Tuan-18-Design-News-Feed — Feed generation service, fan-out service, notification service
Tuan-19-Design-Notification-System — Notification là supporting service điển hình trong microservices
Tuan-20-Design-Key-Value-Store — Data layer cho microservices

Liên quan khác

Tuan-02-Back-of-the-envelope — Estimation cho microservices overhead
Tuan-06-Cache-Strategy — Cache per service, distributed cache
Tuan-07-Database-Sharding-Replication — Database per service strategy
Tuan-09-Rate-Limiter — Rate limiting ở API Gateway level

Tham khảo

Alex Xu, System Design Interview — Chapter 2 (Estimation), Chapter 4-13 (Patterns applied)
Sam Newman, Building Microservices (2nd Edition) — The definitive guide
Chris Richardson, Microservices Patterns — Saga, CQRS, Event Sourcing
Martin Fowler, Microservices Guide — https://martinfowler.com/microservices/
Eric Evans, Domain-Driven Design — Bounded Context, Context Mapping
Microsoft, Cloud Design Patterns — Circuit Breaker, Bulkhead, Strangler Fig
Istio Documentation — https://istio.io/latest/docs/
HashiCorp Vault — https://developer.hashicorp.com/vault
sdi.anhvy.dev — Vietnamese System Design Reference

Tuần tới: Tuan-12-CICD-Pipeline — CI/CD Pipeline cho Microservices: mỗi service một pipeline, automated testing, canary deployment

lthieu's notes

Explorer

Tuan-11-Microservices-Pattern