Tuần Bonus: Progressive Delivery — Argo Rollouts, Flagger, Feature Flags
“CI/CD truyền thống: deploy → all 100% users → if bug → rollback all. Progressive Delivery: deploy → 1% users → measure → 5% → measure → 25% → 100%. Bug detected? Rollback automatic ở 5%. Đây là evolution từ ‘deploy and hope’ sang ‘deploy and verify’.”
Tags: system-design cicd progressive-delivery canary feature-flags bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-12-CICD-Pipeline · Tuan-13-Monitoring-Observability Liên quan: Tuan-11-Microservices-Pattern · Tuan-Bonus-Platform-Engineering-IDP
1. Context & Why
Analogy đời thường — Mở quán mới
Hieu, tưởng tượng em mở chuỗi 100 quán phở mới với recipe mới:
Cách 1 — Big bang launch (CI/CD truyền thống):
- Day 1: Đổi recipe ở tất cả 100 quán cùng lúc
- Khách hàng feedback “phở dở quá!”
- 100 quán mất doanh thu cùng 1 ngày
- Quay lại recipe cũ → 200 quán-ngày bị ảnh hưởng
Cách 2 — Progressive launch (Progressive Delivery):
- Day 1: Đổi recipe ở 1 quán
- Đo lường: doanh thu, complaint, NPS
- Day 2: Nếu OK → 5 quán
- Day 3: Nếu vẫn OK → 25 quán
- Day 7: 100% nếu metrics tốt
- Phát hiện bug ở 5 quán → rollback chỉ 5 quán
Progressive Delivery = Continuous Deployment + Risk Reduction + Automated Decisions.
Tại sao Backend Dev cần hiểu?
| Lý do | Hậu quả nếu không |
|---|---|
| Reduce blast radius | Bug 100% users vs 1% users |
| Automated rollback | Detect + rollback in seconds vs hours |
| Real-world validation | Pre-prod test miss issues, prod 1% reveals |
| Decouple deploy from release | Deploy code mà chưa enable feature |
| DORA metrics | Higher deploy frequency, lower MTTR |
| A/B testing built-in | Validate hypothesis with traffic subsets |
Tại sao Alex Xu không đi sâu?
Alex Xu Vol 1+2 nói về CI/CD basic (blue-green, canary) nhưng không cover automated analysis, feature flag platforms, progressive rollout patterns. Đây là evolution 2020+.
Tham chiếu chính
- Argo Rollouts docs — https://argo-rollouts.readthedocs.io/
- Flagger docs — https://docs.flagger.app/
- Continuous Delivery (Humble & Farley, 2010) — foundational
- Accelerate (Forsgren, Humble, Kim, 2018)
- LaunchDarkly Engineering — https://launchdarkly.com/blog/engineering/
- The Pragmatic Programmer’s Guide to Feature Flags — https://launchdarkly.com/blog/the-pragmatic-programmers-guide-to-feature-flags/
2. Deep Dive — Khái niệm cốt lõi
2.1 Deployment Strategies Spectrum
Risk Speed Complexity
Big Bang ────█ █─── ─
Blue-Green ──██ ██── ██
Rolling ─███ ██── ██
Canary (manual) ████ ███─ ███
Canary (automated) ████ ████ ████
A/B Testing ████ ████ █████
Shadow / Mirror █████ ─███ █████
2.1.1 Big Bang
v1: 100% → v2: 100% (instant cutover)
When: Stateful systems where versioning hard (DB schema migration sometimes) Avoid for: stateless services
2.1.2 Blue-Green
Blue (v1) running, Green (v2) deployed
Switch load balancer: 100% Blue → 100% Green
Keep Blue for instant rollback
Pros: Instant rollback, zero-downtime Cons: 2x infrastructure during deploy, no gradual validation
2.1.3 Rolling Update (K8s default)
10 pods running v1
Replace 1 pod at a time:
Pod 1: v1 → v2
Pod 2: v1 → v2
...
At any time: mix of v1/v2 pods serving traffic
Pros: Default in K8s, gradual Cons: No automated analysis, all-or-nothing per pod
2.1.4 Canary
v1: 99%, v2: 1%
v1: 95%, v2: 5%
v1: 75%, v2: 25%
v1: 50%, v2: 50%
v1: 0%, v2: 100%
Pros: Gradual exposure, can pause at any % Cons: Manual decisions; need infrastructure (service mesh, ingress)
2.1.5 A/B Testing
Same as canary but split based on USER attributes:
v1: existing users
v2: new users in segment X
Measure business metrics (conversion, engagement)
Pros: Validate hypothesis with real users Cons: Need experimentation platform, statistical rigor
2.1.6 Shadow / Mirror
v1: receives 100% traffic, returns response
v2: receives 100% mirrored traffic, response discarded
Compare v1 vs v2 outputs (correctness, performance)
Pros: Test against real traffic without impact Cons: 2x infrastructure, side-effect concerns
2.2 Argo Rollouts — Kubernetes-native Progressive Delivery
Argo Rollouts: Replaces K8s Deployment with Rollout CRD, supports canary + blue-green with automated analysis.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
spec:
replicas: 10
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
containers:
- name: payment-service
image: myorg/payment:v2.0.0
ports: [{ containerPort: 8080 }]
strategy:
canary:
steps:
- setWeight: 5 # 5% traffic to v2
- pause: { duration: 10m }
- analysis:
templates:
- templateName: success-rate
- setWeight: 25 # 25%
- pause: { duration: 30m }
- analysis:
templates:
- templateName: success-rate
- templateName: latency
- setWeight: 50
- pause: { duration: 30m }
- setWeight: 100 # full rollout
canaryService: payment-canary # Routes v2 traffic
stableService: payment-stable # Routes v1 traffic
trafficRouting:
istio:
virtualService:
name: payment-vs
routes: [primary]Steps execution:
- Deploy 1 canary pod with v2
- Route 5% traffic to canary
- Pause 10 minutes
- Run analysis (query Prometheus)
- If pass → next step; if fail → rollback automatically
- Continue progression
2.3 AnalysisTemplate — Automated Verification
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status!~"5..",
version="canary"
}[5m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}",
version="canary"
}[5m]))
- name: p99-latency
interval: 1m
successCondition: result[0] <= 0.5
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}",
version="canary"
}[5m])) by (le)
)Decision logic:
- Run query every minute
- If 3 consecutive failures → abort rollout, trigger rollback
- If 3 consecutive successes → progress to next step
2.4 Flagger — Service Mesh-native
Flagger (Weaveworks/Flux): Similar concept, integrates with Istio/Linkerd/AWS App Mesh natively.
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment
service:
port: 80
targetPort: 8080
gateways: [public-gateway]
analysis:
interval: 1m
threshold: 5 # max 5 failed checks
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange: { min: 99 }
interval: 1m
- name: request-duration
thresholdRange: { max: 500 } # ms
interval: 30s
webhooks:
- name: smoke-test
type: pre-rollout
url: http://test-runner/run-smoke
- name: load-test
url: http://flagger-loadtester.test/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://payment-canary/"Differences from Argo Rollouts:
- Flagger more integrated with service mesh
- Argo Rollouts more flexible (custom analysis providers)
- Both production-grade
2.5 Feature Flags — Decouple Deploy from Release
Concept: Deploy code with feature OFF. Enable feature later via flag.
# Without feature flag
def get_recommendations(user_id):
return new_ml_algorithm(user_id) # Released to all immediately
# With feature flag
def get_recommendations(user_id):
if feature_flag.enabled("new_recommendations", user_id=user_id):
return new_ml_algorithm(user_id)
return old_algorithm(user_id)Use cases:
- Gradual rollout: 1% → 10% → 100%
- Kill switch: Disable broken feature instantly
- A/B testing: Compare variants
- Targeted rollout: Enable for specific users (beta testers)
- Trunk-based development: Merge unfinished features behind flag
2.6 Feature Flag Platforms
| Tool | Type | Best for |
|---|---|---|
| LaunchDarkly | Commercial SaaS | Enterprise, full features |
| Statsig | SaaS, freemium | Experimentation focus |
| Unleash | OSS + cloud | Self-host preference |
| Flagsmith | OSS + cloud | Privacy-conscious |
| OpenFeature (CNCF) | Vendor-neutral spec | Avoid lock-in |
| Cloudflare Workers KV | DIY simple flags | Existing CF infra |
| Custom DB | Self-built | Small scale, simple needs |
2.7 OpenFeature — Standardization
OpenFeature (CNCF): Vendor-neutral feature flag SDK.
# Python OpenFeature SDK
from openfeature import api
from openfeature.contrib.provider.flagsmith import FlagsmithProvider
# Configure provider (swap-able)
api.set_provider(FlagsmithProvider(env_key="..."))
# Use anywhere
client = api.get_client("my-app")
if client.get_boolean_value("new_checkout", default=False, evaluation_context={
"user_id": user_id,
"country": "VN"
}):
new_checkout_flow()
else:
old_checkout_flow()Magic: Switch from LaunchDarkly to Unleash → just change provider, no code change.
2.8 Targeting & Rollout Strategies
2.8.1 Percentage rollout
flag: new_checkout
default: false
targets:
- rule: percentage(10) # 10% random users
value: true2.8.2 Attribute-based
flag: premium_features
default: false
targets:
- rule: user.plan == "enterprise"
value: true
- rule: user.country in ["US", "VN", "JP"]
value: true
- rule: user.id in ["beta-tester-1", "beta-tester-2"]
value: true2.8.3 Context-based
flag: heavy_query_optimization
targets:
- rule: request.path == "/reports"
value: true
- rule: request.method == "POST" && request.size > 1MB
value: true2.8.4 Sticky bucketing
Important: User in 10% bucket today should stay in 10% tomorrow (consistent UX).
# Hash user_id deterministically
def in_bucket(user_id: str, percentage: int) -> bool:
hash_val = mmh3.hash(f"new_checkout:{user_id}") % 100
return hash_val < percentage2.9 Experimentation (A/B Testing)
Beyond rollout: measure business metrics.
import statsig
statsig.initialize("...")
# Get experiment value
config = statsig.get_experiment("new_checkout_design", user)
button_color = config.get_string("button_color", "blue")
button_text = config.get_string("button_text", "Buy Now")
# Render UI with variant
render_button(color=button_color, text=button_text)
# Log conversion event
if user_purchased:
statsig.log_event("purchase", value=order_total)Experiment platform computes:
- Sample size per variant
- Statistical significance
- Lift in conversion
- Confidence intervals
Tools: Statsig, LaunchDarkly Experiments, Eppo, Amplitude Experiment.
2.10 Rollback Strategies
2.10.1 Automated rollback (Argo Rollouts/Flagger)
analysis:
failureLimit: 3 # 3 consecutive failures → abort
inconclusiveLimit: 5When abort triggered:
- Stop progression
- Route 100% traffic back to stable
- Notify team
- Keep canary pod for debugging (configurable)
2.10.2 Feature flag kill switch
# Bug detected → disable feature instantly
launchdarkly.update_flag("new_checkout", enabled=False)
# Effect propagates within seconds, no deploy neededPower: Sub-30-second rollback vs 10-minute deploy rollback.
2.10.3 Database rollback considerations
Problem: Rolling back code is easy, rolling back DB schema is hard.
Pattern: Expand-Contract:
Migration v1 → v2:
Phase 1 (Expand): Add new column, code writes to both old & new
Phase 2: Backfill new column from old
Phase 3: Code reads from new
Phase 4 (Contract): Remove old column
Each phase deployable & rollback-safe.
2.11 DORA Metrics
4 key metrics measure DevOps performance:
| Metric | Elite |
|---|---|
| Deploy frequency | On-demand (multiple per day) |
| Lead time for changes | < 1 day commit to prod |
| Change failure rate | 0-15% |
| MTTR | < 1 hour |
Progressive Delivery improves all 4:
- Deploy frequency ↑ (lower risk per deploy)
- Lead time ↓ (auto-pipeline)
- Change failure rate ↓ (catch issues at 5%)
- MTTR ↓ (auto-rollback)
3. Estimation
3.1 Time saved by automation
Manual canary (without Argo Rollouts):
- Engineer monitors metrics manually: 2h per deploy
- 10 deploys/week × 2h × 5 engineers = 100h/week
- $200K/year just monitoring time
Automated (Argo Rollouts):
- Engineer reviews dashboard occasionally: 0.5h/deploy
- Saves 75h/week = $150K/year
- Plus prevents bad deploys (avg incident cost $50K)
3.2 Risk reduction
Without progressive delivery:
- 1% bad deploy rate × 10 deploys/week × 50 weeks = 5 incidents/year
- Average impact: 30 min × 100K users × $X cost = $$$
With progressive delivery:
- Same bad deploy rate but caught at 5% traffic
- Impact: 5 min × 5K users = 95% reduction
3.3 Feature flag overhead
- ~5-10% perf overhead from flag evaluation
- Mitigations: SDK cache, edge evaluation
- Cost: LaunchDarkly 0 self-host
4. Security First
4.1 Flag evaluation auth
Threat: Attacker manipulates flag → enable hidden feature.
Mitigations:
- API key per environment
- Restrict who can change flags (RBAC)
- Audit log every flag change
- Use cryptographic verification (signed flags)
4.2 Sensitive data in flag context
Don’t include PII in flag context (cached, logged):
# BAD
client.get_boolean_value("flag", context={"email": "[email protected]"})
# GOOD
client.get_boolean_value("flag", context={"user_id": hash(email)})4.3 Flag debt
Problem: Flags accumulate over time (100s of dead flags).
Risks:
- Old flags = old code paths = potential bugs
- Hard to reason about behavior
- Audit nightmare
Solution: Flag lifecycle management
- Tag flags with owner, created_at, sunset_date
- Auto-alert when flag > 90 days old
- Quarterly cleanup ritual
4.4 Canary security testing
Run security scans on canary before promotion:
- DAST (OWASP ZAP)
- Container scan (Trivy)
- API contract test
Fail rollout if security regression.
5. DevOps — Implementation
5.1 Argo Rollouts setup
# Install
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f \
https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install kubectl plugin
brew install argoproj/tap/kubectl-argo-rollouts
# Use
kubectl argo rollouts get rollout payment-service
kubectl argo rollouts pause payment-service
kubectl argo rollouts promote payment-service
kubectl argo rollouts abort payment-service5.2 Service mesh setup (Istio)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: payment-vs
spec:
hosts: [payment]
http:
- name: primary
route:
- destination:
host: payment
subset: stable
weight: 100
- destination:
host: payment
subset: canary
weight: 0 # Argo Rollouts updates this
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: payment
spec:
host: payment
subsets:
- name: stable
labels: { version: stable }
- name: canary
labels: { version: canary }5.3 Feature flag (Unleash) setup
# docker-compose
services:
unleash:
image: unleashorg/unleash-server
environment:
DATABASE_URL: "postgres://unleash:password@db/unleash"
ports: ["4242:4242"]
db:
image: postgres:15
environment:
POSTGRES_USER: unleash
POSTGRES_PASSWORD: password
POSTGRES_DB: unleash5.4 Application integration
from UnleashClient import UnleashClient
client = UnleashClient(
url="http://unleash:4242/api/",
app_name="my-app",
custom_headers={"Authorization": API_TOKEN}
)
client.initialize_client()
def get_recommendations(user_id):
if client.is_enabled(
"new_recommendations",
context={"userId": user_id}
):
return new_algorithm(user_id)
return old_algorithm(user_id)5.5 Monitoring
groups:
- name: progressive_delivery
rules:
- alert: RolloutAborted
expr: rollout_phase{phase="Aborted"} == 1
for: 1m
annotations:
summary: "Rollout {{ $labels.name }} aborted automatically"
- alert: RolloutPaused
expr: rollout_phase{phase="Paused"} == 1
for: 1h
annotations:
summary: "Rollout {{ $labels.name }} paused > 1h"
- alert: HighFlagEvaluationLatency
expr: |
histogram_quantile(0.99, rate(flag_eval_duration_bucket[5m])) > 0.05
annotations:
summary: "Flag SDK P99 > 50ms"
- alert: TooManyOldFlags
expr: count(flag_age_days > 90) > 50
annotations:
summary: "{{ $value }} flags > 90 days old. Cleanup needed."6. Code Implementation
6.1 Custom feature flag service
"""
Lightweight feature flag service.
For when LaunchDarkly is overkill.
"""
import hashlib
import json
from typing import Any
import redis
class FeatureFlags:
def __init__(self, redis_client: redis.Redis):
self.redis = redis_client
def is_enabled(self, flag: str, user_id: str = None,
context: dict = None) -> bool:
"""Check if flag enabled for user/context."""
rules = self._get_rules(flag)
if not rules:
return False
if rules.get("enabled") is False:
return False # Kill switch
# Check user-specific overrides
if user_id and user_id in rules.get("enabled_users", []):
return True
# Check segment rules
if context:
for segment in rules.get("segments", []):
if self._matches_segment(context, segment):
return True
# Percentage rollout (sticky)
rollout = rules.get("rollout_percentage", 0)
if rollout > 0 and user_id:
return self._in_bucket(user_id, flag, rollout)
return False
def _get_rules(self, flag: str):
data = self.redis.get(f"flag:{flag}")
return json.loads(data) if data else None
def _matches_segment(self, context: dict, segment: dict) -> bool:
for key, expected in segment.items():
if context.get(key) != expected:
return False
return True
def _in_bucket(self, user_id: str, flag: str, percentage: int) -> bool:
# Deterministic hashing for sticky assignment
h = hashlib.md5(f"{flag}:{user_id}".encode()).hexdigest()
bucket = int(h[:8], 16) % 100
return bucket < percentage
def update_flag(self, flag: str, rules: dict):
"""Admin API: update flag rules."""
self.redis.set(f"flag:{flag}", json.dumps(rules))
# TTL not set = persistent
# Usage
ff = FeatureFlags(redis.Redis())
# Set up rule
ff.update_flag("new_checkout", {
"enabled": True,
"rollout_percentage": 25,
"enabled_users": ["beta-1", "beta-2"],
"segments": [
{"plan": "enterprise"},
{"country": "VN"}
]
})
# Use
if ff.is_enabled("new_checkout", user_id="user-123",
context={"plan": "enterprise"}):
new_checkout()
else:
old_checkout()6.2 Canary deployment manual orchestration
"""
Manual canary if not using Argo Rollouts.
"""
import time
import requests
from dataclasses import dataclass
@dataclass
class CanaryStep:
weight: int
duration_min: int
class CanaryOrchestrator:
def __init__(self, prometheus_url: str, k8s_client):
self.prom = prometheus_url
self.k8s = k8s_client
async def deploy(self, service: str, new_version: str,
steps: list[CanaryStep]):
# 1. Deploy canary pod
await self.k8s.apply_manifest({
"apiVersion": "apps/v1",
"kind": "Deployment",
"metadata": {"name": f"{service}-canary"},
"spec": {
"replicas": 1,
"template": {
"spec": {
"containers": [{
"image": f"myorg/{service}:{new_version}"
}]
}
}
}
})
# 2. Progressive rollout
for step in steps:
print(f"Routing {step.weight}% to canary...")
await self._update_traffic_split(service, step.weight)
# Wait
time.sleep(step.duration_min * 60)
# Analyze
success_rate = await self._query_prometheus(f"""
sum(rate(http_requests_total{{
service="{service}",
version="canary",
status!~"5.."
}}[5m])) /
sum(rate(http_requests_total{{
service="{service}",
version="canary"
}}[5m]))
""")
if success_rate < 0.95:
print(f"FAILED at {step.weight}% — rolling back")
await self._update_traffic_split(service, 0)
await self.k8s.delete(f"{service}-canary")
raise Exception("Canary failed analysis")
print(f"OK at {step.weight}% — success_rate={success_rate}")
# 3. Promote
print("Promoting canary to stable")
await self._promote(service, new_version)
await self.k8s.delete(f"{service}-canary")
async def _update_traffic_split(self, service, weight):
# Update Istio VirtualService
...
async def _query_prometheus(self, query):
resp = requests.get(f"{self.prom}/api/v1/query", params={"query": query})
return float(resp.json()["data"]["result"][0]["value"][1])7. System Design Diagrams
7.1 Canary Rollout Flow
sequenceDiagram participant Dev participant CI participant Argo as Argo Rollouts participant Mesh as Istio participant Prom as Prometheus Dev->>CI: Push v2.0.0 CI->>Argo: Update Rollout image Argo->>Argo: Spawn canary pod Argo->>Mesh: Set 5% traffic to canary Note over Argo: Wait 10min loop Every 1min Argo->>Prom: Query success_rate Prom-->>Argo: 0.97 Note over Argo: ✓ pass end Argo->>Mesh: Set 25% traffic to canary Note over Argo: Wait 30min loop Every 1min Argo->>Prom: Query Prom-->>Argo: 0.92 Note over Argo: ✗ fail (3 in row) end Argo->>Mesh: Rollback: 0% to canary Argo->>Argo: Delete canary pod Argo->>Dev: Slack alert: rollout aborted
7.2 Blue-Green vs Canary
flowchart TB subgraph BG["Blue-Green"] BGUser[Users] BGUser --> BGLB[Load Balancer] BGLB -->|100%| BGBlue[Blue v1] BGLB -.0%.-> BGGreen[Green v2 ready] Note1[Switch atomically:<br/>0% Blue → 100% Green] end subgraph Canary["Canary"] CUser[Users] CUser --> CMesh[Service Mesh] CMesh -->|95%| CStable[Stable v1] CMesh -->|5%| CCanary[Canary v2] Note2[Gradually shift weight:<br/>5% → 25% → 50% → 100%] end style Note1 fill:#fff9c4 style Note2 fill:#c8e6c9
7.3 Feature Flag Decision Tree
flowchart TD Request[Request] --> Get[Get flag value] Get --> Cache{In SDK cache?} Cache -->|Yes, valid| Return[Return cached value] Cache -->|No| Fetch[Fetch from server] Fetch --> Eval{Evaluate rules} Eval --> Kill{Kill switch?} Kill -->|enabled=false| Default[Return default] Kill -->|enabled=true| User{User-specific override?} User -->|Yes| Override[Return override value] User -->|No| Segment{Match segment?} Segment -->|Yes| SegmentVal[Return segment value] Segment -->|No| Bucket{In rollout bucket?} Bucket -->|Yes| Enabled[Return true] Bucket -->|No| Default Return --> App[Application logic] Override --> App SegmentVal --> App Enabled --> App Default --> App style Default fill:#ffcdd2 style Enabled fill:#c8e6c9
7.4 Decoupling Deploy from Release
gantt title Deploy vs Release Timeline dateFormat YYYY-MM-DD axisFormat %m-%d section Code Develop feature :2026-01-01, 14d Code merged + deployed :milestone, 2026-01-15, 0d section Hidden Internal testing :2026-01-15, 7d Beta users (1%) :2026-01-22, 7d Wider beta (10%) :2026-01-29, 14d section Released 50% rollout :2026-02-12, 7d 100% rollout :milestone, 2026-02-19, 0d Remove feature flag :2026-03-19, 0d section Big Bang (legacy) Develop :crit, 2026-01-01, 14d Release to all :milestone, crit, 2026-01-15, 0d
8. Aha Moments & Pitfalls
Aha Moments
#1: Deploy ≠ Release. Code can be deployed but feature OFF. This decouples engineering velocity from product launch decisions.
#2: Automated analysis = unbiased decision. Humans biased toward “ship it”. Automated metrics-based rollout = objective gate.
#3: 5% canary catches 95% of bugs. Issues that don’t reproduce in staging often surface at 5% real traffic.
#4: Feature flag = kill switch. Production incident? Disable feature in 30 seconds, no deploy. Faster than rollback.
#5: Sticky bucketing matters for UX. User getting feature today, not tomorrow = bad. Hash deterministically.
#6: DORA metrics correlate with business. Higher deploy frequency + lower failure rate = better profitability (Accelerate research).
#7: Progressive delivery + feature flags = compound benefit. Combined: deploy continuously, release gradually, rollback instantly.
#8: Flag debt is real. 100+ stale flags = liability. Lifecycle management mandatory.
Pitfalls
Pitfall 1: No analysis on canary
Deploy 5% but no metrics check → just slow rollout. Fix: AnalysisTemplate with success_rate + latency.
Pitfall 2: Flag in tight loop
if flag.enabled() { ... }called 1000x/request → SDK overhead. Fix: Evaluate once per request, cache.
Pitfall 3: Different bucket each visit
User in 10% today, 50% tomorrow → confusing UX. Fix: Sticky bucketing via hash(user_id).
Pitfall 4: No flag cleanup
200 flags accumulate, 80% dead. Fix: Owner + sunset_date. Quarterly cleanup.
Pitfall 5: Flags for permanent config
“Should we use Postgres or MySQL” — this is config, not flag. Fix: Flags = temporary. Permanent decisions in config files.
Pitfall 6: No statistical rigor
A/B test “showed lift” but n=50 users. Fix: Statistical significance, sample size calc.
Pitfall 7: Canary without traffic
5% canary at 3am = 0 actual users → no signal. Fix: Require minimum traffic for analysis.
Pitfall 8: Database not rollback-safe
Code v2 changes schema → can’t rollback to v1. Fix: Expand-Contract pattern.
Pitfall 9: Manual rollback
Bug detected at 50% → 30 minutes to manual rollback. Fix: Automated rollback on metric failure.
Pitfall 10: Feature flag for security
“Disable login for attackers” via flag → flag service is now critical path. Fix: Use rate limiting / WAF, not flags.
9. Internal Links
| Topic | Liên hệ |
|---|---|
| Tuan-12-CICD-Pipeline | Foundation; PD adds verification + automation |
| Tuan-13-Monitoring-Observability | Metrics drive PD analysis |
| Tuan-11-Microservices-Pattern | Service mesh enables canary |
| Tuan-14-AuthN-AuthZ-Security | Flag-based security gates |
| Tuan-Bonus-Platform-Engineering-IDP | Self-service deploy via IDP |
Tham khảo
Tools:
- Argo Rollouts — https://argo-rollouts.readthedocs.io/
- Flagger — https://docs.flagger.app/
- Spinnaker (Netflix) — https://spinnaker.io/
- LaunchDarkly — https://launchdarkly.com/
- Statsig — https://statsig.com/
- Unleash — https://www.getunleash.io/
- Flagsmith — https://flagsmith.com/
- OpenFeature — https://openfeature.dev/
Books:
- Continuous Delivery (Humble & Farley, 2010)
- Accelerate (Forsgren, Humble, Kim, 2018)
- Feature Flag Best Practices (LaunchDarkly e-book)
Research:
- DORA State of DevOps reports — https://dora.dev/research/
- Trunk-Based Development — https://trunkbaseddevelopment.com/
Engineering blogs:
- LaunchDarkly Engineering — https://launchdarkly.com/blog/engineering/
- Netflix, Spinnaker — https://netflixtechblog.com/
- Facebook, Gatekeeper — internal tool
- Google, Feature flags at scale
Tiếp theo: Tuan-Bonus-Edge-Wasm-Architecture — Edge computing với WebAssembly.