Tuần Bonus: Multi-Tenancy SaaS Patterns

“Day 1: Single tenant, mọi thứ đơn giản. Day 100: 1000 tenants, 1 enterprise customer ‘noisy neighbor’ bóp nghẹt 999 tenant khác. Day 365: Compliance audit yêu cầu ‘prove tenant A’s data không thể leak sang tenant B’ — và em không trả lời được. Multi-tenancy patterns là kiến trúc quyết định life-or-death của một SaaS.”

Tags: system-design multi-tenancy saas isolation noisy-neighbor bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-09-Rate-Limiter · Tuan-11-Microservices-Pattern Liên quan: Tuan-Bonus-Multi-Region-Active-Active-DSQL · Case-Design-Payment-System · Tuan-14-AuthN-AuthZ-Security


1. Context & Why

Analogy đời thường — Mô hình bất động sản

Hieu, tưởng tượng em quản lý 3 mô hình nhà ở:

Silo (Nhà riêng):

  • Mỗi gia đình 1 căn nhà riêng, đất riêng, hàng rào riêng
  • Privacy tuyệt đối, isolation 100%
  • Đắt: 1 nhà cho 1 gia đình
  • Khi nhà A cháy → A bị mất, không lan sang B

Pool (Chung cư):

  • 1 toà nhà, nhiều căn hộ
  • Resource shared (thang máy, hành lang, hệ thống nước)
  • Rẻ: chia chi phí
  • 1 căn cháy → có thể lan; 1 cư dân làm ồn → cả tầng nghe

Hybrid (Khu phức hợp tier):

  • Tier 1 (Free): Chung cư đông đúc
  • Tier 2 (Pro): Chung cư cao cấp ít người
  • Tier 3 (Enterprise): Biệt thự riêng
  • Linh hoạt: trả nhiều = isolation cao hơn

Đây chính là 3 mô hình multi-tenancy trong SaaS:

Mô hìnhIsolationCost/tenantUse case
SiloCao nhấtCao nhấtEnterprise, regulated industries
PoolThấpThấp nhấtFree tier, startup, B2C
Hybrid (Bridge)Tùy tierTùy tierSaaS với multiple plans

Tại sao Backend Dev cần hiểu Multi-Tenancy?

Lý doHậu quả nếu sai
Hầu hết product là SaaSMặc định bài toán multi-tenant
Compliance (SOC 2, HIPAA, PCI)“Prove isolation” — không có architecture đúng = audit fail
Noisy neighbor1 tenant lớn DDoS cả platform
Cost economicsPool cho free tier (rẻ), Silo cho enterprise (premium)
Data residencyTenant EU phải lưu EU; per-tenant region selection
BYOK (Bring Your Own Key)Enterprise yêu cầu; cần tenant-level encryption

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 nói về sharding (tenant-by-shard) nhưng không cover full multi-tenancy spectrum: pool model, RLS, K8s namespace-per-tenant, blast radius isolation. Đây là cross-cutting concern xuất hiện ở mọi layer (DB, app, network, observability).

Tham chiếu chính


2. Deep Dive — Khái niệm cốt lõi

2.1 Tenancy Models — 3 chính

2.1.1 Silo Model (1 tenant = 1 stack)

Tenant A:                    Tenant B:                   Tenant C:
┌──────────┐                ┌──────────┐                ┌──────────┐
│  App A   │                │  App B   │                │  App C   │
└────┬─────┘                └────┬─────┘                └────┬─────┘
┌────▼─────┐                ┌────▼─────┐                ┌────▼─────┐
│   DB A   │                │   DB B   │                │   DB C   │
└──────────┘                └──────────┘                └──────────┘

Pros:

  • Maximum isolation (security, performance, compliance)
  • Per-tenant customization (different versions, features)
  • Easy to migrate / archive 1 tenant

Cons:

  • High operational overhead (1000 tenants = 1000 DB instances)
  • Expensive
  • Hard to push fix to all tenants quickly
  • Complex deployment automation

Use case: Enterprise SaaS (Salesforce big customers), regulated (HIPAA/PCI).

2.1.2 Pool Model (all tenants share)

                ┌──────────────┐
                │  Shared App  │
                │  (multi-tenant)│
                └──────┬───────┘
                       │
                ┌──────▼───────┐
                │  Shared DB   │
                │  (tenant_id   │
                │   column +    │
                │   RLS)        │
                └──────────────┘

Pros:

  • Lowest cost (high tenant density)
  • Easy upgrades (1 deploy → all tenants)
  • Operational simplicity

Cons:

  • Noisy neighbor risk
  • Higher blast radius (1 bad query → all tenants slow)
  • Compliance harder to prove
  • “Account takeover” can leak across tenants if RLS bug

Use case: B2C, freemium SaaS, low-value tenants.

2.1.3 Hybrid / Bridge / Pod Model

Concept: Combine — pool by default, silo for enterprise.

Tier 1 (Free):     [Shared Pool — 10K tenants]
Tier 2 (Pro):      [Pod 1: 100 tenants] [Pod 2: 100 tenants] ...
Tier 3 (Enterprise): [Silo 1] [Silo 2] [Silo 3]

Implementation patterns:

  • Sharded multi-tenant (Shopify pattern): pods of N tenants
  • Tier-based: Different deployments per tier
  • Cell-based: Independent cells (limit blast radius)

Pros:

  • Best of both worlds
  • Right-cost per tenant
  • Flexibility to upgrade tenant tier

Cons:

  • Operational complexity (multiple deployment shapes)
  • Cross-tier feature parity tricky

2.2 Database Tenancy Patterns

2.2.1 Database per tenant (Silo)

-- Each tenant has own database
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_widget_co;
 
-- App routes based on tenant identifier in URL/header
SET search_path = tenant_acme;

Pros: Maximum isolation, easy backup/restore per tenant Cons: 1000 tenants = 1000 DBs → connection pool nightmare

2.2.2 Schema per tenant (PostgreSQL)

-- Single database, schema per tenant
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_widget;
 
CREATE TABLE tenant_acme.users (...);
CREATE TABLE tenant_widget.users (...);
 
-- App sets search_path
SET search_path = tenant_acme;
SELECT * FROM users;  -- Resolves to tenant_acme.users

Pros: Easier than DB-per-tenant, still good isolation Cons: Migration complexity (apply to all schemas), connection pooling tricky

2.2.3 Shared schema with tenant_id (Pool)

-- Single schema, tenant_id column on every table
CREATE TABLE users (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    email TEXT,
    ...
);
 
-- Index includes tenant_id
CREATE INDEX idx_users_tenant ON users (tenant_id, email);
 
-- Every query MUST filter by tenant_id
SELECT * FROM users WHERE tenant_id = $1 AND email = $2;

Risk: Forget WHERE tenant_id → leak across tenants. Use Row-Level Security (RLS).

2.2.4 PostgreSQL Row-Level Security (RLS)

-- Enable RLS
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
 
-- Policy: users see only their tenant
CREATE POLICY tenant_isolation ON users
    USING (tenant_id = current_setting('app.tenant_id')::uuid);
 
-- App sets tenant context per session
SET app.tenant_id = 'acme-uuid-here';
SELECT * FROM users;  -- Auto-filtered to acme tenant

Magic: Even if app forgets WHERE clause, RLS prevents cross-tenant leak.

Performance: RLS adds query plan overhead (~5-10%). Add tenant_id to indexes.

Tools:

  • pgvector: also supports RLS
  • Supabase: built on Postgres RLS
  • Hasura: pg-graphile leverages RLS

2.2.5 Modern Alternative — Neon Branching

Neon (serverless Postgres) supports per-tenant branching:

  • Each tenant = separate Postgres branch
  • Copy-on-write storage → cheap
  • Independent compute (autosuspend)
  • Sub-second branch creation

Use case: Per-tenant dev/staging environments, “free tier with own DB”.

2.3 Network-Level Isolation

2.3.1 Kubernetes Namespace-per-Tenant

# Tenant gets own namespace
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme
 
---
# NetworkPolicy: deny cross-tenant traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-tenant
  namespace: tenant-acme
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: tenant-acme
        - namespaceSelector:
            matchLabels:
              name: shared-platform
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: tenant-acme
        - namespaceSelector:
            matchLabels:
              name: shared-platform

2.3.2 K8s vCluster (per-tenant virtual cluster)

vCluster (Loft Labs): Each tenant gets virtual K8s cluster inside shared physical cluster.

# Create vCluster for tenant
vcluster create acme-cluster --namespace tenant-acme
 
# Tenant has own k8s API, can deploy own resources
# Physical cluster admin doesn't see tenant's resources

Pros: Strong K8s-level isolation, tenant has admin in own vCluster Cons: Resource overhead, complexity

2.3.3 Capsule

Capsule: K8s native multi-tenancy operator.

  • Tenants = K8s CRD
  • Quota enforcement
  • Network policies per tenant
  • RBAC per tenant

2.4 Tenant Context Propagation

Critical: Every operation must know “which tenant”.

# Middleware extracts tenant from JWT/header
async def tenant_middleware(request, call_next):
    # 1. Extract from JWT
    token = request.headers.get("Authorization", "").replace("Bearer ", "")
    claims = jwt.decode(token, key, algorithms=["RS256"])
    tenant_id = claims["tenant_id"]
 
    # 2. Set context (works with asyncio)
    tenant_context.set(tenant_id)
 
    # 3. Set DB session variable for RLS
    async with get_db_pool().acquire() as conn:
        await conn.execute(f"SET app.tenant_id = '{tenant_id}'")
 
    # 4. Pass to log/metrics
    log.contextualize(tenant_id=tenant_id)
 
    response = await call_next(request)
    response.headers["X-Tenant-ID"] = tenant_id
    return response

Common bugs:

  • Background jobs without tenant context (cross-tenant leak)
  • Logger forgets tenant_id (unable to debug per-tenant)
  • Cache key doesn’t include tenant_id (cache leak)

2.5 Noisy Neighbor Mitigation

Problem: 1 tenant runs heavy report query, blocks others.

2.5.1 Per-tenant rate limiting

# Token bucket per tenant
class TenantRateLimiter:
    def __init__(self, redis):
        self.redis = redis
 
    async def check(self, tenant_id: str, action: str) -> bool:
        key = f"ratelimit:{tenant_id}:{action}"
        count = await self.redis.incr(key)
        if count == 1:
            await self.redis.expire(key, 60)
        return count <= self._limit_for(tenant_id, action)
 
    def _limit_for(self, tenant_id, action):
        tier = self._get_tier(tenant_id)
        limits = {
            "free": {"api": 100, "report": 5},
            "pro": {"api": 1000, "report": 50},
            "enterprise": {"api": 10000, "report": 500},
        }
        return limits[tier][action]

2.5.2 Resource quotas (K8s)

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-acme-quota
  namespace: tenant-acme
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "20"

2.5.3 Database connection pooling per tenant

# PgBouncer per tenant for predictable resource usage
pgbouncer_config = {
    "tenant_acme": {
        "max_client_conn": 100,
        "default_pool_size": 20,
    },
    "tenant_widget": {
        "max_client_conn": 50,
        "default_pool_size": 10,
    },
}

2.5.4 Query timeouts per tenant

-- PostgreSQL: SET statement_timeout per session
SET app.tenant_id = 'tenant-acme';
SET statement_timeout = '5s';  -- Free tier
-- Enterprise tier:
SET statement_timeout = '60s';

2.5.5 Fair scheduling

Async job processing với fair queueing:

Naive: FIFO queue. 1 tenant submits 10K jobs → blocks others.

Fair: Round-robin across tenants:
  Queue tenant-A: [j1, j2, j3, j4]
  Queue tenant-B: [j5]
  Queue tenant-C: [j6, j7]

  Process: j1, j5, j6, j2, j7, j3, j4
  → No starvation

Tools: Sidekiq Enterprise (Ruby), AWS SQS FIFO with group ID.

2.6 Tenant Onboarding & Provisioning

Automated provisioning pipeline:

async def onboard_tenant(name: str, tier: str):
    """Provision new tenant idempotently."""
    tenant_id = generate_uuid()
 
    # 1. Database
    if tier == "enterprise":
        # Silo: dedicated DB
        await create_database(f"tenant_{tenant_id}")
        await run_migrations(tenant_id)
    else:
        # Pool: insert tenant record + RLS context
        await db.execute(
            "INSERT INTO tenants (id, name, tier) VALUES ($1, $2, $3)",
            tenant_id, name, tier
        )
 
    # 2. K8s namespace (if enterprise)
    if tier == "enterprise":
        await k8s.create_namespace(f"tenant-{tenant_id}")
        await k8s.apply_quota(tenant_id, tier_to_quota(tier))
 
    # 3. Storage bucket
    await s3.create_bucket(f"tenant-{tenant_id}-uploads")
 
    # 4. Encryption key (BYOK pattern)
    if tier == "enterprise":
        kms_key = await kms.create_key(
            description=f"Tenant {name}",
            tags={"tenant_id": tenant_id}
        )
        await store_tenant_kms_key(tenant_id, kms_key)
 
    # 5. Default admin user
    admin_user = await create_user(
        email=name + "@admin.com",
        tenant_id=tenant_id,
        role="admin"
    )
 
    # 6. Welcome email + setup checklist
    await send_welcome(admin_user.email, tenant_id)
 
    return tenant_id

Idempotency: Failure mid-way → retry safe. Each step checks “already exists”.

2.7 Tenant Deletion (GDPR / offboarding)

async def delete_tenant(tenant_id: str):
    """GDPR-compliant tenant deletion."""
 
    # 1. Soft delete: mark for deletion (30-day grace period)
    await db.execute(
        "UPDATE tenants SET status = 'deleting', delete_after = NOW() + INTERVAL '30 days' WHERE id = $1",
        tenant_id
    )
 
    # 2. Disable access immediately
    await invalidate_all_tokens(tenant_id)
    await k8s.scale_namespace(f"tenant-{tenant_id}", replicas=0)
 
    # 3. After 30 days (cron job)
    if past_grace_period(tenant_id):
        # Delete data
        if is_silo_tenant(tenant_id):
            await drop_database(f"tenant_{tenant_id}")
        else:
            await db.execute(
                "DELETE FROM users WHERE tenant_id = $1", tenant_id
            )
            # ... all tenant tables
 
        # Delete S3
        await s3.delete_bucket(f"tenant-{tenant_id}-uploads")
 
        # Schedule KMS key deletion (7-30 days mandatory waiting)
        await kms.schedule_key_deletion(get_tenant_kms_key(tenant_id), 30)
 
        # Delete K8s namespace
        await k8s.delete_namespace(f"tenant-{tenant_id}")
 
        # Audit log
        await audit_log.record(
            event="tenant_deleted",
            tenant_id=tenant_id,
            timestamp=now()
        )

2.8 Cell-based Multi-Tenancy

Pattern: Group N tenants into “cells”, each cell isolated.

Cell 1: Tenants A-F (max 100 tenants)
Cell 2: Tenants G-M
Cell 3: Tenants N-T
Cell 4: Tenants U-Z

Each cell:
  - Own DB instance
  - Own app deployment
  - Independent failure domain

Pros:

  • Blast radius = 1 cell (10-100 tenants)
  • Easier capacity planning
  • A/B test deploy at cell level

Cons:

  • Need cell router (which cell does tenant X belong to?)
  • Operational overhead (N cells to manage)

Tham chiếu: Tuan-11-Microservices-Pattern section 2.15 Cell-based Architecture.


3. Estimation

3.1 Density per pool

PostgreSQL pool model with proper indexes:

  • 1 PG instance handles ~10K-100K small tenants
  • Bottleneck: connection pooling (PgBouncer can handle 1000s of clients → 100 server connections)

Per-tenant overhead (pool, RLS):

  • ~5-10% query overhead from RLS policy evaluation
  • Minimal storage overhead (just tenant_id column)

3.2 Cost per tenant

ModelCost per tenant/month
Pool (shared infra)1
Pod (10-100 tenants per pod)50
Silo (dedicated infra)500+

ROI calculation:

  • Free tier (low ARPU $0): MUST be pool
  • Pro tier ($50/month ARPU): Pod or pool
  • Enterprise ($1000+/month ARPU): Silo OK

3.3 Onboarding time

OperationPoolSilo
Tenant creation< 1s (insert row)5-30 min (provision DB, namespace, etc.)
Tenant deletionMinutes (DELETE rows)Hours (drop DB, cleanup)

3.4 Noisy neighbor blast radius

ModelBlast radius
Pure Pool100% (all tenants)
Pod1-10% (single pod)
Silo0.1% (single tenant)
Cell-based pod1% (single cell)

4. Security First

4.1 The “Cross-tenant leak” nightmare

Worst case: Bug allows tenant A user to read tenant B data.

Defense in depth:

  1. App layer: Always filter by tenant_id (review in code review)
  2. DB layer: RLS policies (defense even if app forgets)
  3. Network: Per-tenant namespaces (vCluster, Capsule)
  4. Audit: Log every cross-tenant access attempt
  5. Testing: Automated tenant isolation tests

4.2 Tenant Isolation Testing

# Pytest: ensure tenant isolation
@pytest.mark.parametrize("attacker_tenant,victim_tenant", [
    ("tenant-a", "tenant-b"),
])
def test_cannot_read_other_tenant(attacker_tenant, victim_tenant):
    # Create data in victim tenant
    with set_tenant_context(victim_tenant):
        victim_user = create_user(email="[email protected]")
 
    # Try to read as attacker
    with set_tenant_context(attacker_tenant):
        result = get_user_by_email("[email protected]")
        assert result is None, "Cross-tenant leak detected!"
 
    # Try direct ID access
    with set_tenant_context(attacker_tenant):
        result = get_user_by_id(victim_user.id)
        assert result is None, "Cross-tenant leak via direct ID!"
 
# Run for ALL endpoints, not just selected
def test_all_endpoints_isolated():
    for endpoint in get_all_api_endpoints():
        for method in endpoint.methods:
            # Attempt cross-tenant access
            assert is_endpoint_isolated(endpoint, method)

4.3 BYOK (Bring Your Own Key)

Enterprise tenants want to control their encryption keys.

# Per-tenant KMS key
def encrypt_for_tenant(tenant_id: str, plaintext: bytes) -> bytes:
    key_arn = get_tenant_kms_key(tenant_id)  # From tenant config
    return kms.encrypt(key_id=key_arn, plaintext=plaintext)
 
def decrypt_for_tenant(tenant_id: str, ciphertext: bytes) -> bytes:
    return kms.decrypt(ciphertext=ciphertext)
    # KMS auto-detects key from ciphertext metadata

Implications:

  • Tenant can revoke key → permanent data loss for that tenant (intentional)
  • Per-tenant KMS = per-tenant cost
  • Compliance benefit: tenant proves control of their data

4.4 Audit Logging Per Tenant

Every action must be auditable per tenant:

CREATE TABLE audit_log (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    tenant_id UUID NOT NULL,
    user_id UUID,
    action TEXT NOT NULL,
    resource_type TEXT,
    resource_id TEXT,
    ip_address INET,
    user_agent TEXT,
    metadata JSONB,
    -- Append-only: revoke UPDATE/DELETE
);
 
-- Partition by tenant for efficient retrieval
CREATE TABLE audit_log_tenant_acme PARTITION OF audit_log
  FOR VALUES IN ('acme-tenant-uuid');

Forward to immutable storage (S3 Object Lock, blockchain) for compliance.

4.5 Compliance Frameworks

FrameworkMulti-tenancy requirements
SOC 2 Type II”Logical isolation” (RLS sufficient if properly tested)
HIPAAStronger isolation; many require silo for PHI
PCI DSSCard data must be tokenized; cardholder env separated
GDPRTenant consent, data portability, right to delete
ISO 27001Risk assessment per tenant, controls documented
FedRAMPHigh often requires gov cloud silo

5. DevOps — Vận hành Multi-Tenant SaaS

5.1 Provisioning Pipeline

# Argo Workflow: tenant provisioning
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: provision-tenant-
spec:
  entrypoint: provision
  arguments:
    parameters:
      - name: tenant-name
      - name: tier
 
  templates:
    - name: provision
      steps:
        - - name: create-namespace
            template: k8s-namespace
        - - name: create-database
            template: db-setup
        - - name: apply-quota
            template: resource-quota
        - - name: setup-monitoring
            template: monitoring-config
        - - name: send-welcome
            template: notify
 
    - name: k8s-namespace
      script:
        image: kubectl:latest
        command: [sh]
        source: |
          kubectl create namespace tenant-{{workflow.parameters.tenant-name}}
          kubectl label namespace tenant-{{workflow.parameters.tenant-name}} \
            tenant={{workflow.parameters.tenant-name}} \
            tier={{workflow.parameters.tier}}

5.2 Per-tenant monitoring

Critical: Dashboards filterable by tenant.

# Prometheus queries per tenant
rate(http_requests_total{tenant_id="acme"}[5m])
 
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{tenant_id="acme"}[5m])
)
 
# Per-tenant resource usage
sum(rate(container_cpu_usage_seconds_total{namespace=~"tenant-.*"}[5m]))
  by (namespace)

Grafana variable: $tenant dropdown → filter all panels.

5.3 Per-tenant alerting

- alert: TenantQuotaApproached
  expr: |
    (
      sum(container_memory_usage_bytes) by (namespace) /
      kube_resourcequota{type="hard", resource="requests.memory"}
    ) > 0.8
  for: 5m
  labels:
    severity: warning
    tenant: "{{ $labels.namespace }}"
  annotations:
    summary: "Tenant {{ $labels.namespace }} at 80% memory quota"
 
- alert: TenantSLOViolation
  expr: |
    (
      rate(http_requests_total{status=~"5..", tenant=~".+"}[5m]) /
      rate(http_requests_total{tenant=~".+"}[5m])
    ) > 0.01
  for: 10m
  labels:
    severity: critical
    tenant: "{{ $labels.tenant }}"
  annotations:
    summary: "SLO violation for tenant {{ $labels.tenant }}: > 1% error rate"

5.4 Tenant cost allocation

FinOps integration: Track infrastructure cost per tenant.

# Daily cost allocation job
def allocate_costs():
    total_cost = aws.get_billing_total()  # $X
 
    # Allocation strategies:
    # 1. By usage (compute time, storage)
    tenant_usage = {}
    for tenant_id in get_all_tenants():
        tenant_usage[tenant_id] = {
            "compute": prometheus.get_metric(
                f'sum(container_cpu_usage_seconds_total{{tenant_id="{tenant_id}"}})'
            ),
            "storage": s3.get_bucket_size(f"tenant-{tenant_id}-*"),
            "network": cloudwatch.get_metric("NetworkOut", tenant_id),
        }
 
    # 2. By tier (proportional)
    tier_weights = {"free": 0.1, "pro": 1.0, "enterprise": 10.0}
 
    for tenant_id in get_all_tenants():
        cost = calculate_tenant_cost(tenant_id, total_cost, tenant_usage)
        save_cost_record(tenant_id, cost, date=today())

5.5 Disaster scenarios

ScenarioMitigation
Cross-tenant data leak detectedHalt service, security incident response, forensic, customer notification
1 tenant overuses → impacts othersQuotas + circuit breaker; auto-isolate
Migration fails for some tenantsPer-tenant migration tracking, rollback per tenant
Tenant requests forensic copyPre-built tooling for per-tenant backup export

6. Code Implementation

6.1 PostgreSQL RLS-based pool tenancy

-- Migration: enable multi-tenancy
ALTER TABLE users ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_users_tenant_email ON users (tenant_id, email);
 
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
 
CREATE POLICY tenant_isolation ON users
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::uuid);
 
-- Repeat for all tenant tables
# FastAPI app with tenant context
from contextvars import ContextVar
from fastapi import FastAPI, Request, HTTPException, Depends
import jwt
import asyncpg
 
app = FastAPI()
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
db_pool: asyncpg.Pool = None
 
 
@app.middleware("http")
async def tenant_middleware(request: Request, call_next):
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(401, "Missing token")
 
    try:
        claims = jwt.decode(
            auth[7:], JWT_SECRET, algorithms=["HS256"]
        )
        tenant_id = claims["tenant_id"]
    except jwt.JWTError:
        raise HTTPException(401, "Invalid token")
 
    tenant_ctx.set(tenant_id)
    response = await call_next(request)
    response.headers["X-Tenant-ID"] = tenant_id
    return response
 
 
async def get_db():
    """DB connection with tenant context set."""
    async with db_pool.acquire() as conn:
        # Set RLS context per session
        tenant_id = tenant_ctx.get()
        await conn.execute(
            "SET LOCAL app.tenant_id = $1",
            tenant_id
        )
        yield conn
 
 
@app.get("/users")
async def list_users(conn=Depends(get_db)):
    # No need to filter by tenant_id — RLS does it
    rows = await conn.fetch("SELECT id, email FROM users LIMIT 100")
    return [dict(r) for r in rows]

6.2 Tenant context propagation across services

"""
Propagate tenant context across HTTP, gRPC, async jobs.
"""
 
import httpx
from contextvars import ContextVar
 
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
 
 
class TenantAwareHTTPClient(httpx.AsyncClient):
    """HTTP client that auto-propagates tenant context."""
 
    async def request(self, method, url, **kwargs):
        headers = kwargs.pop("headers", {})
        try:
            tenant_id = tenant_ctx.get()
            headers["X-Tenant-ID"] = tenant_id
        except LookupError:
            pass  # No tenant context (e.g., system call)
        return await super().request(method, url, headers=headers, **kwargs)
 
 
# Usage
client = TenantAwareHTTPClient()
async with client:
    # Tenant ID auto-included in headers
    response = await client.get("https://internal-service/data")
 
 
# For async jobs (Celery, RQ, etc.)
class TenantAwareTask:
    def apply_async(self, args, **kwargs):
        try:
            tenant_id = tenant_ctx.get()
            kwargs["headers"] = kwargs.get("headers", {})
            kwargs["headers"]["tenant_id"] = tenant_id
        except LookupError:
            pass
        return super().apply_async(args, **kwargs)
 
 
# Worker side: restore context
@worker.before_task_publish.connect
def set_tenant_context(headers=None, **kwargs):
    if headers and "tenant_id" in headers:
        tenant_ctx.set(headers["tenant_id"])

6.3 Fair queueing for async jobs

"""
Fair scheduler: prevents 1 tenant from monopolizing job queue.
"""
 
import asyncio
from collections import deque
from typing import Optional
 
 
class FairTenantQueue:
    """Round-robin across tenant queues."""
 
    def __init__(self):
        self.queues: dict[str, deque] = {}
        self.tenant_order: list[str] = []
        self.idx = 0
        self.lock = asyncio.Lock()
        self.event = asyncio.Event()
 
    async def enqueue(self, tenant_id: str, job):
        async with self.lock:
            if tenant_id not in self.queues:
                self.queues[tenant_id] = deque()
                self.tenant_order.append(tenant_id)
            self.queues[tenant_id].append(job)
            self.event.set()
 
    async def dequeue(self) -> Optional[tuple]:
        while True:
            async with self.lock:
                # Round-robin
                tried = 0
                while tried < len(self.tenant_order):
                    tenant = self.tenant_order[self.idx]
                    self.idx = (self.idx + 1) % len(self.tenant_order)
 
                    queue = self.queues[tenant]
                    if queue:
                        job = queue.popleft()
                        return tenant, job
 
                    tried += 1
 
                # All empty, wait for new job
                self.event.clear()
 
            await self.event.wait()
 
 
# Worker
async def worker(queue: FairTenantQueue):
    while True:
        result = await queue.dequeue()
        if result is None:
            await asyncio.sleep(0.1)
            continue
 
        tenant_id, job = result
        tenant_ctx.set(tenant_id)
        await process(job)

7. System Design Diagrams

7.1 Tenancy Models Spectrum

flowchart LR
    subgraph Pool["Pool Model"]
        SharedApp[Shared App<br/>multi-tenant]
        SharedDB[(Shared DB<br/>tenant_id + RLS)]
        SharedApp --> SharedDB
    end

    subgraph Pod["Pod Model"]
        PodApp1[Pod 1<br/>~100 tenants]
        PodApp2[Pod 2<br/>~100 tenants]
        PodDB1[(Pod 1 DB)]
        PodDB2[(Pod 2 DB)]
        PodApp1 --> PodDB1
        PodApp2 --> PodDB2
    end

    subgraph Silo["Silo Model"]
        SiloApp1[Tenant A App]
        SiloApp2[Tenant B App]
        SiloDB1[(Tenant A DB)]
        SiloDB2[(Tenant B DB)]
        SiloApp1 --> SiloDB1
        SiloApp2 --> SiloDB2
    end

    Pool -->|"Free Tier"| Pod
    Pod -->|"Pro Tier"| Silo
    Silo -->|"Enterprise Tier"| End[Hybrid SaaS]

    style Pool fill:#fff9c4
    style Pod fill:#c8e6c9
    style Silo fill:#bbdefb

7.2 RLS Tenant Isolation

sequenceDiagram
    participant User
    participant App
    participant Auth as JWT Auth
    participant DB as PostgreSQL

    User->>App: GET /users (Bearer token)
    App->>Auth: Decode JWT
    Auth-->>App: tenant_id="acme"

    App->>DB: BEGIN<br/>SET app.tenant_id = 'acme'
    App->>DB: SELECT * FROM users<br/>(no WHERE clause!)

    DB->>DB: Apply RLS policy:<br/>tenant_id = current_setting('app.tenant_id')
    DB-->>App: Only acme users returned

    App->>DB: COMMIT
    App-->>User: { users: [...] }

    Note over DB: Even if app forgets WHERE,<br/>RLS prevents cross-tenant leak

7.3 Cell-Based Multi-Tenancy

flowchart TB
    Router[Cell Router<br/>tenant_id → cell_id]

    Router --> Cell1
    Router --> Cell2
    Router --> Cell3

    subgraph Cell1["Cell 1 (Tenants 1-100)"]
        App1[App Tier]
        DB1[(DB)]
        App1 --> DB1
    end

    subgraph Cell2["Cell 2 (Tenants 101-200)"]
        App2[App Tier]
        DB2[(DB)]
        App2 --> DB2
    end

    subgraph Cell3["Cell 3 (Tenants 201-300)"]
        App3[App Tier]
        DB3[(DB)]
        App3 --> DB3
    end

    User[User of tenant 50] --> Router

    Note["Bug in Cell 1 → only 100 tenants affected<br/>vs 300 in single pool"]

    style Note fill:#fff9c4

7.4 Noisy Neighbor Mitigation Layers

flowchart TD
    Request[Tenant Request]

    Request --> L1{Rate Limit<br/>per tenant?}
    L1 -->|Exceeded| Block1[429 Too Many Requests]
    L1 -->|OK| L2{Resource Quota<br/>K8s ResourceQuota?}

    L2 -->|Exceeded| Block2[503 Service Unavailable]
    L2 -->|OK| L3{Connection Pool<br/>tenant-aware?}

    L3 -->|Exceeded| Block3[Wait or Reject]
    L3 -->|OK| L4{Query Timeout<br/>per tenant?}

    L4 -->|Exceeded| Block4[Cancel query]
    L4 -->|OK| Success[Process request]

    style Block1 fill:#ffcdd2
    style Block2 fill:#ffcdd2
    style Block3 fill:#ffcdd2
    style Block4 fill:#ffcdd2
    style Success fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: Multi-tenancy là decision đầu tiên ảnh hưởng tất cả. Pool vs Silo vs Hybrid quyết định cost, complexity, compliance, scalability. Khó migrate sau.

#2: PostgreSQL RLS là defense-in-depth. Even nếu app code có bug forget WHERE clause, RLS prevent cross-tenant leak. Bắt buộc bật cho pool model.

#3: Noisy neighbor là silent killer. 1 enterprise customer with bad query → 99% small tenants slow → churn. Per-tenant rate limit + quotas mandatory.

#4: Cell-based pattern reduce blast radius. Pure pool: 100% impact. Cell pool: 1-10% impact. AWS DynamoDB, Slack, Stripe đều dùng pattern này.

#5: Tenant context propagation là hard. HTTP request OK, async jobs khó, background workers dễ quên. Logger không có tenant_id → debug imopssible.

#6: BYOK is enterprise differentiator. Big customers care về key control. Implementation cost cao but premium pricing justify.

#7: Compliance shapes architecture. HIPAA, FedRAMP often force silo model. Plan upfront, retrofit expensive.

#8: Tenant deletion là requirement, không phải nice-to-have. GDPR right to delete. Architect cho hard delete + audit trail từ đầu.

Pitfalls

Pitfall 1: Hardcode tenant in URL

Sai: acme.app.com/users → tenant trong subdomain only. Đúng: Tenant từ JWT claim. Subdomain backup.

Pitfall 2: Forget tenant_id trong index

Sai: CREATE INDEX ON users (email) → cross-tenant scan slow. Đúng: CREATE INDEX ON users (tenant_id, email).

Pitfall 3: No RLS in pool model

Sai: “Code always filter by tenant_id” → 1 SQL bug = data breach. Đúng: Defense-in-depth: app filter + RLS.

Pitfall 4: Background jobs without context

Sai: Cron job processes all tenants, forget to set context → queries miss filter. Đúng: Explicitly set tenant context per job, audit log.

Pitfall 5: Cache without tenant_id

Sai: cache.set("user:123", data) → tenant A’s user 123 vs tenant B’s user 123. Đúng: cache.set(f"user:{tenant_id}:123", data).

Pitfall 6: No fair scheduling

Sai: FIFO job queue → 1 tenant submits 100K jobs → others wait hours. Đúng: Round-robin or weighted fair scheduling.

Pitfall 7: Single connection pool

Sai: 1 PgBouncer for all tenants → 1 tenant exhausts pool → others fail. Đúng: Tenant-aware pool quotas hoặc separate pools per tier.

Pitfall 8: No tenant-level monitoring

Sai: Aggregate metrics — can’t see “tenant X has 50% error rate”. Đúng: Tenant_id label on all metrics; per-tenant dashboards.

Pitfall 9: Same KMS key for all tenants

Sai: 1 master key encrypts all tenants → key compromise = total breach. Đúng: Per-tenant KMS (cost vs security trade-off).

Pitfall 10: No isolation testing

Sai: Manual review → bug slips into production → cross-tenant leak detected by customer. Đúng: Automated tests for every endpoint, tenant_isolation_test.py runs on PR.


TopicLiên hệ
Tuan-07-Database-Sharding-ReplicationSharding by tenant_id; pool model uses tenant_id column
Tuan-09-Rate-LimiterPer-tenant rate limiting
Tuan-11-Microservices-PatternCell-based architecture; namespace per tenant
Tuan-14-AuthN-AuthZ-SecurityTenant context from JWT; RBAC scoped to tenant
Tuan-15-Data-Security-EncryptionBYOK per tenant
Tuan-Bonus-Multi-Region-Active-Active-DSQLTenant-region affinity
Case-Design-Payment-SystemPer-merchant tenant model

Tham khảo

Whitepapers:

Engineering blogs:

Tools:


Tiếp theo: Tuan-Bonus-MCP-Architecture — Model Context Protocol cho LLM tools.