Tuần Bonus: Multi-Tenancy SaaS Patterns

“Day 1: Single tenant, mọi thứ đơn giản. Day 100: 1000 tenants, 1 enterprise customer ‘noisy neighbor’ bóp nghẹt 999 tenant khác. Day 365: Compliance audit yêu cầu ‘prove tenant A’s data không thể leak sang tenant B’ — và em không trả lời được. Multi-tenancy patterns là kiến trúc quyết định life-or-death của một SaaS.”

Tags: system-design multi-tenancy saas isolation noisy-neighbor bonus Student: Hieu (Backend Dev → Architect) Prerequisite: Tuan-07-Database-Sharding-Replication · Tuan-09-Rate-Limiter · Tuan-11-Microservices-Pattern Liên quan: Tuan-Bonus-Multi-Region-Active-Active-DSQL · Case-Design-Payment-System · Tuan-14-AuthN-AuthZ-Security

1. Context & Why

Analogy đời thường — Mô hình bất động sản

Hieu, tưởng tượng em quản lý 3 mô hình nhà ở:

Silo (Nhà riêng):

Mỗi gia đình 1 căn nhà riêng, đất riêng, hàng rào riêng
Privacy tuyệt đối, isolation 100%
Đắt: 1 nhà cho 1 gia đình
Khi nhà A cháy → A bị mất, không lan sang B

Pool (Chung cư):

1 toà nhà, nhiều căn hộ
Resource shared (thang máy, hành lang, hệ thống nước)
Rẻ: chia chi phí
1 căn cháy → có thể lan; 1 cư dân làm ồn → cả tầng nghe

Hybrid (Khu phức hợp tier):

Tier 1 (Free): Chung cư đông đúc
Tier 2 (Pro): Chung cư cao cấp ít người
Tier 3 (Enterprise): Biệt thự riêng
Linh hoạt: trả nhiều = isolation cao hơn

Đây chính là 3 mô hình multi-tenancy trong SaaS:

Mô hình	Isolation	Cost/tenant	Use case
Silo	Cao nhất	Cao nhất	Enterprise, regulated industries
Pool	Thấp	Thấp nhất	Free tier, startup, B2C
Hybrid (Bridge)	Tùy tier	Tùy tier	SaaS với multiple plans

Tại sao Backend Dev cần hiểu Multi-Tenancy?

Lý do	Hậu quả nếu sai
Hầu hết product là SaaS	Mặc định bài toán multi-tenant
Compliance (SOC 2, HIPAA, PCI)	“Prove isolation” — không có architecture đúng = audit fail
Noisy neighbor	1 tenant lớn DDoS cả platform
Cost economics	Pool cho free tier (rẻ), Silo cho enterprise (premium)
Data residency	Tenant EU phải lưu EU; per-tenant region selection
BYOK (Bring Your Own Key)	Enterprise yêu cầu; cần tenant-level encryption

Tại sao Alex Xu không đi sâu?

Alex Xu Vol 1+2 nói về sharding (tenant-by-shard) nhưng không cover full multi-tenancy spectrum: pool model, RLS, K8s namespace-per-tenant, blast radius isolation. Đây là cross-cutting concern xuất hiện ở mọi layer (DB, app, network, observability).

Tham chiếu chính

AWS SaaS Tenant Isolation Strategies (whitepaper) — https://d1.awsstatic.com/whitepapers/saas-tenant-isolation-strategies.pdf
AWS SaaS Builder Toolkit (SBT) — https://github.com/awslabs/sbt-aws
Tenancy patterns (Microsoft Learn) — https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/considerations/tenancy-models
Neon, Noisy Neighbor problem — https://neon.tech/blog/noisy-neighbors

2. Deep Dive — Khái niệm cốt lõi

2.1 Tenancy Models — 3 chính

2.1.1 Silo Model (1 tenant = 1 stack)

Tenant A:                    Tenant B:                   Tenant C:
┌──────────┐                ┌──────────┐                ┌──────────┐
│  App A   │                │  App B   │                │  App C   │
└────┬─────┘                └────┬─────┘                └────┬─────┘
┌────▼─────┐                ┌────▼─────┐                ┌────▼─────┐
│   DB A   │                │   DB B   │                │   DB C   │
└──────────┘                └──────────┘                └──────────┘

Pros:

Maximum isolation (security, performance, compliance)
Per-tenant customization (different versions, features)
Easy to migrate / archive 1 tenant

Cons:

High operational overhead (1000 tenants = 1000 DB instances)
Expensive
Hard to push fix to all tenants quickly
Complex deployment automation

Use case: Enterprise SaaS (Salesforce big customers), regulated (HIPAA/PCI).

                ┌──────────────┐
                │  Shared App  │
                │  (multi-tenant)│
                └──────┬───────┘
                       │
                ┌──────▼───────┐
                │  Shared DB   │
                │  (tenant_id   │
                │   column +    │
                │   RLS)        │
                └──────────────┘

Pros:

Lowest cost (high tenant density)
Easy upgrades (1 deploy → all tenants)
Operational simplicity

Cons:

Noisy neighbor risk
Higher blast radius (1 bad query → all tenants slow)
Compliance harder to prove
“Account takeover” can leak across tenants if RLS bug

Use case: B2C, freemium SaaS, low-value tenants.

2.1.3 Hybrid / Bridge / Pod Model

Concept: Combine — pool by default, silo for enterprise.

Tier 1 (Free):     [Shared Pool — 10K tenants]
Tier 2 (Pro):      [Pod 1: 100 tenants] [Pod 2: 100 tenants] ...
Tier 3 (Enterprise): [Silo 1] [Silo 2] [Silo 3]

Implementation patterns:

Sharded multi-tenant (Shopify pattern): pods of N tenants
Tier-based: Different deployments per tier
Cell-based: Independent cells (limit blast radius)

Pros:

Best of both worlds
Right-cost per tenant
Flexibility to upgrade tenant tier

Cons:

Operational complexity (multiple deployment shapes)
Cross-tier feature parity tricky

2.2 Database Tenancy Patterns

2.2.1 Database per tenant (Silo)

-- Each tenant has own database
CREATE DATABASE tenant_acme;
CREATE DATABASE tenant_widget_co;
 
-- App routes based on tenant identifier in URL/header
SET search_path = tenant_acme;

Pros: Maximum isolation, easy backup/restore per tenant Cons: 1000 tenants = 1000 DBs → connection pool nightmare

2.2.2 Schema per tenant (PostgreSQL)

-- Single database, schema per tenant
CREATE SCHEMA tenant_acme;
CREATE SCHEMA tenant_widget;
 
CREATE TABLE tenant_acme.users (...);
CREATE TABLE tenant_widget.users (...);
 
-- App sets search_path
SET search_path = tenant_acme;
SELECT * FROM users;  -- Resolves to tenant_acme.users

Pros: Easier than DB-per-tenant, still good isolation Cons: Migration complexity (apply to all schemas), connection pooling tricky

2.2.3 Shared schema with tenant_id (Pool)

-- Single schema, tenant_id column on every table
CREATE TABLE users (
    id UUID PRIMARY KEY,
    tenant_id UUID NOT NULL,
    email TEXT,
    ...
);
 
-- Index includes tenant_id
CREATE INDEX idx_users_tenant ON users (tenant_id, email);
 
-- Every query MUST filter by tenant_id
SELECT * FROM users WHERE tenant_id = $1 AND email = $2;

Risk: Forget WHERE tenant_id → leak across tenants. Use Row-Level Security (RLS).

2.2.4 PostgreSQL Row-Level Security (RLS)

-- Enable RLS
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
 
-- Policy: users see only their tenant
CREATE POLICY tenant_isolation ON users
    USING (tenant_id = current_setting('app.tenant_id')::uuid);
 
-- App sets tenant context per session
SET app.tenant_id = 'acme-uuid-here';
SELECT * FROM users;  -- Auto-filtered to acme tenant

Magic: Even if app forgets WHERE clause, RLS prevents cross-tenant leak.

Performance: RLS adds query plan overhead (~5-10%). Add tenant_id to indexes.

Tools:

pgvector: also supports RLS
Supabase: built on Postgres RLS
Hasura: pg-graphile leverages RLS

2.2.5 Modern Alternative — Neon Branching

Neon (serverless Postgres) supports per-tenant branching:

Each tenant = separate Postgres branch
Copy-on-write storage → cheap
Independent compute (autosuspend)
Sub-second branch creation

Use case: Per-tenant dev/staging environments, “free tier with own DB”.

2.3 Network-Level Isolation

2.3.1 Kubernetes Namespace-per-Tenant

# Tenant gets own namespace
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-acme
 
---
# NetworkPolicy: deny cross-tenant traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-cross-tenant
  namespace: tenant-acme
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: tenant-acme
        - namespaceSelector:
            matchLabels:
              name: shared-platform
  egress:
    - to:
        - namespaceSelector:
            matchLabels:
              name: tenant-acme
        - namespaceSelector:
            matchLabels:
              name: shared-platform

2.3.2 K8s vCluster (per-tenant virtual cluster)

vCluster (Loft Labs): Each tenant gets virtual K8s cluster inside shared physical cluster.

# Create vCluster for tenant
vcluster create acme-cluster --namespace tenant-acme
 
# Tenant has own k8s API, can deploy own resources
# Physical cluster admin doesn't see tenant's resources

Pros: Strong K8s-level isolation, tenant has admin in own vCluster Cons: Resource overhead, complexity

2.3.3 Capsule

Capsule: K8s native multi-tenancy operator.

Tenants = K8s CRD
Quota enforcement
Network policies per tenant
RBAC per tenant

2.4 Tenant Context Propagation

Critical: Every operation must know “which tenant”.

# Middleware extracts tenant from JWT/header
async def tenant_middleware(request, call_next):
    # 1. Extract from JWT
    token = request.headers.get("Authorization", "").replace("Bearer ", "")
    claims = jwt.decode(token, key, algorithms=["RS256"])
    tenant_id = claims["tenant_id"]
 
    # 2. Set context (works with asyncio)
    tenant_context.set(tenant_id)
 
    # 3. Set DB session variable for RLS
    async with get_db_pool().acquire() as conn:
        await conn.execute(f"SET app.tenant_id = '{tenant_id}'")
 
    # 4. Pass to log/metrics
    log.contextualize(tenant_id=tenant_id)
 
    response = await call_next(request)
    response.headers["X-Tenant-ID"] = tenant_id
    return response

Common bugs:

Background jobs without tenant context (cross-tenant leak)
Logger forgets tenant_id (unable to debug per-tenant)
Cache key doesn’t include tenant_id (cache leak)

2.5 Noisy Neighbor Mitigation

Problem: 1 tenant runs heavy report query, blocks others.

2.5.1 Per-tenant rate limiting

# Token bucket per tenant
class TenantRateLimiter:
    def __init__(self, redis):
        self.redis = redis
 
    async def check(self, tenant_id: str, action: str) -> bool:
        key = f"ratelimit:{tenant_id}:{action}"
        count = await self.redis.incr(key)
        if count == 1:
            await self.redis.expire(key, 60)
        return count <= self._limit_for(tenant_id, action)
 
    def _limit_for(self, tenant_id, action):
        tier = self._get_tier(tenant_id)
        limits = {
            "free": {"api": 100, "report": 5},
            "pro": {"api": 1000, "report": 50},
            "enterprise": {"api": 10000, "report": 500},
        }
        return limits[tier][action]

2.5.2 Resource quotas (K8s)

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-acme-quota
  namespace: tenant-acme
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20"
    limits.memory: 40Gi
    pods: "50"
    services: "20"

2.5.3 Database connection pooling per tenant

# PgBouncer per tenant for predictable resource usage
pgbouncer_config = {
    "tenant_acme": {
        "max_client_conn": 100,
        "default_pool_size": 20,
    },
    "tenant_widget": {
        "max_client_conn": 50,
        "default_pool_size": 10,
    },
}

2.5.4 Query timeouts per tenant

-- PostgreSQL: SET statement_timeout per session
SET app.tenant_id = 'tenant-acme';
SET statement_timeout = '5s';  -- Free tier
-- Enterprise tier:
SET statement_timeout = '60s';

2.5.5 Fair scheduling

Async job processing với fair queueing:

Naive: FIFO queue. 1 tenant submits 10K jobs → blocks others.

Fair: Round-robin across tenants:
  Queue tenant-A: [j1, j2, j3, j4]
  Queue tenant-B: [j5]
  Queue tenant-C: [j6, j7]

  Process: j1, j5, j6, j2, j7, j3, j4
  → No starvation

Tools: Sidekiq Enterprise (Ruby), AWS SQS FIFO with group ID.

2.6 Tenant Onboarding & Provisioning

Automated provisioning pipeline:

async def onboard_tenant(name: str, tier: str):
    """Provision new tenant idempotently."""
    tenant_id = generate_uuid()
 
    # 1. Database
    if tier == "enterprise":
        # Silo: dedicated DB
        await create_database(f"tenant_{tenant_id}")
        await run_migrations(tenant_id)
    else:
        # Pool: insert tenant record + RLS context
        await db.execute(
            "INSERT INTO tenants (id, name, tier) VALUES ($1, $2, $3)",
            tenant_id, name, tier
        )
 
    # 2. K8s namespace (if enterprise)
    if tier == "enterprise":
        await k8s.create_namespace(f"tenant-{tenant_id}")
        await k8s.apply_quota(tenant_id, tier_to_quota(tier))
 
    # 3. Storage bucket
    await s3.create_bucket(f"tenant-{tenant_id}-uploads")
 
    # 4. Encryption key (BYOK pattern)
    if tier == "enterprise":
        kms_key = await kms.create_key(
            description=f"Tenant {name}",
            tags={"tenant_id": tenant_id}
        )
        await store_tenant_kms_key(tenant_id, kms_key)
 
    # 5. Default admin user
    admin_user = await create_user(
        email=name + "@admin.com",
        tenant_id=tenant_id,
        role="admin"
    )
 
    # 6. Welcome email + setup checklist
    await send_welcome(admin_user.email, tenant_id)
 
    return tenant_id

Idempotency: Failure mid-way → retry safe. Each step checks “already exists”.

async def delete_tenant(tenant_id: str):
    """GDPR-compliant tenant deletion."""
 
    # 1. Soft delete: mark for deletion (30-day grace period)
    await db.execute(
        "UPDATE tenants SET status = 'deleting', delete_after = NOW() + INTERVAL '30 days' WHERE id = $1",
        tenant_id
    )
 
    # 2. Disable access immediately
    await invalidate_all_tokens(tenant_id)
    await k8s.scale_namespace(f"tenant-{tenant_id}", replicas=0)
 
    # 3. After 30 days (cron job)
    if past_grace_period(tenant_id):
        # Delete data
        if is_silo_tenant(tenant_id):
            await drop_database(f"tenant_{tenant_id}")
        else:
            await db.execute(
                "DELETE FROM users WHERE tenant_id = $1", tenant_id
            )
            # ... all tenant tables
 
        # Delete S3
        await s3.delete_bucket(f"tenant-{tenant_id}-uploads")
 
        # Schedule KMS key deletion (7-30 days mandatory waiting)
        await kms.schedule_key_deletion(get_tenant_kms_key(tenant_id), 30)
 
        # Delete K8s namespace
        await k8s.delete_namespace(f"tenant-{tenant_id}")
 
        # Audit log
        await audit_log.record(
            event="tenant_deleted",
            tenant_id=tenant_id,
            timestamp=now()
        )

2.8 Cell-based Multi-Tenancy

Pattern: Group N tenants into “cells”, each cell isolated.

Cell 1: Tenants A-F (max 100 tenants)
Cell 2: Tenants G-M
Cell 3: Tenants N-T
Cell 4: Tenants U-Z

Each cell:
  - Own DB instance
  - Own app deployment
  - Independent failure domain

Pros:

Blast radius = 1 cell (10-100 tenants)
Easier capacity planning
A/B test deploy at cell level

Cons:

Need cell router (which cell does tenant X belong to?)
Operational overhead (N cells to manage)

Tham chiếu: Tuan-11-Microservices-Pattern section 2.15 Cell-based Architecture.

3. Estimation

3.1 Density per pool

PostgreSQL pool model with proper indexes:

1 PG instance handles ~10K-100K small tenants
Bottleneck: connection pooling (PgBouncer can handle 1000s of clients → 100 server connections)

Per-tenant overhead (pool, RLS):

~5-10% query overhead from RLS policy evaluation
Minimal storage overhead (just tenant_id column)

3.2 Cost per tenant

Model	Cost per tenant/month
Pool (shared infra)	$0.10 -$ 1
Pod (10-100 tenants per pod)	$5 -$ 50
Silo (dedicated infra)	$50 -$ 500+

ROI calculation:

Free tier (low ARPU $0): MUST be pool
Pro tier ($50/month ARPU): Pod or pool
Enterprise ($1000+/month ARPU): Silo OK

3.3 Onboarding time

Operation	Pool	Silo
Tenant creation	< 1s (insert row)	5-30 min (provision DB, namespace, etc.)
Tenant deletion	Minutes (DELETE rows)	Hours (drop DB, cleanup)

3.4 Noisy neighbor blast radius

Model	Blast radius
Pure Pool	100% (all tenants)
Pod	1-10% (single pod)
Silo	0.1% (single tenant)
Cell-based pod	1% (single cell)

4. Security First

4.1 The “Cross-tenant leak” nightmare

Worst case: Bug allows tenant A user to read tenant B data.

Defense in depth:

App layer: Always filter by tenant_id (review in code review)
DB layer: RLS policies (defense even if app forgets)
Network: Per-tenant namespaces (vCluster, Capsule)
Audit: Log every cross-tenant access attempt
Testing: Automated tenant isolation tests

4.2 Tenant Isolation Testing

# Pytest: ensure tenant isolation
@pytest.mark.parametrize("attacker_tenant,victim_tenant", [
    ("tenant-a", "tenant-b"),
])
def test_cannot_read_other_tenant(attacker_tenant, victim_tenant):
    # Create data in victim tenant
    with set_tenant_context(victim_tenant):
        victim_user = create_user(email="[email protected]")
 
    # Try to read as attacker
    with set_tenant_context(attacker_tenant):
        result = get_user_by_email("[email protected]")
        assert result is None, "Cross-tenant leak detected!"
 
    # Try direct ID access
    with set_tenant_context(attacker_tenant):
        result = get_user_by_id(victim_user.id)
        assert result is None, "Cross-tenant leak via direct ID!"
 
# Run for ALL endpoints, not just selected
def test_all_endpoints_isolated():
    for endpoint in get_all_api_endpoints():
        for method in endpoint.methods:
            # Attempt cross-tenant access
            assert is_endpoint_isolated(endpoint, method)

4.3 BYOK (Bring Your Own Key)

Enterprise tenants want to control their encryption keys.

# Per-tenant KMS key
def encrypt_for_tenant(tenant_id: str, plaintext: bytes) -> bytes:
    key_arn = get_tenant_kms_key(tenant_id)  # From tenant config
    return kms.encrypt(key_id=key_arn, plaintext=plaintext)
 
def decrypt_for_tenant(tenant_id: str, ciphertext: bytes) -> bytes:
    return kms.decrypt(ciphertext=ciphertext)
    # KMS auto-detects key from ciphertext metadata

Implications:

Tenant can revoke key → permanent data loss for that tenant (intentional)
Per-tenant KMS = per-tenant cost
Compliance benefit: tenant proves control of their data

4.4 Audit Logging Per Tenant

Every action must be auditable per tenant:

CREATE TABLE audit_log (
    id BIGSERIAL PRIMARY KEY,
    timestamp TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    tenant_id UUID NOT NULL,
    user_id UUID,
    action TEXT NOT NULL,
    resource_type TEXT,
    resource_id TEXT,
    ip_address INET,
    user_agent TEXT,
    metadata JSONB,
    -- Append-only: revoke UPDATE/DELETE
);
 
-- Partition by tenant for efficient retrieval
CREATE TABLE audit_log_tenant_acme PARTITION OF audit_log
  FOR VALUES IN ('acme-tenant-uuid');

Forward to immutable storage (S3 Object Lock, blockchain) for compliance.

4.5 Compliance Frameworks

Framework	Multi-tenancy requirements
SOC 2 Type II	”Logical isolation” (RLS sufficient if properly tested)
HIPAA	Stronger isolation; many require silo for PHI
PCI DSS	Card data must be tokenized; cardholder env separated
GDPR	Tenant consent, data portability, right to delete
ISO 27001	Risk assessment per tenant, controls documented
FedRAMP	High often requires gov cloud silo

5. DevOps — Vận hành Multi-Tenant SaaS

5.1 Provisioning Pipeline

# Argo Workflow: tenant provisioning
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: provision-tenant-
spec:
  entrypoint: provision
  arguments:
    parameters:
      - name: tenant-name
      - name: tier
 
  templates:
    - name: provision
      steps:
        - - name: create-namespace
            template: k8s-namespace
        - - name: create-database
            template: db-setup
        - - name: apply-quota
            template: resource-quota
        - - name: setup-monitoring
            template: monitoring-config
        - - name: send-welcome
            template: notify
 
    - name: k8s-namespace
      script:
        image: kubectl:latest
        command: [sh]
        source: |
          kubectl create namespace tenant-{{workflow.parameters.tenant-name}}
          kubectl label namespace tenant-{{workflow.parameters.tenant-name}} \
            tenant={{workflow.parameters.tenant-name}} \
            tier={{workflow.parameters.tier}}

5.2 Per-tenant monitoring

Critical: Dashboards filterable by tenant.

# Prometheus queries per tenant
rate(http_requests_total{tenant_id="acme"}[5m])
 
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{tenant_id="acme"}[5m])
)
 
# Per-tenant resource usage
sum(rate(container_cpu_usage_seconds_total{namespace=~"tenant-.*"}[5m]))
  by (namespace)

Grafana variable: $tenant dropdown → filter all panels.

5.3 Per-tenant alerting

- alert: TenantQuotaApproached
  expr: |
    (
      sum(container_memory_usage_bytes) by (namespace) /
      kube_resourcequota{type="hard", resource="requests.memory"}
    ) > 0.8
  for: 5m
  labels:
    severity: warning
    tenant: "{{ $labels.namespace }}"
  annotations:
    summary: "Tenant {{ $labels.namespace }} at 80% memory quota"
 
- alert: TenantSLOViolation
  expr: |
    (
      rate(http_requests_total{status=~"5..", tenant=~".+"}[5m]) /
      rate(http_requests_total{tenant=~".+"}[5m])
    ) > 0.01
  for: 10m
  labels:
    severity: critical
    tenant: "{{ $labels.tenant }}"
  annotations:
    summary: "SLO violation for tenant {{ $labels.tenant }}: > 1% error rate"

5.4 Tenant cost allocation

FinOps integration: Track infrastructure cost per tenant.

# Daily cost allocation job
def allocate_costs():
    total_cost = aws.get_billing_total()  # $X
 
    # Allocation strategies:
    # 1. By usage (compute time, storage)
    tenant_usage = {}
    for tenant_id in get_all_tenants():
        tenant_usage[tenant_id] = {
            "compute": prometheus.get_metric(
                f'sum(container_cpu_usage_seconds_total{{tenant_id="{tenant_id}"}})'
            ),
            "storage": s3.get_bucket_size(f"tenant-{tenant_id}-*"),
            "network": cloudwatch.get_metric("NetworkOut", tenant_id),
        }
 
    # 2. By tier (proportional)
    tier_weights = {"free": 0.1, "pro": 1.0, "enterprise": 10.0}
 
    for tenant_id in get_all_tenants():
        cost = calculate_tenant_cost(tenant_id, total_cost, tenant_usage)
        save_cost_record(tenant_id, cost, date=today())

5.5 Disaster scenarios

Scenario	Mitigation
Cross-tenant data leak detected	Halt service, security incident response, forensic, customer notification
1 tenant overuses → impacts others	Quotas + circuit breaker; auto-isolate
Migration fails for some tenants	Per-tenant migration tracking, rollback per tenant
Tenant requests forensic copy	Pre-built tooling for per-tenant backup export

6. Code Implementation

6.1 PostgreSQL RLS-based pool tenancy

-- Migration: enable multi-tenancy
ALTER TABLE users ADD COLUMN tenant_id UUID NOT NULL;
CREATE INDEX idx_users_tenant_email ON users (tenant_id, email);
 
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
 
CREATE POLICY tenant_isolation ON users
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id')::uuid);
 
-- Repeat for all tenant tables

# FastAPI app with tenant context
from contextvars import ContextVar
from fastapi import FastAPI, Request, HTTPException, Depends
import jwt
import asyncpg
 
app = FastAPI()
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
db_pool: asyncpg.Pool = None
 
 
@app.middleware("http")
async def tenant_middleware(request: Request, call_next):
    auth = request.headers.get("Authorization", "")
    if not auth.startswith("Bearer "):
        raise HTTPException(401, "Missing token")
 
    try:
        claims = jwt.decode(
            auth[7:], JWT_SECRET, algorithms=["HS256"]
        )
        tenant_id = claims["tenant_id"]
    except jwt.JWTError:
        raise HTTPException(401, "Invalid token")
 
    tenant_ctx.set(tenant_id)
    response = await call_next(request)
    response.headers["X-Tenant-ID"] = tenant_id
    return response
 
 
async def get_db():
    """DB connection with tenant context set."""
    async with db_pool.acquire() as conn:
        # Set RLS context per session
        tenant_id = tenant_ctx.get()
        await conn.execute(
            "SET LOCAL app.tenant_id = $1",
            tenant_id
        )
        yield conn
 
 
@app.get("/users")
async def list_users(conn=Depends(get_db)):
    # No need to filter by tenant_id — RLS does it
    rows = await conn.fetch("SELECT id, email FROM users LIMIT 100")
    return [dict(r) for r in rows]

6.2 Tenant context propagation across services

"""
Propagate tenant context across HTTP, gRPC, async jobs.
"""
 
import httpx
from contextvars import ContextVar
 
tenant_ctx: ContextVar[str] = ContextVar("tenant_id")
 
 
class TenantAwareHTTPClient(httpx.AsyncClient):
    """HTTP client that auto-propagates tenant context."""
 
    async def request(self, method, url, **kwargs):
        headers = kwargs.pop("headers", {})
        try:
            tenant_id = tenant_ctx.get()
            headers["X-Tenant-ID"] = tenant_id
        except LookupError:
            pass  # No tenant context (e.g., system call)
        return await super().request(method, url, headers=headers, **kwargs)
 
 
# Usage
client = TenantAwareHTTPClient()
async with client:
    # Tenant ID auto-included in headers
    response = await client.get("https://internal-service/data")
 
 
# For async jobs (Celery, RQ, etc.)
class TenantAwareTask:
    def apply_async(self, args, **kwargs):
        try:
            tenant_id = tenant_ctx.get()
            kwargs["headers"] = kwargs.get("headers", {})
            kwargs["headers"]["tenant_id"] = tenant_id
        except LookupError:
            pass
        return super().apply_async(args, **kwargs)
 
 
# Worker side: restore context
@worker.before_task_publish.connect
def set_tenant_context(headers=None, **kwargs):
    if headers and "tenant_id" in headers:
        tenant_ctx.set(headers["tenant_id"])

6.3 Fair queueing for async jobs

"""
Fair scheduler: prevents 1 tenant from monopolizing job queue.
"""
 
import asyncio
from collections import deque
from typing import Optional
 
 
class FairTenantQueue:
    """Round-robin across tenant queues."""
 
    def __init__(self):
        self.queues: dict[str, deque] = {}
        self.tenant_order: list[str] = []
        self.idx = 0
        self.lock = asyncio.Lock()
        self.event = asyncio.Event()
 
    async def enqueue(self, tenant_id: str, job):
        async with self.lock:
            if tenant_id not in self.queues:
                self.queues[tenant_id] = deque()
                self.tenant_order.append(tenant_id)
            self.queues[tenant_id].append(job)
            self.event.set()
 
    async def dequeue(self) -> Optional[tuple]:
        while True:
            async with self.lock:
                # Round-robin
                tried = 0
                while tried < len(self.tenant_order):
                    tenant = self.tenant_order[self.idx]
                    self.idx = (self.idx + 1) % len(self.tenant_order)
 
                    queue = self.queues[tenant]
                    if queue:
                        job = queue.popleft()
                        return tenant, job
 
                    tried += 1
 
                # All empty, wait for new job
                self.event.clear()
 
            await self.event.wait()
 
 
# Worker
async def worker(queue: FairTenantQueue):
    while True:
        result = await queue.dequeue()
        if result is None:
            await asyncio.sleep(0.1)
            continue
 
        tenant_id, job = result
        tenant_ctx.set(tenant_id)
        await process(job)

7. System Design Diagrams

7.1 Tenancy Models Spectrum

flowchart LR
    subgraph Pool["Pool Model"]
        SharedApp[Shared App<br/>multi-tenant]
        SharedDB[(Shared DB<br/>tenant_id + RLS)]
        SharedApp --> SharedDB
    end

    subgraph Pod["Pod Model"]
        PodApp1[Pod 1<br/>~100 tenants]
        PodApp2[Pod 2<br/>~100 tenants]
        PodDB1[(Pod 1 DB)]
        PodDB2[(Pod 2 DB)]
        PodApp1 --> PodDB1
        PodApp2 --> PodDB2
    end

    subgraph Silo["Silo Model"]
        SiloApp1[Tenant A App]
        SiloApp2[Tenant B App]
        SiloDB1[(Tenant A DB)]
        SiloDB2[(Tenant B DB)]
        SiloApp1 --> SiloDB1
        SiloApp2 --> SiloDB2
    end

    Pool -->|"Free Tier"| Pod
    Pod -->|"Pro Tier"| Silo
    Silo -->|"Enterprise Tier"| End[Hybrid SaaS]

    style Pool fill:#fff9c4
    style Pod fill:#c8e6c9
    style Silo fill:#bbdefb

7.2 RLS Tenant Isolation

sequenceDiagram
    participant User
    participant App
    participant Auth as JWT Auth
    participant DB as PostgreSQL

    User->>App: GET /users (Bearer token)
    App->>Auth: Decode JWT
    Auth-->>App: tenant_id="acme"

    App->>DB: BEGIN<br/>SET app.tenant_id = 'acme'
    App->>DB: SELECT * FROM users<br/>(no WHERE clause!)

    DB->>DB: Apply RLS policy:<br/>tenant_id = current_setting('app.tenant_id')
    DB-->>App: Only acme users returned

    App->>DB: COMMIT
    App-->>User: { users: [...] }

    Note over DB: Even if app forgets WHERE,<br/>RLS prevents cross-tenant leak

7.3 Cell-Based Multi-Tenancy

flowchart TB
    Router[Cell Router<br/>tenant_id → cell_id]

    Router --> Cell1
    Router --> Cell2
    Router --> Cell3

    subgraph Cell1["Cell 1 (Tenants 1-100)"]
        App1[App Tier]
        DB1[(DB)]
        App1 --> DB1
    end

    subgraph Cell2["Cell 2 (Tenants 101-200)"]
        App2[App Tier]
        DB2[(DB)]
        App2 --> DB2
    end

    subgraph Cell3["Cell 3 (Tenants 201-300)"]
        App3[App Tier]
        DB3[(DB)]
        App3 --> DB3
    end

    User[User of tenant 50] --> Router

    Note["Bug in Cell 1 → only 100 tenants affected<br/>vs 300 in single pool"]

    style Note fill:#fff9c4

7.4 Noisy Neighbor Mitigation Layers

flowchart TD
    Request[Tenant Request]

    Request --> L1{Rate Limit<br/>per tenant?}
    L1 -->|Exceeded| Block1[429 Too Many Requests]
    L1 -->|OK| L2{Resource Quota<br/>K8s ResourceQuota?}

    L2 -->|Exceeded| Block2[503 Service Unavailable]
    L2 -->|OK| L3{Connection Pool<br/>tenant-aware?}

    L3 -->|Exceeded| Block3[Wait or Reject]
    L3 -->|OK| L4{Query Timeout<br/>per tenant?}

    L4 -->|Exceeded| Block4[Cancel query]
    L4 -->|OK| Success[Process request]

    style Block1 fill:#ffcdd2
    style Block2 fill:#ffcdd2
    style Block3 fill:#ffcdd2
    style Block4 fill:#ffcdd2
    style Success fill:#c8e6c9

8. Aha Moments & Pitfalls

Aha Moments

#1: Multi-tenancy là decision đầu tiên ảnh hưởng tất cả. Pool vs Silo vs Hybrid quyết định cost, complexity, compliance, scalability. Khó migrate sau.

#2: PostgreSQL RLS là defense-in-depth. Even nếu app code có bug forget WHERE clause, RLS prevent cross-tenant leak. Bắt buộc bật cho pool model.

#3: Noisy neighbor là silent killer. 1 enterprise customer with bad query → 99% small tenants slow → churn. Per-tenant rate limit + quotas mandatory.

#4: Cell-based pattern reduce blast radius. Pure pool: 100% impact. Cell pool: 1-10% impact. AWS DynamoDB, Slack, Stripe đều dùng pattern này.

#5: Tenant context propagation là hard. HTTP request OK, async jobs khó, background workers dễ quên. Logger không có tenant_id → debug imopssible.

#6: BYOK is enterprise differentiator. Big customers care về key control. Implementation cost cao but premium pricing justify.

#7: Compliance shapes architecture. HIPAA, FedRAMP often force silo model. Plan upfront, retrofit expensive.

#8: Tenant deletion là requirement, không phải nice-to-have. GDPR right to delete. Architect cho hard delete + audit trail từ đầu.

Pitfalls

Pitfall 1: Hardcode tenant in URL

Sai: acme.app.com/users → tenant trong subdomain only. Đúng: Tenant từ JWT claim. Subdomain backup.

Pitfall 2: Forget tenant_id trong index

Sai: CREATE INDEX ON users (email) → cross-tenant scan slow. Đúng: CREATE INDEX ON users (tenant_id, email).

Pitfall 3: No RLS in pool model

Sai: “Code always filter by tenant_id” → 1 SQL bug = data breach. Đúng: Defense-in-depth: app filter + RLS.

Pitfall 4: Background jobs without context

Sai: Cron job processes all tenants, forget to set context → queries miss filter. Đúng: Explicitly set tenant context per job, audit log.

Pitfall 5: Cache without tenant_id

Sai: cache.set("user:123", data) → tenant A’s user 123 vs tenant B’s user 123. Đúng: cache.set(f"user:{tenant_id}:123", data).

Pitfall 6: No fair scheduling

Sai: FIFO job queue → 1 tenant submits 100K jobs → others wait hours. Đúng: Round-robin or weighted fair scheduling.

Pitfall 7: Single connection pool

Sai: 1 PgBouncer for all tenants → 1 tenant exhausts pool → others fail. Đúng: Tenant-aware pool quotas hoặc separate pools per tier.

Pitfall 8: No tenant-level monitoring

Sai: Aggregate metrics — can’t see “tenant X has 50% error rate”. Đúng: Tenant_id label on all metrics; per-tenant dashboards.

Pitfall 9: Same KMS key for all tenants

Sai: 1 master key encrypts all tenants → key compromise = total breach. Đúng: Per-tenant KMS (cost vs security trade-off).

Pitfall 10: No isolation testing

Sai: Manual review → bug slips into production → cross-tenant leak detected by customer. Đúng: Automated tests for every endpoint, tenant_isolation_test.py runs on PR.

9. Internal Links

Topic	Liên hệ
Tuan-07-Database-Sharding-Replication	Sharding by tenant_id; pool model uses tenant_id column
Tuan-09-Rate-Limiter	Per-tenant rate limiting
Tuan-11-Microservices-Pattern	Cell-based architecture; namespace per tenant
Tuan-14-AuthN-AuthZ-Security	Tenant context from JWT; RBAC scoped to tenant
Tuan-15-Data-Security-Encryption	BYOK per tenant
Tuan-Bonus-Multi-Region-Active-Active-DSQL	Tenant-region affinity
Case-Design-Payment-System	Per-merchant tenant model

Tham khảo

Whitepapers:

AWS, SaaS Tenant Isolation Strategies — https://d1.awsstatic.com/whitepapers/saas-tenant-isolation-strategies.pdf
AWS, SaaS Architecture Fundamentals — https://d1.awsstatic.com/whitepapers/saas-architecture-fundamentals.pdf
Microsoft Learn, Multi-tenancy considerations — https://learn.microsoft.com/en-us/azure/architecture/guide/multitenant/

Engineering blogs:

Neon, The Noisy Neighbor Problem — https://neon.tech/blog/noisy-neighbors
Shopify, Sharding for resilience — https://shopify.engineering/sharding-customers-tools-and-techniques
WorkOS, Multi-Tenant Architecture — https://workos.com/blog/multi-tenant-architecture
Supabase, Postgres RLS — https://supabase.com/docs/guides/auth/row-level-security
Stripe, Per-customer SLAs — https://stripe.com/blog/

Tools:

AWS SaaS Builder Toolkit (SBT) — https://github.com/awslabs/sbt-aws
vCluster — https://www.vcluster.com/
Capsule (K8s multi-tenancy) — https://capsule.clastix.io/
Loft (K8s SaaS) — https://loft.sh/

Tiếp theo: Tuan-Bonus-MCP-Architecture — Model Context Protocol cho LLM tools.

lthieu's notes

Explorer

Tuan-Bonus-Multi-Tenancy-SaaS-Patterns

Tuần Bonus: Multi-Tenancy SaaS Patterns

1. Context & Why

Analogy đời thường — Mô hình bất động sản

Tại sao Backend Dev cần hiểu Multi-Tenancy?

Tại sao Alex Xu không đi sâu?

Tham chiếu chính

2. Deep Dive — Khái niệm cốt lõi

2.1 Tenancy Models — 3 chính

2.1.1 Silo Model (1 tenant = 1 stack)

2.1.2 Pool Model (all tenants share)

2.1.3 Hybrid / Bridge / Pod Model

2.2 Database Tenancy Patterns

2.2.1 Database per tenant (Silo)

2.2.2 Schema per tenant (PostgreSQL)

2.2.3 Shared schema with tenant_id (Pool)

2.2.4 PostgreSQL Row-Level Security (RLS)

2.2.5 Modern Alternative — Neon Branching

2.3 Network-Level Isolation

2.3.1 Kubernetes Namespace-per-Tenant

2.3.2 K8s vCluster (per-tenant virtual cluster)

2.3.3 Capsule

2.4 Tenant Context Propagation

2.5 Noisy Neighbor Mitigation

2.5.1 Per-tenant rate limiting

2.5.2 Resource quotas (K8s)

2.5.3 Database connection pooling per tenant

2.5.4 Query timeouts per tenant

2.5.5 Fair scheduling

2.6 Tenant Onboarding & Provisioning

2.7 Tenant Deletion (GDPR / offboarding)

2.8 Cell-based Multi-Tenancy

3. Estimation

3.1 Density per pool

3.2 Cost per tenant

3.3 Onboarding time

3.4 Noisy neighbor blast radius

4. Security First

4.1 The “Cross-tenant leak” nightmare

4.2 Tenant Isolation Testing

4.3 BYOK (Bring Your Own Key)

4.4 Audit Logging Per Tenant

4.5 Compliance Frameworks

5. DevOps — Vận hành Multi-Tenant SaaS

5.1 Provisioning Pipeline

5.2 Per-tenant monitoring

5.3 Per-tenant alerting

5.4 Tenant cost allocation

5.5 Disaster scenarios

6. Code Implementation

6.1 PostgreSQL RLS-based pool tenancy

6.2 Tenant context propagation across services

6.3 Fair queueing for async jobs

7. System Design Diagrams

7.1 Tenancy Models Spectrum

7.2 RLS Tenant Isolation

7.3 Cell-Based Multi-Tenancy

7.4 Noisy Neighbor Mitigation Layers

8. Aha Moments & Pitfalls

Aha Moments

Pitfalls

Pitfall 1: Hardcode tenant in URL

Pitfall 2: Forget tenant_id trong index

Pitfall 3: No RLS in pool model

Pitfall 4: Background jobs without context

Pitfall 5: Cache without tenant_id

Pitfall 6: No fair scheduling

Pitfall 7: Single connection pool

Pitfall 8: No tenant-level monitoring

Pitfall 9: Same KMS key for all tenants

Pitfall 10: No isolation testing

9. Internal Links

Tham khảo

Graph View

Table of Contents

Backlinks