Tuần 15 — Vector Databases cho AI

“Năm 2024-2026, mọi product engineer cần hiểu vector search. Không phải để xây ChatGPT, mà vì ‘RAG cho team docs’ là feature ai cũng yêu cầu. Tuần này dạy bạn từ embedding fundamentals đến deploy production RAG.”

Tags: database vector-db pgvector qdrant pinecone hnsw ai rag Thời lượng: 7 ngày (6-8h/ngày — tuần dày) Prerequisites: Tuan-13-Search-Engines-ES (hybrid search concept), Tuan-14-OLAP-Columnar-ClickHouse (large-scale data) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure (SD course) · Case-Design-Data-AI-RAG


1. Context & Why

1.1 Vector search 101

Traditional search:

"What is recursion?" → keyword match → docs containing "recursion"

Vector search:

"What is recursion?" → embedding → semantic match → docs ABOUT recursion (even without word)
"How does a function call itself?" → also matches docs about recursion

1.2 Embedding overview

Text → high-dim vector capturing meaning.

"king" → [0.12, -0.34, 0.56, ..., 0.78]   (1536 dimensions for OpenAI ada-002)
"queen" → [0.14, -0.32, 0.58, ..., 0.75]  (close to king)
"car" → [0.85, 0.21, -0.43, ..., -0.12]   (far from king)

Distance/similarity:

  • Cosine similarity: 0..1 (1 = identical direction)
  • Euclidean (L2): distance
  • Inner product: dot product

1.3 Mục tiêu tuần

  • Embedding models 2024-2026: OpenAI, Cohere, Voyage, open-source
  • Vector index algorithms: HNSW, IVFFlat, IVFPQ, ScaNN, DiskANN
  • pgvector deep dive (0.7+, 0.8+)
  • Vector DB comparison: Qdrant, Pinecone, Weaviate, Milvus, Chroma
  • Hybrid search: dense + sparse (BM25 + vector)
  • Filtered search at scale
  • RAG pipeline production
  • Cost economics
  • Anti-patterns

1.4 Tham chiếu


2. Embeddings 2024-2026

2.1 Embedding model landscape

ModelProviderDimContextCost (per M tokens)Notes
text-embedding-3-smallOpenAI1536 (configurable)8K$0.02Default 2024
text-embedding-3-largeOpenAI3072 (configurable down to 256, Matryoshka)8K$0.13Higher quality
text-embedding-ada-002OpenAI15368K$0.10Legacy
embed-english-v3.0Cohere1024512$0.10Multi-lang option
voyage-3Voyage AI102432K$0.06RAG-optimized 2024
voyage-3-largeVoyage AI102432K$0.18Flagship 2025, Matryoshka
voyage-code-3Voyage AI102432K$0.18Code retrieval
BGE-large-en-v1.5BAAI1024512Free (self-host)OSS leader 2024
e5-mistral-7b-instructMicrosoft409632KFree (self-host)7B parameter
sentence-transformers/all-MiniLM-L6-v2HuggingFace384512FreeLightweight, fast
nomic-embed-text-v1.5Nomic768 (Matryoshka)8KFreeOSS, Matryoshka tech

2.2 Choosing embedding model

flowchart TD
    A[Choose embedding] --> B{Privacy critical?}
    B -->|Yes| C[Self-host: BGE, E5, Nomic, MiniLM]
    B -->|No| D{Quality tier?}
    D -->|Top quality| E[OpenAI text-embedding-3-large<br/>or Voyage large-2]
    D -->|Good balance| F[OpenAI text-embedding-3-small<br/>or Voyage-3 / Voyage-3-large]
    D -->|Cheap, OK quality| G[Cohere embed-light]
    A --> H{Multi-language?}
    H -->|Yes| I[Cohere multilingual<br/>or BGE-M3]
    A --> J{Need long context?}
    J -->|16K+| K[Voyage-3 or specific model]

    style F fill:#c8e6c9

2.3 Matryoshka embeddings

Modern technique: embedding designed so truncated versions still useful.

# embedding = [d1, d2, ..., d1536]
# First 256 dims = good enough for fast filter
# All 1536 = best quality
 
# Two-stage retrieval
fast_results = search(query, top_k=100, dims=256)
final = rerank(query, fast_results, dims=1536, top_k=10)

OpenAI 3.x and Nomic support Matryoshka.

2.4 Reranking

After initial retrieval, rerank with cross-encoder for better top-k.

# Stage 1: vector search top 100
candidates = vector_search(query_embedding, top_k=100)
 
# Stage 2: cross-encoder rerank
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = reranker.predict([(query, doc.content) for doc in candidates])
top_10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]

Cross-encoder slower per-pair but more accurate. Pattern: vector retrieve broad, rerank narrow.

Cohere Rerank API ($1 per 1000 reranks) saves implementation.


3. Vector Index Algorithms

3.1 Exhaustive (brute-force)

For N=10K vectors → 10K × 1536-dim distance comp per query → <100ms. Fine.

For N=10M → too slow. Need approximate nearest neighbor (ANN) index.

3.2 IVF (Inverted File)

graph TB
    Centroids[K centroids<br/>from k-means clustering]
    Cluster1[Cluster 1<br/>1000 vectors]
    Cluster2[Cluster 2<br/>1000 vectors]
    Cluster3[Cluster 3<br/>1000 vectors]

    Centroids --> Cluster1
    Centroids --> Cluster2
    Centroids --> Cluster3

    Query[Query] -->|find nearest centroids| Centroids
    Query --> Cluster1
    Query --> Cluster3

Build:

  1. K-means cluster vectors into K centroids
  2. Each vector assigned to nearest centroid

Query:

  1. Find top-p nearest centroids
  2. Brute-force within those clusters

Trade-off:

  • Higher p → more accurate, slower
  • Lower p → faster, less accurate

Tunables: nlist (K), nprobe (p).

3.3 HNSW (Hierarchical Navigable Small World)

Multi-layer graph. Top layer sparse, bottom layer dense.

graph TB
    subgraph "Layer 2 (sparse)"
        A2[A]
        B2[B]
        C2[C]
    end
    subgraph "Layer 1"
        A1[A]
        B1[B]
        C1[C]
        D1[D]
        E1[E]
        F1[F]
    end
    subgraph "Layer 0 (dense)"
        A0[A]
        B0[B]
        C0[C]
        D0[D]
        E0[E]
        F0[F]
        G0[G]
        H0[H]
        I0[I]
    end

    A2 -.layer link.-> B2
    B2 -.-> C2
    A2 -.descend.-> A1
    B2 -.descend.-> B1

Query traverses top → bottom, greedy nearest at each layer.

Pros vs IVF:

  • ✅ Higher recall (95-99% typical)
  • ✅ Faster query
  • ❌ Higher memory (graph structure)
  • ❌ Slower build

HNSW is default modern 2024-2026 for most vector DBs.

Tunables:

  • m — max connections per node (default 16). Higher = better accuracy, more memory.
  • ef_construction — quality during build (default 200). Higher = better quality, slower build.
  • ef_search — runtime accuracy/speed trade-off.

3.4 IVFPQ — Product Quantization

For huge datasets (100M+), memory savings via lossy compression of vectors.

1536-dim vector × 4 bytes = 6KB per vector
× 100M vectors = 600GB

With PQ8: 1536 / 96 sub-quantizers × 1 byte = 96 bytes
× 100M = 9.6GB (60x compression)

Accuracy hit ~5-10%, often acceptable.

3.5 DiskANN

Microsoft research. HNSW-style graph but designed for SSD storage (not all-in-RAM).

Pros: Index 100M+ vectors on commodity hardware Cons: Slower than in-memory HNSW for top quality

ScaNN (Google) similar territory.

3.6 Algorithm comparison

AlgoMemoryBuild timeQuery speedRecallBest for
Brute forceHighNoneSlow100%<100K vectors
IVFFlatMediumMediumFast90-95%1M-10M
HNSWHighSlowVery fast95-99%100K-100M, in-mem
IVFPQLowSlowFast85-90%10M-1B+, memory-constrained
DiskANNLow (disk)SlowMedium95%100M+, disk-based

4. pgvector Deep Dive

4.1 Install + basic

CREATE EXTENSION vector;
 
CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text,
    embedding vector(1536)
);
 
INSERT INTO documents (content, embedding)
VALUES ('hello world', '[0.1, 0.2, ..., 0.5]');
 
-- Cosine distance
SELECT content, embedding <=> '[0.1, 0.2, ..., 0.5]'::vector AS distance
FROM documents
ORDER BY distance LIMIT 10;

Distance operators:

  • <-> Euclidean (L2)
  • <#> Inner product (negated)
  • <=> Cosine

4.2 Indexes

-- IVFFlat (older)
CREATE INDEX ON documents USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
SET ivfflat.probes = 10;
 
-- HNSW (pgvector 0.5+, recommended 2024)
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 100;

Notes:

  • HNSW recommended for most cases
  • IVF needs ANALYZE after data inserted (centroids from sample)
  • lists for IVF: roughly sqrt(N)
  • m, ef_construction for HNSW: defaults usually fine

4.3 pgvector 0.7+ features (2024)

  • HNSW build parallelism — multi-thread
  • Iterative scan — better filtered search
  • Half-precision (halfvec) — 2 bytes per dim vs 4 bytes
  • Bit vectors (bit(n)) — for binary embeddings
  • Sparse vectors (sparsevec)
-- Half-precision: 50% memory saving
CREATE TABLE docs_half (id bigserial, embedding halfvec(1536));
 
-- Sparse
CREATE TABLE docs_sparse (id bigserial, embedding sparsevec(30000));

4.4 Filtered search (the hard problem)

Query: “find similar to X where category = ‘tech’“.

Two strategies:

graph LR
    subgraph "Pre-filter"
        P1[Filter rows category=tech]
        P1 --> P2[Brute force vector among them]
        P2 --> P3[Top-k]
    end

    subgraph "Post-filter"
        Q1[Vector top-100 candidates]
        Q1 --> Q2[Filter by category]
        Q2 --> Q3[Top-k - may have <k]
    end

Trade-off:

  • Pre-filter: accurate but if many rows filter, brute-force slow
  • Post-filter: fast but may miss results

pgvector 0.7’s iterative scan = smart hybrid.

-- Force pre-filter via partial index
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE category = 'tech';
-- Specific to that category, fast
 
-- Or use combined approach with filter clause
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> $1 LIMIT 10;

For high-selectivity filters (very few rows match), this works. For low-selectivity, may degrade.

4.5 pgvector vs dedicated vector DB

pgvectorQdrant/Pinecone
SetupAlready have PostgresNew service
Up to N vectorsGood <10MExcellent <100M+
Filtered searchOK (better in 0.7+)Native, fast
Hybrid (BM25+vector)Yes (PG FTS)Native
Operational complexity1 system2 systems
Vendor lockNoneSome
CostDB costAdd service

Pattern 2024-2026: Start pgvector for <10M vectors. Move to dedicated when:

  • 10M vectors and growing

  • Need <10ms p99 latency
  • Complex filtering with high cardinality
  • Tighter integration with embeddings (Qdrant’s payload, etc)

5. Dedicated Vector DBs

5.1 Qdrant

Rust-based, OSS, fast filtered search.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
client = QdrantClient(host="localhost", port=6333)
 
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
 
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"category": "tech", "url": "..."})
    ]
)
 
# Query with filter
results = client.search(
    collection_name="docs",
    query_vector=[0.1, 0.2, ...],
    query_filter=Filter(must=[FieldCondition(key="category", match=MatchValue(value="tech"))]),
    limit=10
)

Pros:

  • Filter integration excellent (pre-filter via indexed payload)
  • Rust speed
  • OSS Apache 2.0
  • Hybrid search (dense + sparse)

5.2 Pinecone

Managed-only, very simple API.

import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("docs")
 
index.upsert([("id1", [0.1, 0.2, ...], {"category": "tech"})])
 
results = index.query(
    vector=[0.1, 0.2, ...],
    filter={"category": "tech"},
    top_k=10,
    include_metadata=True
)

Pros:

  • Zero ops
  • Multi-tenant ready
  • Serverless tier 2024

Cons:

  • Closed source
  • Vendor lock
  • Cost at scale ($70+/month/pod)

5.3 Weaviate

GraphQL-first, OSS.

Special: built-in vectorizer modules (auto-embed on insert).

{
  Get {
    Document(
      nearText: { concepts: ["machine learning"] }
      where: { path: ["category"], operator: Equal, valueString: "tech" }
      limit: 10
    ) { content category }
  }
}

5.4 Milvus

Most feature-rich OSS. Built on Faiss/Knowhere.

Strengths:

  • Multi-replica, multi-shard
  • Multiple index types (HNSW, IVF, DiskANN, etc)
  • Petabyte scale

5.5 Chroma

Lightweight, Python-first, popular in dev/notebook contexts.

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(documents=["text1", "text2"], ids=["1", "2"])
results = collection.query(query_texts=["query"], n_results=5)

Built-in default embedding model (sentence-transformers). Easy start.

5.6 Lance / LanceDB

Built on Apache Arrow, embedded mode. Like DuckDB for vectors.

import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", data=[...])
results = table.search([0.1, 0.2, ...]).limit(10).to_pandas()

5.7 Decision matrix

flowchart TD
    A[Vector DB needed] --> B{Already have Postgres?}
    B -->|Yes| C{<10M vectors?}
    C -->|Yes| D[pgvector]
    C -->|No| E[Dedicated vector DB]
    B -->|No| E
    E --> F{OSS pref?}
    F -->|Yes| G[Qdrant, Milvus, Weaviate, Chroma]
    F -->|No / fully managed| H[Pinecone]
    A --> I{Embedded?}
    I -->|Yes - app library| J[LanceDB, Chroma]

    style D fill:#c8e6c9
    style G fill:#c8e6c9

6. RAG Pipeline Production

6.1 Architecture

graph TB
    subgraph "Indexing phase"
        Docs[Documents] --> Chunker[Chunker]
        Chunker --> Embedder[Embed each chunk]
        Embedder --> Store[(Vector DB)]
    end

    subgraph "Query phase"
        Query[User query] --> QEmbed[Embed query]
        QEmbed --> Retrieve[Search vector DB]
        Query --> BM25[BM25 search]
        Retrieve --> Fusion[Hybrid fusion RRF]
        BM25 --> Fusion
        Fusion --> Rerank[Cross-encoder rerank]
        Rerank --> LLM[LLM with context]
        LLM --> Answer[Answer]
    end

    Store -.-> Retrieve

6.2 Chunking strategies

StrategyWhen
Fixed size (500 tokens)Simple baseline
Sentence-awareBetter than fixed
Paragraph-awareNatural breaks
Section/heading-awareCode, docs with structure
Semantic chunkingGroup similar sentences (slower)
RecursiveTry larger, split if needed

Sliding window (overlap):

chunk1: tokens 0-500
chunk2: tokens 400-900  (100 overlap)
chunk3: tokens 800-1300

Overlap helps catch info at chunk boundaries.

6.3 Metadata enrichment

{
    "id": "doc1_chunk5",
    "content": "...",
    "embedding": [...],
    "metadata": {
        "source_url": "https://...",
        "title": "...",
        "section": "Installation",
        "author": "...",
        "created_at": "2026-05-16",
        "doc_type": "tutorial",
        "language": "en"
    }
}

Use metadata for:

  • Filtering (date range, language, doc type)
  • Citing sources
  • Re-fetch context

6.4 Indexing pipeline

def index_document(doc_path):
    text = load_document(doc_path)
    chunks = chunk_recursive(text, max_size=500, overlap=50)
 
    # Batch embed
    embeddings = openai.embeddings.create(
        model="text-embedding-3-small",
        input=[c.content for c in chunks]
    ).data
 
    # Upsert to vector DB
    vector_db.upsert([
        {
            "id": f"{doc_path}#{i}",
            "vector": emb.embedding,
            "metadata": {**c.metadata, "doc_path": doc_path}
        }
        for i, (c, emb) in enumerate(zip(chunks, embeddings))
    ])

6.5 Query pipeline

def rag_query(question):
    # 1. Hybrid retrieval
    q_embed = openai.embeddings.create(model="...", input=question).data[0].embedding
 
    dense_results = vector_db.search(q_embed, top_k=20)
    sparse_results = bm25_search(question, top_k=20)
 
    # 2. RRF fusion
    fused = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
 
    # 3. Rerank
    candidates = fused[:50]
    reranked = cross_encoder_rerank(question, candidates)
    context_chunks = reranked[:5]
 
    # 4. LLM call with context
    prompt = build_prompt(question, context_chunks)
    answer = llm.complete(prompt)
 
    return answer, [c.metadata for c in context_chunks]  # for citation

6.6 Evaluation

Don’t deploy RAG without evals. Common framework: RAGAS, TruLens.

Metrics:

  • Retrieval: hit rate, MRR, NDCG (does retrieved contain answer?)
  • Generation: faithfulness, relevance, hallucination rate
  • End-to-end: human eval samples

Build eval set early (50-200 Q&A pairs from real users).

6.7 Common improvements (in order of impact)

  1. Better chunking — biggest single improvement
  2. Better retriever — hybrid > dense only
  3. Reranker — adds 5-15% accuracy
  4. Query rewriting — expand/clarify before retrieve
  5. HyDE (hypothetical document embedding) — generate hypothetical answer, embed it instead of query
  6. Multi-query — generate multiple variants, retrieve each, union
  7. Self-RAG — model decides when to retrieve

7. Filtered Search at Scale

7.1 The problem

SELECT * FROM documents
WHERE org_id = 42 AND lang = 'en' AND created_at > '2026-01-01'
ORDER BY embedding <=> $1 LIMIT 10;

100M total docs, org 42 has 1M. Need filter + similarity.

7.2 Strategies

A. Pre-filter via partition or index — for high-cardinality filters

-- One index per org
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE org_id = 42;

→ Works for low number of orgs. Doesn’t scale to 10K orgs.

B. Filter during HNSW traversal — modern vector DBs

  • Qdrant: payload-aware HNSW
  • pgvector 0.7+ iterative scan
  • Pinecone: native filtering

C. Multi-tenancy partitioning

  • Qdrant: collection per tenant if small per tenant
  • Or single collection with filter on tenant_id

D. Per-tenant index

  • Build separate HNSW per org
  • Trade-off: memory grows linearly

7.3 Cardinality matters

graph LR
    A[Filter cardinality?] --> B{<1% rows match}
    B --> B1[Pre-filter brute-force OK]
    A --> C{50-100% rows match}
    C --> C1[Filter during traversal<br/>or post-filter]
    A --> D{Between}
    D --> D1[Iterative scan / native]

8. Cost Economics

8.1 Embedding cost

100K documents × 1000 tokens/doc = 100M tokens
× $0.02 per M tokens (OpenAI 3-small)
= $2 one-time embedding

Self-host BGE/E5: $0 marginal, ~30 docs/sec on GPU.

8.2 Storage cost

100M vectors × 1536 dim × 4 bytes = 600 GB

pgvector (Postgres): same as DB storage
Pinecone: ~$60-200/month for p1 pod
Qdrant Cloud: ~$50-100/month at similar size

8.3 Query cost

RAG query:
- Embed query: ~$0.00001
- Vector search: ~$0 (compute already paid)
- LLM call: ~$0.005 (4K context, GPT-4o)
- Total: ~$0.005/query

10M queries/month = $50K/month at GPT-4o pricing
                  = $1500/month with GPT-4o-mini
                  = $500/month with self-host Llama

LLM dominates RAG cost. Optimize prompt size + model choice.

8.4 Optimization

  • Embed with smaller model (3-small vs 3-large)
  • Quantize vectors (half precision, PQ)
  • Cache frequent queries
  • Use smaller LLM with fewer context tokens
  • Hybrid search reduces context size needed

9. Anti-patterns

PatternWhy badFix
Use vector DB as primary storeNo ACID, eventually consistentVector DB for search, real data elsewhere
Single huge chunk (entire doc)Embedding “averages out” meaningChunk to 200-500 tokens
No metadataHard filter, no citationAlways include metadata
Vector-only retrievalMisses exact keyword matchesHybrid (BM25 + vector)
Embed with v1, query with v2Vectors incompatibleRe-index on model change
Reranker before retrieverSlow, expensiveRetrieve broad, rerank narrow
No eval setDon’t know qualityBuild eval early
10K+ tokens per chunkEmbed quality degradesSmaller chunks
Filter post-search top-10If filter sparse, no resultsPre-filter or use native filter
Same embedding for all use casesOne-size doesn’t fitDifferent models for QA, code, multilingual

10. Lab

10.1 Day 1: pgvector basics

docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=lab pgvector/pgvector:pg16
 
psql -c "CREATE EXTENSION vector;"

Build small RAG with 100 docs. Manual embedding via API. Search.

10.2 Day 2: HNSW tuning

Index 1M synthetic vectors. Tune m, ef_construction, ef_search. Measure recall vs latency.

Combine BM25 (Postgres FTS) + pgvector. Implement RRF fusion.

10.4 Day 4: Qdrant setup

Deploy Qdrant, migrate 1M vectors from pgvector. Compare filtered search performance.

10.5 Day 5: Real RAG

Build production-ish RAG over your team docs / GitHub README / personal notes. Use LangChain or DIY.

10.6 Day 6: Reranking

Add Cohere Rerank or local cross-encoder. Measure accuracy delta.

10.7 Day 7: Eval

Build 50 Q&A eval set. Measure hit rate at top-5, top-10 with and without rerank.


11. Self-check

  1. Embedding fundamentals — vector represent gì? Compare 2 words?
  2. HNSW vs IVF — pros/cons mỗi cái?
  3. pgvector 0.7+ — 3 features mới?
  4. Filtered search — 4 strategies?
  5. Matryoshka embedding — concept + use case?
  6. RAG pipeline — 5 stages?
  7. Chunking strategy — sliding window vì sao quan trọng?
  8. Hybrid search — vì sao thắng vector-only?
  9. Reranker — đặt ở stage nào? Tại sao?
  10. Cost RAG — đâu là dominant cost?

12. Tiếp theo

Phase 4 complete. Bài tiếp: Tuan-16-ORM-CQRS-Multi-Tenancy — application patterns.


Tuần 15 hoàn thành. Embedding is the new database column. Cập nhật: 2026-05-16