Tuần 15 — Vector Databases cho AI

“Năm 2024-2026, mọi product engineer cần hiểu vector search. Không phải để xây ChatGPT, mà vì ‘RAG cho team docs’ là feature ai cũng yêu cầu. Tuần này dạy bạn từ embedding fundamentals đến deploy production RAG.”

Tags: database vector-db pgvector qdrant pinecone hnsw ai rag Thời lượng: 7 ngày (6-8h/ngày — tuần dày) Prerequisites: Tuan-13-Search-Engines-ES (hybrid search concept), Tuan-14-OLAP-Columnar-ClickHouse (large-scale data) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure (SD course) · Case-Design-Data-AI-RAG

1. Context & Why

1.1 Vector search 101

Traditional search:

"What is recursion?" → keyword match → docs containing "recursion"

Vector search:

"What is recursion?" → embedding → semantic match → docs ABOUT recursion (even without word)
"How does a function call itself?" → also matches docs about recursion

1.2 Embedding overview

Text → high-dim vector capturing meaning.

"king" → [0.12, -0.34, 0.56, ..., 0.78]   (1536 dimensions for OpenAI ada-002)
"queen" → [0.14, -0.32, 0.58, ..., 0.75]  (close to king)
"car" → [0.85, 0.21, -0.43, ..., -0.12]   (far from king)

Distance/similarity:

Cosine similarity: 0..1 (1 = identical direction)
Euclidean (L2): distance
Inner product: dot product

1.3 Mục tiêu tuần

Embedding models 2024-2026: OpenAI, Cohere, Voyage, open-source
Vector index algorithms: HNSW, IVFFlat, IVFPQ, ScaNN, DiskANN
pgvector deep dive (0.7+, 0.8+)
Vector DB comparison: Qdrant, Pinecone, Weaviate, Milvus, Chroma
Hybrid search: dense + sparse (BM25 + vector)
Filtered search at scale
RAG pipeline production
Cost economics
Anti-patterns

1.4 Tham chiếu

pgvector — https://github.com/pgvector/pgvector
MTEB Leaderboard — https://huggingface.co/spaces/mteb/leaderboard (embedding model rankings)
Qdrant docs — https://qdrant.tech/documentation/
Pinecone Learning Center — https://www.pinecone.io/learn/
Faiss tutorial — Facebook’s vector lib
HNSW paper — Malkov & Yashunin 2016
DiskANN paper — Microsoft 2019
LangChain, LlamaIndex docs for RAG patterns

2. Embeddings 2024-2026

2.1 Embedding model landscape

Model	Provider	Dim	Context	Cost (per M tokens)	Notes
text-embedding-3-small	OpenAI	1536 (configurable)	8K	$0.02	Default 2024
text-embedding-3-large	OpenAI	3072 (configurable down to 256, Matryoshka)	8K	$0.13	Higher quality
text-embedding-ada-002	OpenAI	1536	8K	$0.10	Legacy
embed-english-v3.0	Cohere	1024	512	$0.10	Multi-lang option
voyage-3	Voyage AI	1024	32K	$0.06	RAG-optimized 2024
voyage-3-large	Voyage AI	1024	32K	$0.18	Flagship 2025, Matryoshka
voyage-code-3	Voyage AI	1024	32K	$0.18	Code retrieval
BGE-large-en-v1.5	BAAI	1024	512	Free (self-host)	OSS leader 2024
e5-mistral-7b-instruct	Microsoft	4096	32K	Free (self-host)	7B parameter
sentence-transformers/all-MiniLM-L6-v2	HuggingFace	384	512	Free	Lightweight, fast
nomic-embed-text-v1.5	Nomic	768 (Matryoshka)	8K	Free	OSS, Matryoshka tech

2.2 Choosing embedding model

flowchart TD
    A[Choose embedding] --> B{Privacy critical?}
    B -->|Yes| C[Self-host: BGE, E5, Nomic, MiniLM]
    B -->|No| D{Quality tier?}
    D -->|Top quality| E[OpenAI text-embedding-3-large<br/>or Voyage large-2]
    D -->|Good balance| F[OpenAI text-embedding-3-small<br/>or Voyage-3 / Voyage-3-large]
    D -->|Cheap, OK quality| G[Cohere embed-light]
    A --> H{Multi-language?}
    H -->|Yes| I[Cohere multilingual<br/>or BGE-M3]
    A --> J{Need long context?}
    J -->|16K+| K[Voyage-3 or specific model]

    style F fill:#c8e6c9

2.3 Matryoshka embeddings

Modern technique: embedding designed so truncated versions still useful.

# embedding = [d1, d2, ..., d1536]
# First 256 dims = good enough for fast filter
# All 1536 = best quality
 
# Two-stage retrieval
fast_results = search(query, top_k=100, dims=256)
final = rerank(query, fast_results, dims=1536, top_k=10)

OpenAI 3.x and Nomic support Matryoshka.

2.4 Reranking

After initial retrieval, rerank with cross-encoder for better top-k.

# Stage 1: vector search top 100
candidates = vector_search(query_embedding, top_k=100)
 
# Stage 2: cross-encoder rerank
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = reranker.predict([(query, doc.content) for doc in candidates])
top_10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]

Cross-encoder slower per-pair but more accurate. Pattern: vector retrieve broad, rerank narrow.

Cohere Rerank API ($1 per 1000 reranks) saves implementation.

3. Vector Index Algorithms

3.1 Exhaustive (brute-force)

For N=10K vectors → 10K × 1536-dim distance comp per query → <100ms. Fine.

For N=10M → too slow. Need approximate nearest neighbor (ANN) index.

3.2 IVF (Inverted File)

graph TB
    Centroids[K centroids<br/>from k-means clustering]
    Cluster1[Cluster 1<br/>1000 vectors]
    Cluster2[Cluster 2<br/>1000 vectors]
    Cluster3[Cluster 3<br/>1000 vectors]

    Centroids --> Cluster1
    Centroids --> Cluster2
    Centroids --> Cluster3

    Query[Query] -->|find nearest centroids| Centroids
    Query --> Cluster1
    Query --> Cluster3

Build:

K-means cluster vectors into K centroids
Each vector assigned to nearest centroid

Query:

Find top-p nearest centroids
Brute-force within those clusters

Trade-off:

Higher p → more accurate, slower
Lower p → faster, less accurate

Tunables: nlist (K), nprobe (p).

3.3 HNSW (Hierarchical Navigable Small World)

Multi-layer graph. Top layer sparse, bottom layer dense.

graph TB
    subgraph "Layer 2 (sparse)"
        A2[A]
        B2[B]
        C2[C]
    end
    subgraph "Layer 1"
        A1[A]
        B1[B]
        C1[C]
        D1[D]
        E1[E]
        F1[F]
    end
    subgraph "Layer 0 (dense)"
        A0[A]
        B0[B]
        C0[C]
        D0[D]
        E0[E]
        F0[F]
        G0[G]
        H0[H]
        I0[I]
    end

    A2 -.layer link.-> B2
    B2 -.-> C2
    A2 -.descend.-> A1
    B2 -.descend.-> B1

Query traverses top → bottom, greedy nearest at each layer.

Pros vs IVF:

✅ Higher recall (95-99% typical)
✅ Faster query
❌ Higher memory (graph structure)
❌ Slower build

HNSW is default modern 2024-2026 for most vector DBs.

Tunables:

m — max connections per node (default 16). Higher = better accuracy, more memory.
ef_construction — quality during build (default 200). Higher = better quality, slower build.
ef_search — runtime accuracy/speed trade-off.

3.4 IVFPQ — Product Quantization

For huge datasets (100M+), memory savings via lossy compression of vectors.

1536-dim vector × 4 bytes = 6KB per vector
× 100M vectors = 600GB

With PQ8: 1536 / 96 sub-quantizers × 1 byte = 96 bytes
× 100M = 9.6GB (60x compression)

Accuracy hit ~5-10%, often acceptable.

3.5 DiskANN

Microsoft research. HNSW-style graph but designed for SSD storage (not all-in-RAM).

Pros: Index 100M+ vectors on commodity hardware Cons: Slower than in-memory HNSW for top quality

ScaNN (Google) similar territory.

3.6 Algorithm comparison

Algo	Memory	Build time	Query speed	Recall	Best for
Brute force	High	None	Slow	100%	<100K vectors
IVFFlat	Medium	Medium	Fast	90-95%	1M-10M
HNSW	High	Slow	Very fast	95-99%	100K-100M, in-mem
IVFPQ	Low	Slow	Fast	85-90%	10M-1B+, memory-constrained
DiskANN	Low (disk)	Slow	Medium	95%	100M+, disk-based

4. pgvector Deep Dive

4.1 Install + basic

CREATE EXTENSION vector;
 
CREATE TABLE documents (
    id bigserial PRIMARY KEY,
    content text,
    embedding vector(1536)
);
 
INSERT INTO documents (content, embedding)
VALUES ('hello world', '[0.1, 0.2, ..., 0.5]');
 
-- Cosine distance
SELECT content, embedding <=> '[0.1, 0.2, ..., 0.5]'::vector AS distance
FROM documents
ORDER BY distance LIMIT 10;

Distance operators:

<-> Euclidean (L2)
<#> Inner product (negated)
<=> Cosine

4.2 Indexes

-- IVFFlat (older)
CREATE INDEX ON documents USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
SET ivfflat.probes = 10;
 
-- HNSW (pgvector 0.5+, recommended 2024)
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 100;

Notes:

HNSW recommended for most cases
IVF needs ANALYZE after data inserted (centroids from sample)
lists for IVF: roughly sqrt(N)
m, ef_construction for HNSW: defaults usually fine

4.3 pgvector 0.7+ features (2024)

HNSW build parallelism — multi-thread
Iterative scan — better filtered search
Half-precision (halfvec) — 2 bytes per dim vs 4 bytes
Bit vectors (bit(n)) — for binary embeddings
Sparse vectors (sparsevec)

-- Half-precision: 50% memory saving
CREATE TABLE docs_half (id bigserial, embedding halfvec(1536));
 
-- Sparse
CREATE TABLE docs_sparse (id bigserial, embedding sparsevec(30000));

4.4 Filtered search (the hard problem)

Query: “find similar to X where category = ‘tech’“.

Two strategies:

graph LR
    subgraph "Pre-filter"
        P1[Filter rows category=tech]
        P1 --> P2[Brute force vector among them]
        P2 --> P3[Top-k]
    end

    subgraph "Post-filter"
        Q1[Vector top-100 candidates]
        Q1 --> Q2[Filter by category]
        Q2 --> Q3[Top-k - may have <k]
    end

Trade-off:

Pre-filter: accurate but if many rows filter, brute-force slow
Post-filter: fast but may miss results

pgvector 0.7’s iterative scan = smart hybrid.

-- Force pre-filter via partial index
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE category = 'tech';
-- Specific to that category, fast
 
-- Or use combined approach with filter clause
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> $1 LIMIT 10;

For high-selectivity filters (very few rows match), this works. For low-selectivity, may degrade.

4.5 pgvector vs dedicated vector DB

	pgvector	Qdrant/Pinecone
Setup	Already have Postgres	New service
Up to N vectors	Good <10M	Excellent <100M+
Filtered search	OK (better in 0.7+)	Native, fast
Hybrid (BM25+vector)	Yes (PG FTS)	Native
Operational complexity	1 system	2 systems
Vendor lock	None	Some
Cost	DB cost	Add service

Pattern 2024-2026: Start pgvector for <10M vectors. Move to dedicated when:

10M vectors and growing
Need <10ms p99 latency
Complex filtering with high cardinality
Tighter integration with embeddings (Qdrant’s payload, etc)

5. Dedicated Vector DBs

5.1 Qdrant

Rust-based, OSS, fast filtered search.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
 
client = QdrantClient(host="localhost", port=6333)
 
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
 
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"category": "tech", "url": "..."})
    ]
)
 
# Query with filter
results = client.search(
    collection_name="docs",
    query_vector=[0.1, 0.2, ...],
    query_filter=Filter(must=[FieldCondition(key="category", match=MatchValue(value="tech"))]),
    limit=10
)

Pros:

Filter integration excellent (pre-filter via indexed payload)
Rust speed
OSS Apache 2.0
Hybrid search (dense + sparse)

5.2 Pinecone

Managed-only, very simple API.

import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("docs")
 
index.upsert([("id1", [0.1, 0.2, ...], {"category": "tech"})])
 
results = index.query(
    vector=[0.1, 0.2, ...],
    filter={"category": "tech"},
    top_k=10,
    include_metadata=True
)

Pros:

Zero ops
Multi-tenant ready
Serverless tier 2024

Cons:

Closed source
Vendor lock
Cost at scale ($70+/month/pod)

5.3 Weaviate

GraphQL-first, OSS.

Special: built-in vectorizer modules (auto-embed on insert).

{
  Get {
    Document(
      nearText: { concepts: ["machine learning"] }
      where: { path: ["category"], operator: Equal, valueString: "tech" }
      limit: 10
    ) { content category }
  }
}

5.4 Milvus

Most feature-rich OSS. Built on Faiss/Knowhere.

Strengths:

Multi-replica, multi-shard
Multiple index types (HNSW, IVF, DiskANN, etc)
Petabyte scale

5.5 Chroma

Lightweight, Python-first, popular in dev/notebook contexts.

import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(documents=["text1", "text2"], ids=["1", "2"])
results = collection.query(query_texts=["query"], n_results=5)

Built-in default embedding model (sentence-transformers). Easy start.

5.6 Lance / LanceDB

Built on Apache Arrow, embedded mode. Like DuckDB for vectors.

import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", data=[...])
results = table.search([0.1, 0.2, ...]).limit(10).to_pandas()

5.7 Decision matrix

flowchart TD
    A[Vector DB needed] --> B{Already have Postgres?}
    B -->|Yes| C{<10M vectors?}
    C -->|Yes| D[pgvector]
    C -->|No| E[Dedicated vector DB]
    B -->|No| E
    E --> F{OSS pref?}
    F -->|Yes| G[Qdrant, Milvus, Weaviate, Chroma]
    F -->|No / fully managed| H[Pinecone]
    A --> I{Embedded?}
    I -->|Yes - app library| J[LanceDB, Chroma]

    style D fill:#c8e6c9
    style G fill:#c8e6c9

6. RAG Pipeline Production

6.1 Architecture

graph TB
    subgraph "Indexing phase"
        Docs[Documents] --> Chunker[Chunker]
        Chunker --> Embedder[Embed each chunk]
        Embedder --> Store[(Vector DB)]
    end

    subgraph "Query phase"
        Query[User query] --> QEmbed[Embed query]
        QEmbed --> Retrieve[Search vector DB]
        Query --> BM25[BM25 search]
        Retrieve --> Fusion[Hybrid fusion RRF]
        BM25 --> Fusion
        Fusion --> Rerank[Cross-encoder rerank]
        Rerank --> LLM[LLM with context]
        LLM --> Answer[Answer]
    end

    Store -.-> Retrieve

6.2 Chunking strategies

Strategy	When
Fixed size (500 tokens)	Simple baseline
Sentence-aware	Better than fixed
Paragraph-aware	Natural breaks
Section/heading-aware	Code, docs with structure
Semantic chunking	Group similar sentences (slower)
Recursive	Try larger, split if needed

Sliding window (overlap):

chunk1: tokens 0-500
chunk2: tokens 400-900  (100 overlap)
chunk3: tokens 800-1300

Overlap helps catch info at chunk boundaries.

6.3 Metadata enrichment

{
    "id": "doc1_chunk5",
    "content": "...",
    "embedding": [...],
    "metadata": {
        "source_url": "https://...",
        "title": "...",
        "section": "Installation",
        "author": "...",
        "created_at": "2026-05-16",
        "doc_type": "tutorial",
        "language": "en"
    }
}

Use metadata for:

Filtering (date range, language, doc type)
Citing sources
Re-fetch context

6.4 Indexing pipeline

def index_document(doc_path):
    text = load_document(doc_path)
    chunks = chunk_recursive(text, max_size=500, overlap=50)
 
    # Batch embed
    embeddings = openai.embeddings.create(
        model="text-embedding-3-small",
        input=[c.content for c in chunks]
    ).data
 
    # Upsert to vector DB
    vector_db.upsert([
        {
            "id": f"{doc_path}#{i}",
            "vector": emb.embedding,
            "metadata": {**c.metadata, "doc_path": doc_path}
        }
        for i, (c, emb) in enumerate(zip(chunks, embeddings))
    ])

6.5 Query pipeline

def rag_query(question):
    # 1. Hybrid retrieval
    q_embed = openai.embeddings.create(model="...", input=question).data[0].embedding
 
    dense_results = vector_db.search(q_embed, top_k=20)
    sparse_results = bm25_search(question, top_k=20)
 
    # 2. RRF fusion
    fused = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
 
    # 3. Rerank
    candidates = fused[:50]
    reranked = cross_encoder_rerank(question, candidates)
    context_chunks = reranked[:5]
 
    # 4. LLM call with context
    prompt = build_prompt(question, context_chunks)
    answer = llm.complete(prompt)
 
    return answer, [c.metadata for c in context_chunks]  # for citation

6.6 Evaluation

Don’t deploy RAG without evals. Common framework: RAGAS, TruLens.

Metrics:

Retrieval: hit rate, MRR, NDCG (does retrieved contain answer?)
Generation: faithfulness, relevance, hallucination rate
End-to-end: human eval samples

Build eval set early (50-200 Q&A pairs from real users).

6.7 Common improvements (in order of impact)

Better chunking — biggest single improvement
Better retriever — hybrid > dense only
Reranker — adds 5-15% accuracy
Query rewriting — expand/clarify before retrieve
HyDE (hypothetical document embedding) — generate hypothetical answer, embed it instead of query
Multi-query — generate multiple variants, retrieve each, union
Self-RAG — model decides when to retrieve

7. Filtered Search at Scale

7.1 The problem

SELECT * FROM documents
WHERE org_id = 42 AND lang = 'en' AND created_at > '2026-01-01'
ORDER BY embedding <=> $1 LIMIT 10;

100M total docs, org 42 has 1M. Need filter + similarity.

7.2 Strategies

A. Pre-filter via partition or index — for high-cardinality filters

-- One index per org
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE org_id = 42;

→ Works for low number of orgs. Doesn’t scale to 10K orgs.

B. Filter during HNSW traversal — modern vector DBs

Qdrant: payload-aware HNSW
pgvector 0.7+ iterative scan
Pinecone: native filtering

C. Multi-tenancy partitioning

Qdrant: collection per tenant if small per tenant
Or single collection with filter on tenant_id

D. Per-tenant index

Build separate HNSW per org
Trade-off: memory grows linearly

7.3 Cardinality matters

graph LR
    A[Filter cardinality?] --> B{<1% rows match}
    B --> B1[Pre-filter brute-force OK]
    A --> C{50-100% rows match}
    C --> C1[Filter during traversal<br/>or post-filter]
    A --> D{Between}
    D --> D1[Iterative scan / native]

8. Cost Economics

8.1 Embedding cost

100K documents × 1000 tokens/doc = 100M tokens
× $0.02 per M tokens (OpenAI 3-small)
= $2 one-time embedding

Self-host BGE/E5: $0 marginal, ~30 docs/sec on GPU.

8.2 Storage cost

100M vectors × 1536 dim × 4 bytes = 600 GB

pgvector (Postgres): same as DB storage
Pinecone: ~$60-200/month for p1 pod
Qdrant Cloud: ~$50-100/month at similar size

8.3 Query cost

RAG query:
- Embed query: ~$0.00001
- Vector search: ~$0 (compute already paid)
- LLM call: ~$0.005 (4K context, GPT-4o)
- Total: ~$0.005/query

10M queries/month = $50K/month at GPT-4o pricing
                  = $1500/month with GPT-4o-mini
                  = $500/month with self-host Llama

LLM dominates RAG cost. Optimize prompt size + model choice.

8.4 Optimization

Embed with smaller model (3-small vs 3-large)
Quantize vectors (half precision, PQ)
Cache frequent queries
Use smaller LLM with fewer context tokens
Hybrid search reduces context size needed

9. Anti-patterns

Pattern	Why bad	Fix
Use vector DB as primary store	No ACID, eventually consistent	Vector DB for search, real data elsewhere
Single huge chunk (entire doc)	Embedding “averages out” meaning	Chunk to 200-500 tokens
No metadata	Hard filter, no citation	Always include metadata
Vector-only retrieval	Misses exact keyword matches	Hybrid (BM25 + vector)
Embed with v1, query with v2	Vectors incompatible	Re-index on model change
Reranker before retriever	Slow, expensive	Retrieve broad, rerank narrow
No eval set	Don’t know quality	Build eval early
10K+ tokens per chunk	Embed quality degrades	Smaller chunks
Filter post-search top-10	If filter sparse, no results	Pre-filter or use native filter
Same embedding for all use cases	One-size doesn’t fit	Different models for QA, code, multilingual

10. Lab

10.1 Day 1: pgvector basics

docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=lab pgvector/pgvector:pg16
 
psql -c "CREATE EXTENSION vector;"

Build small RAG with 100 docs. Manual embedding via API. Search.

10.2 Day 2: HNSW tuning

Index 1M synthetic vectors. Tune m, ef_construction, ef_search. Measure recall vs latency.

10.3 Day 3: Hybrid search

Combine BM25 (Postgres FTS) + pgvector. Implement RRF fusion.

10.4 Day 4: Qdrant setup

Deploy Qdrant, migrate 1M vectors from pgvector. Compare filtered search performance.

10.5 Day 5: Real RAG

Build production-ish RAG over your team docs / GitHub README / personal notes. Use LangChain or DIY.

10.6 Day 6: Reranking

Add Cohere Rerank or local cross-encoder. Measure accuracy delta.

10.7 Day 7: Eval

Build 50 Q&A eval set. Measure hit rate at top-5, top-10 with and without rerank.

11. Self-check

Embedding fundamentals — vector represent gì? Compare 2 words?
HNSW vs IVF — pros/cons mỗi cái?
pgvector 0.7+ — 3 features mới?
Filtered search — 4 strategies?
Matryoshka embedding — concept + use case?
RAG pipeline — 5 stages?
Chunking strategy — sliding window vì sao quan trọng?
Hybrid search — vì sao thắng vector-only?
Reranker — đặt ở stage nào? Tại sao?
Cost RAG — đâu là dominant cost?

12. Tiếp theo

Phase 4 complete. Bài tiếp: Tuan-16-ORM-CQRS-Multi-Tenancy — application patterns.

Tuần 15 hoàn thành. Embedding is the new database column. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-15-Vector-DB-AI