Tuần 15 — Vector Databases cho AI
“Năm 2024-2026, mọi product engineer cần hiểu vector search. Không phải để xây ChatGPT, mà vì ‘RAG cho team docs’ là feature ai cũng yêu cầu. Tuần này dạy bạn từ embedding fundamentals đến deploy production RAG.”
Tags: database vector-db pgvector qdrant pinecone hnsw ai rag Thời lượng: 7 ngày (6-8h/ngày — tuần dày) Prerequisites: Tuan-13-Search-Engines-ES (hybrid search concept), Tuan-14-OLAP-Columnar-ClickHouse (large-scale data) Liên quan: Tuan-Bonus-LLM-Serving-Infrastructure (SD course) · Case-Design-Data-AI-RAG
1. Context & Why
1.1 Vector search 101
Traditional search:
"What is recursion?" → keyword match → docs containing "recursion"
Vector search:
"What is recursion?" → embedding → semantic match → docs ABOUT recursion (even without word)
"How does a function call itself?" → also matches docs about recursion
1.2 Embedding overview
Text → high-dim vector capturing meaning.
"king" → [0.12, -0.34, 0.56, ..., 0.78] (1536 dimensions for OpenAI ada-002)
"queen" → [0.14, -0.32, 0.58, ..., 0.75] (close to king)
"car" → [0.85, 0.21, -0.43, ..., -0.12] (far from king)
Distance/similarity:
- Cosine similarity: 0..1 (1 = identical direction)
- Euclidean (L2): distance
- Inner product: dot product
1.3 Mục tiêu tuần
- Embedding models 2024-2026: OpenAI, Cohere, Voyage, open-source
- Vector index algorithms: HNSW, IVFFlat, IVFPQ, ScaNN, DiskANN
- pgvector deep dive (0.7+, 0.8+)
- Vector DB comparison: Qdrant, Pinecone, Weaviate, Milvus, Chroma
- Hybrid search: dense + sparse (BM25 + vector)
- Filtered search at scale
- RAG pipeline production
- Cost economics
- Anti-patterns
1.4 Tham chiếu
- pgvector — https://github.com/pgvector/pgvector
- MTEB Leaderboard — https://huggingface.co/spaces/mteb/leaderboard (embedding model rankings)
- Qdrant docs — https://qdrant.tech/documentation/
- Pinecone Learning Center — https://www.pinecone.io/learn/
- Faiss tutorial — Facebook’s vector lib
- HNSW paper — Malkov & Yashunin 2016
- DiskANN paper — Microsoft 2019
- LangChain, LlamaIndex docs for RAG patterns
2. Embeddings 2024-2026
2.1 Embedding model landscape
| Model | Provider | Dim | Context | Cost (per M tokens) | Notes |
|---|---|---|---|---|---|
| text-embedding-3-small | OpenAI | 1536 (configurable) | 8K | $0.02 | Default 2024 |
| text-embedding-3-large | OpenAI | 3072 (configurable down to 256, Matryoshka) | 8K | $0.13 | Higher quality |
| text-embedding-ada-002 | OpenAI | 1536 | 8K | $0.10 | Legacy |
| embed-english-v3.0 | Cohere | 1024 | 512 | $0.10 | Multi-lang option |
| voyage-3 | Voyage AI | 1024 | 32K | $0.06 | RAG-optimized 2024 |
| voyage-3-large | Voyage AI | 1024 | 32K | $0.18 | Flagship 2025, Matryoshka |
| voyage-code-3 | Voyage AI | 1024 | 32K | $0.18 | Code retrieval |
| BGE-large-en-v1.5 | BAAI | 1024 | 512 | Free (self-host) | OSS leader 2024 |
| e5-mistral-7b-instruct | Microsoft | 4096 | 32K | Free (self-host) | 7B parameter |
| sentence-transformers/all-MiniLM-L6-v2 | HuggingFace | 384 | 512 | Free | Lightweight, fast |
| nomic-embed-text-v1.5 | Nomic | 768 (Matryoshka) | 8K | Free | OSS, Matryoshka tech |
2.2 Choosing embedding model
flowchart TD A[Choose embedding] --> B{Privacy critical?} B -->|Yes| C[Self-host: BGE, E5, Nomic, MiniLM] B -->|No| D{Quality tier?} D -->|Top quality| E[OpenAI text-embedding-3-large<br/>or Voyage large-2] D -->|Good balance| F[OpenAI text-embedding-3-small<br/>or Voyage-3 / Voyage-3-large] D -->|Cheap, OK quality| G[Cohere embed-light] A --> H{Multi-language?} H -->|Yes| I[Cohere multilingual<br/>or BGE-M3] A --> J{Need long context?} J -->|16K+| K[Voyage-3 or specific model] style F fill:#c8e6c9
2.3 Matryoshka embeddings
Modern technique: embedding designed so truncated versions still useful.
# embedding = [d1, d2, ..., d1536]
# First 256 dims = good enough for fast filter
# All 1536 = best quality
# Two-stage retrieval
fast_results = search(query, top_k=100, dims=256)
final = rerank(query, fast_results, dims=1536, top_k=10)OpenAI 3.x and Nomic support Matryoshka.
2.4 Reranking
After initial retrieval, rerank with cross-encoder for better top-k.
# Stage 1: vector search top 100
candidates = vector_search(query_embedding, top_k=100)
# Stage 2: cross-encoder rerank
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')
scores = reranker.predict([(query, doc.content) for doc in candidates])
top_10 = sorted(zip(candidates, scores), key=lambda x: -x[1])[:10]Cross-encoder slower per-pair but more accurate. Pattern: vector retrieve broad, rerank narrow.
Cohere Rerank API ($1 per 1000 reranks) saves implementation.
3. Vector Index Algorithms
3.1 Exhaustive (brute-force)
For N=10K vectors → 10K × 1536-dim distance comp per query → <100ms. Fine.
For N=10M → too slow. Need approximate nearest neighbor (ANN) index.
3.2 IVF (Inverted File)
graph TB Centroids[K centroids<br/>from k-means clustering] Cluster1[Cluster 1<br/>1000 vectors] Cluster2[Cluster 2<br/>1000 vectors] Cluster3[Cluster 3<br/>1000 vectors] Centroids --> Cluster1 Centroids --> Cluster2 Centroids --> Cluster3 Query[Query] -->|find nearest centroids| Centroids Query --> Cluster1 Query --> Cluster3
Build:
- K-means cluster vectors into K centroids
- Each vector assigned to nearest centroid
Query:
- Find top-p nearest centroids
- Brute-force within those clusters
Trade-off:
- Higher p → more accurate, slower
- Lower p → faster, less accurate
Tunables: nlist (K), nprobe (p).
3.3 HNSW (Hierarchical Navigable Small World)
Multi-layer graph. Top layer sparse, bottom layer dense.
graph TB subgraph "Layer 2 (sparse)" A2[A] B2[B] C2[C] end subgraph "Layer 1" A1[A] B1[B] C1[C] D1[D] E1[E] F1[F] end subgraph "Layer 0 (dense)" A0[A] B0[B] C0[C] D0[D] E0[E] F0[F] G0[G] H0[H] I0[I] end A2 -.layer link.-> B2 B2 -.-> C2 A2 -.descend.-> A1 B2 -.descend.-> B1
Query traverses top → bottom, greedy nearest at each layer.
Pros vs IVF:
- ✅ Higher recall (95-99% typical)
- ✅ Faster query
- ❌ Higher memory (graph structure)
- ❌ Slower build
HNSW is default modern 2024-2026 for most vector DBs.
Tunables:
m— max connections per node (default 16). Higher = better accuracy, more memory.ef_construction— quality during build (default 200). Higher = better quality, slower build.ef_search— runtime accuracy/speed trade-off.
3.4 IVFPQ — Product Quantization
For huge datasets (100M+), memory savings via lossy compression of vectors.
1536-dim vector × 4 bytes = 6KB per vector
× 100M vectors = 600GB
With PQ8: 1536 / 96 sub-quantizers × 1 byte = 96 bytes
× 100M = 9.6GB (60x compression)
Accuracy hit ~5-10%, often acceptable.
3.5 DiskANN
Microsoft research. HNSW-style graph but designed for SSD storage (not all-in-RAM).
Pros: Index 100M+ vectors on commodity hardware Cons: Slower than in-memory HNSW for top quality
ScaNN (Google) similar territory.
3.6 Algorithm comparison
| Algo | Memory | Build time | Query speed | Recall | Best for |
|---|---|---|---|---|---|
| Brute force | High | None | Slow | 100% | <100K vectors |
| IVFFlat | Medium | Medium | Fast | 90-95% | 1M-10M |
| HNSW | High | Slow | Very fast | 95-99% | 100K-100M, in-mem |
| IVFPQ | Low | Slow | Fast | 85-90% | 10M-1B+, memory-constrained |
| DiskANN | Low (disk) | Slow | Medium | 95% | 100M+, disk-based |
4. pgvector Deep Dive
4.1 Install + basic
CREATE EXTENSION vector;
CREATE TABLE documents (
id bigserial PRIMARY KEY,
content text,
embedding vector(1536)
);
INSERT INTO documents (content, embedding)
VALUES ('hello world', '[0.1, 0.2, ..., 0.5]');
-- Cosine distance
SELECT content, embedding <=> '[0.1, 0.2, ..., 0.5]'::vector AS distance
FROM documents
ORDER BY distance LIMIT 10;Distance operators:
<->Euclidean (L2)<#>Inner product (negated)<=>Cosine
4.2 Indexes
-- IVFFlat (older)
CREATE INDEX ON documents USING ivfflat(embedding vector_cosine_ops) WITH (lists = 100);
SET ivfflat.probes = 10;
-- HNSW (pgvector 0.5+, recommended 2024)
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
SET hnsw.ef_search = 100;Notes:
- HNSW recommended for most cases
- IVF needs ANALYZE after data inserted (centroids from sample)
listsfor IVF: roughly sqrt(N)m,ef_constructionfor HNSW: defaults usually fine
4.3 pgvector 0.7+ features (2024)
- HNSW build parallelism — multi-thread
- Iterative scan — better filtered search
- Half-precision (
halfvec) — 2 bytes per dim vs 4 bytes - Bit vectors (
bit(n)) — for binary embeddings - Sparse vectors (
sparsevec)
-- Half-precision: 50% memory saving
CREATE TABLE docs_half (id bigserial, embedding halfvec(1536));
-- Sparse
CREATE TABLE docs_sparse (id bigserial, embedding sparsevec(30000));4.4 Filtered search (the hard problem)
Query: “find similar to X where category = ‘tech’“.
Two strategies:
graph LR subgraph "Pre-filter" P1[Filter rows category=tech] P1 --> P2[Brute force vector among them] P2 --> P3[Top-k] end subgraph "Post-filter" Q1[Vector top-100 candidates] Q1 --> Q2[Filter by category] Q2 --> Q3[Top-k - may have <k] end
Trade-off:
- Pre-filter: accurate but if many rows filter, brute-force slow
- Post-filter: fast but may miss results
pgvector 0.7’s iterative scan = smart hybrid.
-- Force pre-filter via partial index
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE category = 'tech';
-- Specific to that category, fast
-- Or use combined approach with filter clause
SELECT * FROM documents
WHERE category = 'tech'
ORDER BY embedding <=> $1 LIMIT 10;For high-selectivity filters (very few rows match), this works. For low-selectivity, may degrade.
4.5 pgvector vs dedicated vector DB
| pgvector | Qdrant/Pinecone | |
|---|---|---|
| Setup | Already have Postgres | New service |
| Up to N vectors | Good <10M | Excellent <100M+ |
| Filtered search | OK (better in 0.7+) | Native, fast |
| Hybrid (BM25+vector) | Yes (PG FTS) | Native |
| Operational complexity | 1 system | 2 systems |
| Vendor lock | None | Some |
| Cost | DB cost | Add service |
Pattern 2024-2026: Start pgvector for <10M vectors. Move to dedicated when:
-
10M vectors and growing
- Need <10ms p99 latency
- Complex filtering with high cardinality
- Tighter integration with embeddings (Qdrant’s payload, etc)
5. Dedicated Vector DBs
5.1 Qdrant
Rust-based, OSS, fast filtered search.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(host="localhost", port=6333)
client.create_collection(
collection_name="docs",
vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)
client.upsert(
collection_name="docs",
points=[
PointStruct(id=1, vector=[0.1, 0.2, ...], payload={"category": "tech", "url": "..."})
]
)
# Query with filter
results = client.search(
collection_name="docs",
query_vector=[0.1, 0.2, ...],
query_filter=Filter(must=[FieldCondition(key="category", match=MatchValue(value="tech"))]),
limit=10
)Pros:
- Filter integration excellent (pre-filter via indexed payload)
- Rust speed
- OSS Apache 2.0
- Hybrid search (dense + sparse)
5.2 Pinecone
Managed-only, very simple API.
import pinecone
pinecone.init(api_key="...")
index = pinecone.Index("docs")
index.upsert([("id1", [0.1, 0.2, ...], {"category": "tech"})])
results = index.query(
vector=[0.1, 0.2, ...],
filter={"category": "tech"},
top_k=10,
include_metadata=True
)Pros:
- Zero ops
- Multi-tenant ready
- Serverless tier 2024
Cons:
- Closed source
- Vendor lock
- Cost at scale ($70+/month/pod)
5.3 Weaviate
GraphQL-first, OSS.
Special: built-in vectorizer modules (auto-embed on insert).
{
Get {
Document(
nearText: { concepts: ["machine learning"] }
where: { path: ["category"], operator: Equal, valueString: "tech" }
limit: 10
) { content category }
}
}5.4 Milvus
Most feature-rich OSS. Built on Faiss/Knowhere.
Strengths:
- Multi-replica, multi-shard
- Multiple index types (HNSW, IVF, DiskANN, etc)
- Petabyte scale
5.5 Chroma
Lightweight, Python-first, popular in dev/notebook contexts.
import chromadb
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(documents=["text1", "text2"], ids=["1", "2"])
results = collection.query(query_texts=["query"], n_results=5)Built-in default embedding model (sentence-transformers). Easy start.
5.6 Lance / LanceDB
Built on Apache Arrow, embedded mode. Like DuckDB for vectors.
import lancedb
db = lancedb.connect("./data")
table = db.create_table("docs", data=[...])
results = table.search([0.1, 0.2, ...]).limit(10).to_pandas()5.7 Decision matrix
flowchart TD A[Vector DB needed] --> B{Already have Postgres?} B -->|Yes| C{<10M vectors?} C -->|Yes| D[pgvector] C -->|No| E[Dedicated vector DB] B -->|No| E E --> F{OSS pref?} F -->|Yes| G[Qdrant, Milvus, Weaviate, Chroma] F -->|No / fully managed| H[Pinecone] A --> I{Embedded?} I -->|Yes - app library| J[LanceDB, Chroma] style D fill:#c8e6c9 style G fill:#c8e6c9
6. RAG Pipeline Production
6.1 Architecture
graph TB subgraph "Indexing phase" Docs[Documents] --> Chunker[Chunker] Chunker --> Embedder[Embed each chunk] Embedder --> Store[(Vector DB)] end subgraph "Query phase" Query[User query] --> QEmbed[Embed query] QEmbed --> Retrieve[Search vector DB] Query --> BM25[BM25 search] Retrieve --> Fusion[Hybrid fusion RRF] BM25 --> Fusion Fusion --> Rerank[Cross-encoder rerank] Rerank --> LLM[LLM with context] LLM --> Answer[Answer] end Store -.-> Retrieve
6.2 Chunking strategies
| Strategy | When |
|---|---|
| Fixed size (500 tokens) | Simple baseline |
| Sentence-aware | Better than fixed |
| Paragraph-aware | Natural breaks |
| Section/heading-aware | Code, docs with structure |
| Semantic chunking | Group similar sentences (slower) |
| Recursive | Try larger, split if needed |
Sliding window (overlap):
chunk1: tokens 0-500
chunk2: tokens 400-900 (100 overlap)
chunk3: tokens 800-1300
Overlap helps catch info at chunk boundaries.
6.3 Metadata enrichment
{
"id": "doc1_chunk5",
"content": "...",
"embedding": [...],
"metadata": {
"source_url": "https://...",
"title": "...",
"section": "Installation",
"author": "...",
"created_at": "2026-05-16",
"doc_type": "tutorial",
"language": "en"
}
}Use metadata for:
- Filtering (date range, language, doc type)
- Citing sources
- Re-fetch context
6.4 Indexing pipeline
def index_document(doc_path):
text = load_document(doc_path)
chunks = chunk_recursive(text, max_size=500, overlap=50)
# Batch embed
embeddings = openai.embeddings.create(
model="text-embedding-3-small",
input=[c.content for c in chunks]
).data
# Upsert to vector DB
vector_db.upsert([
{
"id": f"{doc_path}#{i}",
"vector": emb.embedding,
"metadata": {**c.metadata, "doc_path": doc_path}
}
for i, (c, emb) in enumerate(zip(chunks, embeddings))
])6.5 Query pipeline
def rag_query(question):
# 1. Hybrid retrieval
q_embed = openai.embeddings.create(model="...", input=question).data[0].embedding
dense_results = vector_db.search(q_embed, top_k=20)
sparse_results = bm25_search(question, top_k=20)
# 2. RRF fusion
fused = reciprocal_rank_fusion(dense_results, sparse_results, k=60)
# 3. Rerank
candidates = fused[:50]
reranked = cross_encoder_rerank(question, candidates)
context_chunks = reranked[:5]
# 4. LLM call with context
prompt = build_prompt(question, context_chunks)
answer = llm.complete(prompt)
return answer, [c.metadata for c in context_chunks] # for citation6.6 Evaluation
Don’t deploy RAG without evals. Common framework: RAGAS, TruLens.
Metrics:
- Retrieval: hit rate, MRR, NDCG (does retrieved contain answer?)
- Generation: faithfulness, relevance, hallucination rate
- End-to-end: human eval samples
Build eval set early (50-200 Q&A pairs from real users).
6.7 Common improvements (in order of impact)
- Better chunking — biggest single improvement
- Better retriever — hybrid > dense only
- Reranker — adds 5-15% accuracy
- Query rewriting — expand/clarify before retrieve
- HyDE (hypothetical document embedding) — generate hypothetical answer, embed it instead of query
- Multi-query — generate multiple variants, retrieve each, union
- Self-RAG — model decides when to retrieve
7. Filtered Search at Scale
7.1 The problem
SELECT * FROM documents
WHERE org_id = 42 AND lang = 'en' AND created_at > '2026-01-01'
ORDER BY embedding <=> $1 LIMIT 10;100M total docs, org 42 has 1M. Need filter + similarity.
7.2 Strategies
A. Pre-filter via partition or index — for high-cardinality filters
-- One index per org
CREATE INDEX ON documents USING hnsw(embedding vector_cosine_ops) WHERE org_id = 42;→ Works for low number of orgs. Doesn’t scale to 10K orgs.
B. Filter during HNSW traversal — modern vector DBs
- Qdrant: payload-aware HNSW
- pgvector 0.7+ iterative scan
- Pinecone: native filtering
C. Multi-tenancy partitioning
- Qdrant: collection per tenant if small per tenant
- Or single collection with filter on tenant_id
D. Per-tenant index
- Build separate HNSW per org
- Trade-off: memory grows linearly
7.3 Cardinality matters
graph LR A[Filter cardinality?] --> B{<1% rows match} B --> B1[Pre-filter brute-force OK] A --> C{50-100% rows match} C --> C1[Filter during traversal<br/>or post-filter] A --> D{Between} D --> D1[Iterative scan / native]
8. Cost Economics
8.1 Embedding cost
100K documents × 1000 tokens/doc = 100M tokens
× $0.02 per M tokens (OpenAI 3-small)
= $2 one-time embedding
Self-host BGE/E5: $0 marginal, ~30 docs/sec on GPU.
8.2 Storage cost
100M vectors × 1536 dim × 4 bytes = 600 GB
pgvector (Postgres): same as DB storage
Pinecone: ~$60-200/month for p1 pod
Qdrant Cloud: ~$50-100/month at similar size
8.3 Query cost
RAG query:
- Embed query: ~$0.00001
- Vector search: ~$0 (compute already paid)
- LLM call: ~$0.005 (4K context, GPT-4o)
- Total: ~$0.005/query
10M queries/month = $50K/month at GPT-4o pricing
= $1500/month with GPT-4o-mini
= $500/month with self-host Llama
LLM dominates RAG cost. Optimize prompt size + model choice.
8.4 Optimization
- Embed with smaller model (3-small vs 3-large)
- Quantize vectors (half precision, PQ)
- Cache frequent queries
- Use smaller LLM with fewer context tokens
- Hybrid search reduces context size needed
9. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| Use vector DB as primary store | No ACID, eventually consistent | Vector DB for search, real data elsewhere |
| Single huge chunk (entire doc) | Embedding “averages out” meaning | Chunk to 200-500 tokens |
| No metadata | Hard filter, no citation | Always include metadata |
| Vector-only retrieval | Misses exact keyword matches | Hybrid (BM25 + vector) |
| Embed with v1, query with v2 | Vectors incompatible | Re-index on model change |
| Reranker before retriever | Slow, expensive | Retrieve broad, rerank narrow |
| No eval set | Don’t know quality | Build eval early |
| 10K+ tokens per chunk | Embed quality degrades | Smaller chunks |
| Filter post-search top-10 | If filter sparse, no results | Pre-filter or use native filter |
| Same embedding for all use cases | One-size doesn’t fit | Different models for QA, code, multilingual |
10. Lab
10.1 Day 1: pgvector basics
docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=lab pgvector/pgvector:pg16
psql -c "CREATE EXTENSION vector;"Build small RAG with 100 docs. Manual embedding via API. Search.
10.2 Day 2: HNSW tuning
Index 1M synthetic vectors. Tune m, ef_construction, ef_search. Measure recall vs latency.
10.3 Day 3: Hybrid search
Combine BM25 (Postgres FTS) + pgvector. Implement RRF fusion.
10.4 Day 4: Qdrant setup
Deploy Qdrant, migrate 1M vectors from pgvector. Compare filtered search performance.
10.5 Day 5: Real RAG
Build production-ish RAG over your team docs / GitHub README / personal notes. Use LangChain or DIY.
10.6 Day 6: Reranking
Add Cohere Rerank or local cross-encoder. Measure accuracy delta.
10.7 Day 7: Eval
Build 50 Q&A eval set. Measure hit rate at top-5, top-10 with and without rerank.
11. Self-check
- Embedding fundamentals — vector represent gì? Compare 2 words?
- HNSW vs IVF — pros/cons mỗi cái?
- pgvector 0.7+ — 3 features mới?
- Filtered search — 4 strategies?
- Matryoshka embedding — concept + use case?
- RAG pipeline — 5 stages?
- Chunking strategy — sliding window vì sao quan trọng?
- Hybrid search — vì sao thắng vector-only?
- Reranker — đặt ở stage nào? Tại sao?
- Cost RAG — đâu là dominant cost?
12. Tiếp theo
Phase 4 complete. Bài tiếp: Tuan-16-ORM-CQRS-Multi-Tenancy — application patterns.
Tuần 15 hoàn thành. Embedding is the new database column. Cập nhật: 2026-05-16