Tuần 13 — Search Engines: Elasticsearch / OpenSearch
“Postgres dùng B-tree để tìm row theo key. Elasticsearch dùng inverted index để tìm row theo nội dung. Khác paradigm = khác toolkit. Cùng dữ liệu có thể cần cả 2.”
Tags: database elasticsearch opensearch lucene search full-text Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-03-Indexing-Mastery (GIN, trigram baseline) Liên quan: Tuan-15-Vector-DB-AI (hybrid search) · Case-Design-Data-AI-RAG
1. Context & Why
1.1 ES vs OpenSearch (license fork 2021)
timeline title Elasticsearch / OpenSearch 2010 : Elasticsearch 1.0 - Apache 2.0 2021 : Elastic moves to SSPL/Elastic License v2 - not OSS : AWS forks → OpenSearch 2024 : Elastic AGAIN OSS license AGPLv3 + ELv2 + SSPL triple 2025 : OpenSearch under Linux Foundation
Trạng thái 2024-2026:
- Elastic 8.x: dual license (AGPLv3 + ELv2)
- OpenSearch 2.x: Apache 2.0
- Cloud: AWS OpenSearch Service (managed OS), Elastic Cloud (managed ES)
- ~99% API compatible
Choose:
- New project, OSS preference, AWS-centric → OpenSearch
- Existing ES, need latest Elastic features → Elasticsearch
- ML/vector advanced → ES leading
1.2 Mục tiêu tuần
- Inverted index — fundamental mental model
- BM25 scoring + analyzer pipeline
- Mapping design: text vs keyword, multi-field
- Query DSL: query vs filter context
- Aggregations
- Index lifecycle: hot-warm-cold-frozen
- Cluster topology + sharding strategy
- Postgres FTS vs ES — pick line
- Hybrid search (BM25 + vector)
1.3 Tham chiếu
- Elasticsearch: The Definitive Guide (free online, dated but core)
- Elastic Documentation — https://www.elastic.co/guide
- OpenSearch Documentation — https://opensearch.org/docs/
- Lucene in Action — Manning (old but Lucene internals)
- Tim Bray — On Search blog series
2. Inverted Index — Fundamental Data Structure
2.1 Concept
Forward index (Postgres B-tree on text):
doc1 → "the quick brown fox"
doc2 → "the lazy brown dog"
Lookup “fox” → must scan all docs.
Inverted index (Lucene):
"the" → [doc1, doc2]
"quick" → [doc1]
"brown" → [doc1, doc2]
"fox" → [doc1]
"lazy" → [doc2]
"dog" → [doc2]
Lookup “fox” → direct: doc1.
2.2 Structure detail
graph TB subgraph "Inverted index" T1[Term dictionary<br/>FST - Finite State Transducer] T2[Posting lists<br/>doc IDs + positions + offsets] T3[Term frequencies] T4[Field norms] end Query[search 'quick fox'] -->|tokenize| Tokens[tokens: 'quick', 'fox'] Tokens --> T1 T1 --> T2 T2 --> Score[BM25 scoring]
Each segment in Lucene contains:
- Term dictionary (FST for prefix-search efficiency)
- Posting lists (compressed)
- Skip lists for fast intersection
- Norms (field length normalization)
- DocValues (column-oriented for sort/aggs)
2.3 Segment + commit
graph LR Index[Index document] --> Buffer[In-memory buffer] Buffer -->|refresh 1s default| Segment1[Segment 1 - searchable] Segment1 --> Segment2[Segment 2] Segment2 --> Merge[Merge into bigger segment] Merge --> Disk[(Disk segment)]
- Refresh (default 1s): in-memory buffer → segment available for search
- Commit/flush: segments persisted, transaction log truncated
- Merge: combine small segments → larger (better search perf)
Trade-off: refresh interval. Default 1s “near real-time”. Tune to 30s for bulk indexing → less segment churn → faster.
PUT /my-index/_settings
{ "index": {"refresh_interval": "30s"} }3. BM25 Scoring
3.1 Formula
score(d, q) = sum for each term t in q:
IDF(t) × (TF(t,d) × (k1+1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))
Components:
- TF (term frequency): how often term appears in doc
- IDF (inverse document frequency): rarer term = higher weight
- Doc length normalization: shorter docs with same TF score higher
k1(default 1.2): term frequency saturationb(default 0.75): length normalization
3.2 Intuition
- Term in many docs = common (low IDF) = less informative
- Term repeated in doc = relevant (high TF) but with diminishing returns (k1)
- Short doc matches better than long doc with same TF
3.3 Compare to TF-IDF
BM25 is improved TF-IDF:
- Term frequency saturation (k1 cap)
- Length normalization (b)
- Better empirical results
ES/OpenSearch default since 5.0.
3.4 Custom similarity
PUT /my-index
{
"settings": {"similarity": {"my_bm25": {"type": "BM25", "k1": 1.5, "b": 0.5}}},
"mappings": {"properties": {"content": {"type": "text", "similarity": "my_bm25"}}}
}Tune for domain (e.g., academic vs news).
4. Analyzer Pipeline
4.1 Steps
graph LR Input["The Quick Brown Foxes"] --> CharFilter[Char filter<br/>HTML strip, etc] CharFilter --> Tokenizer[Tokenizer<br/>whitespace, standard, etc] Tokenizer --> TokenFilter1[Token filter 1<br/>lowercase] TokenFilter1 --> TokenFilter2[Token filter 2<br/>stop words] TokenFilter2 --> TokenFilter3[Token filter 3<br/>stemming] TokenFilter3 --> Output["[quick, brown, fox]"]
4.2 Built-in analyzers
standard— default, Unicode-awaresimple— lowercase + split non-letterwhitespace— split whitespace onlykeyword— single token (whole input)english,vietnamese, etc — language-specific stemming
4.3 Custom analyzer
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "asciifolding", "english_stop", "english_stemmer"]
}
},
"filter": {
"english_stop": {"type": "stop", "stopwords": "_english_"},
"english_stemmer": {"type": "stemmer", "language": "english"}
}
}
}
}4.4 Test analyzer
POST /_analyze
{
"analyzer": "my_analyzer",
"text": "The Quick Brown Foxes Jumping!"
}
// Output: [quick, brown, fox, jump]4.5 Multi-field index
Same text indexed multiple ways:
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english",
"fields": {
"raw": {"type": "keyword"}, // exact match, sort, aggregate
"ngram": {"type": "text", "analyzer": "edge_ngram_analyzer"} // autocomplete
}
}
}
}
}Query:
title→ analyzed full-texttitle.raw→ exacttitle.ngram→ prefix match
5. Mapping & Field Types
5.1 Common types
| Type | Use |
|---|---|
text | Full-text, analyzed |
keyword | Exact, aggregate, sort |
long, integer, short, byte | Integers |
double, float, half_float | Floats |
date | ISO date or epoch |
boolean | True/false |
nested | Array of objects with independent docs |
object | Nested JSON, flattened internally |
geo_point | Lat/lon |
geo_shape | Polygons |
ip | IP addresses |
dense_vector | Embeddings (vector search) |
sparse_vector | Sparse encoding (ELSER) |
5.2 text vs keyword
"name": {"type": "text"} // tokenized, full-text search
"status": {"type": "keyword"} // exact, aggregate
"email": {"type": "keyword"} // exact matchEmail in text analyzer → split on @ → bad. Use keyword for IDs, emails, status.
5.3 Dynamic mapping
ES auto-detects fields on first document. Risk: wrong type guessed.
PUT /my-index
{
"mappings": {
"dynamic": "strict", // reject unknown fields
"properties": { ... }
}
}dynamic: strict recommended for production schemas.
5.4 nested vs object
// object: flattened
{"comments": [{"author": "alice", "text": "..."}]}
// → "comments.author": ["alice"], "comments.text": ["..."]
// Loses relationship between elements
// nested: separate docs internally
"comments": {"type": "nested", "properties": {"author": {...}, "text": {...}}}Use nested if querying for combinations within array (e.g., comments by alice with text containing X).
6. Query DSL
6.1 query vs filter context
{
"query": {
"bool": {
"must": [{"match": {"title": "elasticsearch"}}], // scored
"filter": [{"term": {"status": "published"}}] // not scored, cacheable
}
}
}- must / should — query context, contribute to score
- filter / must_not — filter context, no scoring, cached
Filter is faster for repeat queries (cache).
6.2 Common queries
// match: full-text analyzed
{"match": {"title": "quick brown fox"}}
// match_phrase: exact phrase
{"match_phrase": {"title": "quick brown fox"}}
// term: exact (no analysis)
{"term": {"status": "active"}}
// terms: multi-value
{"terms": {"tags": ["rust", "db"]}}
// range
{"range": {"created_at": {"gte": "2026-01-01", "lt": "2027-01-01"}}}
// exists
{"exists": {"field": "email"}}
// wildcard (slow)
{"wildcard": {"username": "ali*"}}
// prefix
{"prefix": {"username": "ali"}}
// fuzzy (typo tolerance)
{"fuzzy": {"title": {"value": "elsticsearch", "fuzziness": "AUTO"}}}
// bool combinator
{"bool": {
"must": [...], "should": [...],
"filter": [...], "must_not": [...],
"minimum_should_match": 1
}}6.3 Multi-match
Across multiple fields:
{"multi_match": {
"query": "elasticsearch tutorial",
"fields": ["title^3", "body", "tags^2"], // boost titles 3x, tags 2x
"type": "best_fields" // or "most_fields", "cross_fields", "phrase"
}}6.4 Function score / rank_feature
Boost based on field value (e.g., recency, popularity):
{"function_score": {
"query": {...},
"functions": [
{"field_value_factor": {"field": "popularity", "modifier": "log1p"}},
{"gauss": {"created_at": {"origin": "now", "scale": "10d"}}}
],
"score_mode": "sum",
"boost_mode": "multiply"
}}6.5 Highlight
{
"query": {"match": {"body": "elasticsearch"}},
"highlight": {
"fields": {"body": {"pre_tags": ["<mark>"], "post_tags": ["</mark>"]}}
}
}7. Aggregations
7.1 Metric aggregations
{
"size": 0,
"aggs": {
"avg_price": {"avg": {"field": "price"}},
"total_revenue": {"sum": {"field": "amount"}},
"max_score": {"max": {"field": "score"}},
"unique_users": {"cardinality": {"field": "user_id"}}
}
}7.2 Bucket aggregations
{
"aggs": {
"by_category": {
"terms": {"field": "category.keyword", "size": 10}
},
"by_month": {
"date_histogram": {"field": "created_at", "calendar_interval": "month"}
},
"by_price_range": {
"range": {"field": "price", "ranges": [{"to": 100}, {"from": 100, "to": 1000}, {"from": 1000}]}
}
}
}7.3 Nested aggregations
{
"aggs": {
"by_category": {
"terms": {"field": "category.keyword"},
"aggs": {
"avg_price": {"avg": {"field": "price"}},
"top_products": {"top_hits": {"size": 3}}
}
}
}
}7.4 Composite agg for pagination
{
"aggs": {
"all_categories": {
"composite": {
"size": 1000,
"sources": [{"category": {"terms": {"field": "category.keyword"}}}]
}
}
}
}Cursor-based paginate through all buckets.
8. Cluster Topology
8.1 Roles
graph TB subgraph "Cluster" Master[Master eligible<br/>cluster state] Master2[Master eligible] Master3[Master eligible] Data1[Data hot<br/>SSD, recent data] Data2[Data hot] DataWarm1[Data warm<br/>HDD, older] DataFrozen[Data frozen<br/>snapshot-only] Ingest1[Ingest node<br/>preprocessing] Coord[Coordinating only<br/>routing] end Client --> Coord Coord --> Data1 Coord --> Data2 Coord --> DataWarm1
For >10 node cluster, dedicated roles. Smaller: combined.
8.2 Sharding strategy
Index = N primary shards + R replicas each.
- Primary shards: write throughput
- Replicas: read throughput + fault tolerance
Sizing:
- Shard size: 10-50 GB ideal
- Too few large shards → uneven load
- Too many small shards → overhead
PUT /my-index
{"settings": {"number_of_shards": 5, "number_of_replicas": 1}}Cannot change number_of_shards after creation. Plan ahead or use reindex.
8.3 Document routing
Default: hash(_id) % primary_shards. Custom routing:
POST /my-index/_doc/1?routing=user_42
{...}Use case: multi-tenant — all tenant data on same shard for faster queries.
8.4 Allocation awareness
Spread replicas across racks/AZs:
cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: us-east-1aPrevent both primary + replica on same AZ.
9. Index Lifecycle Management (ILM)
9.1 Hot-Warm-Cold-Frozen
graph LR Hot[Hot tier<br/>SSD<br/>recent writes/reads<br/>0-7 days] --> Warm[Warm tier<br/>HDD/SSD<br/>older reads<br/>7-30 days] Warm --> Cold[Cold tier<br/>HDD<br/>searchable snapshots<br/>30-90 days] Cold --> Frozen[Frozen tier<br/>S3 snapshots<br/>partial mount<br/>>90 days] Frozen --> Delete[Delete<br/>policy]
Defined via ILM policy:
PUT _ilm/policy/logs_policy
{
"policy": {
"phases": {
"hot": {"actions": {"rollover": {"max_size": "50gb", "max_age": "1d"}}},
"warm": {"min_age": "7d", "actions": {"forcemerge": {"max_num_segments": 1}, "shrink": {"number_of_shards": 1}}},
"cold": {"min_age": "30d", "actions": {"searchable_snapshot": {"snapshot_repository": "s3_repo"}}},
"frozen": {"min_age": "90d", "actions": {"searchable_snapshot": {...}}},
"delete": {"min_age": "365d", "actions": {"delete": {}}}
}
}
}9.2 Rollover pattern
PUT /logs-2026-05-16-000001
PUT /logs/_alias
{"is_write_index": true}App writes to logs alias. ILM rolls over to new index when threshold hit.
9.3 Searchable snapshots
ES/OS feature: data lives in S3 (cheap), searchable without restore. Cold/frozen tier core feature.
Cost saving: 80-90% storage vs hot SSD for old data.
10. Postgres FTS vs Elasticsearch
| Postgres FTS | Elasticsearch | |
|---|---|---|
| Setup | Extension included | Separate cluster |
| Operational | Same as Postgres | Additional system |
| Scale (data) | Up to ~100M docs | Billions |
| Search relevance | Decent (ts_rank) | Best-in-class (BM25 + functions) |
| Language support | Few stemmers | Many languages, custom analyzers |
| Aggregations | SQL GROUP BY | Rich aggregations |
| Updates | Real-time | Near-real-time (1s refresh) |
| Cost | $0 (already paying for DB) | New cost |
| Hybrid with structured data | Native | Need lookup back to DB |
Decision matrix:
flowchart TD A[Need search?] --> B{Volume of indexed docs?} B -->|<1M, structured| C[Postgres FTS<br/>tsvector + GIN] B -->|1M-50M, simple| D[Postgres FTS + pg_trgm] B -->|>50M or complex queries| E[Elasticsearch / OpenSearch] B -->|Need ML, ranking, analytics| E B -->|AI / RAG vector search| F[Vector DB or ES with vectors] style C fill:#c8e6c9 style D fill:#c8e6c9 style E fill:#fff9c4
11. Hybrid Search (BM25 + Vector) 2024-2026
11.1 Concept
Combine lexical (BM25) + semantic (embedding similarity) for best results.
graph LR Query[Query] --> BM25[BM25 search<br/>exact, recent] Query --> Embedding[Embed query] Embedding --> VecSearch[Vector search<br/>semantic similar] BM25 --> Merge[Hybrid merge<br/>RRF or weighted] VecSearch --> Merge Merge --> Final[Final ranked results]
11.2 Reciprocal Rank Fusion (RRF)
score(doc) = sum over rankers r: 1 / (k + rank_r(doc))
Default k=60. Combine top-N from each ranker, weight by inverse rank.
11.3 ES Hybrid query (8.x)
{
"knn": {
"field": "embedding",
"query_vector": [...],
"k": 10,
"num_candidates": 100
},
"query": {"match": {"title": "query text"}},
"rank": {"rrf": {"window_size": 50, "rank_constant": 20}}
}ES 8 + automatic RRF. OpenSearch has similar.
11.4 With pgvector (Postgres-side)
-- BM25 via tsvector + Vector via pgvector
WITH bm25 AS (
SELECT id, ts_rank(search_vector, query) AS score
FROM products, to_tsquery('iphone case') query
WHERE search_vector @@ query
ORDER BY score DESC LIMIT 100
),
vec AS (
SELECT id, 1 - (embedding <=> '[...]'::vector) AS score
FROM products ORDER BY embedding <=> '[...]' LIMIT 100
),
fused AS (
SELECT id, 1.0/(60 + rank() OVER (ORDER BY score DESC)) AS rrf_score
FROM bm25
UNION ALL
SELECT id, 1.0/(60 + rank() OVER (ORDER BY score DESC)) FROM vec
)
SELECT id, sum(rrf_score) AS hybrid_score
FROM fused GROUP BY id ORDER BY hybrid_score DESC LIMIT 10;Tuan-15-Vector-DB-AI đào sâu hơn.
12. Anti-patterns
| Pattern | Why bad | Fix |
|---|---|---|
| ES as primary DB | No ACID, eventually consistent | Postgres + ES copy |
| ES as join engine | Cross-index join awkward | Denormalize |
| 1 huge index forever | Slow, hard manage | Time-based rollover |
| Default mapping for important fields | Wrong type guessed | Explicit mapping |
| Text field as keyword sort | Won’t work efficient | Multi-field |
dynamic: true strict prod | Mapping explosion | dynamic: strict |
| 1 shard 500GB | Slow recovery, uneven | Resize via reindex |
| Refresh every 1s for bulk | Many small segments | Set 30s during bulk |
| Forgot replicas | Data loss on node fail | At least 1 replica |
| No backup | Cluster fail = lose | Snapshot to S3 daily |
13. Lab
13.1 Day 1: Setup
docker run -d --name es -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.13.0
curl http://localhost:9200/Index sample data, basic queries.
13.2 Day 2: Mapping
Design mapping for product catalog. text vs keyword. Multi-field for sort + search.
13.3 Day 3: Query DSL
Practice 10 query types: match, term, range, bool, fuzzy, prefix, wildcard, multi_match, function_score, nested.
13.4 Day 4: Aggregations
Build analytics queries: top categories, sales over time, avg price by category, percentiles.
13.5 Day 5: Cluster
3-node cluster. Add/remove nodes. Watch shard rebalance.
13.6 Day 6: ILM
Setup index alias + rollover + ILM policy. Watch indices move through phases.
13.7 Day 7: Hybrid search
Combine BM25 + dense_vector. Use sentence-transformers or OpenAI embeddings.
14. Self-check
- Inverted index vs forward index — vẽ ví dụ?
- BM25 vs TF-IDF — khác biệt chính?
- text vs keyword field — khi nào pick mỗi?
- query context vs filter context — khác biệt + ý nghĩa?
- Multi-field mapping — example use case?
- Shard size sweet spot? Vì sao không nên huge?
- Hot-warm-cold-frozen tiering — cost saving?
- Postgres FTS vs ES — 3 lý do pick ES?
- RRF hybrid search — công thức?
- Index alias + rollover — pattern thế nào?
15. Tiếp theo
Bài tiếp: Tuan-14-OLAP-Columnar-ClickHouse — analytical workload.
Tuần 13 hoàn thành. Search != Query. Cập nhật: 2026-05-16