Tuần 13 — Search Engines: Elasticsearch / OpenSearch

“Postgres dùng B-tree để tìm row theo key. Elasticsearch dùng inverted index để tìm row theo nội dung. Khác paradigm = khác toolkit. Cùng dữ liệu có thể cần cả 2.”

Tags: database elasticsearch opensearch lucene search full-text Thời lượng: 7 ngày (5-7h/ngày) Prerequisites: Tuan-03-Indexing-Mastery (GIN, trigram baseline) Liên quan: Tuan-15-Vector-DB-AI (hybrid search) · Case-Design-Data-AI-RAG

1. Context & Why

1.1 ES vs OpenSearch (license fork 2021)

timeline
    title Elasticsearch / OpenSearch
    2010 : Elasticsearch 1.0 - Apache 2.0
    2021 : Elastic moves to SSPL/Elastic License v2 - not OSS
         : AWS forks → OpenSearch
    2024 : Elastic AGAIN OSS license AGPLv3 + ELv2 + SSPL triple
    2025 : OpenSearch under Linux Foundation

Trạng thái 2024-2026:

Elastic 8.x: dual license (AGPLv3 + ELv2)
OpenSearch 2.x: Apache 2.0
Cloud: AWS OpenSearch Service (managed OS), Elastic Cloud (managed ES)
~99% API compatible

Choose:

New project, OSS preference, AWS-centric → OpenSearch
Existing ES, need latest Elastic features → Elasticsearch
ML/vector advanced → ES leading

1.2 Mục tiêu tuần

Inverted index — fundamental mental model
BM25 scoring + analyzer pipeline
Mapping design: text vs keyword, multi-field
Query DSL: query vs filter context
Aggregations
Index lifecycle: hot-warm-cold-frozen
Cluster topology + sharding strategy
Postgres FTS vs ES — pick line
Hybrid search (BM25 + vector)

1.3 Tham chiếu

Elasticsearch: The Definitive Guide (free online, dated but core)
Elastic Documentation — https://www.elastic.co/guide
OpenSearch Documentation — https://opensearch.org/docs/
Lucene in Action — Manning (old but Lucene internals)
Tim Bray — On Search blog series

2. Inverted Index — Fundamental Data Structure

2.1 Concept

Forward index (Postgres B-tree on text):

doc1 → "the quick brown fox"
doc2 → "the lazy brown dog"

Lookup “fox” → must scan all docs.

Inverted index (Lucene):

"the"   → [doc1, doc2]
"quick" → [doc1]
"brown" → [doc1, doc2]
"fox"   → [doc1]
"lazy"  → [doc2]
"dog"   → [doc2]

Lookup “fox” → direct: doc1.

2.2 Structure detail

graph TB
    subgraph "Inverted index"
        T1[Term dictionary<br/>FST - Finite State Transducer]
        T2[Posting lists<br/>doc IDs + positions + offsets]
        T3[Term frequencies]
        T4[Field norms]
    end

    Query[search 'quick fox'] -->|tokenize| Tokens[tokens: 'quick', 'fox']
    Tokens --> T1
    T1 --> T2
    T2 --> Score[BM25 scoring]

Each segment in Lucene contains:

Term dictionary (FST for prefix-search efficiency)
Posting lists (compressed)
Skip lists for fast intersection
Norms (field length normalization)
DocValues (column-oriented for sort/aggs)

2.3 Segment + commit

graph LR
    Index[Index document] --> Buffer[In-memory buffer]
    Buffer -->|refresh 1s default| Segment1[Segment 1 - searchable]
    Segment1 --> Segment2[Segment 2]
    Segment2 --> Merge[Merge into bigger segment]
    Merge --> Disk[(Disk segment)]

Refresh (default 1s): in-memory buffer → segment available for search
Commit/flush: segments persisted, transaction log truncated
Merge: combine small segments → larger (better search perf)

Trade-off: refresh interval. Default 1s “near real-time”. Tune to 30s for bulk indexing → less segment churn → faster.

PUT /my-index/_settings
{ "index": {"refresh_interval": "30s"} }

3. BM25 Scoring

3.1 Formula

score(d, q) = sum for each term t in q:
    IDF(t) × (TF(t,d) × (k1+1)) / (TF(t,d) + k1 × (1 - b + b × |d|/avgdl))

Components:

TF (term frequency): how often term appears in doc
IDF (inverse document frequency): rarer term = higher weight
Doc length normalization: shorter docs with same TF score higher
k1 (default 1.2): term frequency saturation
b (default 0.75): length normalization

3.2 Intuition

Term in many docs = common (low IDF) = less informative
Term repeated in doc = relevant (high TF) but with diminishing returns (k1)
Short doc matches better than long doc with same TF

3.3 Compare to TF-IDF

BM25 is improved TF-IDF:

Term frequency saturation (k1 cap)
Length normalization (b)
Better empirical results

ES/OpenSearch default since 5.0.

3.4 Custom similarity

PUT /my-index
{
  "settings": {"similarity": {"my_bm25": {"type": "BM25", "k1": 1.5, "b": 0.5}}},
  "mappings": {"properties": {"content": {"type": "text", "similarity": "my_bm25"}}}
}

Tune for domain (e.g., academic vs news).

4. Analyzer Pipeline

4.1 Steps

graph LR
    Input["The Quick Brown Foxes"] --> CharFilter[Char filter<br/>HTML strip, etc]
    CharFilter --> Tokenizer[Tokenizer<br/>whitespace, standard, etc]
    Tokenizer --> TokenFilter1[Token filter 1<br/>lowercase]
    TokenFilter1 --> TokenFilter2[Token filter 2<br/>stop words]
    TokenFilter2 --> TokenFilter3[Token filter 3<br/>stemming]
    TokenFilter3 --> Output["[quick, brown, fox]"]

4.2 Built-in analyzers

standard — default, Unicode-aware
simple — lowercase + split non-letter
whitespace — split whitespace only
keyword — single token (whole input)
english, vietnamese, etc — language-specific stemming

4.3 Custom analyzer

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "english_stop", "english_stemmer"]
        }
      },
      "filter": {
        "english_stop": {"type": "stop", "stopwords": "_english_"},
        "english_stemmer": {"type": "stemmer", "language": "english"}
      }
    }
  }
}

4.4 Test analyzer

POST /_analyze
{
  "analyzer": "my_analyzer",
  "text": "The Quick Brown Foxes Jumping!"
}
// Output: [quick, brown, fox, jump]

4.5 Multi-field index

Same text indexed multiple ways:

{
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "raw": {"type": "keyword"},          // exact match, sort, aggregate
          "ngram": {"type": "text", "analyzer": "edge_ngram_analyzer"}  // autocomplete
        }
      }
    }
  }
}

Query:

title → analyzed full-text
title.raw → exact
title.ngram → prefix match

5. Mapping & Field Types

5.1 Common types

Type	Use
`text`	Full-text, analyzed
`keyword`	Exact, aggregate, sort
`long`, `integer`, `short`, `byte`	Integers
`double`, `float`, `half_float`	Floats
`date`	ISO date or epoch
`boolean`	True/false
`nested`	Array of objects with independent docs
`object`	Nested JSON, flattened internally
`geo_point`	Lat/lon
`geo_shape`	Polygons
`ip`	IP addresses
`dense_vector`	Embeddings (vector search)
`sparse_vector`	Sparse encoding (ELSER)

5.2 text vs keyword

"name": {"type": "text"}        // tokenized, full-text search
"status": {"type": "keyword"}    // exact, aggregate
"email": {"type": "keyword"}     // exact match

Email in text analyzer → split on @ → bad. Use keyword for IDs, emails, status.

5.3 Dynamic mapping

ES auto-detects fields on first document. Risk: wrong type guessed.

PUT /my-index
{
  "mappings": {
    "dynamic": "strict",        // reject unknown fields
    "properties": { ... }
  }
}

dynamic: strict recommended for production schemas.

5.4 nested vs object

// object: flattened
{"comments": [{"author": "alice", "text": "..."}]}
// → "comments.author": ["alice"], "comments.text": ["..."]
// Loses relationship between elements
 
// nested: separate docs internally
"comments": {"type": "nested", "properties": {"author": {...}, "text": {...}}}

Use nested if querying for combinations within array (e.g., comments by alice with text containing X).

6. Query DSL

6.1 query vs filter context

{
  "query": {
    "bool": {
      "must": [{"match": {"title": "elasticsearch"}}],     // scored
      "filter": [{"term": {"status": "published"}}]         // not scored, cacheable
    }
  }
}

must / should — query context, contribute to score
filter / must_not — filter context, no scoring, cached

Filter is faster for repeat queries (cache).

6.2 Common queries

// match: full-text analyzed
{"match": {"title": "quick brown fox"}}
 
// match_phrase: exact phrase
{"match_phrase": {"title": "quick brown fox"}}
 
// term: exact (no analysis)
{"term": {"status": "active"}}
 
// terms: multi-value
{"terms": {"tags": ["rust", "db"]}}
 
// range
{"range": {"created_at": {"gte": "2026-01-01", "lt": "2027-01-01"}}}
 
// exists
{"exists": {"field": "email"}}
 
// wildcard (slow)
{"wildcard": {"username": "ali*"}}
 
// prefix
{"prefix": {"username": "ali"}}
 
// fuzzy (typo tolerance)
{"fuzzy": {"title": {"value": "elsticsearch", "fuzziness": "AUTO"}}}
 
// bool combinator
{"bool": {
  "must": [...], "should": [...],
  "filter": [...], "must_not": [...],
  "minimum_should_match": 1
}}

6.3 Multi-match

Across multiple fields:

{"multi_match": {
  "query": "elasticsearch tutorial",
  "fields": ["title^3", "body", "tags^2"],   // boost titles 3x, tags 2x
  "type": "best_fields"                       // or "most_fields", "cross_fields", "phrase"
}}

6.4 Function score / rank_feature

Boost based on field value (e.g., recency, popularity):

{"function_score": {
  "query": {...},
  "functions": [
    {"field_value_factor": {"field": "popularity", "modifier": "log1p"}},
    {"gauss": {"created_at": {"origin": "now", "scale": "10d"}}}
  ],
  "score_mode": "sum",
  "boost_mode": "multiply"
}}

6.5 Highlight

{
  "query": {"match": {"body": "elasticsearch"}},
  "highlight": {
    "fields": {"body": {"pre_tags": ["<mark>"], "post_tags": ["</mark>"]}}
  }
}

7. Aggregations

7.1 Metric aggregations

{
  "size": 0,
  "aggs": {
    "avg_price": {"avg": {"field": "price"}},
    "total_revenue": {"sum": {"field": "amount"}},
    "max_score": {"max": {"field": "score"}},
    "unique_users": {"cardinality": {"field": "user_id"}}
  }
}

7.2 Bucket aggregations

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category.keyword", "size": 10}
    },
    "by_month": {
      "date_histogram": {"field": "created_at", "calendar_interval": "month"}
    },
    "by_price_range": {
      "range": {"field": "price", "ranges": [{"to": 100}, {"from": 100, "to": 1000}, {"from": 1000}]}
    }
  }
}

7.3 Nested aggregations

{
  "aggs": {
    "by_category": {
      "terms": {"field": "category.keyword"},
      "aggs": {
        "avg_price": {"avg": {"field": "price"}},
        "top_products": {"top_hits": {"size": 3}}
      }
    }
  }
}

7.4 Composite agg for pagination

{
  "aggs": {
    "all_categories": {
      "composite": {
        "size": 1000,
        "sources": [{"category": {"terms": {"field": "category.keyword"}}}]
      }
    }
  }
}

Cursor-based paginate through all buckets.

8. Cluster Topology

8.1 Roles

graph TB
    subgraph "Cluster"
        Master[Master eligible<br/>cluster state]
        Master2[Master eligible]
        Master3[Master eligible]

        Data1[Data hot<br/>SSD, recent data]
        Data2[Data hot]

        DataWarm1[Data warm<br/>HDD, older]

        DataFrozen[Data frozen<br/>snapshot-only]

        Ingest1[Ingest node<br/>preprocessing]

        Coord[Coordinating only<br/>routing]
    end

    Client --> Coord
    Coord --> Data1
    Coord --> Data2
    Coord --> DataWarm1

For >10 node cluster, dedicated roles. Smaller: combined.

8.2 Sharding strategy

Index = N primary shards + R replicas each.

Primary shards: write throughput
Replicas: read throughput + fault tolerance

Sizing:

Shard size: 10-50 GB ideal
Too few large shards → uneven load
Too many small shards → overhead

PUT /my-index
{"settings": {"number_of_shards": 5, "number_of_replicas": 1}}

Cannot change number_of_shards after creation. Plan ahead or use reindex.

8.3 Document routing

Default: hash(_id) % primary_shards. Custom routing:

POST /my-index/_doc/1?routing=user_42
{...}

Use case: multi-tenant — all tenant data on same shard for faster queries.

8.4 Allocation awareness

Spread replicas across racks/AZs:

cluster.routing.allocation.awareness.attributes: zone
node.attr.zone: us-east-1a

Prevent both primary + replica on same AZ.

9. Index Lifecycle Management (ILM)

9.1 Hot-Warm-Cold-Frozen

graph LR
    Hot[Hot tier<br/>SSD<br/>recent writes/reads<br/>0-7 days] --> Warm[Warm tier<br/>HDD/SSD<br/>older reads<br/>7-30 days]
    Warm --> Cold[Cold tier<br/>HDD<br/>searchable snapshots<br/>30-90 days]
    Cold --> Frozen[Frozen tier<br/>S3 snapshots<br/>partial mount<br/>>90 days]
    Frozen --> Delete[Delete<br/>policy]

Defined via ILM policy:

PUT _ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {"actions": {"rollover": {"max_size": "50gb", "max_age": "1d"}}},
      "warm": {"min_age": "7d", "actions": {"forcemerge": {"max_num_segments": 1}, "shrink": {"number_of_shards": 1}}},
      "cold": {"min_age": "30d", "actions": {"searchable_snapshot": {"snapshot_repository": "s3_repo"}}},
      "frozen": {"min_age": "90d", "actions": {"searchable_snapshot": {...}}},
      "delete": {"min_age": "365d", "actions": {"delete": {}}}
    }
  }
}

9.2 Rollover pattern

PUT /logs-2026-05-16-000001
PUT /logs/_alias
{"is_write_index": true}

App writes to logs alias. ILM rolls over to new index when threshold hit.

9.3 Searchable snapshots

ES/OS feature: data lives in S3 (cheap), searchable without restore. Cold/frozen tier core feature.

Cost saving: 80-90% storage vs hot SSD for old data.

10. Postgres FTS vs Elasticsearch

	Postgres FTS	Elasticsearch
Setup	Extension included	Separate cluster
Operational	Same as Postgres	Additional system
Scale (data)	Up to ~100M docs	Billions
Search relevance	Decent (ts_rank)	Best-in-class (BM25 + functions)
Language support	Few stemmers	Many languages, custom analyzers
Aggregations	SQL GROUP BY	Rich aggregations
Updates	Real-time	Near-real-time (1s refresh)
Cost	$0 (already paying for DB)	New cost
Hybrid with structured data	Native	Need lookup back to DB

Decision matrix:

flowchart TD
    A[Need search?] --> B{Volume of indexed docs?}
    B -->|<1M, structured| C[Postgres FTS<br/>tsvector + GIN]
    B -->|1M-50M, simple| D[Postgres FTS + pg_trgm]
    B -->|>50M or complex queries| E[Elasticsearch / OpenSearch]
    B -->|Need ML, ranking, analytics| E
    B -->|AI / RAG vector search| F[Vector DB or ES with vectors]

    style C fill:#c8e6c9
    style D fill:#c8e6c9
    style E fill:#fff9c4

11. Hybrid Search (BM25 + Vector) 2024-2026

11.1 Concept

Combine lexical (BM25) + semantic (embedding similarity) for best results.

graph LR
    Query[Query] --> BM25[BM25 search<br/>exact, recent]
    Query --> Embedding[Embed query]
    Embedding --> VecSearch[Vector search<br/>semantic similar]

    BM25 --> Merge[Hybrid merge<br/>RRF or weighted]
    VecSearch --> Merge

    Merge --> Final[Final ranked results]

11.2 Reciprocal Rank Fusion (RRF)

score(doc) = sum over rankers r: 1 / (k + rank_r(doc))

Default k=60. Combine top-N from each ranker, weight by inverse rank.

11.3 ES Hybrid query (8.x)

{
  "knn": {
    "field": "embedding",
    "query_vector": [...],
    "k": 10,
    "num_candidates": 100
  },
  "query": {"match": {"title": "query text"}},
  "rank": {"rrf": {"window_size": 50, "rank_constant": 20}}
}

ES 8 + automatic RRF. OpenSearch has similar.

11.4 With pgvector (Postgres-side)

-- BM25 via tsvector + Vector via pgvector
WITH bm25 AS (
    SELECT id, ts_rank(search_vector, query) AS score
    FROM products, to_tsquery('iphone case') query
    WHERE search_vector @@ query
    ORDER BY score DESC LIMIT 100
),
vec AS (
    SELECT id, 1 - (embedding <=> '[...]'::vector) AS score
    FROM products ORDER BY embedding <=> '[...]' LIMIT 100
),
fused AS (
    SELECT id, 1.0/(60 + rank() OVER (ORDER BY score DESC)) AS rrf_score
    FROM bm25
    UNION ALL
    SELECT id, 1.0/(60 + rank() OVER (ORDER BY score DESC)) FROM vec
)
SELECT id, sum(rrf_score) AS hybrid_score
FROM fused GROUP BY id ORDER BY hybrid_score DESC LIMIT 10;

Tuan-15-Vector-DB-AI đào sâu hơn.

12. Anti-patterns

Pattern	Why bad	Fix
ES as primary DB	No ACID, eventually consistent	Postgres + ES copy
ES as join engine	Cross-index join awkward	Denormalize
1 huge index forever	Slow, hard manage	Time-based rollover
Default mapping for important fields	Wrong type guessed	Explicit mapping
Text field as keyword sort	Won’t work efficient	Multi-field
`dynamic: true` strict prod	Mapping explosion	`dynamic: strict`
1 shard 500GB	Slow recovery, uneven	Resize via reindex
Refresh every 1s for bulk	Many small segments	Set 30s during bulk
Forgot replicas	Data loss on node fail	At least 1 replica
No backup	Cluster fail = lose	Snapshot to S3 daily

13. Lab

13.1 Day 1: Setup

docker run -d --name es -p 9200:9200 -e "discovery.type=single-node" -e "xpack.security.enabled=false" elasticsearch:8.13.0
 
curl http://localhost:9200/

Index sample data, basic queries.

13.2 Day 2: Mapping

Design mapping for product catalog. text vs keyword. Multi-field for sort + search.

13.3 Day 3: Query DSL

Practice 10 query types: match, term, range, bool, fuzzy, prefix, wildcard, multi_match, function_score, nested.

13.4 Day 4: Aggregations

Build analytics queries: top categories, sales over time, avg price by category, percentiles.

13.5 Day 5: Cluster

3-node cluster. Add/remove nodes. Watch shard rebalance.

13.6 Day 6: ILM

Setup index alias + rollover + ILM policy. Watch indices move through phases.

13.7 Day 7: Hybrid search

Combine BM25 + dense_vector. Use sentence-transformers or OpenAI embeddings.

14. Self-check

Inverted index vs forward index — vẽ ví dụ?
BM25 vs TF-IDF — khác biệt chính?
text vs keyword field — khi nào pick mỗi?
query context vs filter context — khác biệt + ý nghĩa?
Multi-field mapping — example use case?
Shard size sweet spot? Vì sao không nên huge?
Hot-warm-cold-frozen tiering — cost saving?
Postgres FTS vs ES — 3 lý do pick ES?
RRF hybrid search — công thức?
Index alias + rollover — pattern thế nào?

15. Tiếp theo

Bài tiếp: Tuan-14-OLAP-Columnar-ClickHouse — analytical workload.

Tuần 13 hoàn thành. Search != Query. Cập nhật: 2026-05-16

lthieu's notes

Explorer

Tuan-13-Search-Engines-ES