knowledge-base/Indexing & Tokenization (BM25 / KNN / Hybrid)

4.5 Indexing & Tokenization (BM25 / KNN / Hybrid)

Pop’s knowledge base retrieval system is composed of three core technologies:
BM25 (keyword retrieval) + KNN vector search (semantic retrieval) + Hybrid (combined retrieval strategy).

This section explains their principles, data structures, pros & cons, and best-use scenarios.


📚 1. Architecture of Pop’s Retrieval System

Pop uses a dual-index architecture:

Chunk Data
 ├── BM25 Index (Keyword Inverted Index)
 └── Vector Index (Embedding-based Semantic Search)

Depending on which retrieval mode is selected, Pop invokes the corresponding index module.


🔍 2. BM25 (Keyword Retrieval)

BM25 is a classic retrieval algorithm widely used in traditional search engines.
It relies on term matching, not semantic understanding.

2.1 How BM25 Works

BM25 computes similarity using three major factors:

Factor Purpose
Term Frequency (TF) How often a word appears in a chunk
Inverse Doc Freq How rare the word is (rarer → more important)
Chunk Length Prevents long chunks from scoring unfairly high

2.2 Best Use Cases for BM25

Scenario Explanation
Documents with jargon API docs, part numbers, command names
Queries with exact match “SR-71”, “gpt-embedding-api”
Short keyword queries FAQ, titles, parameter lookup

2.3 Strengths of BM25

  • Fast and lightweight
  • Excellent for technical terms & code tokens
  • Independent of model accuracy

2.4 Limitations of BM25

  • Cannot understand meaning
  • Weak at synonyms / paraphrasing
  • Not suitable for descriptive questions like:
    “What does this feature do?”

🧠 3. KNN (Vector / Semantic Search)

KNN searches by semantic similarity, using embedding vectors.

3.1 How KNN Works

  1. Each chunk is converted into an embedding (vector, e.g., 1024 dimensions)
  2. Vectors are stored in a vector database
  3. Queries are embedded into a vector
  4. K-nearest neighbors are found via cosine similarity / dot product

The closer the distance → the more semantically relevant.

3.2 Strengths of KNN

  • Understands meaning
  • Handles synonyms, paraphrasing, vague questions
  • Best for intelligent Q&A

Example:

User asks:
“How do I set up domain binding?”

Even if the document says “Configure site domain”, KNN still finds it.

3.3 Limitations of KNN

  • Weaker than BM25 for code, parameters, exact terms
  • Embedding quality depends on the model
  • Large vector sets require optimized indexing structures

⚡ 4. Hybrid Retrieval (BM25 + KNN Combined)

Hybrid merges BM25 and semantic search to achieve the best of both worlds.
Pop defaults to Hybrid mode.

4.1 How Hybrid Works

Common scoring formula:

Hybrid Score = α * BM25_Score + β * Embedding_Score

Pop uses weighted fusion + multi-stage ranking to ensure:

  • BM25 hits jargon & code
  • KNN hits natural language queries
  • Final ranking returns the best N results

4.2 Best Use Cases for Hybrid

Scenario BM25 KNN Hybrid
Code, commands ✔️ ✔️
Technical terminology ✔️ ✔️
Fuzzy questions ✔️ ✔️
Long descriptive queries ✔️ ✔️
FAQ-style content ✔️ ✔️ ✔️
Large/complex KBs ⭐ Must use

Hybrid is the universal best-choice strategy.


🔡 5. Tokenization in Pop

5.1 BM25 Tokenization

Pop’s BM25 uses:

  • Chinese word segmentation
  • English split by space & punctuation
  • Special processing (URLs, paths, identifiers)

Example:

"如何绑定域名到网站?"
→ ["如何", "绑定", "域名", "网站"]

5.2 KNN Tokenization

Embedding models perform their own internal tokenization:

  • Understand Chinese without manual segmentation
  • Handle synonyms and semantic structures

For example, embedding models know:

“绑定域名” ≈ “域名配置”

📊 6. Weighting & Ranking Rules

Pop’s Hybrid retrieval pipeline includes:

  1. BM25 scoring
  2. Embedding similarity scoring
  3. Weighted fusion
  4. Duplicate removal
  5. Document-level priority reorder
  6. Final ranking

This produces stable and reliable results across diverse knowledge bases.


📌 7. Recommended Retrieval Strategy

KB Type Recommended Mode
Technical docs / API / code Hybrid (best)
FAQ / structured documentation Hybrid
Pure text books / novels KNN
Heavy math / symbolic content BM25 + Hybrid
Massive KB (> 1M chunks) KNN + limited BM25

✅ Summary

Each retrieval strategy excels in different tasks:

Retrieval Mode Strengths Weaknesses
BM25 Exact match, tech terms, code Weak semantic ability
KNN Semantic search, fuzzy queries Weak with jargon
Hybrid Best overall, default strategy Requires dual indexes

Hybrid is Pop’s recommended—and default—retrieval mechanism for most knowledge bases.