4.5 Indexing & Tokenization (BM25 / KNN / Hybrid)
Pop’s knowledge base retrieval system is composed of three core technologies:
BM25 (keyword retrieval) + KNN vector search (semantic retrieval) + Hybrid (combined retrieval strategy).
This section explains their principles, data structures, pros & cons, and best-use scenarios.
📚 1. Architecture of Pop’s Retrieval System
Pop uses a dual-index architecture:
Chunk Data
├── BM25 Index (Keyword Inverted Index)
└── Vector Index (Embedding-based Semantic Search)
Depending on which retrieval mode is selected, Pop invokes the corresponding index module.
🔍 2. BM25 (Keyword Retrieval)
BM25 is a classic retrieval algorithm widely used in traditional search engines.
It relies on term matching, not semantic understanding.
2.1 How BM25 Works
BM25 computes similarity using three major factors:
| Factor | Purpose |
|---|---|
| Term Frequency (TF) | How often a word appears in a chunk |
| Inverse Doc Freq | How rare the word is (rarer → more important) |
| Chunk Length | Prevents long chunks from scoring unfairly high |
2.2 Best Use Cases for BM25
| Scenario | Explanation |
|---|---|
| Documents with jargon | API docs, part numbers, command names |
| Queries with exact match | “SR-71”, “gpt-embedding-api” |
| Short keyword queries | FAQ, titles, parameter lookup |
2.3 Strengths of BM25
- Fast and lightweight
- Excellent for technical terms & code tokens
- Independent of model accuracy
2.4 Limitations of BM25
- Cannot understand meaning
- Weak at synonyms / paraphrasing
- Not suitable for descriptive questions like:
“What does this feature do?”
🧠 3. KNN (Vector / Semantic Search)
KNN searches by semantic similarity, using embedding vectors.
3.1 How KNN Works
- Each chunk is converted into an embedding (vector, e.g., 1024 dimensions)
- Vectors are stored in a vector database
- Queries are embedded into a vector
- K-nearest neighbors are found via cosine similarity / dot product
The closer the distance → the more semantically relevant.
3.2 Strengths of KNN
- Understands meaning
- Handles synonyms, paraphrasing, vague questions
- Best for intelligent Q&A
Example:
User asks:
“How do I set up domain binding?”
Even if the document says “Configure site domain”, KNN still finds it.
3.3 Limitations of KNN
- Weaker than BM25 for code, parameters, exact terms
- Embedding quality depends on the model
- Large vector sets require optimized indexing structures
⚡ 4. Hybrid Retrieval (BM25 + KNN Combined)
Hybrid merges BM25 and semantic search to achieve the best of both worlds.
Pop defaults to Hybrid mode.
4.1 How Hybrid Works
Common scoring formula:
Hybrid Score = α * BM25_Score + β * Embedding_Score
Pop uses weighted fusion + multi-stage ranking to ensure:
- BM25 hits jargon & code
- KNN hits natural language queries
- Final ranking returns the best N results
4.2 Best Use Cases for Hybrid
| Scenario | BM25 | KNN | Hybrid |
|---|---|---|---|
| Code, commands | ✔️ | ❌ | ✔️ |
| Technical terminology | ✔️ | ❌ | ✔️ |
| Fuzzy questions | ❌ | ✔️ | ✔️ |
| Long descriptive queries | ❌ | ✔️ | ✔️ |
| FAQ-style content | ✔️ | ✔️ | ✔️ |
| Large/complex KBs | ❌ | ❌ | ⭐ Must use |
Hybrid is the universal best-choice strategy.
🔡 5. Tokenization in Pop
5.1 BM25 Tokenization
Pop’s BM25 uses:
- Chinese word segmentation
- English split by space & punctuation
- Special processing (URLs, paths, identifiers)
Example:
"如何绑定域名到网站?"
→ ["如何", "绑定", "域名", "网站"]
5.2 KNN Tokenization
Embedding models perform their own internal tokenization:
- Understand Chinese without manual segmentation
- Handle synonyms and semantic structures
For example, embedding models know:
“绑定域名” ≈ “域名配置”
📊 6. Weighting & Ranking Rules
Pop’s Hybrid retrieval pipeline includes:
- BM25 scoring
- Embedding similarity scoring
- Weighted fusion
- Duplicate removal
- Document-level priority reorder
- Final ranking
This produces stable and reliable results across diverse knowledge bases.
📌 7. Recommended Retrieval Strategy
| KB Type | Recommended Mode |
|---|---|
| Technical docs / API / code | Hybrid (best) |
| FAQ / structured documentation | Hybrid |
| Pure text books / novels | KNN |
| Heavy math / symbolic content | BM25 + Hybrid |
| Massive KB (> 1M chunks) | KNN + limited BM25 |
✅ Summary
Each retrieval strategy excels in different tasks:
| Retrieval Mode | Strengths | Weaknesses |
|---|---|---|
| BM25 | Exact match, tech terms, code | Weak semantic ability |
| KNN | Semantic search, fuzzy queries | Weak with jargon |
| Hybrid | Best overall, default strategy | Requires dual indexes |
Hybrid is Pop’s recommended—and default—retrieval mechanism for most knowledge bases.