4.5 Indexing & Tokenization (BM25 / KNN / Hybrid)

Pop’s knowledge base retrieval system is composed of three core technologies:
BM25 (keyword retrieval) + KNN vector search (semantic retrieval) + Hybrid (combined retrieval strategy).

This section explains their principles, data structures, pros & cons, and best-use scenarios.

📚 1. Architecture of Pop’s Retrieval System

Pop uses a dual-index architecture:

Chunk Data
 ├── BM25 Index (Keyword Inverted Index)
 └── Vector Index (Embedding-based Semantic Search)

Depending on which retrieval mode is selected, Pop invokes the corresponding index module.

🔍 2. BM25 (Keyword Retrieval)

BM25 is a classic retrieval algorithm widely used in traditional search engines.
It relies on term matching, not semantic understanding.

2.1 How BM25 Works

BM25 computes similarity using three major factors:

Factor	Purpose
Term Frequency (TF)	How often a word appears in a chunk
Inverse Doc Freq	How rare the word is (rarer → more important)
Chunk Length	Prevents long chunks from scoring unfairly high

2.2 Best Use Cases for BM25

Scenario	Explanation
Documents with jargon	API docs, part numbers, command names
Queries with exact match	“SR-71”, “gpt-embedding-api”
Short keyword queries	FAQ, titles, parameter lookup

2.3 Strengths of BM25

Fast and lightweight
Excellent for technical terms & code tokens
Independent of model accuracy

2.4 Limitations of BM25

Cannot understand meaning
Weak at synonyms / paraphrasing
Not suitable for descriptive questions like:
“What does this feature do?”

🧠 3. KNN (Vector / Semantic Search)

KNN searches by semantic similarity, using embedding vectors.

3.1 How KNN Works

Each chunk is converted into an embedding (vector, e.g., 1024 dimensions)
Vectors are stored in a vector database
Queries are embedded into a vector
K-nearest neighbors are found via cosine similarity / dot product

The closer the distance → the more semantically relevant.

3.2 Strengths of KNN

Understands meaning
Handles synonyms, paraphrasing, vague questions
Best for intelligent Q&A

Example:

User asks:
“How do I set up domain binding?”

Even if the document says “Configure site domain”, KNN still finds it.

3.3 Limitations of KNN

Weaker than BM25 for code, parameters, exact terms
Embedding quality depends on the model
Large vector sets require optimized indexing structures

⚡ 4. Hybrid Retrieval (BM25 + KNN Combined)

Hybrid merges BM25 and semantic search to achieve the best of both worlds.
Pop defaults to Hybrid mode.

4.1 How Hybrid Works

Common scoring formula:

Hybrid Score = α * BM25_Score + β * Embedding_Score

Pop uses weighted fusion + multi-stage ranking to ensure:

BM25 hits jargon & code
KNN hits natural language queries
Final ranking returns the best N results

4.2 Best Use Cases for Hybrid

Scenario	BM25	KNN	Hybrid
Code, commands	✔️	❌	✔️
Technical terminology	✔️	❌	✔️
Fuzzy questions	❌	✔️	✔️
Long descriptive queries	❌	✔️	✔️
FAQ-style content	✔️	✔️	✔️
Large/complex KBs	❌	❌	⭐ Must use

Hybrid is the universal best-choice strategy.

🔡 5. Tokenization in Pop

5.1 BM25 Tokenization

Pop’s BM25 uses:

Chinese word segmentation
English split by space & punctuation
Special processing (URLs, paths, identifiers)

Example:

"如何绑定域名到网站？"
→ ["如何", "绑定", "域名", "网站"]

5.2 KNN Tokenization

Embedding models perform their own internal tokenization:

Understand Chinese without manual segmentation
Handle synonyms and semantic structures

For example, embedding models know:

“绑定域名” ≈ “域名配置”

📊 6. Weighting & Ranking Rules

Pop’s Hybrid retrieval pipeline includes:

BM25 scoring
Embedding similarity scoring
Weighted fusion
Duplicate removal
Document-level priority reorder
Final ranking

This produces stable and reliable results across diverse knowledge bases.

📌 7. Recommended Retrieval Strategy

KB Type	Recommended Mode
Technical docs / API / code	Hybrid (best)
FAQ / structured documentation	Hybrid
Pure text books / novels	KNN
Heavy math / symbolic content	BM25 + Hybrid
Massive KB (> 1M chunks)	KNN + limited BM25

✅ Summary

Each retrieval strategy excels in different tasks:

Retrieval Mode	Strengths	Weaknesses
BM25	Exact match, tech terms, code	Weak semantic ability
KNN	Semantic search, fuzzy queries	Weak with jargon
Hybrid	Best overall, default strategy	Requires dual indexes

Hybrid is Pop’s recommended—and default—retrieval mechanism for most knowledge bases.