4.4 Knowledge Base Storage Structure and Chunking Rules
The reason Pop’s Knowledge Base can deliver high‑quality retrieval and accurate answers is because of its core processing pipeline:
⭑ Document Parsing → Cleaning → Chunking → Embedding → Structured Storage
This section explains how Pop splits documents into searchable segments and how those segments are stored internally.
🗂️ 1. Overview of Knowledge Base Storage Structure
Each Knowledge Base consists of three major data layers:
Knowledge Base (KB)
├── Documents
├── Chunks
├── Indexes
├── Vector Index (KNN)
└── BM25 Text Index
Processing flow:
Original Document → Parsing → Cleaning → Automatic Chunking → Embedding → Indexing
📄 2. Document Parsing
Pop automatically parses documents based on their formats:
| Document Type | Parsing Method |
|---|---|
| Text extraction + OCR (scanned) | |
| Word / PPT | Direct text extraction |
| Markdown | Parsed by heading structure |
| HTML / URL | Main‑content extraction (removes ads/navigation) |
| Code Files | Treated as plain text |
After parsing, Pop obtains a clean text version that proceeds to the cleaning stage.
🧹 3. Text Cleaning
To ensure high‑quality embeddings, Pop performs additional cleaning:
- Removes headers, footers, and page numbers
- Strips redundant blank lines and strange characters
- Merges paragraphs split by hard line breaks
- Removes duplicated segments
- Normalizes inconsistent headings and lists
This helps significantly improve chunking and retrieval quality.
🔀 4. Core Logic of Document Chunking
Chunking is one of the most critical steps in building a Knowledge Base.
Good chunks make AI retrieval precise; poor chunks lead to off‑topic answers.
Pop uses a hybrid intelligent chunking strategy:
4.1 Chunking by Heading Structure
Used for Markdown, PDF, Word, etc.:
Example:
# 1. System Overview
## 1.1 Feature List
### 1.1.1 User Module
Each heading level forms a natural segmentation boundary.
4.2 Semantic-Aware Paragraph Chunking
In addition to headings, Pop splits by natural paragraph groups:
- Avoid chunks that are too long
- Avoid chunks that are too short
- Maintain semantic coherence
- Preserve context links
Default strategy:
- Target chunk length: 250–500 Chinese characters / chunk
- Tolerance range: 200–600
- Auto‑merge overly short paragraphs
- Auto‑split overly long paragraphs
4.3 Special Handling Rules
Different content types use optimized strategies:
| Content Type | Handling Strategy |
|---|---|
| Lists (bullet/number) | Kept as a single chunk |
| Tables | Flattened into clean plain text |
| Code blocks | Stored as independent chunks |
| Form fields | Compressed into key‑value lines |
| Images (OCR) | Extracts meaningful OCR text |
🧩 5. Chunk Storage Structure
Each chunk is stored as a structured object, for example:
{
"id": "chunk_xxx",
"kb_id": "kb_abc",
"document_id": "doc_123",
"content": "...",
"tokens": 356,
"embedding": [0.12, -0.44, ...],
"order": 17,
"metadata": {
"title": "1.2 System Features",
"page": 5,
"source": "xxx.pdf"
}
}
Field description:
| Field | Description |
|---|---|
| id | Unique ID of the chunk |
| kb_id | Parent knowledge base |
| document_id | Parent document |
| content | Text content of the chunk |
| tokens | Token count |
| embedding | Vector embedding |
| order | Position in the original document |
| metadata | Title, page, source info |
🔎 6. Index Structures
Pop builds two types of indexes:
6.1 KNN Vector Index
Used for semantic retrieval:
- Embeddings generated by selected model
- Stored in high‑dimensional vector DB
- Supports 768D, 1024D, etc.
- Enables semantic recall
6.2 BM25 Text Index
Ideal for:
- Keyword matching
- FAQ search
- Terminology lookup
Both work together for hybrid retrieval.
🧠 7. Why Chunking Is So Important?
Chunking forms the foundation of RAG (Retrieval-Augmented Generation).
It directly impacts:
- Retrieval accuracy
- Whether correct passages are found
- Whether answers include proper context
- Traceability to source documents
Good chunks = high‑quality RAG
Bad chunks = the model never answers correctly.
Pop’s intelligent chunking provides consistent, reliable results.
📌 Summary
Pop implements a full pipeline:
- Document parsing
- Cleaning and noise removal
- Intelligent chunking
- Embedding generation
- BM25 indexing
- High‑performance storage
These form the core of Pop’s Knowledge Base, enabling accurate and traceable AI answers based on document content.