4.4 Knowledge Base Storage Structure and Chunking Rules

The reason Pop’s Knowledge Base can deliver high‑quality retrieval and accurate answers is because of its core processing pipeline:
⭑ Document Parsing → Cleaning → Chunking → Embedding → Structured Storage

This section explains how Pop splits documents into searchable segments and how those segments are stored internally.

🗂️ 1. Overview of Knowledge Base Storage Structure

Each Knowledge Base consists of three major data layers:

Knowledge Base (KB)
 ├── Documents
 ├── Chunks
 ├── Indexes
       ├── Vector Index (KNN)
       └── BM25 Text Index

Processing flow:

Original Document → Parsing → Cleaning → Automatic Chunking → Embedding → Indexing

📄 2. Document Parsing

Pop automatically parses documents based on their formats:

Document Type	Parsing Method
PDF	Text extraction + OCR (scanned)
Word / PPT	Direct text extraction
Markdown	Parsed by heading structure
HTML / URL	Main‑content extraction (removes ads/navigation)
Code Files	Treated as plain text

After parsing, Pop obtains a clean text version that proceeds to the cleaning stage.

🧹 3. Text Cleaning

To ensure high‑quality embeddings, Pop performs additional cleaning:

Removes headers, footers, and page numbers
Strips redundant blank lines and strange characters
Merges paragraphs split by hard line breaks
Removes duplicated segments
Normalizes inconsistent headings and lists

This helps significantly improve chunking and retrieval quality.

🔀 4. Core Logic of Document Chunking

Chunking is one of the most critical steps in building a Knowledge Base.
Good chunks make AI retrieval precise; poor chunks lead to off‑topic answers.

Pop uses a hybrid intelligent chunking strategy:

4.1 Chunking by Heading Structure

Used for Markdown, PDF, Word, etc.:

Example:

# 1. System Overview
## 1.1 Feature List
### 1.1.1 User Module

Each heading level forms a natural segmentation boundary.

4.2 Semantic-Aware Paragraph Chunking

In addition to headings, Pop splits by natural paragraph groups:

Avoid chunks that are too long
Avoid chunks that are too short
Maintain semantic coherence
Preserve context links

Default strategy:

Target chunk length: 250–500 Chinese characters / chunk
Tolerance range: 200–600
Auto‑merge overly short paragraphs
Auto‑split overly long paragraphs

4.3 Special Handling Rules

Different content types use optimized strategies:

Content Type	Handling Strategy
Lists (bullet/number)	Kept as a single chunk
Tables	Flattened into clean plain text
Code blocks	Stored as independent chunks
Form fields	Compressed into key‑value lines
Images (OCR)	Extracts meaningful OCR text

🧩 5. Chunk Storage Structure

Each chunk is stored as a structured object, for example:

{
  "id": "chunk_xxx",
  "kb_id": "kb_abc",
  "document_id": "doc_123",
  "content": "...",
  "tokens": 356,
  "embedding": [0.12, -0.44, ...],
  "order": 17,
  "metadata": {
      "title": "1.2 System Features",
      "page": 5,
      "source": "xxx.pdf"
  }
}

Field description:

Field	Description
id	Unique ID of the chunk
kb_id	Parent knowledge base
document_id	Parent document
content	Text content of the chunk
tokens	Token count
embedding	Vector embedding
order	Position in the original document
metadata	Title, page, source info

🔎 6. Index Structures

Pop builds two types of indexes:

6.1 KNN Vector Index

Used for semantic retrieval:

Embeddings generated by selected model
Stored in high‑dimensional vector DB
Supports 768D, 1024D, etc.
Enables semantic recall

6.2 BM25 Text Index

Ideal for:

Keyword matching
FAQ search
Terminology lookup

Both work together for hybrid retrieval.

🧠 7. Why Chunking Is So Important?

Chunking forms the foundation of RAG (Retrieval-Augmented Generation).
It directly impacts:

Retrieval accuracy
Whether correct passages are found
Whether answers include proper context
Traceability to source documents

Good chunks = high‑quality RAG
Bad chunks = the model never answers correctly.

Pop’s intelligent chunking provides consistent, reliable results.

📌 Summary

Pop implements a full pipeline:

Document parsing
Cleaning and noise removal
Intelligent chunking
Embedding generation
BM25 indexing
High‑performance storage

These form the core of Pop’s Knowledge Base, enabling accurate and traceable AI answers based on document content.