knowledge-base/Knowledge Base Storage Structure and Chunking Rules

4.4 Knowledge Base Storage Structure and Chunking Rules

The reason Pop’s Knowledge Base can deliver high‑quality retrieval and accurate answers is because of its core processing pipeline:
⭑ Document Parsing → Cleaning → Chunking → Embedding → Structured Storage

This section explains how Pop splits documents into searchable segments and how those segments are stored internally.


🗂️ 1. Overview of Knowledge Base Storage Structure

Each Knowledge Base consists of three major data layers:

Knowledge Base (KB)
 ├── Documents
 ├── Chunks
 ├── Indexes
       ├── Vector Index (KNN)
       └── BM25 Text Index

Processing flow:

Original Document → Parsing → Cleaning → Automatic Chunking → Embedding → Indexing

📄 2. Document Parsing

Pop automatically parses documents based on their formats:

Document Type Parsing Method
PDF Text extraction + OCR (scanned)
Word / PPT Direct text extraction
Markdown Parsed by heading structure
HTML / URL Main‑content extraction (removes ads/navigation)
Code Files Treated as plain text

After parsing, Pop obtains a clean text version that proceeds to the cleaning stage.


🧹 3. Text Cleaning

To ensure high‑quality embeddings, Pop performs additional cleaning:

  • Removes headers, footers, and page numbers
  • Strips redundant blank lines and strange characters
  • Merges paragraphs split by hard line breaks
  • Removes duplicated segments
  • Normalizes inconsistent headings and lists

This helps significantly improve chunking and retrieval quality.


🔀 4. Core Logic of Document Chunking

Chunking is one of the most critical steps in building a Knowledge Base.
Good chunks make AI retrieval precise; poor chunks lead to off‑topic answers.

Pop uses a hybrid intelligent chunking strategy:


4.1 Chunking by Heading Structure

Used for Markdown, PDF, Word, etc.:

Example:

# 1. System Overview
## 1.1 Feature List
### 1.1.1 User Module

Each heading level forms a natural segmentation boundary.


4.2 Semantic-Aware Paragraph Chunking

In addition to headings, Pop splits by natural paragraph groups:

  • Avoid chunks that are too long
  • Avoid chunks that are too short
  • Maintain semantic coherence
  • Preserve context links

Default strategy:

  • Target chunk length: 250–500 Chinese characters / chunk
  • Tolerance range: 200–600
  • Auto‑merge overly short paragraphs
  • Auto‑split overly long paragraphs

4.3 Special Handling Rules

Different content types use optimized strategies:

Content Type Handling Strategy
Lists (bullet/number) Kept as a single chunk
Tables Flattened into clean plain text
Code blocks Stored as independent chunks
Form fields Compressed into key‑value lines
Images (OCR) Extracts meaningful OCR text

🧩 5. Chunk Storage Structure

Each chunk is stored as a structured object, for example:

{
  "id": "chunk_xxx",
  "kb_id": "kb_abc",
  "document_id": "doc_123",
  "content": "...",
  "tokens": 356,
  "embedding": [0.12, -0.44, ...],
  "order": 17,
  "metadata": {
      "title": "1.2 System Features",
      "page": 5,
      "source": "xxx.pdf"
  }
}

Field description:

Field Description
id Unique ID of the chunk
kb_id Parent knowledge base
document_id Parent document
content Text content of the chunk
tokens Token count
embedding Vector embedding
order Position in the original document
metadata Title, page, source info

🔎 6. Index Structures

Pop builds two types of indexes:

6.1 KNN Vector Index

Used for semantic retrieval:

  • Embeddings generated by selected model
  • Stored in high‑dimensional vector DB
  • Supports 768D, 1024D, etc.
  • Enables semantic recall

6.2 BM25 Text Index

Ideal for:

  • Keyword matching
  • FAQ search
  • Terminology lookup

Both work together for hybrid retrieval.


🧠 7. Why Chunking Is So Important?

Chunking forms the foundation of RAG (Retrieval-Augmented Generation).
It directly impacts:

  • Retrieval accuracy
  • Whether correct passages are found
  • Whether answers include proper context
  • Traceability to source documents

Good chunks = high‑quality RAG
Bad chunks = the model never answers correctly.

Pop’s intelligent chunking provides consistent, reliable results.


📌 Summary

Pop implements a full pipeline:

  • Document parsing
  • Cleaning and noise removal
  • Intelligent chunking
  • Embedding generation
  • BM25 indexing
  • High‑performance storage

These form the core of Pop’s Knowledge Base, enabling accurate and traceable AI answers based on document content.