3.7 Multimodal Understanding

Pop integrates advanced multimodal AI capabilities, enabling your assistant to not only read text but also analyze images, PDFs, screenshots, spreadsheets, and structured content.
Whether it’s a technical diagram, error screenshot, scanned PDF, design mockup, or table data, AI can help you extract, summarize, and interpret information.

This chapter introduces all multimodal features supported by Pop and how to use them.

🎯 Use Cases and Advantages of Multimodal AI

Multimodal understanding can help solve real-world problems:

Scenario	What AI Can Do
Error screenshot	Recognize error messages, locate issues, suggest fixes
Image understanding	Read UI elements, interpret diagrams, analyze layout
PDF summarization	Convert lengthy documents into structured summaries
Scanning / OCR	Detect and extract text from scanned or photographed pages
Table understanding	Extract table data into JSON/Markdown
Diagram analysis	Understand architecture diagrams, flowcharts, reports
File-based Q&A	Answer questions based on image/PDF content

🖼️ 1. Image Understanding

You can drag images directly into the chat window or use the Upload Image feature.

Supported formats:

PNG / JPG / JPEG
Screenshots (macOS / Windows)
UI captures, error screenshots
Diagrams and flowcharts

AI can:

Read and interpret text
Detect UI components
Summarize image content
Translate text inside images
Suggest code fixes from error screenshots
Extract structured information from charts

📄 2. PDF Understanding

Pop supports direct analysis of PDF files via drag-and-drop.

AI can handle:

Standard PDFs
Text-based PDFs
Scanned PDFs (automatic OCR)
Multi-page technical documents
Papers, specifications, reports

Supported functions include:

✔ PDF Summaries

Generate chapter summaries, key points, and structured outlines.

✔ PDF Q&A

Locate answers precisely from PDF content.

✔ Structural Extraction

AI can output:

Paragraphs
Table of contents tree
Tables
Code blocks
Highlight summaries

✔ OCR for Scanned PDFs

Contracts, receipts, photographed book pages—all recognized accurately.

📁 3. File Understanding (File Insight)

Beyond images and PDFs, Pop can interpret many file types:

Word (doc/docx)
Excel (xls/xlsx)
Markdown (md)
Text files (txt/log/json/yaml)
Source code files

Simply drop files into the chat—AI parses them automatically.

🔍 4. OCR & Text Extraction

Pop includes a built-in OCR engine supporting:

Scanned contracts
Phone-captured documents
Whiteboard photos
Table screenshots
Receipts & invoices

OCR output formats include:

Raw text
Markdown
Structured JSON

Example:

{
  "column1": "Item",
  "column2": "Quantity",
  "column3": "Amount"
}

📊 5. Data Extraction & Structured Output

AI can extract structured information such as:

Invoice fields
Table-to-JSON conversion
Report → bullet list
Form extraction
Key-value field detection

Examples:

“Extract all prices from this image and output JSON.”

“Convert the full chapter structure of this PDF into an outline.”

🤖 6. Multimodal + Workflows

In workflows, AI nodes can also process files:

Read uploaded PDFs automatically
Summarize and forward results to next nodes
Extract tables from screenshots and convert to Excel
Perform OCR on images and use the output for logic decisions

This enables fully automated pipelines.

🛠 How to Use Multimodal Features

1. Drag-and-drop files into the chat

AI identifies and processes the file type instantly.

2. Use the Upload File button

Supports multiple file formats.

3. Automate processing via workflows

Ideal for repetitive tasks.

4. Use images/PDFs inside the knowledge base

They can serve as KB resources for RAG.

🔐 Privacy & Local Security

Pop follows a strict privacy model:

Local parsing first (OCR, PDF text extraction)
Cloud model uploads happen only with your consent
No storage or third-party sharing

When using local models (e.g., Ollama), all multimodal data stays on your device.

📌 Summary

Pop’s multimodal capabilities extend AI far beyond text:

Images
Screenshots
PDFs
Tables
Text files

These features enable high-level tasks like summarization, analysis, Q&A, OCR, and structured extraction, making Pop ideal for productivity, documentation, and data workflows.