Intelligent Document Ingestion

The document zoo

Enterprise knowledge does not live in neat markdown files. It lives in a chaotic ecosystem of formats, systems, and access patterns that accumulated over decades. Before you can embed a single vector, you have to get that knowledge out.

Here is what a typical enterprise document landscape looks like:

Structured documents. Internal wikis (Confluence, Notion), SharePoint sites, knowledge bases, CRM records, structured databases. These have APIs, consistent formats, and usually clear metadata. They are the easiest to ingest -- but they represent perhaps 20-30% of enterprise knowledge.

Semi-structured documents. PDFs (reports, contracts, policies), Word documents, PowerPoint presentations, Excel spreadsheets. They have visual structure (headings, tables, sections) but extracting that structure programmatically is hard. This is the bulk of most enterprise corpora -- 40-50%.

Unstructured content. Email archives, Slack/Teams messages, meeting transcripts, scanned documents (paper that was photographed or passed through a flatbed scanner), handwritten notes. Minimal structure, often no metadata, sometimes not even machine-readable text. This is the remaining 20-30% and contains some of the highest-value institutional knowledge.

Media. Diagrams, flowcharts, architectural drawings, photographs of whiteboards, recorded presentations. This content is semantically rich but was historically invisible to text-based RAG systems. Vision-capable models are changing this.

The ingestion pipeline is the least glamorous and most important part of your RAG system. Poor ingestion -- garbled text, lost tables, stripped metadata -- propagates errors through every downstream stage. No amount of sophisticated chunking, embedding, or retrieval can compensate for bad input data.

You are surveying your organisation's document landscape to plan a RAG deployment. Which document source is likely to be the most technically challenging to ingest?

The PDF extraction problem

PDF is the most common document format in enterprise corpora and the most difficult to extract correctly. Here is why.

A PDF is not a document. It is a set of drawing instructions. The PDF specification says "draw character 'T' at coordinates (72, 640), draw character 'h' at coordinates (78, 640)..." There is no inherent concept of words, sentences, paragraphs, headings, or tables. What you see when you view a PDF is a rendering engine reconstructing visual structure from low-level drawing commands.

This means extracting text from a PDF requires the extraction tool to reverse-engineer the visual layout:

Word formation. Characters with close horizontal proximity are grouped into words. Simple in theory, but variable character spacing (kerning) and justified text make it ambiguous. Is that a space between characters or just wide kerning?

Line detection. Characters at the same vertical position form a line. But multi-column layouts mean characters at the same vertical position might belong to different columns. Without detecting column boundaries, you get interleaved text: "The vendor shall deliver Annual revenue exceeded" -- alternating words from two columns.

Paragraph boundaries. A larger vertical gap between lines signals a paragraph break. But how large? It varies by document. And headers, footers, and footnotes create additional false paragraph boundaries.

Table extraction. Tables are the hardest. A PDF table is just characters positioned in a grid-like pattern. There are no actual cell boundaries, row markers, or column delimiters. The extraction tool must detect alignment patterns, infer the grid structure, and reconstruct rows and columns. Merged cells, nested tables, and tables that span multiple pages make this significantly harder.

Headers, footers, and page numbers. These repeat on every page and must be detected and excluded, or they pollute every chunk with "Page 47 of 132 -- CONFIDENTIAL" noise.

The consequence: no PDF extraction tool is perfect. Every tool makes trade-offs between speed, accuracy, and the types of documents it handles well. Your ingestion pipeline must account for extraction errors and have quality checks.

You extract text from a 200-page contract PDF and notice the output contains interleaved text like 'The indemnification cap Annual review of the shall not exceed contract terms shall.' What is the most likely cause?

The extraction toolkit

Here are the open-source tools that handle enterprise document extraction, with honest assessments of each.

Unstructured.io (Apache 2.0). The most comprehensive open-source ingestion framework. It handles PDFs, Word, PowerPoint, HTML, email, images, and more. For PDFs, it uses a combination of OCR (Tesseract or PaddleOCR), layout detection (detectron2 or YOLOX), and table extraction. Its partitioning API automatically selects the right extraction strategy per document. The trade-off: the full pipeline (with layout detection and OCR) is slow -- expect 2-5 pages per second for complex PDFs. For simpler PDFs, the "fast" strategy skips layout analysis and processes 20+ pages per second, but at lower quality.

Docling (MIT, IBM Research). A more recent entrant focused specifically on high-quality document understanding. Docling uses a purpose-trained layout analysis model (DocLayNet) and a table structure recognition model (TableFormer). It excels at complex layouts: multi-column, nested tables, figures with captions. Output is structured -- you get a document tree with typed elements (heading, paragraph, table, figure), not just flat text. Slower than Unstructured's fast mode (1-3 pages per second) but significantly more accurate on complex documents.

Marker (GPL-3.0). Optimised for converting PDFs to markdown. Marker is fast (10-20 pages per second) and handles typical business documents well. It uses a combination of heuristic rules and small ML models for layout detection. Less accurate than Docling on complex layouts but much faster. The GPL-3.0 licence may be a concern for some commercial deployments.

pypdf / pymupdf (BSD / AGPL). Low-level PDF libraries that extract raw text. Fast (50+ pages per second) but no layout analysis -- you get text in reading order as best the library can determine, which fails on multi-column and complex layouts. Use these as a fast first pass, with a fallback to Docling/Unstructured for documents that fail quality checks.

PaddleOCR (Apache 2.0). An OCR engine from Baidu, particularly strong for scanned documents and multilingual text. Often used as a component within Unstructured.io or custom pipelines rather than as a standalone ingestion tool.

The practical approach: use a tiered extraction strategy. Run fast extraction (pymupdf) on all documents. Apply quality checks (coherence scoring, table detection). Re-extract failed documents with layout-aware tools (Docling or Unstructured). This gives you the speed of fast extraction for the 60-70% of documents that are simple, and the accuracy of layout analysis for the ones that need it.

Using vision models for documents

Gemma 4 and other recent vision-language models can see documents, not just read their text. This opens up extraction capabilities that text-based tools cannot match.

Table understanding. Rather than trying to reverse-engineer table structure from character positions, you can screenshot a PDF page containing a table and pass it to a vision model with the prompt: "Extract the table on this page as a markdown table. Preserve all headers, row labels, and numerical values exactly." Vision models handle merged cells, nested headers, and formatting that defeats traditional table extractors.

Chart and diagram interpretation. A bar chart showing quarterly revenue trends is semantically rich but invisible to text extraction. A vision model can describe the chart: "Q1: $4.2B, Q2: $3.8B, Q3: $4.1B, Q4: $4.5B. Revenue declined in Q2 before recovering. Full year total: $16.6B." That description can be embedded and retrieved.

Handwritten annotations. Scanned documents with handwritten marginalia (common in legal and engineering contexts) are readable to vision models but invisible to OCR. Marginal notes like "NOT APPROVED -- see revised terms 3/15" carry critical information.

Layout-as-context. A vision model understands that text in a sidebar is supplementary, that a bold heading introduces a new section, and that an italicised footnote is a qualification. This structural understanding can inform how you chunk the document.

The trade-off: vision model processing is slow and expensive compared to text extraction. Processing a single PDF page through Gemma 4 takes 2-5 seconds. For a 200-page document, that is 7-17 minutes per document. At 100,000 documents, you are looking at weeks of processing.

The practical approach: use vision models selectively. Run text extraction first. Flag pages with detected tables, charts, or low OCR confidence. Process only those pages through the vision model. This typically means 5-15% of pages go through vision processing -- a manageable workload.

Your corpus contains 50,000 PDF documents. About 30% have tables that text extraction handles poorly. How should you architect the extraction pipeline?

The real work: connectors and change detection

Extraction quality gets the attention, but connectors are where the engineering effort actually goes. A connector is the integration that pulls documents from a source system, handles authentication, respects rate limits, tracks what has changed, and feeds documents into your extraction pipeline.

Authentication. Confluence uses OAuth 2.0 or API tokens. SharePoint uses Microsoft Graph API with Azure AD tokens. Slack uses bot tokens with scoped permissions. Email archives might use IMAP, EWS, or Graph API. Each connector must handle token refresh, permission scoping, and credential rotation. For a system that ingests from 8-10 source systems, authentication management alone is a significant engineering surface.

Rate limiting. External APIs throttle requests. Confluence's cloud API allows 10 requests/second per user. Microsoft Graph allows 10,000 requests per 10 minutes. Slack's API has tiered rate limits by method. Your connectors must implement backoff, queuing, and parallelism that respects these limits. Ignoring rate limits gets your API access suspended.

Incremental sync. The initial ingestion is a one-time cost. The ongoing challenge is detecting what has changed since the last sync. Some systems provide change feeds or webhooks (Confluence has webhooks, SharePoint has delta queries). Others require you to poll and compare modification timestamps. Some (email archives, file shares) have no change detection at all -- you must hash the content and compare.

Change detection patterns:

Source	Change detection method	Latency
Confluence	Webhooks + CQL query for modified pages	Near real-time
SharePoint	Microsoft Graph delta query	Minutes
Slack	Events API (webhooks)	Near real-time
Email (Exchange)	EWS sync state or Graph delta	Minutes
File shares (SMB/NFS)	inotify / FSEvents or periodic polling	Seconds to hours
Databases	Change Data Capture (Debezium) or polling	Seconds to minutes

The "embed at write time" pattern. Rather than batch re-indexing on a schedule, the most responsive architecture embeds documents as they change. A change event triggers: extract, chunk, embed, upsert into the vector database. This keeps your index continuously fresh, with ingestion latency measured in seconds to minutes rather than hours to days.

The trade-off: continuous indexing requires event-driven infrastructure (message queues, worker pools) and must handle failures gracefully. If the embedding service is down, you need a dead letter queue and retry mechanism. Batch re-indexing is simpler to operate but introduces a latency gap between when a document changes and when the RAG system knows about it.

For most enterprises, a hybrid approach works best: continuous indexing for high-change sources (Confluence, Slack, email) and periodic batch re-indexing for stable sources (archived contracts, policy documents, historical records).

Your RAG system indexes Confluence, SharePoint, and a network file share. Users report that recently updated Confluence pages appear in search results within minutes, but updated files on the network share take up to 24 hours. Why?

✎

Module 5 -- Final Assessment

Why is text extraction from multi-column PDF layouts particularly error-prone?

What is the primary advantage of Docling over pymupdf for document extraction in a RAG pipeline?

When should you use a vision-language model like Gemma 4 for document extraction instead of text-based tools?

What is the 'embed at write time' pattern in document ingestion?