The document zoo
Enterprise knowledge does not live in neat markdown files. It lives in a chaotic ecosystem of formats, systems, and access patterns that accumulated over decades. Before you can embed a single vector, you have to get that knowledge out.
Here is what a typical enterprise document landscape looks like:
Structured documents. Internal wikis (Confluence, Notion), SharePoint sites, knowledge bases, CRM records, structured databases. These have APIs, consistent formats, and usually clear metadata. They are the easiest to ingest -- but they represent perhaps 20-30% of enterprise knowledge.
Semi-structured documents. PDFs (reports, contracts, policies), Word documents, PowerPoint presentations, Excel spreadsheets. They have visual structure (headings, tables, sections) but extracting that structure programmatically is hard. This is the bulk of most enterprise corpora -- 40-50%.
Unstructured content. Email archives, Slack/Teams messages, meeting transcripts, scanned documents (paper that was photographed or passed through a flatbed scanner), handwritten notes. Minimal structure, often no metadata, sometimes not even machine-readable text. This is the remaining 20-30% and contains some of the highest-value institutional knowledge.
Media. Diagrams, flowcharts, architectural drawings, photographs of whiteboards, recorded presentations. This content is semantically rich but was historically invisible to text-based RAG systems. Vision-capable models are changing this.
The ingestion pipeline is the least glamorous and most important part of your RAG system. Poor ingestion -- garbled text, lost tables, stripped metadata -- propagates errors through every downstream stage. No amount of sophisticated chunking, embedding, or retrieval can compensate for bad input data.