The zero-network-request RAG stack
Here is something that would have been impractical two years ago: a complete RAG pipeline -- embed documents, index them, search by semantic similarity, and generate answers with citations -- running entirely in a browser tab. No server. No API calls. No data leaving the user's device.
The architecture is straightforward:
- Embed: Run a small embedding model (Nomic Embed, BGE-small) in the browser via Transformers.js to convert document text into vectors
- Index: Store the vectors in an in-memory HNSW index with persistence to IndexedDB
- Search: Given a user query, embed it with the same model and find the top-k most similar document chunks
- Generate: Pass the retrieved chunks as context to a local LLM (Gemma 4 E2B/E4B) and generate an answer
Every step runs client-side. The user's documents stay on their device. This is the "personal knowledge base" pattern -- each employee has their own local knowledge base that only they can access, built from the documents they choose to add.
This pattern is particularly powerful for:
- Regulated professionals (lawyers, doctors, financial advisers) who work with sensitive client documents
- Field workers who need to query technical documentation offline
- Executives who want AI search over confidential board papers and financial reports
- Anyone whose documents are too sensitive for cloud processing