Edge AI & Private Inference

RAG in the Browser

Running a full retrieval-augmented generation pipeline client-side with zero network requests -- in-browser embedding, vector search, and generation for the personal knowledge base pattern.

The zero-network-request RAG stack

Here is something that would have been impractical two years ago: a complete RAG pipeline -- embed documents, index them, search by semantic similarity, and generate answers with citations -- running entirely in a browser tab. No server. No API calls. No data leaving the user's device.

The architecture is straightforward:

  1. Embed: Run a small embedding model (Nomic Embed, BGE-small) in the browser via Transformers.js to convert document text into vectors
  2. Index: Store the vectors in an in-memory HNSW index with persistence to IndexedDB
  3. Search: Given a user query, embed it with the same model and find the top-k most similar document chunks
  4. Generate: Pass the retrieved chunks as context to a local LLM (Gemma 4 E2B/E4B) and generate an answer

Every step runs client-side. The user's documents stay on their device. This is the "personal knowledge base" pattern -- each employee has their own local knowledge base that only they can access, built from the documents they choose to add.

This pattern is particularly powerful for:

  • Regulated professionals (lawyers, doctors, financial advisers) who work with sensitive client documents
  • Field workers who need to query technical documentation offline
  • Executives who want AI search over confidential board papers and financial reports
  • Anyone whose documents are too sensitive for cloud processing
?

A law firm wants AI-powered search over client case files. Each lawyer works on different cases with strict client confidentiality requirements. What architecture fits?