Retrieval-augmented generation has become the default architecture for enterprise AI. Glean, Notion AI, Harvey AI, and Confluence all use RAG to ground LLM responses in proprietary data. But the pipeline has a latency problem that most teams misdiagnose. They optimize the LLM call — switching models, tuning context windows, batching requests — while the retrieval step quietly dominates end-to-end latency. A vector database round-trip takes 1–5ms. An in-process HNSW lookup takes 0.0015ms. That 3,300x gap is the difference between a RAG application that feels slow and one that feels instant.
Anatomy of RAG Latency
A RAG pipeline has four sequential stages, each contributing to the total time between a user’s question and the final answer. Understanding where time is spent is the prerequisite to optimizing it.
Stage 1: Embedding. The user’s query is converted into a vector embedding. Using OpenAI’s text-embedding-3-small, this takes 20–50ms including network latency. Locally hosted models (ONNX, sentence-transformers) reduce this to 2–8ms. Stage 2: Vector search. The query embedding is compared against your document index to retrieve the top-K most relevant chunks. With an external vector database like Pinecone, Weaviate, or Qdrant, this takes 1–5ms over the network. Stage 3: Context assembly. Retrieved chunks are formatted into a prompt with the system instruction and user query. This is string manipulation — sub-millisecond. Stage 4: LLM generation. The assembled prompt is sent to GPT-4o, Claude, or a self-hosted model. Time to first token is 200–800ms depending on load and model.
Standard RAG Pipeline — External Vector DB
At first glance, the LLM call dominates. But here is what most teams miss: the LLM call is being aggressively optimized by the model providers. GPT-4o’s time-to-first-token has improved 40% in the last year. Speculative decoding, quantization, and batching continue to push inference latency down. Meanwhile, the vector search step has a hard floor — network physics. A TCP round-trip to a hosted vector database in the same AWS region takes 0.5–1ms minimum, plus query execution. You cannot optimize away the speed of light.
The Retrieval Bottleneck at Scale
The problem compounds when you look at production RAG deployments. A single user query often requires multiple retrieval calls. Agentic RAG workflows — where the system decomposes a complex question into sub-queries — might execute 3–5 vector searches per user interaction. Multi-hop retrieval chains add more. Each round-trip to an external vector DB adds 1–5ms. At 4 retrievals per query, you are spending 4–20ms just on vector lookups before the LLM sees a single token.
Companies building production RAG systems understand this. Glean processes enterprise search across thousands of data sources — every millisecond in retrieval multiplied across millions of daily queries adds up to infrastructure cost and user-perceived latency. Notion AI retrieves from workspace documents that change frequently, requiring fast re-indexing and fast search. Harvey AI retrieves from legal corpora where retrieval precision is critical and latency directly impacts attorney productivity at $500+/hour. Confluence AI search operates over corporate knowledge bases where users expect Google-speed responses, not multi-second waits.
In-Process HNSW: Eliminating the Network
The solution is to move the vector index into the application process itself. Instead of querying an external vector database over TCP, the HNSW graph lives in the same memory space as your application. Cachee’s VADD and VSEARCH commands implement this directly: vectors are indexed in-process, and nearest-neighbor lookups execute in 0.0015ms (1.5 microseconds). No serialization, no TCP handshake, no network hop. The query touches L1/L2 CPU cache and returns.
Optimized RAG Pipeline — Cachee In-Process HNSW
The 6% improvement from retrieval optimization alone might look modest in a single-retrieval pipeline. But the real gain appears in two scenarios. First, agentic and multi-hop RAG: when your pipeline makes 4–8 retrieval calls per query, the savings multiply. Reducing retrieval from 12–40ms to 0.006–0.012ms shifts the pipeline from retrieval-bound to purely LLM-bound. Second, cached LLM responses: when you combine in-process vector search with semantic caching, the LLM call is also eliminated on cache hits. The entire pipeline collapses to embedding + vector search + cache lookup — all in under 6ms total, with zero API cost.
Architecture: Where Cachee Fits in Your RAG Stack
Cachee does not replace your vector database for bulk indexing and persistence. It replaces the hot-path query layer. The architecture is a tiered retrieval system: Cachee’s in-process HNSW index holds your most frequently accessed vectors — the top 100K–1M document chunks that account for 90%+ of retrieval hits. The external vector DB (Pinecone, Weaviate, Qdrant, pgvector) serves as the persistence and bulk storage layer for the full corpus. Cache misses fall through to the external DB transparently.
This tiered approach mirrors how CPU caches work: L1 is tiny and fast, L2 is larger and slower, main memory is huge and slowest. The AI infrastructure equivalent is in-process HNSW (L1), hosted vector DB (L2), and cold storage or re-embedding (L3). The economics work because query distributions follow a power law — a small percentage of documents account for the majority of retrievals. Keeping those hot documents in-process eliminates the latency penalty for 90%+ of queries.
The Compounding Effect
RAG optimization is not a single-layer problem. The fastest RAG pipelines combine three optimizations: local embedding (ONNX runtime, 2–8ms vs. 20–50ms API), in-process vector search (Cachee VSEARCH, 0.0015ms vs. 1–5ms), and semantic response caching (skip the LLM entirely on near-duplicate queries). Together, these reduce the median RAG response time from 538ms to under 10ms on cache hits and under 510ms on cache misses. The infrastructure cost drops proportionally — fewer LLM API calls, fewer vector DB queries, lower compute per request. For companies processing millions of RAG queries daily, the savings compound into six or seven figures annually. Check Cachee pricing to see the cost comparison against standalone vector DB deployments.
Related Reading
- AI Infrastructure Solutions
- In-Process Vector Search
- Cachee Pricing
- Start Free Trial
- How Cachee Works
Also Read
Make Your RAG Pipeline 3,300x Faster at the Retrieval Layer.
In-process HNSW vector search at 0.0015ms. No external database. No network round-trip. Drop-in replacement for your retrieval hot path.
Start Free Trial Schedule Demo