Retrieval-augmented generation has become the default architecture for enterprise AI. Glean, Notion AI, Harvey AI, and Confluence all use RAG to ground LLM responses in proprietary data. But the pipeline has a latency problem that most teams misdiagnose. They optimize the LLM call — switching models, tuning context windows, batching requests — while the retrieval step quietly dominates end-to-end latency. A vector database round-trip takes 1–5ms. An in-process HNSW lookup takes 0.0015ms. That 3,300x gap is the difference between a RAG application that feels slow and one that feels instant.
Anatomy of RAG Latency
A RAG pipeline has four sequential stages, each contributing to the total time between a user’s question and the final answer. Understanding where time is spent is the prerequisite to optimizing it.
Stage 1: Embedding. The user’s query is converted into a vector embedding. Using OpenAI’s text-embedding-3-small, this takes 20–50ms including network latency. Locally hosted models (ONNX, sentence-transformers) reduce this to 2–8ms. Stage 2: Vector search. The query embedding is compared against your document index to retrieve the top-K most relevant chunks. With an external vector database like Pinecone, Weaviate, or Qdrant, this takes 1–5ms over the network. Stage 3: Context assembly. Retrieved chunks are formatted into a prompt with the system instruction and user query. This is string manipulation — sub-millisecond. Stage 4: LLM generation. The assembled prompt is sent to GPT-4o, Claude, or a self-hosted model. Time to first token is 200–800ms depending on load and model.
Standard RAG Pipeline — External Vector DB
At first glance, the LLM call dominates. But here is what most teams miss: the LLM call is being aggressively optimized by the model providers. GPT-4o’s time-to-first-token has improved 40% in the last year. Speculative decoding, quantization, and batching continue to push inference latency down. Meanwhile, the vector search step has a hard floor — network physics. A TCP round-trip to a hosted vector database in the same AWS region takes 0.5–1ms minimum, plus query execution. You cannot optimize away the speed of light.
The Retrieval Bottleneck at Scale
The problem compounds when you look at production RAG deployments. A single user query often requires multiple retrieval calls. Agentic RAG workflows — where the system decomposes a complex question into sub-queries — might execute 3–5 vector searches per user interaction. Multi-hop retrieval chains add more. Each round-trip to an external vector DB adds 1–5ms. At 4 retrievals per query, you are spending 4–20ms just on vector lookups before the LLM sees a single token.
Companies building production RAG systems understand this. Glean processes enterprise search across thousands of data sources — every millisecond in retrieval multiplied across millions of daily queries adds up to infrastructure cost and user-perceived latency. Notion AI retrieves from workspace documents that change frequently, requiring fast re-indexing and fast search. Harvey AI retrieves from legal corpora where retrieval precision is critical and latency directly impacts attorney productivity at $500+/hour. Confluence AI search operates over corporate knowledge bases where users expect Google-speed responses, not multi-second waits.
In-Process HNSW: Eliminating the Network
The solution is to move the vector index into the application process itself. Instead of querying an external vector database over TCP, the HNSW graph lives in the same memory space as your application. Cachee’s VADD and VSEARCH commands implement this directly: vectors are indexed in-process, and nearest-neighbor lookups execute in 0.0015ms (1.5 microseconds). No serialization, no TCP handshake, no network hop. The query touches L1/L2 CPU cache and returns.
Optimized RAG Pipeline — Cachee In-Process HNSW
The 6% improvement from retrieval optimization alone might look modest in a single-retrieval pipeline. But the real gain appears in two scenarios. First, agentic and multi-hop RAG: when your pipeline makes 4–8 retrieval calls per query, the savings multiply. Reducing retrieval from 12–40ms to 0.006–0.012ms shifts the pipeline from retrieval-bound to purely LLM-bound. Second, cached LLM responses: when you combine in-process vector search with semantic caching, the LLM call is also eliminated on cache hits. The entire pipeline collapses to embedding + vector search + cache lookup — all in under 6ms total, with zero API cost.
Architecture: Where Cachee Fits in Your RAG Stack
Cachee does not replace your vector database for bulk indexing and persistence. It replaces the hot-path query layer. The architecture is a tiered retrieval system: Cachee’s in-process HNSW index holds your most frequently accessed vectors — the top 100K–1M document chunks that account for 90%+ of retrieval hits. The external vector DB (Pinecone, Weaviate, Qdrant, pgvector) serves as the persistence and bulk storage layer for the full corpus. Cache misses fall through to the external DB transparently.
This tiered approach mirrors how CPU caches work: L1 is tiny and fast, L2 is larger and slower, main memory is huge and slowest. The AI infrastructure equivalent is in-process HNSW (L1), hosted vector DB (L2), and cold storage or re-embedding (L3). The economics work because query distributions follow a power law — a small percentage of documents account for the majority of retrievals. Keeping those hot documents in-process eliminates the latency penalty for 90%+ of queries.
The Compounding Effect
RAG optimization is not a single-layer problem. The fastest RAG pipelines combine three optimizations: local embedding (ONNX runtime, 2–8ms vs. 20–50ms API), in-process vector search (Cachee VSEARCH, 0.0015ms vs. 1–5ms), and semantic response caching (skip the LLM entirely on near-duplicate queries). Together, these reduce the median RAG response time from 538ms to under 10ms on cache hits and under 510ms on cache misses. The infrastructure cost drops proportionally — fewer LLM API calls, fewer vector DB queries, lower compute per request. For companies processing millions of RAG queries daily, the savings compound into six or seven figures annually. Check Cachee pricing to see the cost comparison against standalone vector DB deployments.
Related Reading
- AI Infrastructure Solutions
- In-Process Vector Search
- Cachee Pricing
- Start Free Trial
- How Cachee Works
Also Read
The Numbers That Matter
Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.
- L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
- L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
- Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
- L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.
The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.
Average Latency Hides The Real Story
Average latency is the most misleading number in cache benchmarking. The percentile distribution is what actually breaks production systems. Tail latency — the slowest 0.1% of requests — is where users notice the lag and where SLAs get violated.
| Percentile | Network Redis (same-AZ) | In-process L0 |
|---|---|---|
| p50 | ~85 microseconds | 28.9 nanoseconds |
| p95 | ~140 microseconds | ~45 nanoseconds |
| p99 | ~280 microseconds | ~80 nanoseconds |
| p99.9 | ~1.2 milliseconds | ~150 nanoseconds |
The p99.9 spike on networked Redis isn't a bug — it's the cost of running a single-threaded event loop that occasionally blocks on background tasks like RDB snapshots, AOF rewrites, and expired-key sweeps. Cachee's L0 stays inside a few hundred nanoseconds because the hot-path read is a lock-free shard lookup with no background work scheduled on the same thread.
If your application is sensitive to tail latency — payments, real-time bidding, fraud detection, trading — the p99.9 number is the one to optimize against. Average latency improvements that don't move the tail are vanity metrics.
Memory Efficiency Is The Hidden Cost Lever
Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.
Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.
Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.
Observability And What To Measure
You can't tune what you can't measure. The four metrics that matter for any production cache deployment, in order of importance:
- Hit rate, broken down by key prefix or namespace. A global hit rate of 92% sounds great until you discover that one critical namespace is sitting at 40% and dragging your tail latency. Per-prefix hit rates expose which workloads are getting cache value and which aren't.
- Latency percentiles, not averages. p50, p95, p99, and p99.9 for both cache hits and cache misses. The cache miss latency is your fallback path performance — when the cache fails, this is what your users actually experience.
- Memory pressure and eviction rate. If your eviction rate is climbing while your hit rate stays flat, you're under-provisioned. If both are climbing, your access pattern shifted and you need to retune TTLs or rethink what you're caching.
- Stale-read rate. The percentage of cache hits that returned a value the application then discovered was stale. This is the canary for your invalidation strategy. If it's above 1%, your invalidation logic has a bug.
Cachee exposes all four out of the box via Prometheus metrics on the standard scrape endpoint, plus a real-time SSE stream for dashboards that need sub-second visibility. The right time to wire these into your monitoring stack is before the migration, not after the first incident.
Make Your RAG Pipeline 3,300x Faster at the Retrieval Layer.
In-process HNSW vector search at 0.0015ms. No external database. No network round-trip. Drop-in replacement for your retrieval hot path.
Start Free Trial Schedule Demo