RAG Caching: Cache Retrieval-Augmented Generation Results at 31ns

May 10, 2026 | 15 min read | Engineering

A typical Retrieval-Augmented Generation pipeline processes a user query through four stages: embedding the query into a vector representation (5-20ms), searching a vector database for relevant document chunks (10-50ms), constructing a prompt from the retrieved chunks (1-5ms), and sending the assembled prompt to a large language model for generation (200ms-2s). The total wall-clock time for a single RAG query ranges from 250 milliseconds to over 2 seconds. The total cost per query, including embedding API calls, vector database compute, and LLM inference, ranges from $0.01 to $0.05 depending on the model and retrieval depth. At enterprise scale -- 1 million queries per month -- this is $10,000 to $50,000 per month in pure compute cost, plus the latency that makes your AI assistant feel slow.

Here is the part that most teams overlook: 40-60% of RAG queries in enterprise knowledge bases are semantically identical or near-identical. Employees ask the same questions about PTO policy, expense reports, compliance procedures, product specifications, and IT troubleshooting. Customer-facing chatbots see the same product questions, pricing inquiries, and support issues from different users. Internal code assistants get the same questions about API usage, deployment procedures, and error resolution. Every one of these repeated queries runs the full pipeline -- embedding, retrieval, generation -- and produces the same answer at the same cost with the same latency.

Caching the full RAG pipeline output eliminates this redundancy. The first query runs the full pipeline and stores the result. Subsequent identical queries hit the cache and return the stored response at in-process L1 speed: 31 nanoseconds. No embedding. No vector search. No LLM call. The cost drops from $0.03 per query to effectively zero for cache hits. The latency drops from 500ms+ to 31ns -- a 16,129x improvement.

$18K

Monthly Savings (1M queries, 60% hit rate)

16,129x

Latency Improvement on Cache Hit

40-60%

Typical Enterprise RAG Hit Rate

Why RAG Results Are Cacheable

RAG results are cacheable because the pipeline is deterministic given fixed inputs. The same query embedding, searched against the same vector index version, with the same top-K parameter and the same model version, produces the same retrieved chunks. The same retrieved chunks, assembled into the same prompt template and sent to the same model with temperature=0, produce the same generated response. If any of these inputs change -- the vector index is updated, the model version changes, the prompt template is modified -- then the cached result should be invalidated. But as long as the inputs are fixed, the output is fixed, and caching it is not just safe but obligatory for any team that cares about cost and latency.

The challenge is not whether RAG results can be cached. The challenge is building a cache key that captures all the inputs that affect the output, so that stale results are never served after an upstream change. A naive cache key -- just the query string -- will serve stale results after a knowledge base update. A correct cache key must include every input that affects the output. This is exactly what Cachee computation fingerprinting was designed for.

The RAG Cache Key: Getting It Right

The cache key for a RAG result must be a cryptographic fingerprint of every input that affects the output. Missing a single input creates a correctness bug: the cache serves a result that does not reflect the current state of the system. Including unnecessary inputs reduces the hit rate without improving correctness. The correct set of inputs for a RAG cache key is:

cache_key = SHA3-256(
    query_embedding      ||  # The vector representation of the user's query
    top_k                ||  # Number of chunks retrieved (affects context)
    model_version        ||  # LLM model identifier + version
    system_prompt        ||  # System prompt template (affects generation)
    collection_version   ||  # Vector index version (changes on knowledge base update)
    temperature          ||  # Sampling temperature (must be 0 for deterministic caching)
    max_tokens               # Max generation length (affects truncation)
)

The collection_version field is the most important and most commonly omitted. When your team updates the knowledge base -- adding new documents, updating existing ones, or removing deprecated content -- the vector index changes. Any cached RAG result that was generated from the old index is now stale: it may reference information that has been corrected, it may miss information that has been added, or it may include information that has been removed. Including the collection_version in the cache key means that every knowledge base update automatically invalidates all cached results. No manual cache purge. No invalidation race condition. The fingerprint changes because the input changed, and the cache treats the old results as misses.

The query_embedding is used instead of the raw query string because different query strings can produce the same embedding (and thus the same retrieval results). "What is the PTO policy?" and "Tell me about PTO" may produce identical embeddings and retrieve identical chunks. Using the embedding as the cache key collapses these equivalent queries into a single cache entry, improving hit rate without sacrificing correctness.

Temperature Must Be Zero for Deterministic Caching

If your LLM temperature is greater than zero, the same prompt produces different outputs on each call. Caching these results means serving a single sampled response to all subsequent queries, which may not be the most appropriate response for every context. For RAG caching to be correct, set temperature=0 (or use a model that supports deterministic decoding). If your application requires varied responses, cache only the retrieval stage (chunks) and generate fresh responses from cached chunks.

Three RAG Caching Strategies

Not all RAG caching is the same. The right strategy depends on your hit rate requirements, your privacy constraints, and your tolerance for serving slightly stale responses. Here are three strategies, ranked from highest precision to highest recall.

Strategy 1: Exact Match Caching

Exact match caching uses the full fingerprint described above as the cache key. A cache hit occurs only when every input -- the query embedding, top-K, model version, system prompt, collection version, temperature, and max tokens -- matches exactly. This is the safest strategy because it has zero false positive risk: a cache hit always returns the exact result that would have been generated by running the full pipeline.

The tradeoff is hit rate. Exact match caching only hits on queries that produce identical embeddings. "What is the PTO policy?" hits if someone else asked the exact same question (or a question with an identical embedding). But "How many vacation days do I get?" produces a different embedding and misses, even though the retrieval results and generated response would be nearly identical. Typical hit rates for exact match caching in enterprise knowledge bases range from 40-55%, depending on query diversity.

import hashlib
import json
from cachee import CacheeClient, CacheContract

client = CacheeClient()

def rag_query_cached(query: str, collection: VectorCollection) -> str:
    # 1. Embed the query
    embedding = embed_model.encode(query)

    # 2. Build the cache fingerprint
    fingerprint_input = {
        "query_embedding": embedding.tobytes().hex(),
        "top_k": 10,
        "model_version": "gpt-4-2026-04-01",
        "system_prompt_hash": hashlib.sha3_256(SYSTEM_PROMPT.encode()).hexdigest(),
        "collection_version": collection.version(),
        "temperature": 0,
        "max_tokens": 1024,
    }
    cache_key = hashlib.sha3_256(
        json.dumps(fingerprint_input, sort_keys=True).encode()
    ).hexdigest()

    # 3. Check cache
    response = client.get(
        key=cache_key,
        contract="rag_exact_match"
    )

    if response.verification_status == "Valid":
        # Cache hit — return cached response (31ns lookup)
        return response.value

    # 4. Cache miss — run full RAG pipeline
    chunks = collection.search(embedding, top_k=10)
    prompt = build_prompt(SYSTEM_PROMPT, chunks, query)
    answer = llm.generate(prompt, temperature=0, max_tokens=1024)

    # 5. Cache the result with full attestation
    client.put(
        key=cache_key,
        value=answer,
        contract="rag_exact_match",
        metadata={
            "query": query,
            "chunks_retrieved": len(chunks),
            "model": "gpt-4-2026-04-01",
            "collection_version": collection.version(),
        }
    )

    return answer

Strategy 2: Semantic Similarity Caching

Semantic similarity caching relaxes the exact match requirement. Instead of requiring identical embeddings, it searches for cached entries whose query embeddings are within a cosine similarity threshold (typically 0.95-0.98) of the incoming query. This means "What is the PTO policy?" and "How many vacation days do I get?" can share a cache entry if their embeddings are sufficiently similar.

The hit rate improvement is significant: 55-70% in typical enterprise deployments, compared to 40-55% for exact match. But the strategy introduces two risks. First, there is a correctness risk: two queries with similar embeddings may retrieve different chunks and produce different answers, especially for queries near the decision boundary of the similarity threshold. Second, there is a privacy risk: if User A asks a query that includes PII in the phrasing, and User B asks a similar query, the cached response from User A (which may reference User A's context) could be served to User B.

Semantic similarity caching is appropriate for public knowledge bases (product documentation, FAQ, compliance manuals) where the retrieved chunks do not vary significantly for similar queries and where queries do not contain PII. It is not appropriate for personalized assistants, multi-tenant systems with data isolation requirements, or any system where a cached response might leak information between users.

def rag_query_semantic_cached(query: str, collection: VectorCollection) -> str:
    embedding = embed_model.encode(query)

    # Search for semantically similar cached entries
    similar = client.search_similar(
        embedding=embedding,
        similarity_threshold=0.97,
        contract="rag_semantic",
        max_results=1,
    )

    if similar and similar[0].verification_status == "Valid":
        # Semantic cache hit — similar enough query was cached
        return similar[0].value

    # Cache miss — run full pipeline
    chunks = collection.search(embedding, top_k=10)
    prompt = build_prompt(SYSTEM_PROMPT, chunks, query)
    answer = llm.generate(prompt, temperature=0, max_tokens=1024)

    # Cache with the embedding for future similarity searches
    cache_key = hashlib.sha3_256(embedding.tobytes()).hexdigest()
    client.put(
        key=cache_key,
        value=answer,
        contract="rag_semantic",
        embedding=embedding,
        metadata={
            "query": query,
            "collection_version": collection.version(),
        }
    )

    return answer

Strategy 3: Chunk-Level Caching

Chunk-level caching operates at the retrieval stage rather than the generation stage. Instead of caching the full RAG output (query + chunks + generated response), it caches the individual retrieved chunks keyed by the query embedding and collection version. When a cache hit occurs, the cached chunks are used to build the prompt, and only the LLM generation step runs. This eliminates the embedding and vector search stages (15-70ms) but still requires the LLM call (200ms-2s).

The advantage of chunk-level caching is that it avoids the correctness and privacy risks of response-level caching. Different users can retrieve the same chunks but receive personalized responses because the generation step still runs with the user's context. The hit rate for chunk-level caching is typically 60-80% because chunk retrieval patterns follow a power-law distribution: a small number of document chunks are retrieved frequently by many different queries.

The disadvantage is that the most expensive step -- LLM generation -- is not cached. Chunk-level caching saves 15-70ms per query on cache hits, compared to 250ms-2s for full response caching. The cost savings are also smaller because the LLM inference cost (which dominates the per-query cost) is not eliminated. Chunk-level caching is best used as a complement to exact match caching: cache full responses for exact matches, and cache chunks for near-misses where the response cannot be safely reused.

def rag_query_chunk_cached(query: str, collection: VectorCollection) -> str:
    embedding = embed_model.encode(query)

    # Build chunk cache key (embedding + collection version only)
    chunk_key = hashlib.sha3_256(
        embedding.tobytes() +
        collection.version().encode()
    ).hexdigest()

    # Check chunk cache
    chunk_response = client.get(
        key=f"chunks:{chunk_key}",
        contract="rag_chunks"
    )

    if chunk_response.verification_status == "Valid":
        # Chunk cache hit — skip embedding + vector search
        chunks = json.loads(chunk_response.value)
    else:
        # Chunk cache miss — run retrieval
        chunks = collection.search(embedding, top_k=10)

        # Cache the chunks (not the generated response)
        client.put(
            key=f"chunks:{chunk_key}",
            value=json.dumps([c.to_dict() for c in chunks]),
            contract="rag_chunks",
            metadata={
                "collection_version": collection.version(),
                "chunk_count": len(chunks),
            }
        )

    # Always run generation (not cached at this level)
    prompt = build_prompt(SYSTEM_PROMPT, chunks, query)
    answer = llm.generate(prompt, temperature=0, max_tokens=1024)

    return answer

Computation Fingerprints: Why RAG Cache Integrity Matters

Caching RAG results without integrity verification creates a new attack surface. An attacker who gains access to the cache can modify cached responses, injecting misinformation, malicious instructions, or hallucinated content that is served to every subsequent user who triggers a cache hit. This is a cache poisoning attack, and it is particularly dangerous for RAG systems because users trust the responses as being grounded in the retrieved documents.

Cachee's computation fingerprinting binds each cached RAG result to the exact inputs that produced it. The fingerprint SHA3-256(query_embedding || top_k || model_version || system_prompt || collection_version) is computed at write time and verified at read time. If an attacker modifies the cached response, the fingerprint no longer matches, and the cache returns a miss instead of serving the poisoned result. If an attacker modifies the fingerprint to match the new response, the three post-quantum signatures (ML-DSA-65, FALCON-512, SLH-DSA-SHA2-128f) detect the tampering because the signatures were computed over the original fingerprint.

This is not academic. RAG cache poisoning has been demonstrated in research settings and is an active area of concern for any organization deploying RAG systems at scale. The defense is not "trust the cache" -- it is "verify the cache on every read." Cachee verification makes this the default behavior, not an opt-in configuration.

Every RAG Cache Hit Is Verified

Cachee verifies three independent post-quantum signatures and the computation fingerprint on every cache read. A poisoned cache entry fails verification and is treated as a miss, triggering a fresh RAG pipeline execution. The cost of this verification is negligible -- signature verification completes in microseconds, compared to the 250ms-2s you save by not running the full RAG pipeline.

Knowledge Base Updates and Automatic Invalidation

The most common RAG caching failure is serving a cached response that was generated from an outdated knowledge base. A customer asks "What are your pricing tiers?" and receives a cached response from last week -- before your team updated the pricing page. The response is confident, well-formatted, and completely wrong. This is worse than a cache miss because the user trusts the response as authoritative.

Cachee solves this with the collection_version field in the computation fingerprint. Every time your knowledge base is updated -- documents added, modified, or removed -- the collection version increments. Because the collection version is part of the fingerprint, every cached RAG result that was generated from the old collection version automatically misses on the next read. No manual cache purge. No invalidation race condition. No window during which stale results are served.

The collection version can be as simple as a monotonically increasing integer, or as precise as a hash of the entire collection's document set. The choice depends on your update frequency and your tolerance for unnecessary cache invalidation. A simple integer invalidates everything on every update. A collection hash invalidates only when the actual content changes, which is more precise but requires computing the hash on every update. For most deployments, the simple integer is sufficient because knowledge base updates are infrequent relative to query volume -- even a daily update only invalidates the cache once per day.

Cost Analysis: The Business Case for RAG Caching

The business case for RAG caching is straightforward arithmetic. The inputs are your query volume, your per-query cost, and your expected hit rate. The output is the monthly savings. Here is the math for a typical enterprise deployment.

Metric	Without Caching	With Cachee (60% Hit Rate)
Monthly queries	1,000,000	1,000,000
Queries hitting full pipeline	1,000,000	400,000
Queries served from cache	0	600,000
Cost per full pipeline query	$0.03	$0.03
Cost per cached query	N/A	~$0.00
Monthly compute cost	$30,000	$12,000
Monthly savings	--	$18,000
Annual savings	--	$216,000
Average response latency	500ms	200ms (blended)
P99 response latency	2,000ms	2,000ms (miss) / 31ns (hit)

The $18,000 monthly savings is the conservative case. Organizations with higher query volumes, higher per-query costs (using larger models), or higher hit rates (more FAQ-heavy workloads) save proportionally more. A 10-million-query-per-month deployment with a 70% hit rate and $0.05 per-query cost saves $350,000 per month. These are infrastructure costs that appear on your cloud bill every month, and RAG caching eliminates the majority of them.

The latency improvement is equally important. A 31-nanosecond cache hit feels instantaneous to the user. A 500-millisecond full pipeline execution feels noticeably slow. For conversational AI applications where response time directly affects user satisfaction and engagement, the difference between "instant" and "half a second" is the difference between an AI assistant that feels intelligent and one that feels laggy.

Contract Configuration for RAG Caching

RAG cache contracts define the freshness, verification, and invalidation requirements for cached RAG results. The contract configuration depends on your knowledge base update frequency, your accuracy requirements, and your risk tolerance for serving stale responses.

# cachee.toml — RAG caching contracts

[contracts.rag_exact_match]
computation_type = "rag_response"
max_staleness_ms = 3600000  # 1 hour — adjust to knowledge base update frequency
verification = ["ML-DSA-65", "FALCON-512", "SLH-DSA-SHA2-128f"]
fingerprint_fields = [
    "query_embedding", "top_k", "model_version",
    "system_prompt", "collection_version", "temperature", "max_tokens"
]
audit_trail_depth_days = 90
invalidation = "versioned"  # collection_version change invalidates all
strict_mode = true

[contracts.rag_semantic]
computation_type = "rag_response_semantic"
max_staleness_ms = 1800000  # 30 minutes — tighter window for fuzzy matching
verification = ["ML-DSA-65", "FALCON-512", "SLH-DSA-SHA2-128f"]
fingerprint_fields = [
    "query_embedding_cluster", "top_k", "model_version",
    "collection_version"
]
audit_trail_depth_days = 90
invalidation = "versioned"
strict_mode = true

[contracts.rag_chunks]
computation_type = "rag_chunk_retrieval"
max_staleness_ms = 86400000  # 24 hours — chunks change less frequently
verification = ["ML-DSA-65", "FALCON-512", "SLH-DSA-SHA2-128f"]
fingerprint_fields = ["query_embedding", "top_k", "collection_version"]
audit_trail_depth_days = 30
invalidation = "versioned"
strict_mode = true

Choosing the Right Strategy

The three strategies are not mutually exclusive. The optimal deployment for most enterprise RAG systems is a layered approach. Use exact match caching as the first layer: it has zero correctness risk and catches 40-55% of queries. Use chunk-level caching as the second layer: it saves retrieval time for queries that miss the exact match cache and has minimal correctness risk because generation still runs. Reserve semantic similarity caching for public knowledge bases where privacy and per-user personalization are not concerns.

The decision tree is simple. If your RAG system serves a single knowledge base with no user personalization and no PII in queries, use semantic similarity caching with a 0.97 threshold -- you will achieve 55-70% hit rates. If your RAG system has per-user context, multi-tenant data isolation, or PII in queries, use exact match caching for full responses and chunk-level caching for retrieval results. If you are unsure, start with exact match caching. It is safe by default, and the 40-55% hit rate still saves $12,000-$16,500 per month on a 1-million-query workload.

Every cached RAG result in Cachee is signed by three post-quantum algorithms, bound to a computation fingerprint that includes the collection version, and governed by a cache contract that enforces freshness and invalidation. This means your RAG cache is not just fast -- it is tamper-proof, version-aware, and auditable. When your knowledge base updates, stale results invalidate automatically. When an attacker attempts to poison the cache, signature verification catches it. When an auditor asks how you ensure RAG response accuracy, you point to the computation fingerprints and contracts, not a hand-wave about TTLs.

The Bottom Line

RAG pipelines are expensive and slow, but 40-60% of queries are redundant. Caching the full pipeline output at the Cachee L1 tier serves repeated queries at 31 nanoseconds -- 16,129x faster than the full pipeline. Three caching strategies (exact match, semantic similarity, chunk-level) address different precision/recall/privacy tradeoffs. Computation fingerprints ensure that knowledge base updates automatically invalidate stale results, and triple PQ signatures prevent cache poisoning. The cost savings are $18K/month at 1M queries with a 60% hit rate. The latency improvement is 500ms to 31ns for cache hits. The integrity guarantee is three independent post-quantum signatures verified on every read.

Your RAG pipeline is spending $18K/month on redundant queries. Cachee serves cached RAG results at 31ns with cryptographic integrity.

Get Started Computation Fingerprinting