LLM Caching: Cut OpenAI API Costs 60%
LLM inference is expensive. Not "expensive compared to other API calls" expensive. Expensive in absolute dollar terms. A single GPT-4 class call costs $0.01 to $0.06 depending on token count. That sounds small until you calculate what happens at production scale. A product serving 1 million API calls per day pays $10,000 to $60,000 per month in inference costs alone. That is before infrastructure, engineering salaries, or any other operational expense. The LLM API bill is often the single largest variable cost in the entire stack.
The uncomfortable truth is that 40% of those API calls are near-duplicates. The same customer support question asked in three different phrasings. The same code generation prompt submitted by different users. The same summarization request on the same document. The same data extraction prompt run against records with identical structures. Every one of these duplicate calls goes to the LLM provider, consumes inference compute, generates tokens, and produces a bill. And every one of them returns an answer that is functionally identical to an answer you already have.
This post walks through four caching strategies that eliminate redundant LLM inference, the cost math at different scales, and the engineering considerations that determine which strategy works for which use case. The combined effect: a 60% reduction in API costs with no degradation in response quality for the cached subset.
The Cost Problem at Scale
Before discussing solutions, it is worth understanding the exact cost structure that makes LLM caching so impactful. LLM pricing is token-based. You pay for input tokens (the prompt) and output tokens (the completion). For GPT-4 class models, input tokens cost roughly $0.01 per 1,000 tokens and output tokens cost roughly $0.03 per 1,000 tokens. A typical API call with a 500-token prompt and a 300-token completion costs approximately $0.014.
At 1 million calls per day, that is $14,000 per month. At 5 million calls per day, it is $70,000 per month. At 10 million calls per day, it is $140,000 per month. These numbers are not hypothetical. Any product with a chat interface, a code assistant, a document processing pipeline, or a customer support bot can reach 1 million calls per day within months of launch.
The scaling problem is that LLM costs scale linearly with call volume. Unlike traditional compute where you can optimize code to handle more requests per server, LLM inference costs are set by the provider. You cannot make a GPT-4 call cheaper by writing better code. You can only make fewer calls. That is what caching does. It replaces a $0.014 API call with a $0.00 cache lookup. Every cached response is money not spent on inference.
| Daily Call Volume | Monthly Cost (uncached) | Monthly Cost (60% cached) | Monthly Savings |
|---|---|---|---|
| 100,000 | $1,400 | $560 | $840 |
| 500,000 | $7,000 | $2,800 | $4,200 |
| 1,000,000 | $14,000 | $5,600 | $8,400 |
| 5,000,000 | $70,000 | $28,000 | $42,000 |
| 10,000,000 | $140,000 | $56,000 | $84,000 |
Strategy 1: Exact Match Caching
The simplest caching strategy is exact match. Hash the prompt. Check if you have seen that exact hash before. If yes, return the cached completion. If no, call the LLM and cache the result. This is standard HTTP-style caching applied to LLM requests.
The Computation Fingerprint
The cache key is not just the prompt text. It is a computation fingerprint that includes every parameter that affects the output. If any parameter changes, the cache key changes, and the request goes to the LLM.
fingerprint = SHA3-256(
prompt_text ||
model_version ||
temperature ||
max_tokens ||
system_prompt ||
top_p ||
frequency_penalty ||
presence_penalty
)
This fingerprint ensures that a prompt cached with temperature=0 is not returned for the same prompt with temperature=0.7. It ensures that a prompt cached against GPT-4 is not returned for the same prompt against GPT-3.5. Every parameter that affects the output is included in the fingerprint, so the cache only returns results that were generated under identical conditions.
Exact match caching catches 15-20% of calls in most production systems. That might sound low, but at 1 million calls per day, 15% is 150,000 calls. At $0.014 per call, that is $2,100 per month in savings from the simplest possible caching strategy. The implementation is trivial: hash the inputs, check the cache, store on miss. There are no machine learning models, no embedding computations, no similarity thresholds to tune.
The Determinism Question
A common objection to LLM caching is that LLMs are non-deterministic. The same prompt can produce different completions. This is true at temperature > 0, where the model samples from a probability distribution and introduces randomness. But at temperature = 0, most LLM providers produce deterministic output. The same prompt produces the same completion every time. For any use case that uses temperature = 0 -- which includes most production applications in code generation, data extraction, classification, and structured output -- exact match caching is perfectly safe. The cached response is identical to the response the LLM would have produced.
For use cases with temperature > 0 (creative writing, brainstorming, conversational variety), exact match caching is still useful if the application can tolerate returning the same response for the same prompt. Many customer support bots and FAQ systems can. The user asks "how do I reset my password," and the answer is the same every time regardless of whether the model samples it fresh or the cache returns a previously generated response.
Strategy 2: Semantic Caching
Exact match caching misses near-duplicates. "How do I reset my password" and "I need to reset my password, how?" are different strings with different hashes, but they have the same intent and should produce the same response. Semantic caching catches these near-duplicates by comparing the meaning of prompts rather than their exact text.
The approach is straightforward. When a prompt arrives, compute its embedding vector using a lightweight embedding model (not the expensive LLM). Compare the embedding against cached prompt embeddings using cosine similarity. If the similarity exceeds a threshold (typically 0.95), return the cached completion associated with the closest matching prompt. If no match exceeds the threshold, call the LLM and cache both the prompt embedding and the completion.
# Semantic cache lookup
embedding = embed_model.encode(prompt) # ~0.5ms
matches = cache.nearest(embedding, threshold=0.95)
if matches:
return matches[0].completion # Cache hit: 31ns lookup
else:
completion = llm.complete(prompt) # Cache miss: full API call
cache.insert(embedding, prompt, completion)
return completion
The embedding computation adds roughly 0.5 milliseconds of overhead per request. This is negligible compared to the 500-2000 millisecond latency of an LLM API call. The similarity search adds another 0.1-1 millisecond depending on the size of the embedding index. The total overhead for a semantic cache hit is under 2 milliseconds, compared to the 500+ milliseconds and $0.014 cost of a fresh LLM call.
Choosing the Similarity Threshold
The threshold is the critical tuning parameter. At 0.99, semantic caching behaves almost like exact match -- it only catches near-identical prompts with minor punctuation or whitespace differences. At 0.90, it catches a broader range of paraphrases but risks returning irrelevant cached responses for prompts that happen to be superficially similar. The sweet spot for most applications is 0.95, which catches genuine paraphrases while maintaining high precision.
In production systems, semantic caching catches 35-45% of calls on top of what exact match already catches. The combined hit rate -- exact match plus semantic -- reaches 50-60% for applications with repetitive query patterns like customer support, documentation search, and code generation.
Threshold Sensitivity
Setting the semantic similarity threshold too low (below 0.92) risks returning cached responses for prompts that look similar but have different intent. "How do I delete my account" and "How do I create my account" have high lexical overlap but opposite intent. Monitor false-positive rates when tuning the threshold. Start at 0.95 and lower gradually while measuring response accuracy.
Strategy 3: Prefix Caching
Many LLM applications use a system prompt that is identical across all requests. A customer support bot might prepend a 2,000-token system prompt to every user message. A code assistant might include a 1,500-token context block describing the codebase. These system prompts are sent to the LLM on every call, and the provider charges for every input token, including the system prompt tokens that are identical across requests.
Prefix caching exploits this pattern. When the system prompt is the same across multiple requests, the LLM provider (or your caching layer) can cache the processed representation of the system prompt and only process the variable suffix (the user message) on each call. Some providers offer this natively -- OpenAI and Anthropic both provide prefix caching features that reduce input token costs for repeated prefixes.
At the application layer, you can implement prefix caching by separating the system prompt from the user message in your cache key computation. If you have already called the LLM with the same system prompt and a similar user message, you can return the cached result. This is especially effective when combined with semantic caching: the system prompt is hashed exactly (it must match byte-for-byte), but the user message is compared semantically (it can be a paraphrase).
fingerprint = SHA3-256(
system_prompt || # Exact match required
model_version ||
temperature
)
# User message: semantic similarity
user_embedding = embed_model.encode(user_message)
bucket = cache.get_bucket(fingerprint)
match = bucket.nearest(user_embedding, threshold=0.95)
Prefix caching by itself saves 10-15% on input token costs for applications with long system prompts. Combined with semantic caching on the user message, it pushes the overall cache hit rate higher because the system prompt no longer pollutes the semantic similarity computation. Two user messages that are similar in meaning will match even if they have different system prompts, as long as you are comparing within the same system-prompt bucket.
Strategy 4: Response Deduplication
Response deduplication is the inverse of prompt caching. Instead of asking "have I seen this prompt before," it asks "have I generated this response before for a different prompt." If multiple different prompts produce the same output, you can identify the equivalence class and cache a single response for all prompts in that class.
This is particularly effective for classification and extraction tasks. Consider a sentiment analysis pipeline that classifies customer reviews. The LLM might receive 10,000 unique review texts, but the output is one of five categories: very positive, positive, neutral, negative, very negative. If you can identify that certain patterns of review text always map to the same category, you can short-circuit the LLM call entirely.
The implementation uses a two-phase approach. In the first phase, you run prompts through the LLM normally and build a map of (prompt_embedding, response) pairs. In the second phase, you analyze the map to identify clusters of prompt embeddings that produced identical responses. For new prompts that fall within an established cluster, you return the cluster's response without calling the LLM.
Response deduplication catches 5-10% of calls that the other strategies miss. Its primary value is in batch processing pipelines where the output space is smaller than the input space -- classification, entity extraction, yes/no decisions, and structured data extraction.
Combined Architecture
The four strategies compose into a layered caching architecture. Each layer catches a different type of redundancy, and the combined effect is greater than any individual strategy.
def llm_cached(prompt, system_prompt, params):
# Layer 1: Exact match (15-20% hit rate)
exact_key = sha3_256(prompt + system_prompt + params)
if result := exact_cache.get(exact_key):
return result # 31ns
# Layer 2: Prefix + semantic match (35-45% hit rate)
prefix_key = sha3_256(system_prompt + params)
user_emb = embed(prompt) # 0.5ms
bucket = semantic_cache.get_bucket(prefix_key)
if match := bucket.nearest(user_emb, threshold=0.95):
return match.completion # 0.6ms
# Layer 3: Response dedup cluster (5-10% hit rate)
if cluster := dedup_index.classify(user_emb):
return cluster.response # 0.3ms
# Cache miss: call LLM
result = llm.complete(prompt, system_prompt, params)
# Populate all cache layers
exact_cache.set(exact_key, result, ttl=3600)
bucket.insert(user_emb, prompt, result)
dedup_index.update(user_emb, result)
return result
The layers are checked in order of speed and precision. Exact match is fastest (31 nanoseconds) and most precise (zero false positives). Semantic match is next (0.6 milliseconds) and slightly less precise (threshold-dependent). Response deduplication is last (0.3 milliseconds) and least precise (cluster-dependent). A request only reaches the LLM if all three layers miss.
Cache Hit Rate by Layer
| Layer | Strategy | Hit Rate | Lookup Time | False Positive Risk |
|---|---|---|---|---|
| L1 | Exact match | 15-20% | 31 ns | None |
| L2 | Semantic match | 35-45% | 0.6 ms | Low (threshold-dependent) |
| L3 | Response dedup | 5-10% | 0.3 ms | Medium (cluster-dependent) |
| -- | Combined | 55-65% | Varies | -- |
TTL and Invalidation
Cached LLM responses must expire. The world changes. Product features change. Pricing changes. A cached response about your return policy from three months ago might be wrong today. TTL (time-to-live) controls how long cached responses remain valid before they are evicted and the next request goes to the LLM for a fresh response.
The right TTL depends on the content type. Factual responses about stable information (how to reset a password, what programming language syntax looks like) can have long TTLs of 24-72 hours. Responses about dynamic information (current pricing, stock availability, real-time data) need short TTLs of 5-30 minutes. Responses about time-sensitive information (today's news, current weather) should not be cached at all or should have TTLs under 1 minute.
Invalidation is the harder problem. When your product documentation changes, all cached responses about that documentation are now potentially stale. You need a way to invalidate cached responses that reference changed source material. The cleanest approach is to include a version hash of the source material in the cache fingerprint. When the documentation changes, the version hash changes, the fingerprint changes, and old cached responses are naturally evicted.
# Include source material version in fingerprint
fingerprint = SHA3-256(
prompt ||
model_version ||
temperature ||
system_prompt ||
docs_version_hash # Changes when docs update
)
Latency Improvement
Cost reduction is the primary motivation, but the latency improvement is equally significant. An LLM API call takes 500-2000 milliseconds depending on the model, token count, and provider load. A cached response lookup takes 31 nanoseconds for an exact match or 0.6 milliseconds for a semantic match. That is a 1,000x to 60,000x latency reduction.
For interactive applications -- chat interfaces, code completions, search results -- latency determines user experience. A 1.5-second response feels sluggish. A sub-millisecond cached response feels instant. Users do not care whether the response was generated fresh or served from cache. They care that it arrived fast and answered their question.
The latency improvement also enables architectures that would be impractical with raw LLM calls. Real-time features that need sub-100ms responses cannot wait for a 1.5-second LLM call. But if 60% of requests hit the cache at sub-millisecond latency, the average response time drops to 600 milliseconds (0.6 * 1ms + 0.4 * 1500ms), and the P50 response time is under 1 millisecond. That changes what you can build.
Latency Comparison
Fresh LLM API call: 500-2000ms. Exact match cache hit: 31ns (16,000x-64,000x faster). Semantic cache hit: 0.6ms (833x-3,333x faster). At 60% cache hit rate, median response time drops from 1.5 seconds to under 1 millisecond. The user perceives instant responses for the majority of queries.
What Should You Not Cache
Not every LLM call should be cached. Some use cases require fresh responses every time, and caching would degrade the user experience or produce incorrect results.
Creative content generation at high temperature is designed to produce different outputs each time. If a user asks for "five marketing tagline ideas," they expect different ideas each time they ask. Caching would return the same five taglines repeatedly, defeating the purpose of the feature.
Conversation with memory. If the LLM is maintaining a conversation history and the response depends on the full context window, caching based on the latest user message alone will produce incorrect results. The same message ("yes, do that") has entirely different meanings depending on the conversation context.
Real-time data queries. "What is the current price of Bitcoin" has a different correct answer every minute. Caching this response would return stale data. Either do not cache, or use a very short TTL (under 60 seconds) and accept that some responses will be slightly stale.
Personalized responses. If the response depends on user-specific context (their purchase history, their account settings, their location), the cache key must include those context elements. A response cached for User A should never be returned for User B. This is solvable -- include the user context in the fingerprint -- but it reduces the cache hit rate because fewer requests share the same (prompt + user context) combination.
Implementation with Cachee
Setting up LLM caching with Cachee requires three steps: initializing the cache, wrapping your LLM client, and configuring TTL and eviction. The cache runs in-process, so there is no network round-trip for cache lookups. Exact match lookups complete in 31 nanoseconds.
# Initialize Cachee with LLM caching profile
cachee init --profile llm-cache \
--capacity 500000 \
--eviction cachee-lfu \
--ttl-default 3600
# Start the cache
cachee start
# Monitor LLM cache performance
cachee status --profile llm-cache
# Output:
# LLM Cache:
# Entries: 234,891 / 500,000
# Hit rate: 61.7%
# Exact hits: 18.3%
# Semantic: 38.2%
# Dedup: 5.2%
# Avg hit: 31ns (exact) / 0.6ms (semantic)
# Saved: $8,640 this month
# Evictions: 1,247 (CacheeLFU)
The CacheeLFU eviction policy is important for LLM caching. Frequency-based eviction keeps the most commonly requested responses in cache and evicts rarely-accessed responses. This matches the access pattern of LLM applications, where a small number of common queries (password reset, pricing questions, API documentation) dominate traffic and benefit most from caching.
The Cost Math
Here is the math for a production application processing 2 million LLM calls per day at an average cost of $0.014 per call.
Without caching: 2,000,000 calls per day multiplied by $0.014 per call equals $28,000 per day, or $840,000 per month. With 60% cache hit rate: 800,000 calls per day reach the LLM. 1,200,000 calls per day are served from cache. LLM cost: 800,000 multiplied by $0.014 equals $11,200 per day, or $336,000 per month. Cache infrastructure cost: negligible (in-process, no additional servers). Monthly savings: $504,000.
The embedding model for semantic caching adds a small cost. A lightweight embedding model processes the 2 million daily prompts at roughly $0.0001 per prompt (100x cheaper than the LLM call). That is $200 per day or $6,000 per month. Net savings: $498,000 per month. The cache pays for itself within the first hour of operation.
Even at smaller scales, the economics are compelling. At 100,000 calls per day, monthly LLM cost is $42,000. At 60% cache hit rate, monthly cost drops to $16,800. Monthly savings: $25,200. The embedding cost is $300 per month. Net savings: $24,900 per month.
Beyond Cost: Reliability
LLM APIs have outages. OpenAI's status page shows multiple incidents per month. When the API is down, uncached requests fail. Cached requests succeed. A cache with a 60% hit rate means that 60% of your traffic continues to be served during an API outage. For many applications, that is the difference between a degraded experience and a total outage.
Rate limits are another reliability concern. OpenAI imposes rate limits on API calls. If your application exceeds the rate limit, requests are throttled or rejected. A cache that handles 60% of requests reduces your effective API call rate by 60%, keeping you well under rate limits that would otherwise require upgrading to a higher tier or implementing complex backoff logic.
The cache also provides a natural failover mechanism. If the LLM API returns an error, you can fall back to the most recent cached response for a similar prompt (using semantic matching with a slightly lower threshold). The response might not be perfectly fresh, but it is better than an error page. This degraded-but-functional mode is possible because the cache retains responses even after they would normally expire. You keep expired entries in a "stale" pool and serve them only when the live API is unavailable.
The Bottom Line
40% of LLM API calls are near-duplicates that produce functionally identical responses. A layered caching strategy -- exact match, semantic similarity, prefix caching, and response deduplication -- eliminates 60% of API calls. At $0.014 per call and 1 million calls per day, that is $8,400 per month in savings. The cache lookup takes 31 nanoseconds instead of 1.5 seconds, improving both cost and latency. Cache the computation, not just the bytes.
Cut LLM API costs 60%. Serve cached completions at 31 nanoseconds.
brew install cachee STARK Verification Caching