LLM Caching: How to Cut OpenAI API Costs 60% With

Every duplicate or near-duplicate prompt to GPT-4 costs you $0.03–0.06. At 100K requests per day, that is $3,000–6,000 per day wasted on answers you have already paid for. The responses are sitting in OpenAI’s servers, generated and discarded, while your application fires the same question again with slightly different phrasing. Semantic caching matches similar prompts — not just identical ones — and serves cached responses instantly. The result is a 40–60% reduction in API spend, sub-millisecond response times on cache hits, and an architecture that actually gets cheaper as your traffic grows.

Why Traditional Caching Doesn’t Work for LLMs

You might think the obvious solution is to cache LLM responses the same way you cache database queries: hash the input, store the output, serve it on match. The problem is that natural language does not hash deterministically. A user asking “What is Kubernetes?” and another asking “Explain Kubernetes to me” and a third asking “Can you tell me about Kubernetes?” will produce three completely different cache keys. Three cache misses. Three API calls to GPT-4. Three nearly identical responses billed at full price. The semantic meaning is the same. The cache key is not.

This is why hash-based caching typically achieves less than a 5% hit rate on LLM traffic. Natural language is inherently variable. Users rephrase, abbreviate, add filler words, change sentence structure, and use synonyms — all while asking the same fundamental question. Even slight variations like capitalization, trailing punctuation, or an extra “please” at the end generate a different SHA-256 hash. Traditional caching architectures were designed for structured, deterministic inputs like SQL queries or API parameters. They were never designed for the fuzzy, probabilistic nature of human language. Applying exact-match caching to LLM prompts is like using a combination lock where the combination changes every time someone rephrases the same request.

// Three prompts. Same answer. Three cache misses with hash-based caching.

const prompts = [
  "What is Kubernetes?",
  "Explain Kubernetes to me",
  "Can you tell me about Kubernetes?",
];

prompts.forEach(p => {
  const key = sha256(p);
  const cached = cache.get(key);  // MISS, MISS, MISS
  // 3 API calls to GPT-4 = 3 x $0.04 = $0.12 wasted
});
        

What Is Semantic Caching

Semantic caching replaces the hash function with a vector embedding. Instead of hashing the prompt string, you convert it into a high-dimensional vector that captures its meaning. When a new prompt arrives, you embed it and search for the nearest cached prompt vector using cosine similarity. If the similarity score exceeds a configurable threshold — typically 0.92–0.97 — you serve the cached response instead of calling the LLM API. The key insight is that semantically identical questions produce nearly identical embeddings regardless of how the words are arranged.

“What is Kubernetes?” and “Explain Kubernetes to me” might have a cosine similarity of 0.96. That is well above a 0.93 threshold, so the second prompt gets the cached response from the first — no API call, no tokens consumed, no latency penalty. This approach routinely achieves 40–60% hit rates on typical LLM traffic, compared to the less-than-5% hit rate of hash-based caching. The improvement comes from the fundamental reality that humans ask the same questions in dozens of different ways, and embeddings collapse all those surface-level variations into a single semantic neighborhood. The embedding model itself is small and fast — a 384-dimension embedding takes roughly 2–5ms to compute — so the overhead of the similarity search is negligible compared to the 800ms–3s latency of an actual LLM API call.

            Semantic match example: “What is Kubernetes?” and “Can you explain what Kubernetes does?” produce embeddings with 0.96 cosine similarity. Above a 0.93 threshold, the second prompt is a cache hit. Same answer, zero API cost, 1.5µs response time instead of 1.2 seconds.
        

The Cost Math

LLM APIs price by token volume. GPT-4o charges $5 per million input tokens and $15 per million output tokens. Claude 3.5 Sonnet charges $3 per million input and $15 per million output. At scale, these numbers compound fast. Consider a customer support chatbot handling 100,000 queries per day with an average of 500 input tokens and 800 output tokens per request. Without caching, you are consuming 50 million input tokens and 80 million output tokens daily. With GPT-4o, that is $250/day in input costs and $1,200/day in output costs — $1,450/day total, or roughly $529,000 per year.

At a 50% semantic cache hit rate, half of those requests never reach the API. Your annual spend drops from $529K to $265K — a savings of $264,000 per year. At 60% hit rate, the savings climb to $317K. The cache infrastructure itself costs a fraction of that: embedding computation, vector storage, and the cache layer combined run under $500/month at this scale. The ROI is not marginal. It is an order of magnitude.

Model	Daily Cost (No Cache)	50% Hit Rate	60% Hit Rate	Annual Savings (60%)
GPT-4o	$1,450/day	$725/day	$580/day	$317K
Claude 3.5 Sonnet	$1,350/day	$675/day	$540/day	$296K
GPT-4o mini	$135/day	$67.50/day	$54/day	$29.6K
Gemini 1.5 Pro	$1,025/day	$512/day	$410/day	$225K

These numbers assume 100K requests/day at 500 input + 800 output tokens per request. Your actual savings scale linearly with volume. At 1M requests/day, multiply everything by 10. The economics only improve at scale because cache hit rates tend to increase as traffic grows — more requests means more opportunities for semantic overlap.

How Cachee Handles LLM Caching

Cachee implements a two-tier architecture purpose-built for AI inference caching. The first tier is an L1 in-process cache that handles exact-match lookups in 1.5 microseconds. If the exact prompt string has been seen before, the response is returned from the application’s own memory — no embedding computation, no similarity search, no network hop. This catches the surprisingly large percentage of traffic that is truly identical: automated health checks, repeated integrations, bots, and power users who ask the same question repeatedly.

The second tier is the L2 semantic search layer. When L1 misses, the prompt is embedded and compared against the cached embedding index using approximate nearest neighbor search. If the cosine similarity exceeds the configurable threshold (default 0.95), the cached response is returned. The threshold is tunable per use case: a customer support bot might use 0.93 (more aggressive matching, higher hit rate), while a code generation tool might use 0.97 (stricter matching, higher precision). Cachee also supports TTL-aware semantic caching — queries about time-sensitive data (“What is Bitcoin’s price?”) can be configured with shorter TTLs or excluded from semantic matching entirely, ensuring stale financial data or live statistics are never served from cache.

Predictive pre-warming takes this further. Cachee’s ML layer learns prompt patterns and pre-computes embeddings for anticipated queries before they arrive. If your support bot sees a spike in “how do I reset my password” queries every Monday morning, the embedding and response are already warm in L1 before the first user types the question.

// Cachee LLM caching: two-tier lookup

// Tier 1: L1 exact match (1.5µs)
const exactHit = cachee.get("What is Kubernetes?");
// Returns cached response instantly if exact string seen before

// Tier 2: L2 semantic match (~3ms)
const semanticHit = cachee.semanticGet("Explain Kubernetes to me", {
  threshold: 0.95,      // cosine similarity minimum
  ttl: 3600,             // 1 hour TTL
  namespace: "support",  // isolated per use case
});
// Matches against "What is Kubernetes?" (0.96 similarity) — cache hit

// Miss path: call LLM, cache both exact string and embedding
if (!semanticHit) {
  const response = await openai.chat(prompt);
  cachee.set(prompt, response, { semantic: true });
}
        

Beyond Cost: Latency

Cost savings are the headline number, but the latency improvement is what your users actually feel. A GPT-4o API call takes 800ms–3 seconds depending on output length, server load, and whether you are hitting rate limits. During peak hours, latency can spike to 5–8 seconds. Your users see a spinner. They see “thinking...” animations. They wait. A cached response from Cachee’s L1 tier returns in 1.5 microseconds. Not milliseconds — microseconds. That is the difference between a chatbot that feels like it is “generating” and one that feels like it already knows the answer. For AI infrastructure that serves real-time user interactions, this latency gap is the difference between an application that feels intelligent and one that feels slow.

The latency benefit also eliminates a class of reliability problems. When the OpenAI API experiences degraded performance or outages — which happens more often than their status page suggests — your cached responses continue serving instantly. Your application develops a resilience layer that absorbs API instability without any user-visible impact. At 50–60% cache hit rate, half or more of your traffic is completely decoupled from upstream API availability.

1.5µs Cached Response (L1)

800ms GPT-4o Fastest

3s+ GPT-4o Under Load

533,000× Faster on Cache Hit

Implementation: What Changes in Your Code

The architectural shift from “call LLM on every request” to “check cache, then call LLM” requires minimal code changes but delivers outsized returns. The pattern is straightforward: intercept the prompt before it reaches the API, check both exact and semantic caches, and only forward to the LLM on a true miss. On the response path, store the result with its embedding for future lookups. The entire integration is typically 10–15 lines of code.

// Before: every request hits the API
async function askBot(userQuestion) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: userQuestion }],
  });
  return response.choices[0].message.content;
}

// After: Cachee intercepts, caches, and serves
async function askBot(userQuestion) {
  const cached = await cachee.semanticGet(userQuestion, {
    threshold: 0.95,
    namespace: "chatbot",
  });
  if (cached) return cached;  // 1.5µs, no API call

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: userQuestion }],
  });
  const answer = response.choices[0].message.content;
  await cachee.semanticSet(userQuestion, answer);  // cache for next time
  return answer;
}
        

The critical design decisions are around threshold tuning and TTL policy. A threshold of 0.95 is a safe default for most applications — it catches rephrased questions while avoiding false positives where semantically adjacent but meaningfully different questions return the wrong cached answer. Lower the threshold to 0.92 for high-volume, low-stakes use cases like FAQ bots. Raise it to 0.97–0.99 for applications where precision matters more than hit rate, such as medical or legal information retrieval. TTL should reflect the freshness requirements of your data: static knowledge bases can use 24-hour or longer TTLs, while anything referencing real-time data should use short TTLs or bypass semantic caching entirely.

            Threshold guide: 0.92–0.94 for FAQ/support bots (aggressive, higher hit rate). 0.95 for general-purpose chatbots (balanced). 0.97–0.99 for code generation, medical, or legal (conservative, higher precision). Start at 0.95 and adjust based on false positive rate in your specific domain.
        

When Not to Cache

Semantic caching is not universally applicable. There are categories of LLM traffic where caching is inappropriate or counterproductive, and understanding these boundaries is as important as understanding the benefits.

Personalized responses: If the LLM output depends on user-specific context (account history, preferences, prior conversation), a semantically similar prompt from a different user should not return the same cached answer. Namespace your cache per user or per conversation thread.
Creative generation: Applications that rely on LLM creativity — marketing copy variations, brainstorming, fiction — need different outputs for the same input. Caching defeats the purpose.
Real-time data: “What is the stock price of AAPL?” requires a live answer. Serve this from cache and you serve stale data. Use TTL-based expiration measured in seconds, not hours.
Multi-turn conversations: The same user message means different things in different conversation contexts. Cache at the full-context level (system prompt + message history), not the individual message level.

For everything else — knowledge retrieval, FAQ responses, documentation queries, classification tasks, entity extraction, summarization of static documents — semantic caching delivers massive savings with near-zero risk of incorrect responses.

Stop Paying for Answers You Already Have.

Semantic caching cuts LLM API costs 40–60% and delivers cached responses in 1.5µs instead of seconds. See it in action.

Start Free Trial Schedule Demo

LLM Caching: How to Cut OpenAI API Costs 60% With Semantic Cache