AI agents built on LangChain, CrewAI, and AutoGPT make 3–5 LLM calls per user task. A research agent might call GPT-4o to parse the query, call it again to select tools, call it a third time to synthesize results, and call it a fourth time to format the output. At $0.03 per call, a single agent task costs $0.09–0.15. Scale that to 50,000 tasks per day and you are spending $4,500–7,500 per month on LLM API calls alone — and 40–60% of those calls are generating answers that another user already triggered minutes or hours earlier. The problem is that you cannot blindly cache agent responses because context matters. The solution is context-aware semantic caching.
The Hidden Repetition in Agent Workflows
AI agents decompose complex tasks into sub-steps, and many of those sub-steps are remarkably similar across users. Consider a customer support agent that handles “How do I reset my password?” and “I need to change my password.” Both queries trigger the same internal chain: intent classification (returns “password_reset”), tool lookup (returns the password reset API endpoint), knowledge retrieval (returns the password reset documentation), and response generation (returns step-by-step instructions). Four LLM calls, all producing nearly identical intermediate results.
The data confirms this pattern consistently. Across enterprise deployments, 40–60% of agent sub-calls produce outputs that are semantically equivalent to a previous call. The breakdown by call type:
- Tool selection / routing: 70–85% cache-able. The mapping from intent to tool is deterministic for most queries. If “reset my password” routes to the password_reset tool, so will every semantically equivalent rephrasing.
- Knowledge retrieval / RAG: 50–70% cache-able. The same document chunks are retrieved for similar questions. Caching the retrieval results avoids redundant vector search and LLM re-reading.
- System prompt processing: 90%+ cache-able. The system prompt is identical across all users. The LLM’s “understanding” of the system prompt does not change between invocations.
- Final response generation: 30–50% cache-able. This depends on whether the user’s specific context changes the answer materially.
Why Naive Caching Breaks Agents
The tempting approach is to hash the prompt and cache the response. This fails for agents because the same prompt means different things in different contexts. “Summarize the results” as step 3 of a research agent depends entirely on what steps 1 and 2 produced. Cache the summary from one user’s research and serve it to another user researching a different topic, and you have delivered a confidently wrong answer.
Multi-turn conversation context compounds the problem. An agent that received “I’m a premium customer” three messages ago should not serve the same cached response as an agent handling a free-tier user, even if the current prompt is identical. The context window — the accumulated state from prior turns — fundamentally changes the meaning of the prompt.
This is why teams that try basic prompt caching on agent workflows typically abandon the approach within weeks. The false positive rate — serving cached responses that are wrong for the current context — is unacceptable. But the solution is not to give up on caching. It is to make the cache context-aware.
Context-Aware Key Generation
The core technique is to construct cache keys that incorporate the relevant context, not just the prompt text. Instead of hashing only the user’s message, you hash the prompt plus the context window that actually affects the output. The key insight is that not all context is relevant. A 20-message conversation history might contain 18 messages that do not affect the current step’s output. Identifying and hashing only the relevant context is what makes this approach both effective and performant.
Cachee’s semantic cache takes this further by using vector similarity on the context-aware keys. Two users who asked slightly different questions but triggered the same tool chain with the same prior results will match on the semantic level, even if their exact prompt strings differ. This is what pushes hit rates from the 15–20% you get with hash-based context keys to the 40–60% range that delivers meaningful cost savings.
Per-Step Caching Strategy
Not every step in an agent chain should be cached the same way. The optimal strategy assigns different caching policies based on the step’s position in the chain and its sensitivity to context:
| Agent Step | Cache Strategy | TTL | Hit Rate |
|---|---|---|---|
| Intent classification | Semantic, prompt-only key | 24h | 75–85% |
| Tool selection | Semantic, prompt + intent key | 24h | 70–80% |
| RAG retrieval | Semantic, query + filter key | 1–6h | 50–70% |
| Response generation | Semantic, full context key | 1h | 30–50% |
Early steps (intent classification, tool selection) are highly cacheable because they depend primarily on the user’s question, not on accumulated context. Later steps (response generation) require the full context key but still achieve meaningful hit rates because many users ask similar questions within similar contexts. The per-step approach maximizes overall savings without compromising accuracy at any point in the chain.
The Enterprise Math
Consider an enterprise deployment running AI agents for customer support, internal knowledge retrieval, and sales enablement across three LangChain-based applications:
Without caching: 500K tasks × 4.2 calls × $0.03 × 30 days = $1,890,000/year in LLM API costs. With context-aware semantic caching at a blended 50% hit rate across all steps: half those calls are eliminated. Annual savings: $945,000. Even at a conservative 35% blended hit rate, the savings are $661,500/year. For companies like Klarna, Intercom, and Zendesk that have publicly discussed their AI agent deployments handling millions of interactions monthly, these numbers are table stakes.
What Not to Cache
Context-aware caching is powerful, but there are agent steps that should never be cached:
- Actions with side effects: If an agent step triggers an API call that creates, updates, or deletes data, do not cache the result. The action needs to execute every time.
- User-specific data retrieval: Steps that fetch the current user’s account balance, order status, or other live personal data should bypass the cache entirely. Use a TTL of zero or exclude the namespace.
- Steps with randomness requirements: Brainstorming agents, creative writing agents, or any step where output diversity is a feature should not be cached.
- Time-sensitive reasoning: If the agent is answering “What is the current status of my order?” the answer changes by the minute. Apply very short TTLs (seconds) or skip caching.
The principle is straightforward: cache the reasoning steps (classification, routing, knowledge synthesis) and skip the action steps (writes, fetches of live data). Most agent workflows are 60–70% reasoning and 30–40% action, which is why the blended hit rates of 40–60% are achievable without serving incorrect responses.
For AI infrastructure teams running agents at scale, context-aware caching is not an optimization — it is a requirement. The alternative is paying full price for every LLM call, most of which are regenerating answers your system already knows. The cache does not make your agent dumber. It makes it remember.
Related Reading
Also Read
Cache Agent Reasoning, Not Just Responses.
Context-aware semantic caching eliminates 40–60% of LLM calls in agent workflows. Save $15K–50K/month without serving wrong answers.
Start Free Trial Schedule Demo