An NVIDIA H100 costs $2–4 per hour on AWS. At typical LLM inference workloads, GPU compute utilization sits between 30% and 40%. The GPU is not saturated by model computation — it is waiting. Waiting for KV cache lookups. Waiting for context retrieval. Waiting for token history to arrive from external stores. The data access layer is the bottleneck, and every millisecond the GPU stalls is money evaporating at $30,000+ per GPU per year. OpenAI, Anthropic, Cohere, and AI21 Labs all face this problem at enormous scale. The fix is not more GPUs. It is faster data.
Where the GPU Time Actually Goes
LLM inference has two phases: prefill (processing the input prompt in parallel) and decode (generating output tokens one at a time). During prefill, the GPU is compute-bound — matrix multiplications saturate the tensor cores. Utilization can spike above 80%. But decode is fundamentally different. Each token generation requires reading the KV cache for all previous tokens, performing attention computation, and writing the new KV entry. For a 32K context window, the KV cache alone can be 2–4GB per request. The GPU spends more time moving data than computing on it.
This is the memory-bandwidth bottleneck. An H100 has 3.35 TB/s of HBM3 bandwidth, but the KV cache access pattern is irregular and difficult to prefetch. When you add external data dependencies — retrieving context documents, loading user session state, fetching conversation history from Redis or DynamoDB — the problem compounds. Each external fetch introduces milliseconds of stall time during which the GPU sits idle, burning through your cloud budget while producing nothing.
The Math on Wasted GPU-Hours
A single H100 on AWS (p5.48xlarge, 8 GPUs) costs approximately $98/hr, or $12.25/hr per GPU. Running 24/7, that is $107,310 per GPU per year. At 35% utilization, you are paying for 8,760 GPU-hours but only using 3,066 hours of actual compute. The remaining 5,694 hours — $69,750 per GPU per year — is spent waiting on data.
Scale this to a production LLM serving cluster. A mid-size deployment uses 32–64 GPUs. At 64 GPUs with 35% utilization, the annual waste is $4.46 million in idle GPU-hours. For companies operating at the scale of OpenAI (rumored 25,000+ H100s), Anthropic (14,000+ H100s), or Cohere and AI21 Labs (thousands each), the waste runs into hundreds of millions of dollars annually. Even a 10% improvement in utilization translates to tens of millions saved.
| Cluster Size | Annual GPU Cost | Wasted at 35% Util | Saved at 60% Util |
|---|---|---|---|
| 8 GPUs | $858K | $558K | $215K |
| 32 GPUs | $3.43M | $2.23M | $858K |
| 64 GPUs | $6.87M | $4.46M | $1.72M |
| 256 GPUs | $27.5M | $17.8M | $6.87M |
Why External Data Stores Are the Problem
The GPU data access bottleneck has two components: on-GPU memory bandwidth (the KV cache problem, addressed by hardware and model architecture) and off-GPU data fetching (context retrieval, session state, conversation history — addressed by caching). The second component is where most AI infrastructure teams have room to improve immediately.
In a typical LLM serving stack, external data fetches happen at multiple points in the request lifecycle:
- Context retrieval for RAG: Fetching relevant document chunks from a vector database. Round-trip: 1–5ms via Pinecone/Weaviate, 0.5–2ms via Redis, 0.0015ms via in-process Cachee.
- Conversation history: Loading prior messages from a session store (Redis, DynamoDB). Round-trip: 0.5–3ms network, 1.5µs in-process.
- User preferences and guardrails: Loading user-specific system prompts, content filters, and persona configurations. Round-trip: 0.5–2ms from Redis, 1.5µs in-process.
- KV cache offloading: When GPU memory is full, KV cache entries are offloaded to host CPU memory or NVMe, then reloaded. Round-trip: 50–200µs for CPU memory, 0.5–2ms for NVMe.
Each of these data fetches stalls the inference pipeline. The GPU cannot begin token generation until context is assembled. It cannot continue multi-turn conversations without loading history. Every millisecond of data fetch latency directly reduces GPU utilization.
L1 Caching at 1.5 Microseconds
Cachee’s L1 in-process cache stores frequently accessed data — conversation history, user configurations, RAG context chunks, KV cache segments — in the application’s own memory space. Access time: 1.5 microseconds. No TCP connection, no serialization, no network stack. The data is already in CPU cache lines adjacent to the inference process.
The throughput improvement is multiplicative. By eliminating external data fetch stalls, the GPU receives input data faster, processes more requests per second, and maintains higher sustained utilization. In practice, this translates to 2–4x more inferences per GPU depending on the ratio of data-fetch time to compute time in your specific workload. A workload where 60% of time was spent on data fetches and 40% on compute can shift to 15% data access and 85% compute — more than doubling effective throughput without adding a single GPU.
The Compounding Throughput Effect
GPU utilization improvements compound non-linearly. When you double the effective throughput per GPU, you do not just save on GPU costs — you also reduce the size of the cluster needed to serve your traffic. A 64-GPU cluster operating at 35% utilization serves the same requests as a 32-GPU cluster at 70% utilization. That is 32 fewer H100s to provision, cool, network, and manage. At $107K/GPU/year, that is $3.4 million in annual infrastructure savings from a caching layer that costs a fraction of a single GPU.
The companies operating at the frontier of LLM serving — OpenAI serving ChatGPT to 200M+ users, Anthropic scaling Claude across enterprise deployments, Cohere powering enterprise search and RAG, AI21 Labs serving Jamba models — all face the same physics. GPU compute is expensive. Data access latency is the controllable variable. Moving the data access layer from milliseconds to microseconds is the highest-leverage optimization available in LLM serving infrastructure today. See the Cachee vector search benchmarks for the full performance data.
Related Reading
- AI Infrastructure Solutions
- In-Process Vector Search
- Cachee Pricing
- Start Free Trial
- How Cachee Works
Also Read
The Numbers That Matter
Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.
- L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
- L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
- Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
- L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.
The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.
When Caching Actually Helps
Caching isn't free. It introduces a consistency problem you didn't have before. Before adding any cache layer, the question to answer is whether your workload actually benefits from caching at all.
Caching helps when three conditions hold simultaneously. First, your reads dramatically outnumber your writes — typically a 10:1 ratio or higher. Second, the same keys get read repeatedly within a window where a cached value remains valid. Third, the cost of computing or fetching the underlying value is meaningfully higher than the cost of a cache lookup. Database queries that hit secondary indexes, RPC calls to slow upstream services, expensive computed aggregations, and rendered template fragments all qualify.
Caching hurts when those conditions don't hold. Write-heavy workloads suffer because every write invalidates a cache entry, multiplying your work. Workloads with poor key locality suffer because the cache wastes memory storing entries that never get reused. Workloads where the underlying fetch is already fast — well-indexed primary key lookups against a properly tuned database, for example — gain almost nothing from caching and inherit the consistency complexity for no reason.
The honest first step before any cache deployment is measuring your actual read/write ratio, key access distribution, and underlying fetch latency. If your read/write ratio is below 5:1 or your underlying database is already returning results in single-digit milliseconds, the engineering time is better spent elsewhere.
Memory Efficiency Is The Hidden Cost Lever
Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.
Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.
Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.
What This Actually Costs
Concrete pricing math beats hypothetical. A typical SaaS workload with 1 billion cache operations per month, average 800-byte values, and a 5 GB hot working set currently runs on AWS ElastiCache cache.r7g.xlarge primary plus a read replica — roughly $480 per month for the two nodes, plus cross-AZ data transfer charges that quietly add another $50-150 per month depending on access patterns.
Migrating the hot path to an in-process L0/L1 cache and keeping ElastiCache as a cold L2 fallback drops the dedicated cache spend to $120-180 per month. For workloads where the hot working set fits inside the application's existing memory budget, you can eliminate the dedicated cache tier entirely. The cache becomes a library you link into your binary instead of a separate service to operate.
Compounded over twelve months, that's $3,600 to $4,500 per year on a single small workload. Multiply across a fleet of services and the savings start showing up in finance team conversations. The bigger savings usually come from eliminating cross-AZ data transfer charges, which Redis-as-a-service architectures incur on every read that crosses an availability zone.
Stop Burning GPU-Hours on Data Fetches.
L1 caching at 1.5 microseconds eliminates the data access bottleneck in LLM serving. Get 2–4x more inferences per GPU.
Start Free Trial Schedule Demo