An NVIDIA H100 costs $2–4 per hour on AWS. At typical LLM inference workloads, GPU compute utilization sits between 30% and 40%. The GPU is not saturated by model computation — it is waiting. Waiting for KV cache lookups. Waiting for context retrieval. Waiting for token history to arrive from external stores. The data access layer is the bottleneck, and every millisecond the GPU stalls is money evaporating at $30,000+ per GPU per year. OpenAI, Anthropic, Cohere, and AI21 Labs all face this problem at enormous scale. The fix is not more GPUs. It is faster data.
Where the GPU Time Actually Goes
LLM inference has two phases: prefill (processing the input prompt in parallel) and decode (generating output tokens one at a time). During prefill, the GPU is compute-bound — matrix multiplications saturate the tensor cores. Utilization can spike above 80%. But decode is fundamentally different. Each token generation requires reading the KV cache for all previous tokens, performing attention computation, and writing the new KV entry. For a 32K context window, the KV cache alone can be 2–4GB per request. The GPU spends more time moving data than computing on it.
This is the memory-bandwidth bottleneck. An H100 has 3.35 TB/s of HBM3 bandwidth, but the KV cache access pattern is irregular and difficult to prefetch. When you add external data dependencies — retrieving context documents, loading user session state, fetching conversation history from Redis or DynamoDB — the problem compounds. Each external fetch introduces milliseconds of stall time during which the GPU sits idle, burning through your cloud budget while producing nothing.
The Math on Wasted GPU-Hours
A single H100 on AWS (p5.48xlarge, 8 GPUs) costs approximately $98/hr, or $12.25/hr per GPU. Running 24/7, that is $107,310 per GPU per year. At 35% utilization, you are paying for 8,760 GPU-hours but only using 3,066 hours of actual compute. The remaining 5,694 hours — $69,750 per GPU per year — is spent waiting on data.
Scale this to a production LLM serving cluster. A mid-size deployment uses 32–64 GPUs. At 64 GPUs with 35% utilization, the annual waste is $4.46 million in idle GPU-hours. For companies operating at the scale of OpenAI (rumored 25,000+ H100s), Anthropic (14,000+ H100s), or Cohere and AI21 Labs (thousands each), the waste runs into hundreds of millions of dollars annually. Even a 10% improvement in utilization translates to tens of millions saved.
| Cluster Size | Annual GPU Cost | Wasted at 35% Util | Saved at 60% Util |
|---|---|---|---|
| 8 GPUs | $858K | $558K | $215K |
| 32 GPUs | $3.43M | $2.23M | $858K |
| 64 GPUs | $6.87M | $4.46M | $1.72M |
| 256 GPUs | $27.5M | $17.8M | $6.87M |
Why External Data Stores Are the Problem
The GPU data access bottleneck has two components: on-GPU memory bandwidth (the KV cache problem, addressed by hardware and model architecture) and off-GPU data fetching (context retrieval, session state, conversation history — addressed by caching). The second component is where most AI infrastructure teams have room to improve immediately.
In a typical LLM serving stack, external data fetches happen at multiple points in the request lifecycle:
- Context retrieval for RAG: Fetching relevant document chunks from a vector database. Round-trip: 1–5ms via Pinecone/Weaviate, 0.5–2ms via Redis, 0.0015ms via in-process Cachee.
- Conversation history: Loading prior messages from a session store (Redis, DynamoDB). Round-trip: 0.5–3ms network, 1.5µs in-process.
- User preferences and guardrails: Loading user-specific system prompts, content filters, and persona configurations. Round-trip: 0.5–2ms from Redis, 1.5µs in-process.
- KV cache offloading: When GPU memory is full, KV cache entries are offloaded to host CPU memory or NVMe, then reloaded. Round-trip: 50–200µs for CPU memory, 0.5–2ms for NVMe.
Each of these data fetches stalls the inference pipeline. The GPU cannot begin token generation until context is assembled. It cannot continue multi-turn conversations without loading history. Every millisecond of data fetch latency directly reduces GPU utilization.
L1 Caching at 1.5 Microseconds
Cachee’s L1 in-process cache stores frequently accessed data — conversation history, user configurations, RAG context chunks, KV cache segments — in the application’s own memory space. Access time: 1.5 microseconds. No TCP connection, no serialization, no network stack. The data is already in CPU cache lines adjacent to the inference process.
The throughput improvement is multiplicative. By eliminating external data fetch stalls, the GPU receives input data faster, processes more requests per second, and maintains higher sustained utilization. In practice, this translates to 2–4x more inferences per GPU depending on the ratio of data-fetch time to compute time in your specific workload. A workload where 60% of time was spent on data fetches and 40% on compute can shift to 15% data access and 85% compute — more than doubling effective throughput without adding a single GPU.
The Compounding Throughput Effect
GPU utilization improvements compound non-linearly. When you double the effective throughput per GPU, you do not just save on GPU costs — you also reduce the size of the cluster needed to serve your traffic. A 64-GPU cluster operating at 35% utilization serves the same requests as a 32-GPU cluster at 70% utilization. That is 32 fewer H100s to provision, cool, network, and manage. At $107K/GPU/year, that is $3.4 million in annual infrastructure savings from a caching layer that costs a fraction of a single GPU.
The companies operating at the frontier of LLM serving — OpenAI serving ChatGPT to 200M+ users, Anthropic scaling Claude across enterprise deployments, Cohere powering enterprise search and RAG, AI21 Labs serving Jamba models — all face the same physics. GPU compute is expensive. Data access latency is the controllable variable. Moving the data access layer from milliseconds to microseconds is the highest-leverage optimization available in LLM serving infrastructure today. See the Cachee vector search benchmarks for the full performance data.
Related Reading
- AI Infrastructure Solutions
- In-Process Vector Search
- Cachee Pricing
- Start Free Trial
- How Cachee Works
Also Read
Stop Burning GPU-Hours on Data Fetches.
L1 caching at 1.5 microseconds eliminates the data access bottleneck in LLM serving. Get 2–4x more inferences per GPU.
Start Free Trial Schedule Demo