ML Inference Caching: Eliminate Redundant GPU Compute

April 26, 2026 | 14 min read | Engineering

Machine learning inference is the most expensive computation most companies run. A single GPU instance costs $2 to $8 per hour. A production inference fleet serving real-time predictions might run 10, 50, or 200 GPU instances around the clock. The monthly bill for inference compute alone can range from $15,000 for a modest deployment to $1.2 million for a large-scale recommendation or fraud detection system.

Here is the part that should make your infrastructure team uncomfortable: approximately 40% of those inference calls are redundant. The same input arrives. The same model processes it. The same output is produced. The GPU burns the same cycles, consumes the same power, and occupies the same memory to compute a result that was already computed minutes, hours, or days ago. The result sits nowhere. Nobody cached it. So the GPU computes it again.

A cached inference result returns in 31 nanoseconds from local memory. Not 10 milliseconds from a GPU. Not 50 milliseconds from a model serving endpoint. 31 nanoseconds. The difference between caching and not caching inference results is the difference between a GPU fleet that runs at 60% utilization doing useful work and one that runs at 100% utilization doing 40% redundant work.

40%

Redundant Inference Calls

31 ns

Cached Result Latency

$2-8/hr

GPU Instance Cost

This post walks through five ML workloads where inference caching delivers immediate savings, explains the computation fingerprint approach that makes caching safe across model updates, addresses the "but my model changes" objection, and provides cost savings math at three scales. If you run ML inference in production and you are not caching results, you are lighting money on fire.

Why ML Inference Is Cacheable

ML inference, for the vast majority of production workloads, is a deterministic function. Given the same input tensor and the same model weights, the output is identical. There are exceptions -- models with dropout at inference time, stochastic sampling in generative models -- but these represent a minority of production inference workloads. Classification, regression, embedding generation, recommendation scoring, and fraud detection are all deterministic given the same input and model version.

This determinism is what makes caching viable. If you can identify that the same input has been processed by the same model before, you can return the cached result without invoking the model. The GPU never fires. The memory is never allocated. The result appears in 31 nanoseconds from an in-process cache instead of 10-100 milliseconds from a model serving endpoint.

The challenge is identifying "same input, same model" reliably. This is where the computation fingerprint comes in.

The Computation Fingerprint

A computation fingerprint for ML inference is a cryptographic hash that uniquely identifies the exact inference computation. It is constructed from four components:

fingerprint = SHA3-256(
    input_bytes       ||  // Serialized input tensor or feature vector
    model_version     ||  // Model checkpoint hash or version identifier
    parameters        ||  // Inference parameters (temperature, top_k, etc.)
    domain_separator      // Prevents cross-model cache collisions
)

Input bytes are the serialized representation of whatever goes into the model. For an embedding model, this is the tokenized text. For a classification model, this is the feature vector. For a recommendation model, this is the user-item pair encoding. The key requirement is that the serialization is canonical: the same logical input always produces the same bytes.

Model version is a hash of the model weights or a version identifier that changes whenever the model is updated. This is what makes caching safe across model updates: when you deploy a new model version, the fingerprint changes for every input, and all cached results are naturally invalidated. The cache does not serve stale predictions from an old model.

Parameters capture any inference-time configuration that affects the output. For most deterministic models (classifiers, embedding models, regressors), there are no such parameters. For generative models, this might include temperature, top-k, or other sampling parameters. Including them in the fingerprint ensures that different inference configurations produce different cache keys.

Domain separator prevents cross-model collisions. Without it, two different models that happen to receive the same input bytes could return each other's cached results. The domain separator is typically the model name or a unique model identifier.

Five Cacheable ML Workloads

Not all ML workloads benefit equally from caching. The value depends on three factors: the cost of recomputation, the hit rate in practice, and the size of the cached result. Here are five workloads where inference caching delivers substantial savings, ordered by typical impact.

1. Embedding Lookups

Embedding generation is the most common ML inference workload in production. Every search query, every product listing, every document, every user profile gets converted into a dense vector representation. These embeddings power search, recommendations, clustering, and similarity computations across virtually every consumer-facing application.

A typical embedding model (BERT-base, 768 dimensions) takes 5-15 milliseconds per inference on a GPU, or 20-50 milliseconds on CPU. The output is a 3,072-byte vector (768 float32 values). This is small enough to cache millions of entries in memory. A 4 GB cache holds approximately 1.3 million embeddings.

GPU cost of recomputation: At 10ms per embedding on an A10G GPU ($1.21/hr on AWS), each embedding costs approximately $0.0000034. That sounds negligible until you compute 100 million embeddings per day: $340/day, $10,200/month.

Cache hit rate in practice: 60-85%, depending on the application. Search engines see the same queries repeatedly (the top 1,000 queries account for 30-40% of traffic). E-commerce product embeddings are computed once per product and queried millions of times. User profile embeddings change only when the user's behavior changes, which is infrequent relative to query frequency.

Savings with 70% cache hit rate: 70% of 100 million daily embeddings are served from cache. 70 million GPU inferences eliminated per day. At $0.0000034 per inference, that is $238/day, $7,140/month in GPU compute savings. Plus the latency improvement: 70% of requests return in 31 nanoseconds instead of 10 milliseconds.

2. Classification Results

Classification models assign labels to inputs. Content moderation classifiers check whether text or images violate policies. Sentiment classifiers score customer reviews. Intent classifiers route support tickets. All of these are deterministic: the same input always gets the same label from the same model.

GPU cost of recomputation: A classification inference typically takes 5-20 milliseconds on GPU, depending on model size. A content moderation system processing 50 million items per day at 10ms per item needs approximately 139 GPU-hours per day. At $1.21/hr (A10G), that is $168/day, $5,040/month.

Cache hit rate in practice: 30-60%. Content moderation sees repeated content from reposts, copies, and near-duplicates. Sentiment classification over product reviews sees repeated or near-identical reviews. Intent classification sees the same customer questions phrased identically. The hit rate depends on the diversity of inputs. Applications with high redundancy (social media, e-commerce reviews) see higher rates than applications with unique inputs (medical imaging, scientific data).

Savings with 45% cache hit rate: 22.5 million cached results per day. 62.5 GPU-hours eliminated per day. $75/day, $2,268/month. Modest per-unit savings, but they compound across every classification model in your pipeline. Most production systems run 5-15 classification models, so the aggregate savings are 5-15x the per-model number.

3. Recommendation Scores

Recommendation systems compute scores for user-item pairs. A user visits a product page, and the system scores thousands of candidate items to generate a ranked recommendation list. The same user visiting the same page within a short window produces identical scores if the model and feature inputs have not changed.

GPU cost of recomputation: Scoring 1,000 candidate items for a single user takes 5-50 milliseconds on GPU, depending on model complexity. A system serving 10 million recommendation requests per day, each scoring 1,000 candidates, performs 10 billion candidate scorings per day. Even on efficient hardware, this represents substantial GPU allocation.

Cache hit rate in practice: 40-70%. Users frequently revisit the same pages within sessions. The recommendation context (user profile + page context) is identical across refreshes and repeat visits. Session-level caching of recommendation results with a 5-minute TTL captures the majority of redundant recomputations. Some systems see higher rates when users browse similar categories repeatedly.

Savings with 50% cache hit rate: 5 million recommendation requests served from cache per day. 50% reduction in GPU inference load for the recommendation service. For a fleet of 20 GPU instances dedicated to recommendations at $3/hr each, this eliminates 10 instances: $720/day, $21,600/month.

4. Fraud Scoring

Fraud detection models score transactions in real time. A transaction arrives with features like merchant category, amount, location, device fingerprint, and recent transaction history. The model outputs a fraud probability. Many transactions share identical feature patterns, especially for legitimate transactions from repeat customers at familiar merchants.

GPU cost of recomputation: Fraud scoring is typically fast (1-5 milliseconds per transaction) because latency requirements are strict. But the volume is massive. A payment processor handling 50,000 transactions per second performs 4.32 billion scorings per day. Even at 2 milliseconds per scoring, that requires continuous allocation of approximately 100 GPU-seconds per second -- a substantial fleet.

Cache hit rate in practice: 35-55%. Repeat purchases at the same merchant for similar amounts (subscription payments, regular grocery shopping, recurring bills) produce identical feature vectors. The cache key is the feature vector fingerprint, not the transaction ID, so different transactions with identical features share the same cached score. This is correct behavior: if the features are identical, the fraud score is identical.

Savings with 40% cache hit rate: 1.73 billion cached scorings per day. At 2ms per scoring on GPU, that eliminates approximately 960 GPU-hours per day. At $1.21/hr, savings are $1,162/day, $34,860/month. For a high-volume payment processor, this is one of the highest-impact applications of inference caching.

5. Semantic Search Results

Semantic search systems convert queries into embeddings and perform nearest-neighbor search against a vector index. The expensive step is the query embedding generation (covered above) plus the vector similarity computation. When the same query arrives, both the embedding and the search results can be cached.

GPU cost of recomputation: Query embedding plus vector search typically takes 15-50 milliseconds end-to-end. A search system handling 10,000 queries per second processes 864 million queries per day. The query embedding step alone requires significant GPU allocation. Adding the vector search (which may also use GPU acceleration) increases the cost further.

Cache hit rate in practice: 50-80%. Search queries follow a heavy-tailed distribution. The top 10% of unique queries account for 60-70% of total query volume. "How to reset password," "order status," "return policy" -- these queries appear thousands of times per day and produce identical search results each time. Caching the final search results (not just the embedding) eliminates both the embedding generation and the vector search.

Savings with 65% cache hit rate: 561 million queries served from cache per day. At 30ms per query on GPU, that eliminates approximately 4,680 GPU-hours per day. At $1.21/hr, savings are $5,663/day, $169,880/month. Semantic search is the single highest-impact workload for inference caching due to the combination of high redundancy and expensive per-query computation.

Aggregate Savings Across Workloads

Most production ML systems run multiple models in a pipeline. A typical e-commerce application might run embedding generation, classification, recommendation scoring, and fraud detection on every user interaction. The redundancy compounds across the pipeline: if a user triggers the same feature vector through the same sequence of models, every model's output can be cached.

Workload	Monthly GPU Cost	Cache Hit Rate	Monthly Savings
Embedding lookups	$10,200	70%	$7,140
Classification	$5,040	45%	$2,268
Recommendations	$43,200	50%	$21,600
Fraud scoring	$87,120	40%	$34,860
Semantic search	$261,360	65%	$169,880
Total	$406,920		$235,748

For a large-scale production system running all five workloads, inference caching saves approximately $235,748 per month, or $2.83 million per year. Even at a fraction of this scale -- a company running just embedding lookups and classification -- the savings are $9,408/month, $112,896/year. These numbers use conservative cache hit rates. Systems with highly repetitive traffic patterns (e-commerce, financial services, customer support) routinely see higher rates.

$235K

Monthly Savings (Full Stack)

$2.83M

Annual Savings (Full Stack)

58%

Average GPU Cost Reduction

Addressing "But My Model Updates"

The most common objection to ML inference caching is that models change. Teams retrain models daily, weekly, or continuously. How do you prevent the cache from serving predictions from an old model?

The answer is in the computation fingerprint. The model version is part of the fingerprint. When you deploy a new model version, every fingerprint changes. The cache produces misses for every input until the new model's results populate the cache. This happens automatically, with no cache flush, no manual invalidation, and no coordination between the model deployment system and the cache.

# Model v1 deployed
fingerprint_v1 = SHA3-256(input + "model-v1.3.2" + params + domain)
# Cache: MISS -> compute -> store result

# Same input, same model
fingerprint_v1 = SHA3-256(input + "model-v1.3.2" + params + domain)
# Cache: HIT -> return cached result (31ns)

# Model v2 deployed
fingerprint_v2 = SHA3-256(input + "model-v2.0.0" + params + domain)
# Cache: MISS -> compute with new model -> store new result

# Old v1 results expire naturally via TTL or LFU eviction

The model version string should be a content-addressable identifier whenever possible: a hash of the model weights, a commit SHA from your model registry, or a version number that is guaranteed to change on every update. Using a timestamp or "latest" tag is insufficient because it does not uniquely identify the model weights.

For models that update continuously (online learning, streaming updates), the model version changes frequently, which reduces cache hit rates. This is correct behavior: if the model is genuinely different, the predictions should be recomputed. In practice, even continuously updated models produce stable predictions for the majority of inputs. The top-1 prediction for most inputs does not change across incremental model updates. The cache hit rate is lower than for batch-updated models, but it is still positive and still saves GPU compute.

Implementation Architecture

ML inference caching sits between your application and your model serving infrastructure. It intercepts inference requests, computes the fingerprint, checks the cache, and either returns the cached result or forwards the request to the model server.

Application Request
    |
    v
[Fingerprint Computation] -- SHA3-256(input + model_version + params)
    |
    v
[Cache Lookup] -- 31ns on hit
    |
    +-- HIT --> Return cached result (31ns total)
    |
    +-- MISS --> Forward to model server
                    |
                    v
                [GPU Inference] -- 10-50ms
                    |
                    v
                [Cache Store] -- result + fingerprint
                    |
                    v
                Return result to application

The cache layer adds negligible latency on a miss (the fingerprint computation takes approximately 60 nanoseconds, the cache miss lookup takes approximately 25 nanoseconds, total overhead of 85 nanoseconds on a cold path). On a hit, the total latency is 31 nanoseconds versus 10-50 milliseconds for GPU inference. The hit path is 300,000x to 1,600,000x faster than the miss path.

Memory Sizing

The memory required for an inference cache depends on the size of the cached results and the number of unique inputs you want to cache.

Workload	Result Size	1M Entries	10M Entries
Embeddings (768-dim)	3,072 bytes	2.9 GB	29 GB
Classification (label + score)	64 bytes	61 MB	610 MB
Recommendation (top-50 scores)	400 bytes	381 MB	3.8 GB
Fraud score (probability)	8 bytes	7.6 MB	76 MB
Search results (top-10 IDs)	80 bytes	76 MB	762 MB

Classification results and fraud scores are tiny. You can cache 10 million entries in under 1 GB. Embeddings are larger but still manageable: 1 million embeddings fit in 3 GB of memory, which is a small fraction of a typical application server's capacity. The CacheeLFU admission policy ensures that only frequently accessed results remain in cache, so the memory budget is spent on the entries that produce the most cache hits.

Cost Savings at Three Scales

The following table shows the impact of inference caching at three scales, assuming a blended cache hit rate of 50% across all workloads and an average GPU cost of $1.50/hr per instance.

Scale	GPU Instances	Monthly GPU Cost	Savings (50% hit rate)	Annual Savings
Startup (2 models)	4	$4,320	$2,160/mo	$25,920
Mid-size (5 models)	25	$27,000	$13,500/mo	$162,000
Enterprise (15 models)	200	$216,000	$108,000/mo	$1,296,000

At the startup scale, inference caching saves $25,920 per year. That is roughly the cost of one junior engineer. At the enterprise scale, inference caching saves $1.3 million per year. That is the cost of a small engineering team, saved by eliminating redundant GPU work.

These numbers assume a conservative 50% cache hit rate. Production systems with repetitive traffic patterns routinely see 60-80% hit rates on their most expensive inference workloads. At a 70% hit rate, the enterprise savings increase to $1.8 million per year. The relationship is linear: every 10% increase in cache hit rate reduces GPU costs by an additional 10%.

When Not to Cache Inference Results

Do not cache results from models where the output must reflect real-time state. A stock price prediction model that takes current market data as input should not return cached results from 30 seconds ago. However, even here, sub-second caching (TTL of 100-500 milliseconds) can eliminate redundant computations from concurrent requests for the same prediction. Additionally, do not cache results from stochastic models where different outputs are desired for the same input (creative text generation, diverse recommendation exploration). For deterministic inference, caching is always correct.

Implementation With Cachee

Cachee provides ML inference caching with computation fingerprinting as a core feature. The setup requires three steps.

# Install Cachee
brew tap h33ai-postquantum/tap
brew install cachee

# Initialize with ML inference caching mode
cachee init --mode inference

# Start Cachee
cachee start

The integration wraps your existing inference client with a caching layer:

// Pseudocode for ML inference caching
fn predict_with_cache(input: &Tensor, model: &ModelInfo) -> Prediction {
    // Step 1: Compute fingerprint
    let fingerprint = sha3_256(
        input.to_bytes(),
        model.version_hash(),
        model.inference_params(),
        model.domain_separator(),
    );

    // Step 2: Check cache
    if let Some(result) = cachee.get(fingerprint) {
        return result;  // 31 nanoseconds
    }

    // Step 3: Run inference (cache miss)
    let result = model.predict(input);  // 10-50 milliseconds on GPU

    // Step 4: Cache the result
    cachee.set(fingerprint, result, ttl: model.cache_ttl());

    return result;
}

The cache TTL should be set based on the model's update frequency. For models updated daily, a 12-24 hour TTL is appropriate. For models updated weekly, 3-7 days. For static models (pre-trained embeddings that do not change), the TTL can be set to 30 days or longer. The fingerprint-based invalidation handles model updates correctly regardless of TTL, so the TTL is a secondary safety mechanism rather than the primary invalidation strategy.

Beyond Individual Predictions: Pipeline Caching

Most ML systems run multi-model pipelines. A request might flow through an embedding model, a classification model, a ranking model, and a post-processing model. Each step is independently cacheable. But the pipeline itself is also cacheable: if the same input produces the same output from the entire pipeline, you can cache the final result and skip all intermediate steps.

Pipeline caching uses a compound fingerprint that includes the versions of all models in the pipeline:

pipeline_fingerprint = SHA3-256(
    input_bytes         ||
    model_1_version     ||
    model_2_version     ||
    model_3_version     ||
    pipeline_version    ||
    domain_separator
)

If any model in the pipeline is updated, the pipeline fingerprint changes. This is more aggressive than per-model caching (the entire pipeline cache invalidates when any component changes) but captures the highest-value optimization: skipping the entire pipeline on a cache hit. In practice, both per-model and pipeline caching are used together. The pipeline cache provides the fast path when nothing has changed. Per-model caching provides intermediate speedups when some models are updated and others are not.

The compounding effect is significant. If each model in a four-model pipeline has a 50% cache hit rate, the pipeline hit rate is lower (because all four must hit). But the per-model caching still saves GPU compute on each intermediate step that hits. The combined savings from pipeline caching plus per-model caching consistently exceed the savings from either approach alone.

The Bottom Line

40% of ML inference calls in a typical production system compute results that were already computed before. Same input, same model, same output. A cached result returns in 31 nanoseconds. GPU inference takes 10 to 50 milliseconds. The computation fingerprint -- hash of input, model version, and parameters -- ensures cached results invalidate automatically on model updates. At enterprise scale, inference caching saves over $1 million per year in GPU compute. The GPU cycles you are not burning are the cheapest cycles you will ever find.

Stop burning GPU cycles on redundant inference. Cache the result at 31ns. Keep the GPU for new work.

brew install cachee ZK Proof Caching Guide