Cache Expensive Computation: The Engineering Guide
Most caching literature focuses on caching data: store a database query result, avoid the round-trip next time. That is useful, but it misses the larger opportunity. The most expensive thing in most applications is not fetching data. It is computing something with that data. A database query takes 2-10 milliseconds. A zero-knowledge proof verification takes 25 milliseconds. An ML inference call takes 50-200 milliseconds. A pricing engine calculation takes 10-50 milliseconds. These computations are repeated millions of times per day with identical inputs, and each repetition burns CPU, memory, and time that could be avoided entirely.
This is the guide to caching computation results, not data. It covers five categories of expensive computation that benefit most from result caching, the technique for building computation fingerprints that make caching safe, and the measured savings at each layer. The core insight is simple: if a function is pure (same inputs always produce same outputs) and expensive (more than a few microseconds), its results should be cached. The question is how to do it correctly.
The Computation Fingerprint
Before diving into specific computation types, every computation cache needs a fingerprint -- a key that uniquely identifies the inputs to the computation. If two requests produce the same fingerprint, their computation results are identical and the cached result can be returned.
A computation fingerprint is typically a hash of the function's inputs. For a pure function f(a, b, c) -> result, the fingerprint is SHA3-256(serialize(a) || serialize(b) || serialize(c)). The hash must include every input that affects the output. Missing an input creates a correctness bug where two different computations collide on the same cache key.
// Computation fingerprint example
fn compute_fingerprint(inputs: &ComputeInputs) -> CacheKey {
let mut hasher = Sha3_256::new();
hasher.update(&inputs.algorithm_version.to_le_bytes());
hasher.update(&inputs.parameters);
hasher.update(&inputs.payload);
CacheKey::from(hasher.finalize())
}
// Cache-aware computation
fn compute_with_cache(inputs: &ComputeInputs, cache: &Cache) -> Result {
let key = compute_fingerprint(inputs);
if let Some(cached) = cache.get(&key) {
return cached; // 31ns instead of recomputation
}
let result = expensive_compute(inputs); // Full computation
cache.insert(key, result.clone());
result
}
The fingerprint must also include the algorithm version. If you update the computation logic (new model weights, new pricing rules, new verification algorithm), the old cached results are invalid. Including a version identifier in the fingerprint ensures that algorithm updates automatically invalidate stale results without requiring a cache flush.
Type 1: Zero-Knowledge Proof Verification
ZK proof verification is one of the most CPU-intensive operations in modern security infrastructure. A STARK proof verification involves evaluating polynomial constraints over a large finite field, checking Merkle paths in the FRI commitment scheme, and verifying the out-of-domain sampling. This takes 20-30 milliseconds on a modern CPU.
But here is the critical insight: the same proof is often verified multiple times. In a system where proofs are submitted once and verified on every access (authorization checks, document authenticity, transaction validation), the same proof might be verified hundreds or thousands of times before it expires. Each verification is identical. The proof has not changed. The verification key has not changed. The result (valid or invalid) is deterministic.
The Savings
Uncached verification: 25 microseconds per verification (STARK proof, optimized implementation). Cached verification: 0.085 microseconds (85 nanoseconds) -- the time to compute the fingerprint and perform a cache lookup. That is a 294x speedup.
At 100,000 verifications per second, uncached verification consumes 2.5 seconds of CPU time per second -- you need at least 3 CPU cores dedicated to verification. Cached verification at a 95% hit rate consumes 0.13 seconds of CPU time per second. The 5% of cache misses (new proofs being verified for the first time) still require full verification, but 95% of the compute is eliminated.
The Fingerprint
For STARK proofs, the fingerprint includes the proof bytes, the public inputs, and the verification key hash. The verification key changes when the circuit changes (algorithm update), so including its hash ensures that proofs verified under an old circuit are re-verified under the new one.
// STARK verification fingerprint
fn stark_verify_fingerprint(
proof: &StarkProof,
public_inputs: &[FieldElement],
vk_hash: &[u8; 32],
) -> CacheKey {
let mut hasher = Sha3_256::new();
hasher.update(b"stark-verify-v1");
hasher.update(vk_hash);
hasher.update(&proof.commitment_hash());
for input in public_inputs {
hasher.update(&input.to_bytes());
}
CacheKey::from(hasher.finalize())
}
When It Is Safe
ZK proof verification is a pure function: the same proof with the same public inputs and the same verification key always produces the same result (accept or reject). Caching is always safe as long as the fingerprint includes all three components. The cached result can be a single boolean (valid/invalid) or a small struct with the verification result and metadata.
Type 2: ML Model Inference
ML inference is the largest compute cost in many modern applications. A classification model might take 10-50 milliseconds per inference. A language model embedding takes 20-100 milliseconds. A recommendation model takes 5-30 milliseconds. These numbers are per-request, and they dominate the request latency budget.
The cacheability of ML inference depends on the input distribution. If your inputs are high-cardinality and unique (e.g., free-form text), the cache hit rate will be low. But many ML use cases have moderate-cardinality inputs with heavy repetition. A fraud detection model that scores transactions by merchant category, amount bucket, and user risk tier has a finite input space. A content moderation model that classifies images by perceptual hash has a finite input space. A recommendation model that generates suggestions by user segment and context has a finite input space.
The Savings
| Inference Type | Uncached Latency | Cached Latency | Speedup | Typical Hit Rate |
|---|---|---|---|---|
| Text classification | 15ms | 31ns | 483,871x | 40-60% |
| Image classification (by phash) | 50ms | 31ns | 1,612,903x | 70-90% |
| Embedding lookup | 20ms | 31ns | 645,161x | 80-95% |
| Recommendation (by segment) | 30ms | 31ns | 967,742x | 85-95% |
| Fraud scoring (by feature bucket) | 10ms | 31ns | 322,581x | 60-80% |
Even at a 50% hit rate, caching ML inference halves your GPU or CPU inference costs. At an 85% hit rate, you eliminate 85% of inference calls. For applications running inference on GPUs at $2-5/hour per instance, an 85% hit rate means 85% fewer GPU instances. The dollar savings are immediate and proportional.
The Fingerprint
ML inference fingerprints must include the model version (weights change between deployments), the featurized input (not the raw input -- apply feature engineering first, then hash), and any inference-time parameters (temperature, top-k, threshold). For image inputs, use a perceptual hash (pHash) rather than a cryptographic hash of raw pixels, so that visually identical images with minor encoding differences still hit the cache.
// ML inference fingerprint
fn inference_fingerprint(
model_version: &str,
features: &FeatureVector,
params: &InferenceParams,
) -> CacheKey {
let mut hasher = Sha3_256::new();
hasher.update(model_version.as_bytes());
hasher.update(&features.to_bytes());
hasher.update(¶ms.temperature.to_le_bytes());
hasher.update(¶ms.threshold.to_le_bytes());
CacheKey::from(hasher.finalize())
}
When It Is Safe
Deterministic inference (classification, embedding, scoring with fixed weights) is always safe to cache. Non-deterministic inference (sampling with temperature > 0, dropout at inference time) should not be cached because the same inputs can produce different outputs. The rule is simple: if you set the random seed and get the same output twice, the inference is cacheable. If the outputs differ, it is not.
Type 3: Pricing Engine Calculations
Pricing engines in fintech, insurance, and e-commerce are surprisingly expensive to evaluate. A real-time pricing calculation might involve evaluating 50-200 business rules, looking up rate tables, applying regional tax logic, computing volume discounts, and checking promotional eligibility. Each evaluation takes 5-50 milliseconds depending on complexity.
Pricing calculations are highly cacheable because the input space is bounded. A quote for "product X, quantity Y, customer tier Z, region W" produces the same price until the pricing rules change. Product catalogs have thousands to millions of SKUs, but the combinations that are actually requested follow a power-law distribution. The top 1,000 product-quantity-tier-region combinations account for 80-90% of pricing requests.
The Savings
A typical e-commerce platform serving 50,000 pricing requests per second with an average computation time of 15 milliseconds per request needs 750 CPU-seconds of pricing computation per second. That requires approximately 750 CPU cores dedicated to pricing. With a 90% cache hit rate, you need 75 CPU cores. That is a 10x reduction in pricing infrastructure, which translates directly to compute cost savings.
At $0.04/hour per vCPU on AWS (c7g spot pricing), 750 vCPUs cost $30/hour or $21,600/month. With caching, 75 vCPUs cost $3/hour or $2,160/month. The cache saves $19,440 per month in compute costs alone, before accounting for the latency improvement that increases conversion rates.
The Fingerprint
Pricing fingerprints must include the pricing rule version (so that rule updates invalidate cached prices), the product identifier, the quantity, the customer tier, the region, the currency, and the timestamp bucket (if prices vary by time of day or day of week). The timestamp bucket should be rounded to the granularity of price changes -- if prices update hourly, bucket to the hour.
// Pricing computation fingerprint
fn pricing_fingerprint(
rule_version: u64,
product_id: &str,
quantity: u32,
customer_tier: &str,
region: &str,
time_bucket: u64,
) -> CacheKey {
let mut hasher = Sha3_256::new();
hasher.update(b"pricing-v1");
hasher.update(&rule_version.to_le_bytes());
hasher.update(product_id.as_bytes());
hasher.update(&quantity.to_le_bytes());
hasher.update(customer_tier.as_bytes());
hasher.update(region.as_bytes());
hasher.update(&time_bucket.to_le_bytes());
CacheKey::from(hasher.finalize())
}
Type 4: Auth Decision Pipelines
Authentication and authorization pipelines are executed on every single API request in most applications. A typical auth pipeline includes: validate the JWT signature, decode the claims, check token expiry, look up the user's roles, evaluate RBAC or ABAC policies, and return an allow/deny decision. This pipeline takes 1-10 milliseconds depending on the complexity of the policy engine and whether it requires database lookups.
Auth decisions are highly repetitive. A single user session might generate thousands of API requests, all with the same JWT, all requiring the same auth decision. The token does not change between requests (until it expires or is refreshed). The user's roles do not change between requests (role changes are rare events). The RBAC policies do not change between requests (policy updates are infrequent). Every execution of the auth pipeline with the same token and the same policies produces the same result.
The Savings
In a production system processing 200,000 API requests per second, the auth pipeline runs 200,000 times per second. At 3 milliseconds per evaluation, that is 600 CPU-seconds per second -- 600 cores dedicated to auth. With a 95% cache hit rate (5% misses from new tokens, token refreshes, and role changes), you need 30 cores. The cache saves 570 cores of auth compute.
The latency improvement is equally important. Removing 3 milliseconds from every API request is a material improvement in user experience. At 200,000 requests per second, this is 600 seconds of cumulative user wait time eliminated per second. Over a day, that is 51.8 million seconds (598 days) of aggregate user wait time removed by caching a single computation.
The Fingerprint
Auth decision fingerprints must include the token signature (or token hash -- do not include the full JWT as it may be large), the policy version, and the resource being accessed. The token signature uniquely identifies the claims, so any change in the user, roles, or expiry produces a different fingerprint. The policy version ensures that policy updates invalidate cached decisions.
// Auth decision fingerprint
fn auth_decision_fingerprint(
token_signature: &[u8],
policy_version: u64,
resource: &str,
action: &str,
) -> CacheKey {
let mut hasher = Sha3_256::new();
hasher.update(b"auth-decision-v1");
hasher.update(token_signature);
hasher.update(&policy_version.to_le_bytes());
hasher.update(resource.as_bytes());
hasher.update(action.as_bytes());
CacheKey::from(hasher.finalize())
}
Safety Considerations
Auth caching requires careful TTL management. The cached decision must expire before or at the same time as the underlying token. If a JWT has a 15-minute expiry, the cached auth decision should have a TTL of no more than 15 minutes. For systems that support token revocation, the cache must be invalidated on revocation events, either via a pub/sub notification or by including a revocation check in the cache hit path. The trade-off is clear: caching auth decisions saves enormous compute, but stale auth decisions are a security risk. Keep TTLs tight and invalidation fast.
Type 5: Risk Scoring
Risk scoring in financial services, insurance, and fraud detection involves evaluating complex models against multi-dimensional feature vectors. A typical risk score computation involves feature extraction (parsing transaction metadata, user history aggregation), model evaluation (decision tree ensemble, logistic regression, or neural network), and rule overlay (regulatory limits, manual overrides, exclusion lists). The total computation takes 5-30 milliseconds.
Risk scoring is cacheable when the input features can be bucketed. Raw transaction data is high-cardinality, but the features derived from that data often have bounded cardinality. A fraud model might use features like "merchant category" (600 MCC codes), "transaction amount bucket" ($0-10, $10-50, $50-200, $200-1000, $1000+), "user risk tier" (low, medium, high, very high), and "time of day bucket" (morning, afternoon, evening, night). The total feature space is 600 * 5 * 4 * 4 = 48,000 combinations. That fits in a few megabytes of cache.
The Savings
| Risk Scoring Type | Uncached | Cached | Speedup | Typical Hit Rate |
|---|---|---|---|---|
| Transaction fraud | 12ms | 31ns | 387,097x | 70-85% |
| Credit risk | 25ms | 31ns | 806,452x | 80-90% |
| Insurance premium | 30ms | 31ns | 967,742x | 85-95% |
| AML screening | 8ms | 31ns | 258,065x | 60-75% |
For a payment processor evaluating 50,000 transactions per second with a 12-millisecond fraud scoring computation, uncached scoring requires 600 CPU-seconds per second. At an 80% cache hit rate, this drops to 120 CPU-seconds per second. At $0.04/hour per vCPU, this saves approximately $14,000 per month in compute costs.
The Fingerprint
Risk scoring fingerprints hash the bucketed features, not the raw transaction data. This is important: two transactions with different raw amounts but the same amount bucket should produce the same risk score (assuming the model only uses amount buckets, not raw amounts). The fingerprint must also include the model version and the exclusion list version (since exclusion list updates change the score for excluded entities).
The "Percent Computation Avoided" Metric
The single most important metric for computation caching is percent computation avoided: the fraction of computation requests served from cache instead of recomputed. This is not the same as cache hit rate, because different computations have different costs. A 90% hit rate on a 25-microsecond computation saves less compute than a 70% hit rate on a 50-millisecond computation.
The weighted metric is:
% Compute Avoided = SUM(hit_rate_i * cost_i * volume_i) / SUM(cost_i * volume_i)
Where hit_rate_i is the cache hit rate for computation type i, cost_i is the per-invocation cost in CPU-time, and volume_i is the invocations per second. This metric tells you what fraction of your total compute budget is being saved by caching.
Target Metrics by Computation Type
| Computation Type | Target Hit Rate | Target % Avoided | TTL Guidance |
|---|---|---|---|
| ZK proof verification | 90-98% | 92-98% | Until proof expiry |
| ML inference | 60-90% | 65-92% | Until model update |
| Pricing calculation | 85-95% | 87-96% | Until rule update (or time bucket) |
| Auth decision | 90-98% | 92-98% | Min(token expiry, 15 min) |
| Risk scoring | 70-90% | 75-92% | Until model or exclusion list update |
If your "% compute avoided" is below these targets, the cause is usually one of three things: the fingerprint is too specific (including inputs that do not affect the output, causing cache fragmentation), the TTL is too short (results expire before they are reused), or the input distribution is too uniform (no hot spots, every input is unique). Each has a different fix.
Implementation: In-Process vs. Network Cache
Where you cache computation results matters as much as whether you cache them. A network cache (Redis, Memcached) adds 300 microseconds to 3 milliseconds of latency per lookup. If your computation takes 25 microseconds (like ZK proof verification), caching it in a network cache that adds 300 microseconds of lookup time makes the cache slower than the computation. You have made performance worse by adding a cache.
This is why computation caching must be in-process for any computation that takes less than approximately 1 millisecond. The lookup must be faster than the computation, or caching provides negative value. In-process cache lookup takes 31 nanoseconds -- orders of magnitude faster than any computation you would bother caching.
For computations that take more than 10 milliseconds (ML inference, complex pricing), a network cache still provides significant savings even with its 300-microsecond overhead. But an in-process cache provides even more savings. The decision is not "in-process or network" -- it is "in-process always, network additionally for shared state."
Memory Budget
Computation results are typically small. A ZK verification result is 1 byte (valid/invalid) plus 32 bytes of fingerprint. A pricing result is 8 bytes (price as f64) plus 32 bytes of fingerprint. An auth decision is 1 byte plus metadata plus 32 bytes of fingerprint. A million cached computation results occupy approximately 50-100 MB of memory. This is trivial on modern servers.
The CacheeLFU admission policy ensures that cold entries are evicted to make room for hot ones. You do not need to pre-size the cache for your entire input space. Size it for your hot working set -- the computations that are repeated frequently -- and let CacheeLFU handle the rest.
# Install Cachee
brew tap h33ai-postquantum/tap
brew install cachee
# Initialize for computation caching
cachee init --mode compute --memory 512mb
# Start
cachee start
The Compound Effect
Most request paths involve multiple expensive computations. A single API request might require auth decision (3ms), fraud scoring (12ms), pricing calculation (15ms), and response assembly (2ms). Total computation: 32 milliseconds. With computation caching at 90% average hit rate across all four types, the expected computation per request drops to: (0.1 * 3) + (0.1 * 12) + (0.1 * 15) + 2 = 5 milliseconds. That is an 84% reduction in request latency and compute cost.
The compound effect accelerates with scale. At 100,000 requests per second, the uncached compute requirement is 3.2 million CPU-milliseconds per second (3,200 CPU cores). The cached requirement is 500,000 CPU-milliseconds per second (500 CPU cores). The cache saves 2,700 CPU cores of compute at the cost of a few hundred megabytes of memory per application instance.
The Bottom Line
The most expensive thing in your application is not data fetching. It is computation. Five types of computation -- ZK proof verification, ML inference, pricing engines, auth pipelines, and risk scoring -- are repeated millions of times daily with identical inputs. Cache expensive computation results using SHA3-256 fingerprints of the inputs, and you eliminate 70-98% of compute cost depending on the hit rate. Use in-process caching for sub-millisecond computations where network cache latency would exceed the computation itself. The metric that matters is percent computation avoided, not cache hit rate.
Stop recomputing what you already know. 31ns lookups for cached computation results.
brew install cachee Why In-Process Beats Network Cache