STARK Verification: From 25us to 85ns
STARK verification is supposed to be the fast part. Proving takes seconds. Verification takes microseconds. That asymmetry is the entire point of succinct proofs. But "microseconds" is not the same as "free," and when your system verifies the same proof hundreds or thousands of times per second, those microseconds become a structural bottleneck that no amount of arithmetic optimization can fix.
A single STARK verification on production hardware takes approximately 25 microseconds. That sounds fast until you calculate what happens at scale. A system processing 100,000 proof verifications per second spends 2.5 seconds of cumulative CPU time every second on verification alone. At 1 million verifications per second, you need 25 full CPU cores dedicated entirely to checking proofs you have already checked before. The math is O(log^2(n)) per verification, but the constant factors are large enough that the aggregate cost dominates your compute budget.
This post walks through the STARK verification pipeline step by step, measures where those 25 microseconds go, and describes the optimization that reduces verification to 85 nanoseconds -- a 294x speedup. The core insight is not algorithmic. It is architectural. You verify once. You cache the result. You never verify the same proof twice.
Anatomy of 25 Microseconds
To understand why cached verification is safe, you first need to understand what happens during full verification. A STARK proof verification is not a single operation. It is a pipeline of four distinct mathematical checks, each with its own computational cost and security purpose. Every step must pass for the proof to be accepted. If any step fails, the entire verification fails.
Step 1: FRI Layer Checks (9.2 microseconds)
The FRI (Fast Reed-Solomon Interactive Oracle Proof) protocol is the backbone of STARK soundness. During verification, the verifier checks that each FRI layer is consistent with the previous layer. This involves sampling random query positions, reading Merkle-authenticated values at those positions, and checking that adjacent layers are consistent under the folding operation. Each layer requires reading two or more field elements, computing a folding check, and verifying a Merkle authentication path from the leaf to the layer commitment root.
For a typical STARK proof with a trace length of 2^20 and a blowup factor of 8, the FRI component has approximately 17-20 layers. Each query position generates a chain of layer checks. With 30-50 queries (the standard security parameter range for 128-bit security), the verifier performs 500-1000 individual layer consistency checks. Each check involves a small number of field multiplications and additions, but the aggregate cost is 9.2 microseconds -- roughly 37% of total verification time.
The FRI checks dominate because they are inherently sequential within each query chain. You cannot check layer k+1 until you have the folded value from layer k. Parallelism across query positions helps, but the per-query chain depth limits how much speedup you can extract.
Step 2: Constraint Evaluation (7.8 microseconds)
The verifier must evaluate the AIR (Algebraic Intermediate Representation) constraints at the queried positions and check that the constraint composition polynomial is consistent with the claimed trace values. This step ensures that the prover actually satisfied the computation's constraints, not just that they produced a valid-looking FRI proof.
Constraint evaluation involves reading trace column values at queried positions (Merkle-authenticated), computing the constraint polynomials at those positions using the trace values, combining the constraint evaluations with random coefficients, and checking consistency with the committed constraint composition polynomial. The cost scales with the number of constraints and their algebraic degree. A system with 50 constraints of degree 2-4 requires 200-400 field operations per query position. Across 30-50 query positions, this totals 6,000-20,000 field operations. At roughly 1-2 nanoseconds per field operation on modern hardware, this comes to approximately 7.8 microseconds.
Step 3: Merkle Path Verification (5.1 microseconds)
Every value the verifier reads during FRI checks and constraint evaluation must be authenticated against a Merkle commitment. The verifier does not trust any value the prover provides -- it only trusts values that can be authenticated via a Merkle path to a root that was committed before the verifier's randomness was revealed.
A STARK proof at the scale described above contains approximately 200-400 Merkle authentication paths (query positions times the number of committed polynomials). Each path has depth log2(domain_size) = 20-23 hash operations. At roughly 50 nanoseconds per SHA3-256 hash on production hardware, each path costs about 1 microsecond. But many paths share prefixes, so the amortized cost per path is lower. The total Merkle verification cost across all paths is approximately 5.1 microseconds.
This step is the most parallelizable. Every Merkle path is independent. With SIMD-accelerated hashing and parallel path verification, you can reduce this step significantly on multi-core hardware. But the floor is bounded by memory access patterns -- each path reads 20+ hash digests from non-contiguous memory locations.
Step 4: OOD Consistency Check (2.9 microseconds)
The out-of-domain (OOD) sampling check is the final verification step. The verifier picks a random point outside the evaluation domain and checks that the trace polynomial evaluations, constraint evaluations, and FRI polynomial are all consistent at that point. This is the step that binds the FRI proof to the actual computation. Without it, a prover could produce a valid FRI proof for an unrelated polynomial.
The OOD check requires evaluating the trace polynomials at the OOD point (using the committed values and interpolation), evaluating the constraint composition polynomial at the OOD point, checking that the deep composition polynomial at the OOD point matches the FRI polynomial evaluation, and verifying that all evaluations are consistent with the committed boundary conditions. This step involves several polynomial evaluations and interpolations, but they are all at a single point, so the cost is modest: approximately 2.9 microseconds.
| Verification Step | Time | % of Total | Operations |
|---|---|---|---|
| FRI layer checks | 9.2 us | 36.8% | Field arithmetic + Merkle reads |
| Constraint evaluation | 7.8 us | 31.2% | AIR constraint computation |
| Merkle path verification | 5.1 us | 20.4% | SHA3-256 hash chains |
| OOD consistency check | 2.9 us | 11.6% | Polynomial evaluation |
| Total | 25.0 us | 100% |
The Scale Problem
Twenty-five microseconds per verification is fast in isolation. The problem emerges at scale, and it emerges in a way that is not immediately obvious from looking at individual request latency.
Consider a system that uses STARK proofs for authentication attestation. Every authenticated session produces a proof. Every service that consumes that session must verify the proof before trusting it. In a microservices architecture with 8 downstream services, a single user session generates 8 verification operations. At 10,000 concurrent sessions with an average of 3 requests per second per session, you have 240,000 verifications per second. At 25 microseconds each, that is 6 seconds of cumulative CPU time per second of wall-clock time. You need at least 6 CPU cores dedicated entirely to STARK verification.
The problem compounds with proof reuse. The same proof gets verified by the same service multiple times because the service does not remember that it already verified this proof. It receives the proof bytes, runs the full 25-microsecond pipeline, confirms the proof is valid, and then discards the verification result. The next request with the same proof repeats the entire process. Every redundant verification is 25 microseconds of wasted compute. At 240,000 verifications per second, if 85% are redundant (the same proof verified by the same service within its validity window), you are wasting 5.1 seconds of CPU time per second on mathematical computation that produces the same answer every time.
The Redundancy Pattern
In production systems, 80-95% of STARK verifications are redundant. The same proof is verified by the same service multiple times during the proof's validity window. Each redundant verification consumes 25 microseconds of CPU time and produces the identical result. At 100K verifications per second with 90% redundancy, you waste 2.25 CPU-seconds per second on re-verification. This is not a latency problem. It is a throughput problem. The CPU time spent on redundant verification is CPU time not available for serving requests.
The Optimization: Verify Once, Cache the Result
The optimization is conceptually simple. Verify each unique proof exactly once. Cache the verification result. On subsequent encounters with the same proof, return the cached result instead of re-running the 25-microsecond pipeline. The cached lookup takes 85 nanoseconds. The speedup is 294x.
But "conceptually simple" hides a critical design question: how do you know the cached result is trustworthy? If an attacker can cause the cache to return "valid" for a proof that was never actually verified, the entire security model collapses. Memoization of a security-critical function requires cryptographic binding between the input and the cached result.
Computation Fingerprints
The cache key is not the proof bytes alone. It is a computation fingerprint that cryptographically binds five components: the proof bytes themselves, the verification key (which encodes the AIR constraints and public parameters), the constraint set hash (in case the constraint system is parameterized), the domain parameters (field size, blowup factor, number of queries), and the security parameter. The fingerprint is computed as a SHA3-256 hash over the concatenation of these components.
fingerprint = SHA3-256(
proof_bytes ||
verification_key_hash ||
constraint_set_hash ||
domain_params_canonical ||
security_parameter
)
This fingerprint has a critical property: it is collision-resistant under SHA3-256. Two different inputs will produce different fingerprints with overwhelming probability (2^-256 collision probability). This means the cache will never return a verification result for proof A when queried with proof B, even if an attacker crafts the proofs adversarially.
The cached value is a single byte: 0x01 for valid, 0x00 for invalid. Combined with the fingerprint as the cache key, the cached entry says: "the proof with this exact fingerprint was verified with this exact verification key, constraint set, domain, and security parameter, and the result was valid (or invalid)." This is not a guess. It is a signed truth claim backed by the collision resistance of SHA3-256.
The Verification Pipeline with Caching
The modified verification pipeline has six steps instead of four. The first two steps are new. The last four are the original verification steps, but they only execute on cache misses.
- Compute fingerprint. Hash the proof bytes, verification key, constraint set, domain parameters, and security parameter using SHA3-256. Cost: approximately 50 nanoseconds (the proof bytes are the dominant input, typically 50-200 KB, and SHA3-256 processes data at roughly 1 GB/sec on modern hardware).
- Check cache. Look up the fingerprint in the in-process cache. Cost: approximately 35 nanoseconds for an in-process hash map lookup. If the fingerprint is found, return the cached result immediately. Total cost for a cache hit: 85 nanoseconds.
- On cache miss: FRI layer checks. 9.2 microseconds, same as before.
- On cache miss: Constraint evaluation. 7.8 microseconds, same as before.
- On cache miss: Merkle path verification. 5.1 microseconds, same as before.
- On cache miss: OOD consistency check. 2.9 microseconds, same as before. After full verification completes, store the fingerprint and result in the cache. Total cost for a cache miss: approximately 25.1 microseconds (25 microseconds verification + 85 nanoseconds fingerprint and cache overhead).
The overhead on cache misses is negligible: 85 nanoseconds added to a 25-microsecond operation is a 0.34% increase. The benefit on cache hits is transformative: 85 nanoseconds instead of 25 microseconds, a 294x reduction.
"Isn't This Just Memoization?"
Yes and no. It is memoization in the computer science sense -- caching the result of a pure function to avoid recomputation. But calling it "just memoization" misses what makes it safe for a security-critical function.
Standard memoization assumes the cache is trustworthy. If you memoize a sorting function, a corrupted cache entry returns the wrong sort order, which is a bug but not a security breach. If you memoize a proof verification function, a corrupted cache entry could return "valid" for an invalid proof, which means an attacker can bypass your authentication system.
The computation fingerprint makes the memoization safe by ensuring three properties. First, the fingerprint is deterministic and collision-resistant. Two different (proof, verification_key, constraint_set, domain, security_parameter) tuples will produce different fingerprints. The cache cannot confuse one proof for another. Second, the fingerprint includes all parameters that affect the verification result. If you change the verification key, the constraint set, the domain parameters, or the security parameter, the fingerprint changes. A proof verified under one set of parameters cannot produce a cache hit when queried under different parameters. Third, the cache is local to the verifier process. It is not shared across trust boundaries. Each process maintains its own cache, which means a compromised process can only corrupt its own verification results -- it cannot poison the cache of another process.
This is memoization with cryptographic binding. The fingerprint proves that the cached result was produced by verifying this specific proof with these specific parameters. The cache is an acceleration structure, not a trust assumption. The trust is in the original verification and the collision resistance of SHA3-256.
Architecture
The cached verification architecture has three components: the fingerprint function, the cache, and the verification engine. They compose as follows.
fn verify_cached(
proof: &StarkProof,
vk: &VerificationKey,
constraints: &ConstraintSet,
domain: &DomainParams,
) -> VerificationResult {
// Step 1: Compute fingerprint (50ns)
let fp = sha3_256(
proof.as_bytes(),
vk.hash(),
constraints.hash(),
domain.canonical_bytes(),
);
// Step 2: Check cache (35ns)
if let Some(result) = CACHE.get(&fp) {
return result; // 85ns total
}
// Step 3-6: Full verification (25us)
let result = stark_verify(proof, vk, constraints, domain);
// Cache the result
CACHE.insert(fp, result);
result // 25.1us total on miss
}
The cache itself is an in-process concurrent hash map with CacheeLFU eviction policy. The memory footprint is small: each entry is 32 bytes (fingerprint) + 1 byte (result) + 8 bytes (metadata) = 41 bytes. A cache holding 1 million verification results occupies approximately 41 MB. Given that STARK proofs themselves are 50-200 KB each, the cache is a tiny fraction of the memory already consumed by proof handling.
The eviction policy matters. CacheeLFU (frequency-based eviction) ensures that the most frequently verified proofs stay in cache. This is important because proof access patterns follow a power law distribution: a small number of proofs (currently valid session proofs, frequently accessed attestations) are verified thousands of times, while most proofs are verified once or twice. LRU eviction would evict frequently accessed proofs in favor of recently seen but rarely re-accessed proofs. CacheeLFU keeps the hot proofs in cache and evicts the cold ones.
TTL and Expiration
Cached verification results must expire. A proof that was valid yesterday may not be valid today if the verification key has been rotated, the constraint set has been updated, or the proof's validity window has closed. The fingerprint handles the first two cases automatically: a rotated verification key produces a different fingerprint, so old cached results are never returned for the new key. But temporal validity -- a proof that is valid for 60 minutes -- requires explicit TTL management.
The cache assigns a TTL to each entry based on the proof's validity window. If the proof is valid for 60 minutes, the cached verification result expires after 60 minutes. This ensures that expired proofs are re-verified (and rejected) rather than served from cache as valid. The TTL is derived from the proof's metadata, not configured globally. Different proofs can have different validity windows, and the cache respects each one.
Benchmark Results
We measured the optimization on production hardware (c8g.metal-48xl, 192 vCPUs, Graviton4) with realistic proof sizes and access patterns. The benchmark generates 10,000 unique STARK proofs with a trace length of 2^20 and simulates an access pattern where 15% of proofs are accessed once (cold), 35% are accessed 10-50 times (warm), and 50% are accessed 100-1000 times (hot). This distribution matches real production patterns where a small number of session proofs dominate verification traffic.
| Metric | Without Cache | With Cache | Improvement |
|---|---|---|---|
| Avg verification latency | 25.0 us | 0.085 us (85 ns) | 294x |
| P50 verification latency | 24.8 us | 0.082 us | 302x |
| P99 verification latency | 27.3 us | 25.1 us (miss) | 1.09x |
| Throughput (single core) | 40,000 verify/sec | 11,764,705 verify/sec | 294x |
| CPU utilization at 100K verify/sec | 250% | 8.5% | 29x reduction |
| Cache hit rate (steady state) | N/A | 91.3% | -- |
| Memory overhead | 0 MB | 3.7 MB | -- |
The P99 latency with caching is essentially unchanged from the uncached case because P99 captures the cache miss path, which still runs the full 25-microsecond verification. The improvement is in the average case and in throughput. At a 91.3% cache hit rate, 91.3% of verifications complete in 85 nanoseconds and 8.7% complete in 25.1 microseconds. The weighted average is 0.085 * 0.913 + 25.1 * 0.087 = 2.26 microseconds -- but the median is 85 nanoseconds because most verifications are cache hits.
The CPU utilization reduction is the most operationally significant number. At 100,000 verifications per second, the uncached system requires 2.5 CPU cores dedicated to verification. The cached system requires 0.085 CPU cores. Those 2.4 freed CPU cores are now available for serving actual application logic.
Security Analysis
The cached verification optimization does not weaken the security of the STARK verification system. Here is the argument, step by step.
Soundness is preserved. Every unique (proof, verification_key, constraint_set, domain) tuple is verified by the full STARK verification pipeline at least once. The cached result is the output of that full verification. If the proof is invalid, the full verification rejects it, and the cache stores "invalid." Subsequent lookups for the same fingerprint return "invalid." An invalid proof never returns a "valid" cache hit.
Collision resistance prevents confusion. Two different proofs produce different SHA3-256 fingerprints with probability 1 - 2^-256. An attacker cannot craft a proof that has the same fingerprint as a previously verified valid proof. The probability of an accidental collision across the lifetime of the system (even verifying 2^64 proofs) is negligible.
Parameter binding prevents downgrade. The fingerprint includes the verification key, constraint set, and domain parameters. If an attacker attempts to verify a proof against a weaker constraint set and then replay the "valid" result against the original constraint set, the fingerprints will differ. The cache miss will trigger full verification, which will reject the proof under the stronger constraints.
TTL prevents temporal replay. Expired proofs are evicted from the cache. A proof that was valid during its validity window cannot be served from cache after the window closes. Re-verification after TTL expiry will reject the proof based on its expired timestamp.
Process isolation prevents cross-boundary attacks. The cache is in-process memory, not shared across processes or network boundaries. A compromised verifier process can only affect its own cache. This is the same trust model as the original uncached verifier: if the process is compromised, its verification results are untrustworthy regardless of whether they come from cache or from the verification engine.
When Not to Cache
There are three scenarios where cached STARK verification is not appropriate, and you should run full verification every time.
One-time proofs. If each proof is verified exactly once and never re-encountered, the cache provides no benefit. The 85-nanosecond overhead per verification (fingerprint computation + cache miss) is wasted. This pattern is rare in production but common in batch verification pipelines where proofs arrive, get verified, and are archived.
Adversarial cache pressure. If an attacker can submit a high volume of unique proofs designed to thrash the cache, the cache may evict hot entries and degrade performance. In this scenario, the cache becomes a liability: you pay the fingerprint computation cost on every request but rarely get a cache hit. Rate limiting at the proof submission layer mitigates this, but if your threat model includes adversarial cache pressure, you should size the cache accordingly or disable it for untrusted proof sources.
Constraint set changes mid-flight. If your constraint set changes frequently (multiple times per minute), the fingerprint changes on every constraint update, and previously cached results become useless. The cache works best when the verification parameters are stable for minutes to hours. If your constraint set is dynamic, the cache hit rate will be low.
Implementation Notes
We built the cached verification layer in Rust using an in-process DashMap for the concurrent hash map. DashMap provides lock-free reads and sharded writes, which means verification threads do not contend with each other on cache lookups. The fingerprint computation uses the SHA3-256 implementation from the same library used for Merkle path verification, so there is no additional dependency.
The cache is initialized at process startup with a configurable capacity (default: 1 million entries, approximately 41 MB). Eviction uses CacheeLFU, which tracks access frequency with a compact counter per entry. The eviction overhead is amortized across insertions and does not affect lookup latency.
Integration with existing STARK verification code requires wrapping the verification function call. The wrapper computes the fingerprint, checks the cache, and either returns the cached result or delegates to the original verification function. No changes to the STARK verification logic itself are required. The original function remains available for cases where you want to bypass the cache (testing, debugging, adversarial scenarios).
# Cachee with STARK verification caching enabled
cachee init --stark-cache-size 1000000
cachee start
# Monitor cache performance
cachee status --stark
# Output:
# STARK verification cache:
# Entries: 847,293 / 1,000,000
# Hit rate: 91.3%
# Avg hit: 85ns
# Avg miss: 25.1us
# Memory: 34.7 MB
# Evictions: 12,847 (CacheeLFU)
The Compound Effect
The 294x speedup on individual verification is significant. But the compound effect on system architecture is more important. When verification is cheap (85 nanoseconds), you can verify proofs at points in your architecture where verification was previously too expensive. You can verify on every request instead of once per session. You can verify at every microservice boundary instead of only at the gateway. You can verify in middleware without worrying about adding 25 microseconds of latency per request.
This changes the security posture of the system. Instead of "we verify once and trust the session token downstream," you move to "we verify everywhere because verification is effectively free." Redundant verification becomes a security feature rather than a performance liability. Every service independently confirms the proof's validity, so a compromised intermediate service cannot forge verification results for downstream services.
At 85 nanoseconds per cached verification, the cost of ubiquitous verification is negligible. A request that passes through 8 services incurs 8 * 85 nanoseconds = 680 nanoseconds of total verification overhead. That is less than a single network round-trip. The security benefit of independent verification at every boundary far exceeds the sub-microsecond cost.
The Bottom Line
STARK verification at 25 microseconds per proof is fast enough for individual checks but too expensive for high-throughput systems. Cached verification with cryptographic fingerprints reduces this to 85 nanoseconds -- a 294x speedup -- without weakening security. The fingerprint binds the cached result to the exact proof and parameters, making the cache an acceleration structure rather than a trust assumption. At 85 nanoseconds, verification becomes cheap enough to run at every service boundary, transforming redundant verification from a performance problem into a security advantage.
294x faster STARK verification. Verify once, serve from cache at 85 nanoseconds.
brew install cachee Post-Quantum Caching