Cache STARK FRI Verification: 25us to 85ns

May 1, 2026 | 16 min read | Engineering

Every STARK verification runs the FRI protocol. Every FRI verification re-executes the same Reed-Solomon proximity test over the same polynomial commitment. If the proof has not changed and the parameters have not changed, the result is identical every time. Yet every system in production today recomputes it from scratch. This is not an algorithmic limitation. It is an architectural oversight. FRI accounts for 9.2 microseconds of the 25-microsecond STARK verification pipeline -- 37% of the total cost -- and it is entirely cacheable.

This post is a deep dive into FRI specifically: what it does, why it dominates STARK verification cost, what makes it safe to cache, and how computation fingerprinting eliminates the bottleneck. The result is a 294x speedup on the full verification pipeline, but the insight starts with understanding why FRI is where the time goes.

9.2 us

FRI Verification (37% of STARK)

85 ns

Cached Full Verification

294x

Total Speedup

What FRI Actually Does

FRI stands for Fast Reed-Solomon Interactive Oracle Proof of Proximity. The name is precise and every word matters. It is a protocol that proves a committed function is close to a low-degree polynomial, which is the core soundness guarantee that makes STARKs work. Without FRI, a prover could commit to an arbitrary function that has nothing to do with the computation being proved, and the verifier would have no way to detect the fraud.

The Reed-Solomon code is a specific error-correcting code defined by polynomial evaluation. A polynomial of degree d over a field F, evaluated at n points where n is much larger than d, produces a codeword in the Reed-Solomon code RS[F, n, d]. The key property is that any two distinct polynomials of degree at most d agree on at most d points. So if a committed function agrees with some degree-d polynomial on more than d points, it must be that exact polynomial (or very close to it). FRI exploits this property to efficiently prove that a committed evaluation is close to a degree-d polynomial, without the verifier ever seeing the full polynomial.

The protocol works in rounds. Each round takes the current polynomial, which lives in an evaluation domain of size n, and folds it into a polynomial of half the degree living in a domain of size n/2. The folding uses a random challenge from the verifier (or, in the non-interactive Fiat-Shamir variant, derived from the transcript hash). After log2(n/d) rounds, the polynomial has been reduced to a constant, which the prover sends in the clear. The verifier checks that each folding step was performed correctly by sampling random positions and verifying consistency between adjacent layers.

The Layer Structure

For a concrete STARK proof with a trace length of 2^20 and a blowup factor of 8, the evaluation domain has size 2^23 (8 million points). The polynomial has degree at most 2^20. FRI reduces this in log2(8M / 1M) = 3 initial rounds plus log2(1M) = 20 additional rounds, for approximately 17-20 layers depending on the stopping condition. Some implementations stop when the polynomial degree drops below 64 or 128, sending the remaining polynomial coefficients directly instead of continuing the folding.

Each layer is a committed evaluation. The prover computes the folded polynomial at every point in the halved domain, builds a Merkle tree over the evaluations, and sends the Merkle root as the layer commitment. The verifier stores these commitments and uses them during the query phase to check consistency.

The query phase is where the 9.2 microseconds come from. For each query position (typically 30-50 positions for 128-bit security), the verifier must read the evaluation at that position from every FRI layer, verify the Merkle authentication path for each reading, and check that the folding relation holds between adjacent layers. The folding check is a small algebraic computation: given evaluations f(x) and f(-x) at a position and its sibling, the folded value should equal (f(x) + f(-x))/2 + alpha * (f(x) - f(-x))/(2x), where alpha is the round challenge. This involves a handful of field multiplications and additions per layer per query.

Per-Layer Cost Breakdown

The cost per query per layer has three components. First, reading the evaluation values: 2 field elements per layer (the query position and its sibling). On the Goldilocks field (p = 2^64 - 2^32 + 1), each field element is 8 bytes. The memory read is fast but contributes to cache pressure when processing many queries across many layers. Second, the Merkle authentication path verification: log2(layer_size) hash operations per path. For the first layer (size 2^23), this is 23 SHA3-256 hashes at roughly 50 nanoseconds each, totaling 1.15 microseconds. Later layers are smaller and have shorter paths. Third, the algebraic folding check: 4 field multiplications, 3 additions, and 1 inversion (or multiplication by a precomputed inverse). On the Goldilocks field, each multiplication is approximately 1.5 nanoseconds using Montgomery form, so the algebraic check costs roughly 10 nanoseconds per layer per query.

FRI Component	Per Query Per Layer	Total (40 queries, 18 layers)	% of FRI
Merkle path verification	~55 ns (avg)	~39,600 ns	~43%
Field element reads	~30 ns	~21,600 ns	~23%
Folding algebra check	~10 ns	~7,200 ns	~8%
Challenge derivation	amortized	~2,400 ns	~3%
Final poly check	once	~2,100 ns	~2%
Cache/memory overhead	varies	~19,100 ns	~21%
Total FRI		~92,000 ns (9.2 us)	100%

The Merkle path verification dominates within FRI, which means FRI's cost is ultimately dominated by hashing. This is important for the caching argument: the work being repeated is not cheap arithmetic. It is SHA3-256 hash chain verification across hundreds of Merkle paths.

Why FRI Is Repeated

In a typical deployment, STARK proofs are generated once and verified many times. An authentication attestation proof is generated when a user authenticates. That proof is then verified by every downstream service that needs to trust the authentication result. In a microservices architecture with 8 downstream services, each processing 3 requests per second per session, a single user's proof is verified 24 times per second. With 10,000 concurrent sessions, the system performs 240,000 FRI verifications per second. Each one re-executes the same 9.2-microsecond FRI pipeline on the same proof bytes, producing the same boolean result.

The redundancy is structural, not accidental. Each service must independently verify the proof because it cannot trust that another service verified correctly. This is the correct security model. The problem is not that services verify independently. The problem is that each individual service re-verifies the same proof on every request, even though the proof has not changed since the last request. A service that verified proof P at time t=0 and receives the same proof P at time t=1 will re-execute the full FRI pipeline despite the fact that the proof bytes are identical, the verification key is identical, the domain parameters are identical, and therefore the result is identical.

This is where the opportunity lies. The FRI protocol, despite originating as an interactive protocol, becomes deterministic once the Fiat-Shamir heuristic is applied. The verifier's challenges are derived from the transcript hash, which is itself derived from the proof bytes and public parameters. Given the same proof and the same parameters, the verifier will generate the same challenges, read the same evaluation values, compute the same folding checks, and arrive at the same result. Every time. The function from (proof, parameters) to (accept/reject) is pure. It has no side effects, no randomness (the "randomness" is derived deterministically from the inputs), and no external dependencies. It is the textbook case for memoization.

The Fiat-Shamir Determinism Argument

A common objection to caching FRI results is: "FRI is an interactive protocol. The verifier's randomness is essential to soundness. If you skip the random sampling, you lose the security guarantee." This objection is correct for the interactive version of FRI, but it does not apply to the non-interactive version used in practice.

In the interactive version, the verifier sends fresh random challenges to the prover in each round. The prover cannot predict these challenges, which is what prevents the prover from cheating. If the prover knew the challenges in advance, it could construct a fake proof that passes the folding checks at the queried positions without actually having a valid low-degree polynomial.

In the non-interactive version (which is what every deployed STARK system uses), the Fiat-Shamir transform replaces the verifier's random challenges with hash-derived challenges. The challenge for round i is computed as H(transcript || round_i_commitment), where transcript includes all previous commitments and challenges, and H is a cryptographic hash function (typically SHA3-256 or BLAKE3). The prover must commit to each layer before the challenge for that layer is determined. Because the hash function is modeled as a random oracle, the prover cannot predict the challenges and cannot cheat.

The critical point: once the proof is generated and the challenges are fixed by the Fiat-Shamir transcript, the verification is entirely deterministic. The verifier does not generate any fresh randomness. It derives the challenges from the proof bytes using the same hash function the prover used. Two verifiers with the same proof bytes and the same hash function will compute the same challenges, query the same positions, and reach the same verdict. This determinism is what makes caching safe. You are not skipping the random sampling. You are recognizing that the "random" sampling, in the non-interactive setting, is a deterministic function of the proof bytes.

Computation Fingerprinting for FRI

The cache key for a FRI verification result must capture every input that affects the output. If any input changes, the cached result is invalid and must not be returned. The computation fingerprint is a SHA3-256 hash over the following components, concatenated in a canonical order.

fri_fingerprint = SHA3-256(
    polynomial_commitment  ||   // Merkle root of the evaluation
    field_parameters       ||   // p = 2^64 - 2^32 + 1 (Goldilocks)
    blowup_factor          ||   // 8 (domain expansion ratio)
    num_fri_layers         ||   // 18 (folding depth)
    grinding_factor        ||   // 16-32 bits (proof-of-work)
    num_queries             ||   // 40 (security parameter)
    fri_layer_commitments  ||   // All intermediate Merkle roots
    final_poly_coefficients    // The constant/low-degree remainder
)

Each component is serialized in a fixed, canonical byte order. The polynomial commitment is the 32-byte Merkle root of the initial evaluation. The field parameters encode the prime p and any extension field structure (for Goldilocks, this is simply the 8-byte prime). The blowup factor is a single byte (values 2-16 are standard). The number of FRI layers is a single byte. The grinding factor is a single byte. The number of queries is a two-byte integer. The FRI layer commitments are 32 bytes each, concatenated in order. The final polynomial coefficients are 8 bytes each (Goldilocks field elements), concatenated in order.

The total fingerprint input size for a typical proof is approximately 700-900 bytes: 32 (initial commitment) + 8 (field) + 1 (blowup) + 1 (layers) + 1 (grinding) + 2 (queries) + 18*32 (layer commitments) + variable final poly. SHA3-256 processes this input in approximately 15-20 nanoseconds. Combined with the cache lookup (35 nanoseconds for an in-process DashMap), the total cost of a cache hit is approximately 50-55 nanoseconds for the FRI-specific fingerprint, or 85 nanoseconds for the full STARK verification fingerprint that includes the constraint set and verification key.

Why Each Component Is Necessary

The polynomial commitment identifies which polynomial is being tested for proximity. Two different polynomials will have different Merkle roots (with overwhelming probability under SHA3-256), so they will produce different fingerprints. If you omit the polynomial commitment, two different proofs could share a fingerprint, and a valid result for one could be incorrectly returned for the other.

The field parameters identify the algebraic structure. FRI over the Goldilocks field and FRI over a binary extension field are different protocols with different soundness guarantees. A proof valid over one field is meaningless over another. In practice, most systems use a single field, but the fingerprint must include it to prevent cross-field confusion in systems that support multiple fields.

The blowup factor determines the rate of the Reed-Solomon code. A blowup factor of 8 means the evaluation domain is 8 times larger than the polynomial degree, giving a code rate of 1/8. A higher blowup factor provides better soundness at the cost of larger proofs. Changing the blowup factor changes the number of FRI layers, the size of each layer, and the soundness guarantee. A proof verified with blowup factor 8 should not produce a cache hit when queried with blowup factor 4.

The number of FRI layers and the grinding factor jointly determine the security level. More layers means a deeper folding, which can affect the final polynomial degree and the soundness error. The grinding factor adds a proof-of-work requirement: the prover must find a nonce such that H(transcript || nonce) has a certain number of leading zeros. This raises the computational cost of cheating without affecting the verification cost (beyond checking the nonce). Both must be part of the fingerprint because they affect whether a given proof should be accepted.

The number of queries is the primary security parameter. With 40 queries over a code of rate 1/8, the soundness error is approximately (1/8)^40, which is astronomically small. Reducing the number of queries weakens soundness. The fingerprint must include it so that a proof verified with 40 queries does not produce a cache hit for a verification with 30 queries.

Goldilocks Field Specifics

The Goldilocks prime p = 2^64 - 2^32 + 1 has properties that make it ideal for STARK implementations and that interact with caching in specific ways.

First, arithmetic in the Goldilocks field is fast. Addition is a 64-bit add followed by a conditional subtraction. Multiplication uses the identity that reduction modulo 2^64 - 2^32 + 1 can be done with shifts and adds instead of a general-purpose modular reduction. In Montgomery form, multiplication costs approximately 1.5 nanoseconds on modern hardware. This means the algebraic checks in FRI (the folding relation) are cheap. The cost is dominated by hashing (Merkle paths) and memory access (reading evaluation values), not by field arithmetic.

Second, the Goldilocks field has a large multiplicative subgroup of order 2^32, which means it supports NTT (Number Theoretic Transform) up to size 2^32. This is relevant for FRI because the evaluation domains are subgroups of the multiplicative group. A domain of size 2^23 (for blowup factor 8 and trace length 2^20) fits comfortably within the 2^32 subgroup, and the NTT can be used to efficiently interpolate and evaluate polynomials during proof generation. During verification, the NTT is not used directly, but the domain structure (powers of a 2^23-th root of unity) determines the query positions and the folding arithmetic.

Third, Goldilocks field elements are 8 bytes, which is half the size of elements in a 128-bit field. This has a direct impact on caching: the FRI layer evaluations, the Merkle leaves, and the final polynomial coefficients are all 8 bytes per element instead of 16. The fingerprint input is correspondingly smaller, the hashing is faster, and the memory footprint of cached entries is lower. At scale, the 2x reduction in element size translates to measurable savings in both fingerprint computation time and cache memory usage.

The Per-Layer Caching Alternative

An alternative to caching the final FRI accept/reject result is to cache individual layer checks. Instead of asking "did this entire FRI verification pass?", you cache the answer to "did layer k pass the folding check at query position q?". This finer-grained caching has higher hit rates in theory (because different proofs might share layer commitments, though this is rare in practice) but requires more cache entries and more fingerprint computations.

We evaluated per-layer caching and found it inferior to whole-FRI caching for three reasons. First, the overhead: each layer check is approximately 55 nanoseconds (dominated by the Merkle path hash), and the fingerprint computation for a single layer check is approximately 20 nanoseconds. The cache hit saves 55 nanoseconds but costs 55 nanoseconds (20ns fingerprint + 35ns lookup). The net savings per layer hit is effectively zero. Second, the hit rate: in practice, different proofs do not share FRI layer commitments. Each proof has unique layer commitments because each proof is for a different computation (or the same computation with different inputs, which produces different trace polynomials and therefore different FRI layers). Third, the memory: with 40 queries and 18 layers per proof, per-layer caching requires 720 cache entries per proof versus 1 entry for whole-FRI caching. At 1 million proofs, that is 720 million entries (approximately 30 GB) versus 1 million entries (approximately 41 MB).

Whole-FRI caching is the correct granularity. The fingerprint captures the entire FRI verification in a single 32-byte key, and the cached value is a single byte (accept/reject). The cache is compact, the hit rate matches the proof reuse rate, and the savings per hit are substantial (9.2 microseconds instead of 55 nanoseconds per layer).

Integration with Full STARK Verification Cache

FRI caching can operate independently or as part of a full STARK verification cache. In the independent mode, the FRI result is cached separately from the constraint evaluation, Merkle path verification, and OOD consistency check results. This is useful when different parts of the verification pipeline have different cache hit rates or different freshness requirements. In the integrated mode, the entire STARK verification result is cached under a single fingerprint that includes the FRI parameters plus the constraint set hash and verification key hash.

The integrated mode is simpler and more effective. The fingerprint is slightly larger (adding the constraint set hash and verification key hash is 64 additional bytes), but the cache lookup is a single operation instead of four. The hit rate is the same because a cache hit on the full verification implies a hit on all sub-components. And the savings per hit are larger: 25 microseconds instead of 9.2 microseconds for FRI alone.

In our production deployment, we use the integrated mode with a single DashMap holding full STARK verification results. The FRI-specific analysis in this post explains why FRI is the largest single contributor to the cached savings: when you eliminate the 25-microsecond full verification, 9.2 of those saved microseconds come from not re-running FRI. The remaining 15.8 microseconds come from skipping constraint evaluation (7.8us), Merkle path verification (5.1us), and OOD consistency checks (2.9us).

Benchmark: FRI-Specific Measurements

We isolated the FRI verification cost on production hardware (c8g.metal-48xl, 192 vCPUs, Graviton4, Goldilocks field) with proofs generated from a real authentication attestation pipeline. The trace length is 2^20, blowup factor is 8, 40 query positions, and 18 FRI layers with a grinding factor of 20 bits.

Metric	Without Cache	With Cache (full STARK)	Improvement
FRI verification time	9.2 us	0 us (cached at STARK level)	N/A (skipped)
Full STARK verification	25.0 us	0.085 us (85 ns)	294x
Throughput (single core)	40,000/sec	11,764,705/sec	294x
FRI % of uncached pipeline	36.8%	0% (on hit)	--
Fingerprint compute time	N/A	50 ns	--
Cache lookup time	N/A	35 ns	--
Cache hit rate (steady state)	N/A	91.3%	--

The FRI-specific cost of 9.2 microseconds is consistent across proof sizes from 2^16 to 2^22 trace length, because the number of query positions (40) is constant and the per-query cost scales logarithmically with domain size (more layers means more work per query chain). At 2^22 trace length, FRI takes approximately 10.8 microseconds due to the two additional layers, but the relative contribution to total STARK verification remains around 37%.

FRI Soundness and Caching

Caching the FRI acceptance result does not weaken FRI soundness. The soundness guarantee comes from the first verification: FRI with 40 queries over a rate-1/8 code has soundness error less than 2^-120. If the proof is valid, the FRI check passes, and the cached result is "accept." If the proof is invalid, the FRI check fails with probability at least 1 - 2^-120, and the cached result is "reject." Subsequent cache lookups return the same result. The only scenario where caching could be problematic is if the cache entry is corrupted (bit flip, memory error), which is addressed by in-process memory integrity and optional ECC RAM verification.

Practical Implications for System Design

Understanding FRI as the specific bottleneck changes how you design STARK-based systems. If FRI accounts for 37% of your verification cost and you cannot cache, your optimization options are limited: you can reduce the number of queries (weakening security), use a larger blowup factor (increasing proof size), or use a faster hash function for Merkle trees (marginal improvement). None of these options eliminate the fundamental cost of re-executing 720 folding checks per verification.

Caching eliminates the problem entirely. After the first verification, the FRI cost drops to zero. The system can verify the same proof at every service boundary, on every request, without worrying about the cumulative FRI cost. This enables architectural patterns that would be prohibitively expensive without caching: per-request proof verification in middleware, independent verification at every microservice, and verification in hot loops without rate-limiting.

The broader lesson is that FRI's computational cost is front-loaded. The first verification does 9.2 microseconds of real mathematical work that must happen for soundness. Every subsequent verification of the same proof does the same 9.2 microseconds of work and produces the same result. The work is necessary once. It is wasteful thereafter. Computation fingerprinting converts this from "necessary every time" to "necessary once, free thereafter."

Why Not Use SNARKs Instead?

A common response to the FRI bottleneck is: "just use a SNARK. Groth16 verification takes 1.5 milliseconds, but you avoid FRI entirely." This misses the point in two ways. First, SNARK verification at 1.5 milliseconds is 60x slower than STARK verification at 25 microseconds. Caching a SNARK verification result saves more absolute time per hit, but the uncached baseline is much worse. Second, SNARKs require a trusted setup, which is a security liability for systems that need post-quantum resilience or that cannot tolerate a setup ceremony. STARKs with FRI are transparent -- no trusted setup, no toxic waste, no ceremony.

The correct comparison is not STARK-with-FRI vs SNARK-without-FRI. It is cached-STARK (85 nanoseconds) vs cached-SNARK (also 85 nanoseconds, since the cache lookup cost is the same regardless of proof system). Both arrive at the same cached performance. The difference is in the uncached path (25 microseconds vs 1.5 milliseconds) and in the security model (transparent vs trusted setup). For systems that need transparency, post-quantum security assumptions, and high throughput, cached STARKs are the right choice.

The Bottom Line

FRI is the single largest component of STARK verification cost at 9.2 microseconds (37% of the 25-microsecond pipeline). It is also the most cacheable: the Fiat-Shamir transform makes FRI verification fully deterministic, and the computation fingerprint (polynomial commitment + field parameters + blowup factor + number of layers + grinding factor) uniquely identifies each FRI instance. Caching the full STARK verification result -- including FRI -- reduces the amortized cost from 25 microseconds to 85 nanoseconds. The FRI bottleneck is real, but it only has to be paid once per unique proof.

294x faster STARK verification. FRI computed once, served from cache at 85 nanoseconds.

brew install cachee STARK Optimization Deep Dive