Cache Goldilocks STARK Proofs in Production

May 1, 2026 | 15 min read | Engineering

The choice of field in a STARK system determines everything downstream: proof size, verification time, prover throughput, and ultimately how many verifications per second your cache must absorb. The Goldilocks prime -- p = 2^64 - 2^32 + 1 -- has emerged as the field of choice for high-performance STARK systems because it fits in a single 64-bit machine word. This seemingly simple property eliminates multi-precision arithmetic from the entire STARK pipeline, making every field operation 4-8x faster than systems operating over 256-bit fields. Faster verification means more verifications per second. More verifications per second means more cache pressure. More cache pressure means more value from caching. The Goldilocks field makes STARK verification fast enough that the caching layer becomes the critical optimization target.

2^64-2^32+1

Goldilocks prime

4-8x

Faster than 256-bit fields

85 ns

Cached verification

What Makes Goldilocks Special

A field element in a STARK system is a number modulo a prime p. Every arithmetic operation -- addition, subtraction, multiplication -- produces a result that must be reduced modulo p. The cost of modular reduction depends on the structure of p. For a generic 256-bit prime, modular reduction requires a full 256-bit division or, with Montgomery or Barrett reduction, several 256-bit multiplications. These operations use multi-precision arithmetic libraries that operate on arrays of 64-bit words, with carry propagation between words. Each field multiplication involves 8-16 machine multiplications plus carry handling.

The Goldilocks prime p = 2^64 - 2^32 + 1 has a structure that makes reduction trivially cheap. To reduce a 128-bit product x into the Goldilocks field, you decompose x into two 64-bit halves: x = x_hi * 2^64 + x_lo. Then you substitute 2^64 = 2^32 - 1 (mod p), giving x = x_hi * (2^32 - 1) + x_lo (mod p). This is a single 64-bit multiplication by 2^32 (a left shift by 32 bits), a subtraction, and an addition. Three operations. No multi-precision arithmetic. No carry propagation across word boundaries. No division. The entire reduction fits in a handful of machine instructions.

This efficiency applies to every field operation in the STARK pipeline. NTT (Number Theoretic Transform) butterfly operations, polynomial evaluations, FRI folding checks, constraint evaluations, Merkle hash computations (when hashing field elements), and random coefficient generation all use field arithmetic. A single STARK verification involves thousands to tens of thousands of field operations. Reducing each operation from 8-16 machine multiplications (256-bit field) to 3-4 machine instructions (Goldilocks) produces a 4-8x end-to-end speedup in verification time.

Who Uses Goldilocks

The Goldilocks field has been adopted by several major STARK and proof systems. Plonky2, developed by Polygon, uses Goldilocks as its base field for recursive proof composition. The system generates proofs that can be verified in approximately 170 milliseconds on consumer hardware, with the Goldilocks field enabling fast recursive verification that would be prohibitively expensive over a 256-bit field. ethSTARK, the STARK specification used by StarkWare for Ethereum validity proofs, supports Goldilocks as a field option. Polygon Miden, a zero-knowledge virtual machine, uses Goldilocks for its execution trace and constraint evaluation. H33-74-AIR, the algebraic intermediate representation used in the H33-74 attestation system, operates over Goldilocks for the same performance reasons. The convergence of these independent projects on the same field reflects a practical consensus: 64-bit native arithmetic is the right trade-off for STARK systems that prioritize verification throughput.

Why Goldilocks Matters for Caching

The relationship between field choice and caching is indirect but powerful. It works through a chain of consequences. Goldilocks makes verification faster. Faster verification means a single CPU core can verify more proofs per second. More verifications per second means the system encounters more unique proofs per second. More unique proofs means larger cache working sets. Larger cache working sets mean eviction pressure increases. Higher eviction pressure means the eviction policy (CacheeLFU) must be more precise. And in the other direction: faster verification also means each verification is cheaper, which means the relative cost of a cache miss decreases. But the absolute throughput improvement from caching remains constant -- 85 nanoseconds per cached lookup regardless of the field size.

Here is the concrete math. Over a 256-bit field, STARK verification takes approximately 100-150 microseconds. A single CPU core can verify approximately 7,000-10,000 proofs per second. At 90% cache hit rate, each core handles approximately 70,000-100,000 proof checks per second (90% at 85ns + 10% at 100-150us). Over Goldilocks, STARK verification takes approximately 15-25 microseconds. A single CPU core can verify approximately 40,000-67,000 proofs per second without caching. At 90% cache hit rate, each core handles approximately 400,000-670,000 proof checks per second. The cache amplifies the Goldilocks performance advantage by an additional 5-6x because the cache hit path (85ns) is the same for both fields, and the miss path is much cheaper for Goldilocks.

Metric	256-bit Field	Goldilocks (64-bit)	Advantage
Field multiplication	8-16 machine muls	1 mul + shift + add	4-8x
STARK verification	100-150 us	15-25 us	4-6x
Verify/sec (no cache)	7,000-10,000	40,000-67,000	5-7x
Verify/sec (90% cache)	70,000-100,000	400,000-670,000	5-7x
Cached lookup	85 ns	85 ns	Same
Cache value per proof	Larger (more data)	Smaller (compact field)	2-3x less memory

The Computation Fingerprint for Goldilocks STARKs

Caching STARK verification results requires a computation fingerprint that uniquely identifies the proof and its verification context. For Goldilocks STARKs, the fingerprint must include six components: the field modulus (to distinguish Goldilocks from other fields), the constraint count and algebraic degree (to identify the specific AIR), the trace length (which determines the domain size and FRI layer count), the FRI configuration (blowup factor, number of queries, folding factor), the proof bytes themselves, and the verification key or constraint commitment.

fingerprint = SHA3-256(
    field_modulus: 0xFFFFFFFF00000001,  // Goldilocks
    constraint_count: 147000,
    max_degree: 4,
    trace_length: 512,
    fri_blowup: 8,
    fri_queries: 40,
    fri_folding: 4,
    proof_bytes: [...],
    vk_hash: [...]
)

The field modulus is included in the fingerprint even though it seems redundant. The reason is cross-circuit safety. A system that verifies proofs from multiple STARK backends -- some over Goldilocks, some over the BN254 scalar field, some over the BLS12-381 scalar field -- must ensure that a proof verified over one field cannot produce a cache hit when queried with a proof from a different field. Including the field modulus in the fingerprint guarantees that proofs from different fields always produce different fingerprints, even if the proof bytes happen to be identical (which would be an extraordinary coincidence but is not cryptographically impossible).

The fingerprint computation cost is approximately 50 nanoseconds for a typical Goldilocks STARK proof. The proof bytes are the dominant input: a proof with a 512-row trace and 40 FRI queries is approximately 50-80 KB. SHA3-256 processes data at roughly 1 GB/sec on modern hardware, so hashing 80 KB takes approximately 80 microseconds. Wait -- that seems high. The resolution is that we do not hash the full proof bytes for the fingerprint. Instead, we hash the proof's Merkle root commitments (a few hundred bytes) plus the FRI final values (a few hundred bytes) plus the verification key hash. The full proof bytes are implicitly committed to by the Merkle roots. This reduces the fingerprint input to approximately 1-2 KB, bringing the hash time down to approximately 50 nanoseconds.

A Concrete Example: secp256k1 Scalar Multiplication

To make the caching architecture concrete, consider a specific circuit: verifying a secp256k1 elliptic curve scalar multiplication inside a Goldilocks STARK. This is a common building block for Bitcoin signature verification in zero-knowledge. The circuit takes a scalar k and a point P on the secp256k1 curve and proves that Q = k * P, where Q is the output point.

The secp256k1 base field is a 256-bit prime. Goldilocks is a 64-bit prime. Representing a single secp256k1 field element inside a Goldilocks STARK requires four Goldilocks limbs (4 * 64 = 256 bits). Each limb is a Goldilocks field element, and arithmetic on the secp256k1 field element requires carry propagation and range proofs across the four limbs. This is called non-native field arithmetic, and it is the primary source of constraint complexity in the circuit.

A single secp256k1 point addition in non-native arithmetic requires approximately 50-80 constraints. A scalar multiplication with a 256-bit scalar requires approximately 256 point doublings and 128 point additions (on average, for a random scalar), totaling approximately 384 group operations or roughly 20,000-30,000 constraints. Add range proofs for carry propagation (approximately 2 range proof constraints per limb per operation, with 4 limbs per field element), and the total constraint count for a single secp256k1 scalar multiplication is approximately 147,000 constraints.

With a trace length of 512 rows (the next power of two above the constraint count divided by the number of constraint evaluations per row), the Goldilocks STARK proof for this circuit has the following verification characteristics: FRI layer checks take approximately 6 microseconds (fewer layers than a larger trace), constraint evaluation takes approximately 12 microseconds (147,000 constraints are expensive even at Goldilocks speed), Merkle path verification takes approximately 4 microseconds, and OOD consistency check takes approximately 3 microseconds. Total: approximately 25 microseconds per verification.

The Non-Native Arithmetic Cost

The 147,000 constraints for a secp256k1 scalar multiplication deserve scrutiny. In a native 256-bit field STARK, the same scalar multiplication would require approximately 10,000-15,000 constraints because each field operation is a single constraint. The 10x constraint blowup from non-native arithmetic is the price of using a fast field (Goldilocks) for a computation that naturally operates over a large field (secp256k1).

Is the trade-off worth it? Yes, because the verification time scales with both the constraint count and the per-constraint cost. With 147,000 constraints over Goldilocks at approximately 80 nanoseconds per constraint evaluation (including all the field operations for non-native arithmetic), the constraint evaluation step costs 12 microseconds. With 15,000 constraints over a 256-bit field at approximately 800 nanoseconds per constraint evaluation, the constraint evaluation step costs 12 microseconds. The constraint counts differ by 10x, but the per-constraint costs also differ by 10x, so the total constraint evaluation time is approximately the same. The advantage of Goldilocks appears in the other verification steps (FRI, Merkle paths, OOD check), where the native 64-bit arithmetic is unambiguously faster.

The caching implication is clear: non-native arithmetic makes constraint evaluation more expensive and more complex, which makes the verification result more valuable to cache. Each cached verification of a secp256k1 scalar multiplication proof saves 25 microseconds of dense computation. At 100,000 verifications per second with 90% cache hit rate, caching saves 2.25 seconds of CPU time per second -- CPU time that would otherwise be spent on carry propagation and range proof checks for non-native field arithmetic that produces the same result every time.

Cross-Circuit Cache Sharing

A production STARK system typically verifies proofs from multiple circuits. A blockchain validator might verify transaction proofs, state transition proofs, and signature aggregation proofs, each with different AIR constraints and different trace lengths. The cache must handle proofs from all circuits simultaneously without confusing proofs from one circuit with proofs from another.

The computation fingerprint handles this automatically. Each circuit has a different constraint set, which produces a different constraint commitment hash, which produces a different fingerprint prefix. A proof from the transaction circuit and a proof from the signature circuit will always produce different fingerprints, even if the proof bytes are structurally similar. The cache stores verification results from all circuits in the same hash map, and the fingerprint ensures that each lookup returns the result for the correct circuit.

Cross-circuit cache sharing provides a secondary benefit: memory amortization. Instead of maintaining separate caches for each circuit (each with its own eviction policy, memory allocation, and monitoring), a single unified cache with CacheeLFU eviction automatically allocates memory to the circuits that produce the most cache hits. If the transaction proof circuit generates 80% of verification traffic, CacheeLFU will allocate approximately 80% of the cache capacity to transaction proof results, without any explicit configuration. This adaptive allocation is more efficient than static per-circuit partitioning, which requires manual tuning and cannot adapt to changing traffic patterns.

Fingerprint Collision Across Circuits

A natural concern with cross-circuit cache sharing is collision: could a proof from circuit A produce the same fingerprint as a proof from circuit B? The answer is no, with overwhelming probability. The fingerprint is a SHA3-256 hash, which provides 256-bit collision resistance. The constraint commitment hash for each circuit is derived from the circuit's constraint polynomials, which are structurally different between circuits. Two circuits with different constraints will produce different constraint commitment hashes, which will produce different fingerprint prefixes. Even if the proof bytes were identical (a near-impossibility for proofs from different circuits), the constraint commitment component of the fingerprint would differentiate them.

The probability of an accidental collision across the lifetime of the system is bounded by the birthday paradox. With 2^64 total fingerprints (far more than any system will ever produce), the collision probability is approximately 2^-128. This is the same probability as guessing an AES-128 key on the first try. It is not a practical concern.

Production Setup: Compute, Fingerprint, Verify, Cache, Serve

The production pipeline for cached Goldilocks STARK verification has five stages. Each stage has a specific latency budget and a clear responsibility.

Stage 1: Proof Arrival (network-dependent)

The proof arrives from the prover or from an upstream service. For a typical Goldilocks STARK proof with a 512-row trace and 40 FRI queries, the proof size is approximately 50-80 KB. Over a local network, receiving the proof takes approximately 0.1-0.3 milliseconds. Over the public internet, it takes 1-50 milliseconds depending on geography and bandwidth. The proof is received into a byte buffer and is ready for fingerprinting.

Stage 2: Fingerprint Computation (50 nanoseconds)

The fingerprint is computed by hashing the proof's Merkle root commitments, FRI final values, and verification key hash. This avoids hashing the full proof bytes (50-80 KB) by relying on the Merkle roots as implicit commitments to the full proof data. The SHA3-256 hash over approximately 1-2 KB of fingerprint input takes approximately 50 nanoseconds on modern hardware. The fingerprint is a 32-byte hash that serves as the cache key.

Stage 3: Cache Lookup (35 nanoseconds)

The fingerprint is used to query the in-process DashMap. If the fingerprint is found (cache hit), the cached verification result (valid/invalid) is returned immediately. Total time for a cache hit: 50 nanoseconds (fingerprint) + 35 nanoseconds (lookup) = 85 nanoseconds. If the fingerprint is not found (cache miss), proceed to Stage 4.

Stage 4: Full Verification (15-25 microseconds)

On a cache miss, the full STARK verification pipeline executes: FRI layer checks, constraint evaluation (including non-native field arithmetic for circuits like secp256k1), Merkle path verification, and OOD consistency check. Over Goldilocks, this takes 15-25 microseconds depending on the circuit complexity. The verification result (valid/invalid) is stored in a single byte.

Stage 5: Cache Insert and Serve (10 nanoseconds)

The fingerprint and verification result are inserted into the DashMap. The insertion takes approximately 10 nanoseconds amortized (including CacheeLFU frequency counter update and potential eviction). The verification result is returned to the caller. Total time for a cache miss: 50 nanoseconds (fingerprint) + 35 nanoseconds (lookup miss) + 15,000-25,000 nanoseconds (verification) + 10 nanoseconds (insert) = approximately 15.1-25.1 microseconds.

// Production Goldilocks STARK cached verification
fn verify_goldilocks_cached(
    proof: &GoldilocksStarkProof,
    vk: &VerificationKey,
    air: &GoldilocksAir,
) -> VerificationResult {
    // Stage 2: Fingerprint (50ns)
    let fp = sha3_256(
        GOLDILOCKS_MODULUS,          // 0xFFFFFFFF00000001
        air.constraint_commitment(), // circuit identity
        proof.fri_commitments(),     // Merkle roots
        proof.fri_remainder(),       // final FRI values
        vk.hash(),                   // verifier key binding
    );

    // Stage 3: Cache lookup (35ns)
    if let Some(result) = CACHE.get(&fp) {
        return result;  // 85ns total
    }

    // Stage 4: Full verification (15-25us)
    let result = goldilocks_stark_verify(proof, vk, air);

    // Stage 5: Cache insert (10ns)
    CACHE.insert(fp, result);

    result
}

Memory Profile for Goldilocks STARK Caches

Goldilocks STARK proofs are generally smaller than their 256-bit field counterparts because field elements are 8 bytes instead of 32 bytes. This means the Merkle tree nodes in the proof are smaller (8-byte leaves instead of 32-byte leaves, though the Merkle hash digests are still 32 bytes for SHA3-256), and the FRI layer values are 8 bytes each instead of 32 bytes. A typical Goldilocks STARK proof is 50-80 KB, compared to 150-300 KB for an equivalent proof over a 256-bit field.

However, the cache does not store the full proof. It stores only the fingerprint (32 bytes) and the verification result (1 byte), plus metadata (8 bytes for CacheeLFU frequency counter and TTL). Each cache entry is 41 bytes regardless of the proof size or the field. This means the cache memory profile is the same for Goldilocks and 256-bit field STARKs: 41 bytes per entry, approximately 41 MB per million entries.

The field choice affects the cache indirectly through the working set size. Goldilocks verification is 4-8x faster, which means the system processes 4-8x more proofs per second, which means the cache sees 4-8x more unique proofs per time window, which means the cache needs 4-8x more capacity to maintain the same hit rate. If a 256-bit field system needs a 1 million entry cache for a 90% hit rate, a Goldilocks system handling the same proof distribution at higher throughput might need 4-8 million entries for the same hit rate. At 41 bytes per entry, that is 164-328 MB -- still modest compared to the GB-scale memory requirements of post-quantum key caching.

FRI Configuration and Cache Behavior

The FRI (Fast Reed-Solomon Interactive Oracle Proof) configuration significantly affects verification time and therefore cache economics. Three FRI parameters matter: the blowup factor, the number of queries, and the folding factor.

The blowup factor (typically 4, 8, or 16) determines the ratio between the evaluation domain size and the trace length. A higher blowup factor increases soundness but also increases proof size and verification time. For Goldilocks STARKs, a blowup factor of 8 is standard, providing approximately 128 bits of security with 30-50 queries. The blowup factor is included in the computation fingerprint because changing it changes the verification semantics.

The number of FRI queries (typically 30-50) determines the number of random positions the verifier checks. More queries provide higher soundness but increase verification time linearly. Each additional query adds approximately 0.3-0.5 microseconds of verification time for a Goldilocks STARK. The query count is a tunable parameter: systems that prioritize verification speed can use fewer queries with a corresponding reduction in provable soundness. The query count must be included in the fingerprint because a proof verified with 30 queries has a different soundness guarantee than the same proof verified with 50 queries.

The folding factor (typically 2 or 4) determines how many FRI layers are collapsed in each folding step. A folding factor of 4 means each FRI round reduces the domain size by 4x instead of 2x, resulting in fewer FRI layers and fewer Merkle path verifications, but each folding step is more expensive. The optimal folding factor depends on the relative cost of field arithmetic versus Merkle hashing. For Goldilocks, where field arithmetic is cheap, a folding factor of 4 is often optimal because it reduces the number of expensive Merkle path verifications.

Handling Circuit Updates

In production, circuits are not static. Constraint sets are updated when the computation being proved changes (a new transaction format, an updated verification rule, a bug fix in the circuit). When the constraint set changes, the constraint commitment hash changes, and all existing cache entries for the old constraint set become unreachable -- they can never produce a cache hit because no new proof will have the old constraint commitment. These orphaned entries will eventually be evicted by CacheeLFU as their frequency counters age out.

The transition period during a circuit update is the most cache-hostile phase. All proofs for the new circuit are cache misses. If the system was operating at 90% cache hit rate before the update, it drops to 0% immediately after the update and gradually recovers as the cache fills with verification results for the new circuit. The recovery time depends on the proof generation rate and the cache capacity. For a system generating 10,000 unique proofs per second with a 1 million entry cache, full recovery takes approximately 100 seconds (1 million entries / 10,000 entries per second). During this recovery window, every verification pays the full 15-25 microsecond cost.

To mitigate the cold-start penalty, you can pre-warm the cache by verifying known proofs before the circuit update goes live. If you have access to a set of proofs that will be frequently verified under the new circuit (for example, proofs for well-known accounts or high-traffic transactions), verify them during a maintenance window and populate the cache before the new circuit starts receiving live traffic. This reduces the cold-start period from minutes to seconds.

The Bottom Line

The Goldilocks field makes STARK verification 4-8x faster by enabling native 64-bit arithmetic. This amplifies the value of caching: faster verification means higher throughput, which means more cache pressure, which means more savings from each 85-nanosecond cached lookup. The computation fingerprint includes the field modulus, constraint commitment, FRI configuration, and proof commitments -- ensuring cross-circuit safety and cross-field isolation. In production, the pipeline is: proof arrives, fingerprint in 50ns, cache lookup in 35ns, full verify only on miss. For non-native arithmetic circuits like secp256k1 over Goldilocks, the high constraint count makes caching especially valuable -- each cached hit saves 25 microseconds of dense carry-propagation computation that produces the same result every time.

Cache Goldilocks STARK verification at 85 nanoseconds. Native 64-bit field arithmetic, cached results.

brew install cachee STARK Verification: 25us to 85ns