FIPS 204 Caching: ML-DSA Key Sizes and Performance at Scale
FIPS 204 was finalized by NIST in August 2024, standardizing ML-DSA (Module-Lattice-Based Digital Signature Algorithm, formerly known as CRYSTALS-Dilithium) as the primary post-quantum digital signature standard. Every organization that signs data -- authentication tokens, API responses, document attestations, audit records -- will eventually migrate to ML-DSA. The question is not whether you will use it, but how your infrastructure will handle the performance and size implications when you do.
The numbers are unforgiving. An ML-DSA-65 public key is 1,952 bytes. An ML-DSA-65 signature is 3,309 bytes. Compare that to Ed25519: 32-byte public key, 64-byte signature. Your post-quantum signatures are 52 times larger than your classical signatures. Every system that stores, transmits, or verifies these signatures will feel the difference. Your cache layer will feel it most, because the cache is where signature verification results should live -- and where the size explosion can be contained.
This guide examines FIPS 204 from a caching perspective. Not the mathematics of lattice-based signatures. Not the security proofs. The operational reality: how large are the keys and signatures at each security level, how fast is verification, what happens to Redis when you start caching 3,309-byte signatures at scale, and why the correct caching strategy is to cache the 1-byte boolean verification result instead of the signature itself.
ML-DSA Parameter Sets: The Complete Reference
FIPS 204 defines three parameter sets, each targeting a different NIST security level. The security levels correspond to the computational effort required to break the scheme: Level 2 is equivalent to AES-128, Level 3 to AES-192, and Level 5 to AES-256. The parameter sets differ in key sizes, signature sizes, and verification performance. Understanding these differences is essential for making caching decisions.
| Parameter | ML-DSA-44 (Level 2) | ML-DSA-65 (Level 3) | ML-DSA-87 (Level 5) |
|---|---|---|---|
| NIST Security Level | Level 2 (AES-128) | Level 3 (AES-192) | Level 5 (AES-256) |
| Public Key Size | 1,312 bytes | 1,952 bytes | 2,592 bytes |
| Secret Key Size | 2,560 bytes | 4,032 bytes | 4,896 bytes |
| Signature Size | 2,420 bytes | 3,309 bytes | 4,627 bytes |
| Matrix Dimensions (k, l) | (4, 4) | (6, 5) | (8, 7) |
| Polynomial Degree (n) | 256 | 256 | 256 |
| Modulus (q) | 8,380,417 | 8,380,417 | 8,380,417 |
| Dropped Bits (d) | 13 | 13 | 13 |
| Challenge Weight (tau) | 39 | 49 | 60 |
| Gamma1 | 2^17 | 2^19 | 2^19 |
| Gamma2 | (q-1)/88 | (q-1)/32 | (q-1)/32 |
| Max Signature Attempts (beta) | 78 | 196 | 120 |
The size progression is clear. From ML-DSA-44 to ML-DSA-87, public keys grow by 97% (1,312 to 2,592 bytes), and signatures grow by 91% (2,420 to 4,627 bytes). Every step up in security level nearly doubles the data your cache must handle per signature operation. This is not a minor overhead. At 100,000 signature verifications per second with ML-DSA-65, you are processing 316 MB/sec of signature data alone -- before counting the messages being verified or the public keys used for verification.
Why ML-DSA-65 Is the Sweet Spot
For most applications, ML-DSA-65 (NIST Level 3) provides the optimal balance between security and performance. Here is the reasoning.
Security level 3 matches the deployment reality. Level 3 targets AES-192 equivalent security. For context, no classical computer has ever broken AES-128, let alone AES-192. Level 3 provides substantial margin against both classical and quantum attacks. Level 5 (AES-256 equivalent) is appropriate for classified government systems and long-term archival signatures. For authentication tokens, API response attestation, and session management -- the use cases where caching matters most -- Level 3 is more than sufficient.
The size penalty from Level 3 to Level 5 is severe. Moving from ML-DSA-65 to ML-DSA-87 increases the public key by 640 bytes (33%) and the signature by 1,318 bytes (40%). For a system verifying 100,000 signatures per second, that is an additional 126 MB/sec of data throughput just for the size increase. The security gain (from AES-192 to AES-256 equivalent) does not justify the operational cost for most use cases.
Verification performance scales with parameters. ML-DSA-65 verification requires computing NTT operations on (6, 5) matrices, while ML-DSA-87 uses (8, 7) matrices. The computational cost scales roughly with k*l, meaning ML-DSA-87 verification is approximately 1.87x slower than ML-DSA-65 (56/30). At scale, this difference determines how many verifications your infrastructure can perform per second -- and how critical caching becomes.
ML-DSA-65 is what the ecosystem is standardizing on. TLS libraries, certificate authorities, and authentication frameworks are implementing ML-DSA-65 as the default parameter set. Using the same parameter set as the rest of your stack simplifies key management, reduces interoperability risk, and ensures that your cached verification results are compatible with the verification logic of every other component in your infrastructure.
The Redis Problem: Signature Sizes at Scale
Redis was designed when signatures were 64 bytes. The transition to post-quantum signatures changes the math fundamentally. Here is what happens when you cache ML-DSA signatures in Redis.
Memory Consumption
Consider caching 1 million ML-DSA-65 signatures in Redis. Each signature is 3,309 bytes. With Redis overhead (key storage, internal data structures, memory fragmentation), each entry consumes approximately 3,500 bytes. One million entries requires 3.35 GB of Redis memory. For ML-DSA-87, that grows to 4.46 GB. And this is just the signatures -- it does not include the public keys, the messages, or any other cached data in the same Redis instance.
Compare that to caching the verification result. Each verification result is a boolean: true or false. That is 1 byte. One million verification results with Redis overhead is approximately 100 MB. The difference between caching signatures and caching verification results is a factor of 33x in memory consumption.
Network Bandwidth
At 100,000 verifications per second with ML-DSA-65 signatures cached in Redis, you are transferring 3,309 bytes per lookup over the network. That is 316 MB/sec of cache read traffic -- 2.5 Gbps -- just for signature lookups. Add the write traffic for cache population and you are consuming significant network bandwidth for cache operations alone. On a shared network, this competes with your application traffic, database queries, and inter-service communication.
With Cachee's in-process L1 tier, there is no network transfer at all. The 31ns lookup happens in the application's memory space. But even if you must use a distributed cache, caching the 1-byte verification result instead of the 3,309-byte signature reduces network bandwidth by 3,309x for reads. At 100,000 lookups per second, that is the difference between 316 MB/sec and 95 KB/sec.
Latency at Each Size
Redis GET latency is not constant with respect to value size. For small values (under 100 bytes), Redis GET latency is typically 50-100 microseconds over the network. As value size increases, serialization, deserialization, and network transfer time grow linearly. For a 3,309-byte ML-DSA-65 signature, typical Redis GET latency is 120-180 microseconds. For a 4,627-byte ML-DSA-87 signature, latency increases to 150-220 microseconds.
| Parameter Set | Signature Size | Redis GET Latency | Cachee L1 Lookup | Speedup |
|---|---|---|---|---|
| ML-DSA-44 | 2,420 B | ~110 us | 31 ns | 3,548x |
| ML-DSA-65 | 3,309 B | ~150 us | 31 ns | 4,839x |
| ML-DSA-87 | 4,627 B | ~185 us | 31 ns | 5,968x |
The speedup increases with signature size because Redis latency grows with value size while Cachee's in-process lookup latency remains constant at 31 nanoseconds regardless of the size of the original data. This is a fundamental architectural advantage: in-process caching eliminates the network, and the network cost is what scales with value size.
Do Not Cache the Signature. Cache the Verification Result.
The instinct when implementing ML-DSA caching is to cache the signature so you do not have to fetch it again. This is wrong. The signature is 3,309 bytes (ML-DSA-65). The verification result is 1 byte (true or false). Caching the result instead of the signature reduces cache memory by 1,511x, reduces network bandwidth by 3,309x, and eliminates the re-verification computation entirely. You are not caching a large object to avoid fetching it. You are caching a tiny result to avoid recomputing it.
The Correct Caching Strategy: Verification Result Caching
ML-DSA verification is computationally expensive. It involves NTT operations on polynomial matrices, hash computations, and comparison checks. For ML-DSA-65, a single verification takes approximately 250-400 microseconds on modern hardware. At 100,000 verifications per second, that is 25-40 CPU-seconds per wall-clock second -- more than one full CPU core dedicated entirely to signature verification.
The insight is that signature verification is deterministic. Given the same message, signature, and public key, the result is always the same. If you verified a signature once and the result was true, it will be true every subsequent time you verify it with the same inputs. The computation is pure. The result is cacheable.
The cache key should be a fingerprint of the verification inputs: SHA3-256(public_key || message || signature). The cached value is a single byte: 0x01 for valid, 0x00 for invalid. This is what Cachee's computation fingerprinting does automatically -- it hashes the inputs to the computation and stores the result, binding them together cryptographically.
Here is the size comparison for 1 million cached verifications:
| Caching Strategy | Per Entry | 1M Entries | 10M Entries |
|---|---|---|---|
| Cache full signature (ML-DSA-44) | 2,420 B | 2.26 GB | 22.6 GB |
| Cache full signature (ML-DSA-65) | 3,309 B | 3.08 GB | 30.8 GB |
| Cache full signature (ML-DSA-87) | 4,627 B | 4.31 GB | 43.1 GB |
| Cache verification result (any level) | 1 B | 0.95 MB* | 9.5 MB* |
*Including cache key overhead (32-byte SHA3-256 fingerprint + 1-byte result + metadata). Actual per-entry overhead with Cachee is approximately 80 bytes including the attestation metadata.
The reduction from ML-DSA-65 full signature caching to verification result caching is 3,309x per entry. For the total cache footprint with metadata overhead, the reduction is approximately 38x. Either way, the difference is the difference between needing 30 GB of cache memory for 10 million entries and needing less than 1 GB.
Implementation: ML-DSA Verification Caching with Cachee
The following examples demonstrate how to implement ML-DSA verification caching. The pattern is the same regardless of which ML-DSA parameter set you use.
use cachee::{CacheeEngine, ComputationFingerprint, CacheContract};
// Define the verification computation type
let contract = CacheContract::new("ml-dsa-65-verify")
.freshness_ms(86_400_000) // 24 hours -- signatures don't expire quickly
.strict_mode(true)
.attestation_required(true);
// Compute the fingerprint from the verification inputs
let fingerprint = ComputationFingerprint::new()
.input(&public_key_bytes) // 1,952 bytes (ML-DSA-65)
.input(&message_bytes) // variable
.input(&signature_bytes) // 3,309 bytes (ML-DSA-65)
.computation("ml-dsa-65-verify")
.version("fips-204-final")
.hardware_class("graviton4")
.finalize(); // SHA3-256 -> 32 bytes
// Check cache first
match engine.get(&fingerprint) {
Some(result) => {
// Cache hit: result is 1 byte (0x01 = valid, 0x00 = invalid)
// Saved: 250-400us of verification computation
// Saved: 3,309 bytes of signature transfer (if using distributed cache)
let is_valid = result[0] == 0x01;
return is_valid;
}
None => {
// Cache miss: perform the full ML-DSA-65 verification
let is_valid = ml_dsa_65_verify(&public_key, &message, &signature);
// Cache the 1-byte result, not the 3,309-byte signature
let result = if is_valid { vec![0x01] } else { vec![0x00] };
engine.put(&fingerprint, &result, &contract);
return is_valid;
}
}
Notice what is cached: 1 byte. Not the 1,952-byte public key. Not the 3,309-byte signature. Not the variable-length message. The 1-byte boolean result. The fingerprint (32 bytes) is the cache key, computed from all three inputs. If any input changes -- different message, different signature, different public key -- the fingerprint changes and the cache returns a miss, forcing re-verification. The computation fingerprint guarantees correctness: a cached result is only returned when the exact same inputs are presented.
Batch Verification Caching
In high-throughput systems, ML-DSA verifications often arrive in batches. An API gateway verifying 32 requests per batch, each with its own ML-DSA-65 signature, can batch the cache lookups to amortize overhead.
// Batch verification with cache
let fingerprints: Vec = requests.iter()
.map(|req| {
ComputationFingerprint::new()
.input(&req.public_key)
.input(&req.message)
.input(&req.signature)
.computation("ml-dsa-65-verify")
.version("fips-204-final")
.hardware_class("graviton4")
.finalize()
})
.collect();
// Batch lookup: 31ns * 32 = ~1us for all 32 lookups
let cached_results = engine.get_batch(&fingerprints);
// Only verify signatures that were not cached
let mut results = Vec::with_capacity(requests.len());
for (i, cached) in cached_results.iter().enumerate() {
match cached {
Some(result) => results.push(result[0] == 0x01),
None => {
let is_valid = ml_dsa_65_verify(
&requests[i].public_key,
&requests[i].message,
&requests[i].signature,
);
let result = if is_valid { vec![0x01] } else { vec![0x00] };
engine.put(&fingerprints[i], &result, &contract);
results.push(is_valid);
}
}
}
// With 80% cache hit rate:
// - 26 of 32 served from cache in ~806ns (31ns each)
// - 6 of 32 verified in ~1,800us (300us each)
// - Total: ~1,801us vs ~9,600us without caching (5.3x speedup)
The batch pattern matters because ML-DSA verification is CPU-bound. Each verification that hits the cache saves 250-400 microseconds of CPU time. At 80% cache hit rate on a 32-request batch, you save approximately 7,800 microseconds of CPU time per batch. That is 7.8 milliseconds of CPU reclaimed per batch, which at 1,000 batches per second translates to 7.8 CPU-seconds per wall-clock second -- almost 8 full CPU cores freed from signature verification.
Security Considerations for Verification Result Caching
Caching verification results raises a legitimate security question: what if an attacker poisons the cache to make an invalid signature appear valid? This is a real concern, and it requires specific mitigations.
Cache entry integrity. If an attacker can modify a cached verification result from 0x00 (invalid) to 0x01 (valid), they can bypass signature verification for any message/signature pair that has been previously verified and cached. This is why cache entry integrity is not optional for verification result caching. Cachee's triple PQ signatures on every cache entry prevent this attack: modifying a cached verification result invalidates the cache entry's own signatures, causing it to be rejected on read.
Cache key collision resistance. The cache key is SHA3-256(public_key || message || signature). If an attacker can find a collision -- a different (message, signature) pair that produces the same SHA3-256 hash -- they could cause a valid verification result to be returned for an invalid signature. SHA3-256 has 128-bit collision resistance, which is well beyond any foreseeable attack capability. This is the same hash function securing the rest of the PQ infrastructure.
Cache poisoning via first verification. An attacker might try to submit an invalid signature for verification, knowing that the 0x00 (invalid) result will be cached. If the attacker later discovers a valid signature for the same message, they would want the cached result to not interfere. This is handled naturally by the fingerprint: a different signature produces a different SHA3-256 fingerprint, so the cached invalid result does not affect lookups with the valid signature. The fingerprint includes all three inputs (public key, message, signature), so different signatures always produce different cache keys.
Time-of-check-to-time-of-use (TOCTOU). If a public key is revoked after a verification result is cached, the cached result remains valid until it expires. This is addressed by cache contracts. For authentication-critical verifications, set the freshness window to match your key revocation propagation time. If your PKI distributes CRLs every 15 minutes, set the cache freshness to 15 minutes. A cached verification result older than the freshness window is discarded, forcing re-verification against the current key state.
Migration Path: Classical to Post-Quantum Signature Caching
Most organizations are not migrating from "no signatures" to ML-DSA. They are migrating from Ed25519 or ECDSA to ML-DSA. The caching strategy should support both classical and post-quantum signatures during the migration period.
# cachee.toml -- dual signature verification caching
[contracts.ecdsa-p256-verify]
computation = "ecdsa-p256-verify"
freshness_ms = 3600000 # 1 hour
attestation = true
[contracts.ml-dsa-65-verify]
computation = "ml-dsa-65-verify"
freshness_ms = 86400000 # 24 hours
attestation = true
[contracts.hybrid-verify]
computation = "hybrid-ecdsa-mldsa65-verify"
freshness_ms = 3600000 # 1 hour (bound by shorter-lived classical key)
attestation = true
The hybrid verification contract caches the result of verifying both a classical and a post-quantum signature. The freshness window is bound by the shorter-lived key (the classical ECDSA key, which is typically rotated more frequently than PQ keys). During migration, you cache three types of verification results: classical-only, PQ-only, and hybrid. As endpoints migrate to PQ-only, the classical cache entries naturally expire and are not repopulated.
Performance at Scale: What the Numbers Mean
Let us put the numbers together for a production scenario. Consider an API gateway handling 500,000 requests per second, each requiring ML-DSA-65 signature verification.
Without caching: 500,000 verifications/sec at 300 microseconds each requires 150 CPU-seconds per wall-clock second. That is 150 CPU cores dedicated entirely to signature verification. On a 96-vCPU Graviton4 instance, you need two full instances just for signature verification, with no capacity left for anything else. Cost: approximately $4.60/hour for signature verification alone.
With verification result caching at 90% hit rate: 50,000 verifications/sec (10% cache miss) at 300 microseconds each requires 15 CPU-seconds per wall-clock second. The 450,000 cache hits at 31 nanoseconds consume 0.014 CPU-seconds. Total: 15.014 CPU cores. That is an 90% reduction in CPU requirement for signature verification, from 150 cores to 15 cores. One Graviton4 instance handles the entire load with capacity to spare.
Cache memory for 500K verifications/sec at 90% hit rate: With a 24-hour freshness window, the cache holds approximately 43 million unique verification results (500K/sec * 86,400 sec * cache-unique rate). At 80 bytes per entry (32-byte fingerprint + 1-byte result + metadata), that is 3.2 GB. Compare that to caching the signatures themselves: 43 million * 3,309 bytes = 136 GB. The verification result strategy fits in a single instance's memory. The signature caching strategy requires a distributed cache cluster.
The FIPS 204 Caching Formula
FIPS 204 ML-DSA signatures are 52x larger than Ed25519. Verification is 10x slower. The correct caching strategy is to cache the 1-byte verification result, not the 3,309-byte signature. This delivers 1,511x size reduction per entry, eliminates re-verification computation (300us saved per hit), and reduces network bandwidth by 3,309x for distributed lookups. At scale, the difference is between 150 CPU cores and 15. Between 136 GB of cache memory and 3.2 GB. Between two dedicated instances and a fraction of one.
FIPS 204 is not a future requirement. It is a finalized standard that federal agencies are mandated to adopt and that the private sector is adopting voluntarily to address harvest-now-decrypt-later risk. The signature sizes are large. The verification is expensive. Caching is not optional for any system that verifies ML-DSA signatures at scale. But the caching strategy matters: cache the result, not the signature. Use computation fingerprints to bind results to inputs. Set freshness windows to match your key lifecycle. And use a cache that provides integrity verification on the cached results themselves, because a verification cache without its own integrity guarantees is a security liability masquerading as an optimization.
ML-DSA-65 signatures are 3,309 bytes. Cached verification results are 1 byte. Cachee serves both at 31ns with PQ attestation.
Get Started PQ Key Size Reference