NIST Level 5 Key Caching: ML-KEM-1024 + ML-DSA-87

May 1, 2026 | 16 min read | Engineering

NIST Level 5 is the ceiling. It provides 256-bit classical security equivalent -- the same security margin as AES-256. In a world where quantum computers can run Grover's algorithm against symmetric ciphers, Level 5 algorithms are designed to require approximately 2^128 quantum operations to break. This is the security level mandated by the NSA's CNSA 2.0 suite for national security systems, and it is the appropriate choice for any cryptographic material that must remain secure for decades.

The cost of that maximum security is size. ML-KEM-1024 uses 1,568-byte public keys and 1,568-byte ciphertexts. ML-DSA-87 uses 2,592-byte public keys and 4,627-byte signatures. Combined, a single authenticated session requires 6,195 bytes of cryptographic material. That is 65x the 96 bytes required by classical ECDSA-P256 plus X25519. At one million concurrent sessions, you are looking at 6.2 gigabytes of cache memory dedicated solely to cryptographic material. At ten million sessions, 62 gigabytes. These numbers change the cache architecture conversation fundamentally. Network-attached caches are not viable. In-process caching is not optional. It is the only path that works.

6,195 B

PQ Level 5 per session

65x

vs classical (96 B)

6.2 GB

1M sessions

What NIST Level 5 Means in Practice

NIST's security levels are defined relative to the difficulty of attacking specific symmetric algorithms. Level 1 is as hard as brute-forcing AES-128. Level 3 is as hard as brute-forcing AES-192. Level 5 is as hard as brute-forcing AES-256. In quantum terms, Grover's algorithm provides a quadratic speedup for brute-force searches, reducing AES-256's effective security from 2^256 to 2^128 quantum operations. Level 5 post-quantum algorithms are designed so that breaking them with a quantum computer requires at least 2^128 quantum operations, matching the quantum-era security of AES-256.

The practical implication is longevity. Data encrypted or signed today with Level 5 algorithms is protected against quantum attacks for as long as the underlying mathematical problems remain hard. For lattice-based schemes like ML-KEM-1024 and ML-DSA-87, the hardness assumption is the Module Learning With Errors (MLWE) problem at higher dimensions. The larger parameters at Level 5 make the lattice dimension and noise parameters large enough that even the most optimistic projections for quantum lattice algorithms cannot reach the security bound within any foreseeable timeframe.

Level 5 is not universally necessary. It is specifically required by CNSA 2.0 for national security systems, appropriate for root certificate authorities whose certificates are valid for 20-30 years, necessary for code signing keys that protect critical infrastructure, and recommended for financial transaction signatures that may need to be verified decades later for legal or compliance reasons. For session tokens that expire in 15 minutes, Level 5 is overkill. The security level should match the data's secrecy or authenticity lifetime.

ML-KEM-1024: The Key Encapsulation Component

ML-KEM-1024 (FIPS 203) is the Level 5 variant of the NIST-standardized key encapsulation mechanism, formerly known as CRYSTALS-Kyber. It encapsulates a 32-byte shared secret using lattice-based cryptography with the following parameters: a module dimension of k=4, a polynomial degree of n=256, and a modulus of q=3329. The resulting public key is 1,568 bytes, the ciphertext is 1,568 bytes, and the shared secret is 32 bytes.

The 1,568-byte public key is the persistent cache burden. In a TLS-like key exchange, the server's ML-KEM-1024 public key must be available for every incoming connection. If you are handling 100,000 new connections per second, the server reads the public key 100,000 times per second. With an in-process cache, each read is a 35-nanosecond hash map lookup. With a network-attached cache, each read involves serializing a request, sending it over TCP, waiting for the response, and deserializing 1,568 bytes. At best, this takes 0.3 milliseconds per read. At 100,000 reads per second, that is 30 seconds of cumulative network latency per second -- an impossibility that immediately eliminates network-attached caching from the Level 5 architecture.

The ciphertext is 1,568 bytes per encapsulation. Unlike public keys, which are cached and reused across many connections, each ciphertext is unique to a specific session. Ciphertexts are not typically cached for reuse; they are processed once during key establishment and then discarded or archived. However, if you need to store session establishment records for audit purposes, the 1,568-byte ciphertext per session adds up. At one million sessions per day, that is 1.57 GB of ciphertext data per day for audit logs alone.

ML-KEM-1024 Performance

Key generation for ML-KEM-1024 takes approximately 0.1-0.2 milliseconds on modern hardware. Encapsulation takes approximately 0.15-0.25 milliseconds. Decapsulation takes approximately 0.15-0.25 milliseconds. These are all faster than a network round-trip to a cache server, which reinforces the architectural point: you cannot afford to fetch key material from a remote store for every operation. The cryptographic operations themselves are faster than the network overhead of fetching their inputs. The only sensible architecture is to keep the key material in-process.

ML-DSA-87: The Signature Component

ML-DSA-87 (FIPS 204, formerly CRYSTALS-Dilithium Level 5) is the Level 5 variant of the NIST-standardized digital signature scheme. It uses module lattices with a module dimension of k=8, l=7, a polynomial degree of n=256, and a modulus of q=8380417. The public key is 2,592 bytes, the private key is 4,896 bytes, and the signature is 4,627 bytes.

The signature size is the dominant cache concern. At 4,627 bytes per signature, Level 5 ML-DSA signatures are 6.7x larger than Level 1 FALCON-512 signatures (690 bytes), 1.4x larger than Level 3 ML-DSA-65 signatures (3,309 bytes), and 72x larger than classical ECDSA-P256 signatures (64 bytes). In a system that caches the most recent signature per session for ongoing verification, 4,627 bytes per session is a significant memory commitment.

The public key at 2,592 bytes is also substantial. In a signature verification flow, the verifier needs the signer's public key. If the public key is fetched from a remote store on every verification, the 2,592-byte transfer plus network overhead adds 0.3-0.8 milliseconds per verification. For a system performing 100,000 verifications per second, that is 30-80 seconds of cumulative network latency per second. In-process caching of public keys eliminates this entirely, reducing the retrieval to a 35-nanosecond hash map lookup.

ML-DSA-87 Verification Performance

ML-DSA-87 verification takes approximately 0.3-0.5 milliseconds on modern hardware. This is the operation that benefits most from caching. When you cache the verification result (not the public key, but the fact that a specific signature over specific data was verified as valid), you eliminate the 0.3-0.5 millisecond verification cost on cache hits. The cached result lookup takes 35 nanoseconds. The speedup is approximately 10,000-15,000x for verification result caching.

At 100,000 signature verifications per second with a 90% cache hit rate, the savings are dramatic. Without caching: 100,000 verifications at 0.4 milliseconds each equals 40 seconds of CPU time per second, requiring 40 CPU cores dedicated to signature verification. With caching: 10,000 misses at 0.4 milliseconds plus 90,000 hits at 35 nanoseconds equals 4.003 seconds of CPU time per second, requiring 4 CPU cores. The cache frees 36 CPU cores from verification duty.

The Memory Math at Level 5

The combined per-session footprint for ML-KEM-1024 + ML-DSA-87 is 6,195 bytes of cryptographic material. Adding session metadata (timestamps, identifiers, state flags) of approximately 135 bytes brings the total to approximately 6,330 bytes per session. Here is what that means at scale.

Active Sessions	Crypto Material	Total with Metadata	Classical Equivalent
10,000	62 MB	63.3 MB	0.96 MB
100,000	620 MB	633 MB	9.6 MB
1,000,000	6.2 GB	6.33 GB	96 MB
10,000,000	62 GB	63.3 GB	960 MB
100,000,000	620 GB	633 GB	9.6 GB

At one million sessions, 6.2 GB of cryptographic material requires a server with at least 16 GB of RAM to leave room for the operating system, application logic, and other memory consumers. This is feasible on a single server -- modern cloud instances with 64-256 GB of RAM are readily available. But the memory is dedicated to session caching and cannot be shared with other in-process workloads.

At ten million sessions, 62 GB is at the upper limit of single-server memory for standard instances. You either need a high-memory instance (256 GB or more) or a distributed in-process caching architecture where sessions are partitioned across multiple servers. Note that "distributed in-process" is different from "network-attached cache." Each server owns a partition of the session space and serves those sessions from its own in-process cache. There is no central cache server. The routing layer directs sessions to their owning server.

Redis Is Not Viable at Level 5

A single Redis GET for a 6.2 KB value takes approximately 0.8 milliseconds when you account for TCP round-trip, protocol parsing, and data transfer. An in-process hash map lookup for the same value takes 31-35 nanoseconds. That is a 25,806x difference. At 100,000 session lookups per second, Redis would consume 80 seconds of cumulative network latency per second -- an impossible load. At Level 5 sizes, in-process caching is not an optimization. It is a requirement.

Why In-Process Caching Is Non-Negotiable

The argument for in-process caching at Level 5 is not about preference. It is about physics. The speed of light through fiber optic cable is approximately 200,000 kilometers per second. A round-trip within a data center (1 kilometer of cable) takes approximately 10 microseconds for the light itself. Add switch latency, protocol overhead, serialization, and deserialization, and a cache lookup over the network takes 0.3-1.0 milliseconds. This is a hard floor that no amount of network optimization can reduce below 0.1 milliseconds for non-trivial payloads.

An in-process hash map lookup does not cross a network boundary. It computes a hash of the key (approximately 10-20 nanoseconds for a 32-byte SHA3-256 fingerprint), indexes into the hash table (a few nanoseconds for pointer arithmetic and memory access), and returns a pointer to the cached value (zero-copy). The total is 31-35 nanoseconds. There is no serialization, no TCP handshake, no protocol parsing, no data copy over the network. The value is already in the same address space as the application that needs it.

At Level 5, the network overhead for a 6.2 KB payload is proportionally worse than at Level 1 (1.49 KB). Network-attached caches transfer the full value on every GET. A 6.2 KB value takes approximately 4x longer to transfer over the network than a 1.49 KB value. Combined with the higher number of bytes per session, this means that Level 5 shifts the break-even point where network-attached caching becomes impractical to a much lower session count. At Level 1, you can get away with Redis up to perhaps 500,000 sessions. At Level 5, the network becomes a bottleneck at 50,000 sessions.

In-Process Architecture for Level 5

The in-process cache for Level 5 sessions uses a sharded DashMap with CacheeLFU eviction. Sharding across 64 segments (the default in DashMap) ensures that concurrent reads from different threads do not contend for the same lock. CacheeLFU eviction tracks access frequency per entry, evicting the least frequently accessed sessions when the cache reaches capacity. This is critical at Level 5 because the cost of a cache miss is high: re-fetching a 2,592-byte public key and re-verifying a 4,627-byte signature is orders of magnitude more expensive than at Level 1.

Memory management for Level 5 caches requires careful attention to fragmentation. When entries are 6+ KB each and the cache holds millions of entries, memory allocation and deallocation patterns can create fragmentation that wastes 20-30% of allocated memory. Using a slab allocator or arena allocator for cache entries mitigates this. Each entry is allocated from a fixed-size slab (8 KB for Level 5, allowing room for metadata alongside the 6,330-byte payload), which eliminates fragmentation at the cost of some internal waste per entry.

The FALCON-1024 Alternative

ML-DSA-87 is not the only Level 5 signature option. FALCON-1024 also provides Level 5 security (technically NIST Level 5 equivalent for the NTRU lattice problem) with dramatically smaller signatures: 1,330 bytes versus ML-DSA-87's 4,627 bytes. That is 3.5x smaller. The trade-off is implementation complexity. FALCON requires careful constant-time Gaussian sampling during signing, which is harder to implement securely than ML-DSA's rejection sampling approach.

Parameter	ML-DSA-87	FALCON-1024	Difference
Security level	NIST Level 5	NIST Level 5*	Equivalent
Public key	2,592 B	1,793 B	FALCON 31% smaller
Signature	4,627 B	1,330 B	FALCON 3.5x smaller
Combined (sig + pk)	7,219 B	3,123 B	FALCON 2.3x smaller
1M sessions (sig + pk)	7.22 GB	3.12 GB	FALCON saves 4.1 GB
Hardness assumption	Module LWE	NTRU lattices	Different math
Keygen complexity	Simple	Complex (Gaussian)	ML-DSA easier

The cache implication is substantial. Switching from ML-DSA-87 to FALCON-1024 for the signature component (while keeping ML-KEM-1024 for key exchange) reduces the per-session signature-related overhead from 7,219 bytes to 3,123 bytes, a 2.3x reduction. At one million sessions, this saves 4.1 GB of cache memory. At ten million sessions, it saves 41 GB. If your primary constraint is cache memory, the FALCON-1024 variant of Level 5 is significantly more cache-friendly while providing equivalent security.

The decision between ML-DSA-87 and FALCON-1024 is not purely about size. ML-DSA-87 is simpler to implement securely, has a more straightforward constant-time implementation, and is the default choice in FIPS 204. FALCON-1024 requires careful implementation of the trapdoor sampler, and bugs in the sampler can leak private key information. For organizations that prioritize implementation simplicity and have the memory budget for 4,627-byte signatures, ML-DSA-87 is the safer choice. For organizations that are constrained on cache memory and have the cryptographic engineering expertise to implement FALCON correctly, FALCON-1024 offers compelling size savings.

Cache Architecture: L1 / L2 / Archival

At Level 5, a single-tier cache is not always sufficient. The memory requirements are large enough that a tiered caching architecture provides better cost efficiency. The tier structure separates active sessions (L1), recently active sessions (L2), and archived sessions (archival export).

L1: In-Process Active Sessions

The L1 cache holds currently active sessions -- sessions that have had a request within the last TTL window (typically 15-60 minutes). This is the in-process DashMap with CacheeLFU eviction described above. The L1 cache is sized for the expected number of concurrent active sessions, not the total number of sessions that have ever existed. If you have 10 million registered users but only 500,000 are active at any given time, L1 holds 500,000 entries at approximately 3.17 GB. This is manageable on a single server.

L2: Warm Storage for Recently Active Sessions

The L2 cache holds sessions that were recently active but have not had a request in the last TTL window. These sessions might become active again (a user who steps away for 30 minutes and returns), so keeping their cryptographic material in a fast-access store avoids the cost of re-establishing the session from scratch. L2 can be backed by a memory-mapped file, a local SSD-backed store, or a compressed in-memory structure. The access latency for L2 is 1-10 microseconds, which is 30-300x slower than L1 but 100-1000x faster than re-establishing a Level 5 session from a database.

L2 is where compression becomes valuable. Level 5 cryptographic material has moderate compressibility. ML-DSA-87 signatures contain structured polynomial coefficients that compress to approximately 60-70% of their original size with LZ4. A 4,627-byte signature compresses to approximately 3,000-3,200 bytes. Across one million L2 entries, this saves 1.4-1.6 GB of memory. The decompression cost (approximately 0.5 microseconds for LZ4 on a 4.6 KB payload) is negligible compared to the L2 access latency.

Archival Export

Sessions that have been inactive for longer than the L2 retention period are exported to archival storage. For Level 5 use cases (national security, long-lived certificates, financial records), the session establishment records may need to be retained for years or decades for audit and compliance purposes. Archival export writes the session's cryptographic material (public keys, signatures, ciphertexts) to a durable store (S3, database, HSM-backed archive) and removes the entry from both L1 and L2. The export is batched and asynchronous to avoid blocking the hot path.

The archival export format should include the full session establishment record: the ML-KEM-1024 public key and ciphertext (for key exchange verification), the ML-DSA-87 public key and signature (for attestation verification), timestamps, session identifiers, and a cryptographic binding to the session's H33-74 attestation if applicable. This bundle is approximately 12-15 KB per session when including all metadata. At one million archived sessions per day, that is 12-15 GB of archival data per day, which is manageable with standard object storage pricing.

Where Level 5 Is Required

Level 5 is not the default choice. It is the choice when the consequences of a cryptographic break are severe enough to justify the infrastructure cost. The following use cases require Level 5 or an equivalent security margin.

National Security Systems (CNSA 2.0)

The NSA's Commercial National Security Algorithm Suite 2.0 (CNSA 2.0) mandates specific post-quantum algorithms and security levels for national security systems. For digital signatures, CNSA 2.0 requires ML-DSA-87 (Level 5) or, in the future, FALCON-1024. For key establishment, it requires ML-KEM-1024 (Level 5). These are not recommendations; they are requirements. Systems that process classified or sensitive government data must use Level 5 algorithms. The cache architecture for these systems must accommodate the 6,195-byte per-session footprint as a non-negotiable constraint.

Root Certificate Authorities

Root CA certificates are valid for 20-30 years. A quantum computer that breaks the root CA's signature invalidates every certificate issued under that root. The cost of a root CA compromise is catastrophic: revocation and reissuance of every downstream certificate, loss of trust in the entire certificate hierarchy, and potential for undetectable man-in-the-middle attacks during the revocation period. Level 5 provides the maximum security margin for root CA keys, ensuring that even aggressive projections for quantum computing development do not threaten the root's integrity within its validity period.

Code Signing for Critical Infrastructure

Code signing keys for operating systems, firmware, industrial control systems, and medical devices protect the integrity of software that may run for decades. A signing key compromise allows an attacker to distribute malicious updates that appear authentic. Level 5 ensures that the signing key remains secure against quantum attacks for the entire operational lifetime of the signed software. The cache requirements are modest for code signing (typically hundreds to thousands of active signing sessions, not millions), so the Level 5 memory overhead is manageable.

Financial Transaction Signatures

Financial regulators increasingly require that transaction records be verifiable for 7-10 years or longer. A signature scheme that can be broken within that window exposes the financial institution to fraud claims that cannot be cryptographically refuted. Level 5 provides assurance that transaction signatures remain verifiable and unforgeable for the regulatory retention period, even if quantum computers become available during that period. The cache architecture for financial transaction signing is typically high-throughput (millions of signatures per day) with shorter session lifetimes (minutes), so the active session count in L1 is manageable even at Level 5 sizes.

Benchmarks: Level 5 Cache Performance

We benchmarked Level 5 session caching on production hardware (c8g.metal-48xl, 192 vCPUs, Graviton4) with realistic session counts and access patterns. The benchmark simulates a mixed workload with 500,000 active sessions (L1), 2 million warm sessions (L2), and an access pattern where 70% of requests hit L1, 25% hit L2, and 5% are misses requiring full session reconstruction.

Metric	Level 5 (ML-KEM-1024 + ML-DSA-87)	Level 1 (FALCON-512 + ML-KEM-512)
L1 cache memory	3.17 GB	745 MB
L1 lookup latency	35 ns	33 ns
L2 lookup latency	4.2 us	1.8 us
Cache miss penalty	1.2 ms	0.4 ms
Weighted avg latency	0.085 us	0.044 us
Throughput (single core)	11.7M lookups/sec	22.7M lookups/sec
Verification result cache	31 ns (hit)	31 ns (hit)

The L1 lookup latency is nearly identical between Level 5 and Level 1 (35 ns vs 33 ns) because the hash map lookup time is dominated by the key hash computation, not the value size. The value is returned as a pointer; no data copy occurs during the lookup. The difference in throughput is due to cache line effects: larger values cause more L3 cache pollution during sequential scans, which slightly reduces throughput under high concurrency.

The L2 lookup latency is 2.3x higher at Level 5 because the compressed value is larger (approximately 4 KB compressed vs 1.1 KB at Level 1), requiring more time for decompression and memory copy. The miss penalty is 3x higher because reconstructing a Level 5 session requires generating larger keys and performing more expensive cryptographic operations.

Implementation with Cachee

# Initialize Cachee for NIST Level 5 sessions
cachee init --pq-level 5 --l1-capacity 500000 --l2-capacity 2000000

# Configure L2 warm storage
cachee config set l2.backend mmap
cachee config set l2.compression lz4
cachee config set l2.path /var/cachee/l2

# Start with tiered caching
cachee start

# Monitor Level 5 cache metrics
cachee status --level-5

# Output:
# NIST Level 5 session cache:
#   L1 entries:   487,293 / 500,000
#   L1 memory:    3.09 GB
#   L1 hit rate:  94.2%
#   L1 avg hit:   35ns
#   L2 entries:   1,847,293 / 2,000,000
#   L2 memory:    7.8 GB (compressed)
#   L2 hit rate:  82.1%
#   L2 avg hit:   4.2us
#   Eviction:     CacheeLFU
#   PQ bundle:    ML-KEM-1024 + ML-DSA-87

The Bottom Line

NIST Level 5 with ML-KEM-1024 + ML-DSA-87 provides maximum post-quantum security at 6,195 bytes per session -- 65x classical. At this scale, in-process caching is non-negotiable: Redis at 6.2 KB per GET takes 0.8 ms versus 31 ns in-process, a 25,806x gap. Use tiered caching: L1 in-process for active sessions, L2 compressed warm storage for recently active, archival export for compliance. Consider FALCON-1024 as an alternative signature scheme for 3.5x smaller signatures at equivalent security. Reserve Level 5 for national security systems, root CAs, code signing, and long-lived financial records. For everything else, Level 1 or Level 3 provides sufficient security at a fraction of the cache cost.

Level 5 post-quantum caching at 31 nanoseconds. Maximum security, minimum latency.

brew install cachee PQ Key Size Reference