FALCON-1024 vs ML-DSA-87: Cache NIST L5

May 1, 2026 | 16 min read | Engineering

NIST Level 5 is the highest security category in the post-quantum standardization effort. It claims security equivalent to AES-256 -- meaning a quantum attacker with a cryptographically relevant quantum computer and unlimited classical resources cannot break the scheme without effort equivalent to brute-forcing a 256-bit symmetric key. Two signature schemes compete at this level: FALCON-1024, producing 1,330-byte signatures, and ML-DSA-87, producing 4,627-byte signatures. The 3.5x size difference has direct consequences for any system that caches signatures at scale.

This is not a theoretical comparison. When your system caches 1 million Level 5 signatures, the choice between FALCON-1024 and ML-DSA-87 is the choice between 1.33 GB and 4.63 GB of cache memory. At 10 million signatures, it is 13.3 GB vs 46.3 GB. The algorithm you choose determines whether your cache fits in a single process or requires a distributed system with all the latency and complexity that entails.

1,330 B

FALCON-1024 Signature

4,627 B

ML-DSA-87 Signature

3.5x

Size Difference

FALCON-1024: NTRU at 256-bit Security

FALCON-1024 is the NIST Level 5 parameter set for FALCON. It doubles the ring dimension from 512 to 1024, operating in Z[x]/(x^1024 + 1) with modulus q = 12289. The doubling of the ring dimension increases all key and signature sizes compared to FALCON-512, but the increase is sublinear for signatures due to the compression scheme.

The public key is 1,793 bytes: 1 byte header plus 1,792 bytes for the polynomial h encoded as 1024 coefficients of 14 bits each. The signature is 1,330 bytes on average: 41 bytes of header (including the 40-byte salt) plus approximately 1,289 bytes for the compressed signature polynomial. The private key is 2,305 bytes, containing the NTRU basis polynomials (f, g, F, G) and the precomputed sampling tree for the larger ring.

The NTRU lattice at dimension 1024 has a different security landscape than at dimension 512. The best known attacks against NTRU-1024 are lattice reduction algorithms (BKZ with progressive sieving) applied to the NTRU lattice of rank 2048. The estimated classical cost is 2^371 operations and the estimated quantum cost (using Grover-accelerated sieving) is 2^337 operations, both of which exceed the NIST Level 5 threshold of 2^256 equivalent security. The security margin is comfortable, though not as wide as for some other Level 5 schemes.

Key Generation and Signing Costs

FALCON-1024 key generation is expensive: approximately 20-30 milliseconds on production hardware (Graviton4). This is roughly 2x the cost of FALCON-512 key generation (8-12ms), which reflects the quadratic scaling of the NTRU basis computation with ring dimension. Key generation involves finding polynomials f, g such that the NTRU lattice has good geometry for Gaussian sampling. This requires iterating over candidate (f, g) pairs and testing whether the resulting lattice basis meets the quality threshold. The expected number of iterations is small (typically 1-3), but each iteration involves an O(n^2) Gram-Schmidt computation on the 2n-dimensional lattice basis.

Signing with FALCON-1024 takes approximately 1.0-1.2 milliseconds, roughly 2x the cost of FALCON-512 (0.5ms). The signing algorithm is fast Fourier sampling over the NTRU tree, which has depth log2(1024) = 10 (versus depth 9 for FALCON-512). Each tree level involves O(n) operations in the ring, so the total signing cost scales as O(n * log(n)), giving the roughly 2x increase.

Verification with FALCON-1024 takes approximately 2.1 microseconds, compared to 1.2 microseconds for FALCON-512. The verification involves decompressing the signature, computing a polynomial product (c - s2*h mod q using a size-1024 NTT), and checking the norm bound. The NTT at size 1024 is roughly 2.1x the cost of the size-512 NTT (O(n log n) scaling), which accounts for the latency increase.

ML-DSA-87: Module Lattices at 256-bit Security

ML-DSA-87 is the NIST Level 5 parameter set for the ML-DSA standard (FIPS 204), based on the CRYSTALS-Dilithium scheme. It uses module lattices over the ring Z_q[x]/(x^256 + 1) with q = 8380417. The "87" denotes the module dimensions: k=8 rows and l=7 columns in the matrix A that defines the module lattice. This is the largest ML-DSA parameter set, and the increased module dimensions are what push the signature and key sizes up.

The public key is 2,592 bytes: 32 bytes for the seed that generates the matrix A (via an XOF), plus 2,560 bytes for the vector t1 (the high-order bits of As1 + s2, encoded as k*256 coefficients of 10 bits each). The signature is 4,627 bytes: 32 bytes for the challenge hash, 4,032 bytes for the response vector z (encoded as l*256 coefficients using a fixed-width encoding), and 563 bytes for the hint vector h (a sparse binary vector indicating where rounding overflows occurred). The private key is 4,896 bytes.

The signing algorithm for ML-DSA-87 is rejection sampling. The signer generates a random mask vector y, computes w = Ay, derives a challenge c from H(message || w), computes z = y + c*s1, and checks whether z is "small enough" and whether the hint for the rounding of w - c*s2 is sparse enough. If either check fails, the signer rejects and tries again with a fresh y. The expected number of attempts is approximately 5.1 for ML-DSA-87, which means signing involves approximately 5 matrix-vector multiplications on average.

ML-DSA-87 Performance Characteristics

ML-DSA-87 key generation is fast: approximately 0.25 milliseconds on production hardware. This is 80-120x faster than FALCON-1024 key generation. The key generation involves expanding a seed to generate the matrix A, sampling short error vectors s1 and s2, and computing t = As1 + s2. All operations are over the module ring with NTT acceleration, and there is no iterative rejection step.

Signing with ML-DSA-87 takes approximately 2.5-3.0 milliseconds on average, primarily due to the rejection sampling loop. Each iteration involves a size-256 NTT for 15 polynomials (k+l=15 in the module), plus polynomial arithmetic. With an average of 5.1 iterations, the total cost is roughly 15 NTTs per iteration times 5 iterations, or 75 NTTs. At approximately 40 nanoseconds per size-256 NTT on Graviton4, this is about 3 microseconds for the NTTs alone, but the dominant cost is the memory operations for the 2,592-byte public key expansion and the 4,627-byte signature encoding.

Verification with ML-DSA-87 takes approximately 1.5 microseconds. Despite the larger output, verification is only marginally slower than FALCON-1024 (1.5us vs 2.1us) because ML-DSA verification does not require a size-1024 NTT. Instead, it uses 15 size-256 NTTs (for the matrix-vector product) and a hash computation to check the challenge. The size-256 NTTs are individually cheaper than the size-1024 NTT, and the matrix-vector product is embarrassingly parallelizable across the module dimensions.

Property	FALCON-1024	ML-DSA-87	Ratio
NIST Security Level	5 (256-bit)	5 (256-bit)	--
Hardness assumption	NTRU lattices	Module-LWE	--
Public key size	1,793 B	2,592 B	0.69x (FALCON smaller)
Signature size	1,330 B	4,627 B	0.29x (FALCON smaller)
Private key size	2,305 B	4,896 B	0.47x (FALCON smaller)
Key generation	20-30 ms	0.25 ms	80-120x (ML-DSA faster)
Signing	1.0-1.2 ms	2.5-3.0 ms	0.4x (FALCON faster)
Verification	2.1 us	1.5 us	1.4x (ML-DSA faster)
FIPS standard	Pending (FN-DSA)	FIPS 204 (published)	--
Cached verification	31 ns	31 ns	1x (same)

The Cache Memory Equation

The 3.5x signature size difference between FALCON-1024 (1,330 bytes) and ML-DSA-87 (4,627 bytes) translates directly to a 3.5x difference in cache memory consumption when caching full signatures. This matters at every scale.

Cached Entries	FALCON-1024	ML-DSA-87	Delta
100,000	140 MB	470 MB	330 MB
1,000,000	1.40 GB	4.70 GB	3.30 GB
5,000,000	7.02 GB	23.5 GB	16.5 GB
10,000,000	14.0 GB	47.0 GB	33.0 GB

The per-entry sizes include 72 bytes of cache overhead (32-byte key, 8-byte pointer, 8-byte TTL, 8-byte frequency counter, 16-byte alignment). FALCON-1024: 1,330 + 72 = 1,402 bytes/entry. ML-DSA-87: 4,627 + 72 = 4,699 bytes/entry.

At 1 million entries, FALCON-1024 uses 1.40 GB -- comfortably within the memory budget of a standard server with 16-32 GB RAM. ML-DSA-87 uses 4.70 GB, which is feasible on a 32 GB server but consumes a significant fraction of available memory. At 5 million entries, FALCON-1024 uses 7.02 GB (fits on a 16 GB server with room for the application), while ML-DSA-87 requires 23.5 GB (needs a high-memory instance). At 10 million entries, FALCON-1024 requires 14.0 GB (fits on a 32 GB server), while ML-DSA-87 requires 47.0 GB (needs either a 64 GB server or a distributed cache).

The distributed cache option is particularly problematic at Level 5 sizes. A Redis GET for a 4,627-byte value takes approximately 200-250 microseconds (network round-trip plus serialization of the larger payload). For a 1,330-byte FALCON-1024 value, the same Redis GET takes approximately 140-180 microseconds. In both cases, the Redis latency is orders of magnitude slower than the in-process cache hit (31 nanoseconds). But the larger ML-DSA-87 values push against Redis's sweet spot for per-key throughput: at sustained 100,000 GETs per second with 4.6 KB values, Redis transfers 460 MB/s of payload data, which is a significant fraction of a 10 Gbps network link.

The Verification-Boolean Shortcut

If you only need to know whether a signature is valid (not retransmit the signature itself), you can cache the verification boolean instead of the full signature. The cache entry becomes 33 bytes: a 32-byte computation fingerprint (SHA3-256 of the signature, public key hash, and message hash) plus a 1-byte result. With 72 bytes of cache overhead, each entry is 105 bytes.

At this size, both FALCON-1024 and ML-DSA-87 have identical cache footprints: 105 bytes per entry, 105 MB per million entries. The 3.5x size advantage of FALCON-1024 vanishes because you are no longer storing the signature itself. If your architecture only requires verification caching (not signature caching), the choice between FALCON-1024 and ML-DSA-87 should be based on other factors: signing speed, key generation speed, standardization status, and the hardness assumption diversity you want.

However, many architectures require both signature caching and verification caching. The signing service caches full signatures to avoid re-signing. The verification services cache verification booleans to avoid re-verification. In this architecture, the total cache memory is dominated by the signing service's full-signature cache, and FALCON-1024's 3.5x size advantage matters.

// Signing service: cache full signatures (size-sensitive)
// FALCON-1024: 1,402 B/entry vs ML-DSA-87: 4,699 B/entry
fn sign_cached(msg: &[u8], sk: &PrivateKey) -> Signature {
    let key = sha3_256(msg);
    if let Some(sig) = SIGN_CACHE.get(&key) {
        return sig;  // 31ns, saves 1-3ms signing
    }
    let sig = sign(sk, msg);
    SIGN_CACHE.insert(key, sig);
    sig
}

// Verification service: cache boolean (size-identical)
// Both schemes: 105 B/entry
fn verify_cached(sig: &[u8], pk: &PublicKey, msg: &[u8]) -> bool {
    let fp = sha3_256(sig, pk.hash(), sha3_256(msg));
    if let Some(result) = VERIFY_CACHE.get(&fp) {
        return result;  // 31ns, saves 1.5-2.1us verification
    }
    let result = verify(pk, msg, sig);
    VERIFY_CACHE.insert(fp, result);
    result
}

When Each Algorithm Wins

FALCON-1024 Wins When:

Signature volume is high and cache memory is constrained. If your system produces and caches more than 1 million Level 5 signatures, FALCON-1024's 1,330-byte signatures save 3.3 GB per million entries compared to ML-DSA-87. At 10 million signatures, the savings are 33 GB -- the difference between fitting in a single 16 GB process and needing a 48+ GB process or a distributed cache.

Network bandwidth is limited. At 100,000 signatures per second, FALCON-1024 transmits 133 MB/s versus ML-DSA-87's 463 MB/s. On a 1 Gbps internal link, FALCON uses 10.6% of capacity while ML-DSA uses 37.0%. On constrained networks (satellite links, IoT gateways, cross-region replication), the 3.5x bandwidth difference can be the limiting factor.

Signing speed matters more than key generation speed. FALCON-1024 signs at 1.0-1.2 milliseconds versus ML-DSA-87's 2.5-3.0 milliseconds. For high-throughput signing (real-time attestation, stream authentication), the 2-3x signing speed advantage reduces latency and CPU cost per operation. The slow key generation (20-30ms vs 0.25ms) is amortized over the lifetime of the key pair and is irrelevant for steady-state performance.

You want NTRU as a distinct hardness assumption. FALCON-1024 relies on the hardness of finding short vectors in NTRU lattices, while ML-DSA-87 relies on the hardness of Module-LWE. These are mathematically distinct problems. Both are lattice-based, but an algorithmic breakthrough against Module-LWE (say, an improved BKZ variant that exploits module structure) would break ML-DSA without affecting FALCON, and vice versa. In a three-family deployment that uses FALCON (NTRU), ML-DSA (Module-LWE), and SLH-DSA (hash functions) -- three independent hardness assumptions -- FALCON-1024 provides diversification that ML-DSA cannot.

ML-DSA-87 Wins When:

FIPS compliance is required today. ML-DSA-87 is standardized in FIPS 204, which is published and final. FALCON-1024 is pending standardization as FN-DSA (FIPS 206). If your regulatory or contractual requirements mandate a published FIPS standard, ML-DSA-87 is the only option at Level 5.

Implementation simplicity is a priority. ML-DSA's rejection sampling is straightforward to implement in constant time. FALCON's fast Fourier sampling over the NTRU tree requires careful floating-point to fixed-point conversion, constant-time tree traversal, and correct Gaussian sampling. The implementation complexity of FALCON has historically been a source of side-channel vulnerabilities. If you are using a library without constant-time guarantees, ML-DSA is the safer choice.

Key generation frequency is high. If your system generates new key pairs frequently (ephemeral signatures, per-session key pairs, key rotation every few minutes), ML-DSA-87's 0.25ms key generation is 80-120x faster than FALCON-1024's 20-30ms. At 1,000 key generations per second, ML-DSA uses 0.25 CPU-seconds while FALCON uses 20-30 CPU-seconds. This is typically not the bottleneck (keys are reused, not regenerated per-operation), but it matters for systems that prioritize rapid key rotation.

Verification speed is the bottleneck on cache misses. ML-DSA-87 verifies at 1.5 microseconds versus FALCON-1024's 2.1 microseconds. On cache misses (the first verification of a new signature), ML-DSA is 1.4x faster. If your cache hit rate is low (below 50%), the uncached verification latency dominates, and ML-DSA's faster verification is an advantage. At high cache hit rates (above 80%), both schemes verify at 31 nanoseconds (the cache lookup cost), and the uncached latency is irrelevant.

Redis Latency at Level 5 Sizes

Level 5 signatures are larger than Level 1 or Level 3 signatures, and the size affects Redis performance. Redis command processing time includes serialization of the value, which scales linearly with value size. For small values (under 256 bytes), the serialization cost is negligible compared to network latency. For multi-kilobyte values, serialization becomes measurable.

Value Size	Redis GET Latency (same-AZ)	In-Process Lookup	Ratio
33 B (verification boolean)	110 us	31 ns	3,548x
690 B (FALCON-512 sig)	135 us	31 ns	4,355x
1,330 B (FALCON-1024 sig)	155 us	31 ns	5,000x
3,309 B (ML-DSA-65 sig)	195 us	31 ns	6,290x
4,627 B (ML-DSA-87 sig)	225 us	31 ns	7,258x

The Redis latency increases with value size, but the in-process lookup remains constant at 31 nanoseconds regardless of value size. This is because the in-process DashMap stores values inline in memory, and the lookup returns a reference to the value without copying or deserializing it. Redis must serialize the value into the RESP protocol, transmit it over TCP, and deserialize it on the client side. The larger the value, the more serialization and network transmission work is required.

At ML-DSA-87's 4,627-byte signature size, the Redis latency of 225 microseconds is 150x slower than the ML-DSA-87 verification itself (1.5 microseconds). Using Redis to cache a 1.5-microsecond operation and paying 225 microseconds to retrieve the cached result is counterproductive. The cache makes performance worse, not better. This is the fundamental argument for in-process caching at post-quantum signature sizes: the signatures are too large for efficient network transmission, and the verification is too fast for network-latency caching to provide any benefit.

Three-Family Level 5 Architecture

A maximum-security deployment uses Level 5 parameter sets from all three families: FALCON-1024 (NTRU lattices), ML-DSA-87 (Module-LWE lattices), and SLH-DSA-256f (hash functions). The security model requires an attacker to break all three hardness assumptions simultaneously -- NTRU, Module-LWE, and hash preimage resistance. These are three independent mathematical bets, and the probability that all three are broken by a single algorithmic advance is negligible.

In this architecture, the caching requirements for each family are different. FALCON-1024 handles high-frequency signing with compact 1,330-byte signatures (best cache density). ML-DSA-87 handles compliance-critical operations where FIPS 204 is required (larger signatures at 4,627 bytes, but lower frequency). SLH-DSA-256f handles root attestations and long-lived proofs (49,856-byte signatures, but very low frequency -- perhaps once per hour or per day). The total cache memory is dominated by FALCON-1024 entries because they are the most numerous, despite each entry being 3.5x smaller than an ML-DSA-87 entry.

The H33-74 attestation pipeline produces a compact 74-byte proof that binds all three signature families. The attestation is signed by whichever family is appropriate for the operation's frequency and compliance requirements, and the verification result is cached with a computation fingerprint that includes the attestation bytes, the signing family identifier, and the public key hash. This allows the cache to hold verification results for all three families in a single DashMap, differentiated by their fingerprints.

The Standardization Gap

As of May 2026, ML-DSA-87 (FIPS 204) is the only NIST Level 5 post-quantum signature with a published FIPS standard. FALCON-1024 (FN-DSA, FIPS 206) is expected but not yet published. For deployments that must demonstrate FIPS compliance today, ML-DSA-87 is the only compliant Level 5 option. For deployments that prioritize cache efficiency and are willing to adopt FALCON-1024 ahead of the FN-DSA publication, the 3.5x signature size advantage makes FALCON-1024 the better choice for high-volume workloads. Track the NIST FN-DSA timeline and plan to add FALCON-1024 when the standard publishes.

The Bottom Line

At NIST Level 5, FALCON-1024 produces signatures 3.5x smaller than ML-DSA-87 (1,330 bytes vs 4,627 bytes). For systems that cache full signatures, this means 3.5x less memory: 1.40 GB vs 4.70 GB per million entries. FALCON-1024 signs 2-3x faster (1.0ms vs 2.5ms) but has slower key generation (25ms vs 0.25ms) and slower verification (2.1us vs 1.5us). With in-process caching at 31 nanoseconds, the verification speed difference vanishes. The choice comes down to cache memory (FALCON wins), FIPS compliance (ML-DSA wins today), and which hardness assumption you want as your primary bet at Level 5.

Cache Level 5 PQ signatures at 31 nanoseconds. FALCON-1024 or ML-DSA-87, your choice.

brew install cachee PQ Key Size Guide