Skip to main content
Why CacheeHow It Works
All Verticals5G TelecomAd TechAI InfrastructureFraud DetectionGamingTrading
PricingDocsBlogSchedule DemoLog InStart Free Trial
← Back to Blog
Engineering

How Cachee Runs Admission Tracking for 10 Million Keys in 512 Kilobytes

Ten million unique keys. Full admission frequency tracking for every one of them. Total memory footprint: 512 kilobytes. The equivalent DashMap-based approach needs 620 megabytes for the same keyspace — a 1,239x reduction. This post is about how the math works, how we measured it running inside a real post-quantum authentication pipeline at 1,667,875 operations per second on bare-metal Graviton4, and the nine-times regression we found and fixed along the way.

We shipped Cachee's CacheeLFU admission filter this week. The headline is the memory number, but the engineering story underneath is worth reading — because we almost published the wrong conclusion about our own cache.

Why admission tracking matters, in one sentence

A cache without admission control treats every new key as equally important. Your hot keys get evicted to make room for one-time lookups. Your hit rate collapses under pressure. Admission control fixes this by asking, before inserting any new key: is this candidate more frequently accessed than the thing I'd have to evict to make room? If not, reject the candidate. Keep what's hot, drop what's noise.

CacheeLFU is the admission policy that made this approach famous — it's what powers Caffeine, the JVM cache library that replaced Guava in most modern Java codebases. The mechanism is a count-min sketch: a fixed-size, probabilistic frequency counter that approximates access counts for an unbounded keyspace. It never underestimates. It can overestimate slightly under hash collisions. The tradeoff is that it never needs to store actual keys — just a small table of counters that any key can map into.

The memory magic of count-min sketches is that the counter table size does not grow with keyspace. You can track a thousand keys or ten million keys in the same 512 KB table. That's the flat-memory property that makes multi-tenant admission tracking actually affordable.

The memory number, measured three ways

Here's the exact measurement from Cachee's benchmark binary, running on a c8g.metal-48xl Graviton4 instance on April 11, 2026. The test loads every key, fills the tracker, and reports the heap footprint.

Keyspace size CacheeLFU sketch Raw DashMap baseline Ratio
100,000 keys512.17 KiB6.20 MiB12.4x
1,000,000 keys512.17 KiB61.99 MiB123.9x
10,000,000 keys512.17 KiB619.89 MiB1,239.4x

The first thing to notice is that the sketch column does not move. Not by a kilobyte. The count-min sketch in Cachee is a fixed 4-row by 65,536-cell table of AtomicU16 counters — 512 KiB no matter what. The DashMap column, by contrast, grows linearly with keyspace because it has to actually store every key. At 10 million keys, a realistic cache key (we used a 13-character SHA3-based string plus internal DashMap overhead) works out to about 62 bytes per entry. Ten million entries is 620 megabytes of pure admission bookkeeping, before you've stored a single cached value.

The enterprise math that this unlocks: a single 512 KiB admission sketch can cover fifty independent tenants at 10 million keys each without multiplying memory cost. White-label deployments, shared-infrastructure products, and multi-customer SaaS caches all benefit from the same property — the admission budget is constant, set at startup, and does not compound with tenant count. We went from "admission tracking is a per-tenant cost to budget for" to "admission tracking is a fixed overhead, regardless of customer count."

What's actually inside those 512 kilobytes

The layout is deliberately simple so it stays honest under scrutiny. Four independent hash rows. Each row is 65,536 cells. Each cell is a single AtomicU16 counter, saturating at u16::MAX to prevent counter overflow during long-running workloads. Four rows times 65,536 cells times two bytes per cell is 524,288 bytes of raw counter storage. Add about 176 bytes of struct overhead, four hasher seeds, an aging mutex, and a sample counter, and the total heap footprint lands at exactly 524,464 bytes. That's the number you see in the benchmark output. No hidden allocations. No surprise growth.

Every key access hashes the key once per row using ahash with row-specific seeds, then atomically increments the corresponding cell with a saturating compare-exchange. The estimate function reads one cell per row and returns the minimum — this is the count-min trick. Since each row's hash collides on different pairs of keys, taking the minimum across rows cancels out most collision noise. The math says the per-estimate error is bounded by approximately e / width of the total number of operations, with confidence approximately 1 - exp(-depth). For our 65,536-wide, 4-deep table, that's an error bound of roughly 4.15 in 100,000 operations, with 98.2% confidence. In practice, overestimates are rare and always bounded. We never underestimate — if the sketch says a key was accessed 100 times, it was accessed at least 100 times.

Aging is handled by a separate mechanism: every 10 * width operations, the sketch halves every counter in place under a try_lock mutex. Concurrent writers continue during aging — we accept a small amount of counter loss during the reset window because the sketch is approximate by design anyway. The aging lock never blocks reads, and because it only fires once per ~650,000 operations, its cost amortizes out to effectively zero on the hot path.

The single-thread throughput story that grows with scale

Memory efficiency is half the story. The other half is that the sketch gets faster relative to the DashMap baseline as the keyspace grows. This is because the DashMap baseline pays for heap allocations on every first-touch of a new key, and those allocations compound with keyspace pressure.

Keyspace CacheeLFU sketch (Mops/s) DashMap baseline (Mops/s) Ratio
100,000 keys25.199.892.55x
1,000,000 keys24.334.215.77x
10,000,000 keys21.393.366.36x

The sketch's per-operation cost is roughly stable across three orders of magnitude of keyspace size (25 Mops/s at 100K keys, 21 Mops/s at 10M keys). DashMap collapses from 9.9 Mops/s to 3.4 Mops/s across the same range. The gap widens with scale because the baseline's costs are proportional to working-set size, while the sketch's costs are not.

For cold-start scenarios — a cache warming up, a freshly-deployed node catching up to the live hit distribution, or a tenant rotation in a multi-customer system — this difference compounds. The sketch-based approach recovers from cold-start at a rate that doesn't fall off when the keyspace gets large. The DashMap approach does, badly.

The engineering detour: a 9x regression we didn't expect

Here's where the story gets interesting, and where we almost published the wrong conclusion.

The benchmark above measures the admission sketch in isolation — as a standalone data structure, with no wrapper code around it. We wanted to measure what would happen in a more realistic setting: the sketch wrapped inside Cachee's full CacheeEngine API, with bloom filters, metrics collection, hit-rate tracking, and everything else a production cache engine includes. So we wired the sketch into CacheeEngine.get() and CacheeEngine.set() and re-ran the workload inside a real post-quantum auth pipeline driven by 96 worker threads on Graviton4.

The first result was catastrophic: sustained throughput collapsed from 1,708,400 authentications per second (raw DashMap baseline) to 183,828 authentications per second — a 9.3x regression. Per-lookup latency ballooned from 61 nanoseconds to 2,474 nanoseconds, a 40x per-operation slowdown. If we'd stopped there and published, the story would have been: "Cachee's admission sketch wrecks your cache throughput. Do not use."

The instinct is to blame the new code. The sketch is the new thing; therefore the sketch must be the problem. That instinct is wrong, and the scar tissue from chasing wrong-instinct regressions is what taught us to measure the data flow before blaming the feature.

We profiled. The sketch itself, on its own hot path, adds approximately 12 nanoseconds per operation — four atomic compare-and-swap operations against a 64-byte cache line. That's nothing. The sketch is innocent. The 2,400 nanoseconds of added latency have to come from somewhere else.

They came from instrumentation. Three of them, stacked.

Three shared write locks, each one a serialization point

Cachee's CacheeEngine.get() was doing three separate things that looked harmless in isolation and were catastrophic in combination:

  1. Stats update. Every get() took a parking_lot::RwLock<InternalStats>::write() to increment a hits counter. Single-threaded, this is a few nanoseconds. At 96 workers hammering the lock, every operation serializes through a single write-holding point.
  2. Histogram recording. Every get() called self.get_histogram.write().record(latency_us), where get_histogram was a RwLock<SimpleHistogram>. The ironic part: SimpleHistogram's fields were already AtomicU64 internally. The RwLock wrapper was doing nothing except taking a lock for no reason.
  3. Pattern detector access history. Every get() called self.patterns.record_access(key), which appended to a RwLock<VecDeque> ring buffer. Another write lock, another serialization point.

Each lock, individually, added maybe 100-300 nanoseconds under contention. Three of them, serialized on every cache get, at 96 workers, is how you turn a 61-nanosecond hot path into a 2,474-nanosecond hot path. It's how you take a fast cache and give it the throughput profile of a slow one. The data structures underneath were all fine. The instrumentation around them was serializing everything.

The broader lesson: observability is not free, and it does not automatically scale. We built a fast cache, wrapped it in standard observability patterns that work fine at 8 threads, and watched those patterns become the dominant cost at 96 threads. If your cache is lock-free on its hot path, every lock you add around it must be lock-free too, or you've just wrapped your fast thing in a slow thing.

The fix was mechanical, not architectural. Three surgical changes:

Fix one: replace RwLock<InternalStats> with a struct of AtomicU64 fields. fetch_add(1, Ordering::Relaxed) instead of stats.write().total_ops += 1. Same observational semantics, no lock. Relaxed ordering is correct here because stats are observational, not synchronization primitives. This one change doubled sustained throughput from 184K to 396K ops/sec.

Fix two: drop the RwLock wrapper around SimpleHistogram. The histogram already used atomics internally — the record method was taking &mut self by convention but not by necessity. Changing the signature to &self and removing the wrapper made the hot path lock-free without changing any of the histogram's counting logic. Min/max updates got a compare_exchange_weak loop to stay race-safe. For the throughput-window bookkeeping that genuinely needed mutation, we switched to try_write() with a skip-on-contention policy — the rolling window becomes a statistical sample under load rather than a perfect trace, which is exactly what observability data is supposed to be.

Fix three: gate the pattern detector behind 1/N sampling. Rather than taking a write lock on every access, we added an AtomicU64 counter and only do the full record on every 64th call. At 96 workers, the lock contention drops by a factor of 64, which is enough to make the contention disappear. The correlation detector that runs off the access history is statistical by design, so a 1.5% sample is more than enough signal to learn co-access patterns.

The final number on bare-metal Graviton4

After all three fixes, we re-ran the full benchmark on the same c8g.metal-48xl with all 96 workers pointed at the full CacheeEngine hot path, with the CMS admission sketch actively feeding every get() and set(). This is not a standalone cache microbenchmark. This is the admission sketch running inside a real post-quantum authentication pipeline, with FHE verification, ZKP lookup, SHA3 digesting, and Dilithium signing all happening on the same cores in the same process.

Configuration Sustained auth/sec Delta vs raw DashMap
Raw DashMap (baseline)1,708,400
CacheeEngine unfixed183,8289.3x regression
CacheeEngine + fix one (atomic stats)395,8164.3x regression
CacheeEngine + all three fixes1,667,8752.37%

That's the number. 1,667,875 authentications per second, sustained over 30 seconds, 96 workers on bare-metal Graviton4, with the full CacheeLFU admission sketch actively bumping counters on every single operation. The memory footprint of the sketch during this run was confirmed by the benchmark binary: 524,464 bytes, exactly 512.17 KiB, stable from start to finish. The cache hit rate was 100% (50,107,392 hits, 0 misses), and the sketch was wired into the real get-set hot path, not bypassed.

The 2.37% delta versus the raw DashMap baseline is the honest overhead of the full engine including instrumentation, bloom filters, and admission tracking. Two point three seven percent, in exchange for a 1,239x memory reduction at 10 million keys. That's the trade.

Why the in-pipeline measurement matters

The reason we insisted on running the sketch inside a real workload rather than a microbenchmark is that microbenchmarks on cache primitives are famously misleading. A microbenchmark that does nothing but hammer a cache get-path will see every lock contention, every cache line bounce, and every atomic fence. A real workload that does compute-heavy work between cache accesses — which is what production caches actually see — dilutes the per-operation overhead dramatically.

In our benchmark, each batch of 32 authentications includes roughly 944 microseconds of FHE computation before it ever touches the cache. The 32 cache lookups within that batch add about 12 microseconds in the CacheeEngine path. The ratio is 944 to 12 — the FHE work dominates so overwhelmingly that the per-lookup cache overhead becomes statistical noise at the pipeline level. This is the environment production caches actually run in, and it is why the 2.37% end-to-end delta is the number that matters, not the 5.9x per-lookup isolated latency delta.

Put differently: the memory story is no longer hypothetical. 512 KiB is not a projection or a microbenchmark claim. It is the measured footprint of an admission sketch that ran inside the 1,667,875 auth/sec number. The sketch was present, active, and participating in every single operation that produced that throughput. If the slide says "Cachee tracks 10 million keys in 512 kilobytes," the slide is citing a number that has been validated end-to-end inside a real production workload.

What ships next

The cms-admission feature flag is now safe to flip on in Cachee. The three instrumentation fixes are in the main cachee-core crate, the unit test suite passes on every one of them, and the atomic-stats refactor alone is a meaningful improvement that benefits every consumer of CacheeEngine.stats(), not just the ones using the admission sketch. If you run Cachee in an embedded configuration and never touch admission tracking, you still get the lock-free metrics path for free.

For customers running Cachee as a multi-tenant admission layer — particularly in verticals like ad tech, fraud detection, and AI inference where admission decisions matter and tenant density is high — the practical takeaway is that admission tracking has gone from a per-tenant memory line item to a fixed overhead that does not grow with customer count. One 512 KiB sketch covers your first tenant and your fiftieth. The economics of multi-tenant cache deployments change materially when admission is no longer a scaling cost.

The next piece of the roadmap is wiring the admission sketch into eviction decisions directly. The sketch is tracking frequency. The engine now knows which keys are hot. The remaining step is having the eviction policy consult the sketch when it has to choose which key to drop — the original CacheeLFU promise, where admission and eviction work together to protect the hot set against one-time-visit pollution. That work is in flight and will land in a future release.

If you want to see the sketch in action, the cachee-core crate is open for evaluation and the admission module ships with its own benchmark binary (admission_graviton4) that reproduces the table in this post on any ARM64 or x86-64 hardware. The memory numbers are deterministic and hardware-independent. The throughput numbers scale with core count.

The engineering moral we keep relearning is the same one every performance engineer eventually learns and keeps relearning: the thing that's slow is almost never the thing you think is slow. Measure the data flow before blaming the feature. Wrap fast things in other fast things, or you will give away everything the fast thing gained you. And when the number doesn't match the theory, the theory is what needs to change — not the measurement.

Related Reading

10 Million Keys. 512 Kilobytes. Shipped.

Cachee's CacheeLFU admission filter, Rust-native engine, in-process hot path. Drop-in RESP compatibility. Multi-tenant-ready out of the box. Deploy in minutes.

Start Free Trial Schedule Demo