We published 2,209,429 authentications per second in March. Our current production number is 1,667,875. That looks like a 24% regression. It isn't. This is the full accounting.
Two numbers, two configurations
The 2.2M number was real. It was measured on Graviton4 metal with 96 workers running the full H33 pipeline: BFV fully homomorphic encryption, batch Dilithium signing, and ZKP verification. But the cache layer was a raw DashMap — a concurrent hash map with no admission control, no instrumentation, no telemetry, no pattern detection. That configuration does not ship to customers.
The 1,667,875 number is also real. Same hardware. Same pipeline. But the cache layer is the full production CacheeEngine: CacheeLFU admission sketch (Count-Min Sketch with width-partitioned multi-tenant support), atomic statistics collectors, lock-free histograms, and a sampled pattern detector. That is the configuration customers actually run.
The difference between the two is 2.37%. That's CacheeEngine's overhead.
The math
| Build | Configuration | Auth/sec |
|---|---|---|
| v10 (Feb) | Raw DashMap, no cache engine | 2,172,518 |
| v11 (Mar) | Raw DashMap, no cache engine | 2,209,429 |
| Apr 11 | Raw DashMap baseline (same metal) | 1,708,400 |
| Apr 11 | Full production CacheeEngine | 1,667,875 |
Two things are visible in this table. First, the CacheeEngine overhead: 1,708,400 raw vs 1,667,875 with full cache = 2.37% cost. That's effectively free. Second, there's a separate 21% drop in the raw DashMap baseline itself — Feb/Mar raw was 2.2M, April raw was 1.7M on the same instance type. That has nothing to do with the cache layer.
How we got from 184K to 1,667,875
CacheeEngine's first integration did not produce a 2.37% overhead. It produced a 9.3x regression. 1.7M raw dropped to 184K with the cache layer enabled. Three contended write locks in the instrumentation code serialized the entire hot path at 96 workers.
Bug 1: InternalStats wrapped in Arc<RwLock<InternalStats>>
Every cache operation — every get, every set, every admission check — took a write lock to increment hit/miss/admission counters. At 96 concurrent workers, this single lock became the bottleneck for the entire pipeline. Every worker waited in line to bump a counter.
Fix: Replace RwLock<InternalStats> with Arc<InternalStats> where every counter is an AtomicU64. No lock. No contention. Each worker atomically increments its counter and moves on.
Result: 183,828 to 395,816 auth/sec. A 115% improvement from removing one lock.
Bug 2: MetricsCollector histograms wrapped in RwLock<SimpleHistogram>
The histograms already used atomic counters internally — the SimpleHistogram struct was built on AtomicU64 bins from the start. But the MetricsCollector wrapped each histogram in an additional RwLock and called record() with &mut self. The outer lock was pure redundant overhead.
Fix: Remove the RwLock. Change SimpleHistogram::record() from &mut self to &self. The atomics inside the histogram already handle concurrent access correctly.
Result: 395,816 to approximately 1,400,000 auth/sec.
Bug 3: PatternDetector::record_access() writing to RwLock<VecDeque> on every cache get
The workload pattern detector maintained a rolling window of recent access patterns in a VecDeque protected by an RwLock. Every single cache get — the hottest path in the entire system — took a write lock on this deque to push a new entry. The pattern data was only consulted periodically for prefetch decisions, but the lock was paid on every operation.
Fix: Add a sampling gate. An AtomicU64 counter increments on every access. The full RwLock write only fires every 64th call. The other 63 calls do nothing — one atomic increment and return. The sampling rate is configurable (default 64).
Result: approximately 1,400,000 to 1,667,875 auth/sec — landing at 2.37% below the raw DashMap baseline.
What all three bugs had in common
All three root causes were contended write locks in the cache's instrumentation code — the stats counters, the metrics histograms, the pattern detector. Not in the CacheeLFU admission sketch. Not in the DashMap. Not in the bloom filter. Not in the L0 hot tier. The actual cache data structures were fine. The code that observed the cache was the problem.
None of these bugs were visible at single-threaded benchmarks. None were visible at 4-worker or 8-worker benchmarks. They only surfaced at 96 concurrent workers on bare-metal Graviton4 where lock contention dominates everything. A write lock that takes 50 nanoseconds under no contention takes 5 microseconds under 96-way contention — a 100x amplification that turns a trivial operation into the pipeline bottleneck.
The 21% raw baseline regression
The table shows a separate issue: the raw DashMap path itself dropped from 2.2M (Feb/Mar) to 1.7M (April) on the same c8g.metal-48xl instance type. This is not CacheeEngine's fault — the cache layer was not involved in the raw baseline measurement. The regression is under investigation. The primary suspect is a Dilithium signing change between v10/v11 and the April build, but it requires dedicated profiling to confirm. We documented this as an open item and will report the root cause when we find it.
Why we publish the lower number
We could report 2,209,429 auth/sec. It was measured, it was real, and it's a bigger headline. But nobody deploys the raw DashMap configuration. Every customer runs with CacheeLFU admission, atomic stats, histograms, and the pattern detector. The number that matters is the number that represents what ships.
1,667,875 auth/sec is the honest number. It includes every feature a production deployment has. It runs on the same Graviton4 metal our customers target. The CacheeEngine overhead is 2.37% — less than measurement variance between benchmark runs.
The three-bug story is worth telling because it's a pattern that applies to every high-concurrency Rust system: the code that observes your hot path will become your hot path if you're not careful with your locking strategy. Atomic counters. Lock-free histograms. Sampled writes. These are not optimizations — they are requirements at 96-worker scale.
Before fixes: 183,828 auth/sec — 9.3x regression from raw
After fixes: 1,667,875 auth/sec — 2.37% overhead vs raw
Recovery: 9.1x
The lesson: an RwLock on the hot path is almost always wrong at scale, even when the code inside the lock is trivial.
Related Reading
- How We Made Session Validation 30 Nanoseconds
- How Cachee Runs Admission Tracking for 10 Million Keys in 512 Kilobytes
- 16 Microseconds: Why We Stopped Using ElastiCache
1,667,875 auth/sec. Full production stack. 2.37% cache overhead.
CacheeEngine ships with lock-free instrumentation, CacheeLFU admission, and atomic telemetry. The overhead you pay for observability is less than measurement noise.
Start Free Trial Read the Docs