Cachee v4.3 drops L1 GET latency from 4.65µs to 31ns and adds horizontal scaling with pub/sub cache coherence. These are production measurements from a Graviton4 c8g.16xlarge running our native cache server. Here's what we changed and why.
The Bottleneck Wasn't the Cache
Our v3.0 engine (DashMap + adaptive eviction) could look up a key in 4.65µs. But the HTTP response — the thing the client actually sees — took 14.5µs. Where did the extra 10µs go?
Middleware. Every GET request passed through compression negotiation, CORS headers, security headers, rate limiting middleware, and authentication middleware. Each layer adds dispatch overhead, header allocation, and async context switches. For a cache hit that takes microseconds, the middleware stack was the bottleneck — not the cache itself.
The insight: For GET /cache/:key, most middleware does nothing useful. Compression? We can pre-compress at write time. CORS? Not needed for API calls with API keys. Security headers? Already set by the CDN. We needed a fast lane.
The Fast Lane
The fast lane is an Axum middleware that sits before all other middleware. It intercepts GET /cache/:key requests and serves them directly — inline auth, DashMap lookup, pre-compressed response body. Everything else passes through to the normal middleware stack.
The key design decision: this is a middleware, not a separate router. Our first implementation used a separate Axum Router with a fallback, but Axum returns 405 Method Not Allowed when a path matches but the method doesn't — so DELETE /cache/:key returned 405 instead of falling through. The middleware approach lets non-GET requests pass through cleanly.
Pre-Compression at Write Time
In v3.0, every GET response ran through compression middleware that negotiated Accept-Encoding and compressed on the fly. For small values this was fast, but for anything over 1KB it added measurable overhead.
v4.3 pre-compresses values at write time. When you SET a key, we store three copies:
On GET, the fast lane checks Accept-Encoding and returns the matching pre-compressed variant. Zero runtime compression. Zero negotiation overhead. Just a DashMap lookup and a pre-built response body.
Inline Authentication
The standard auth middleware extracted the API key, dispatched through a middleware chain, and performed the comparison. The fast lane does it inline with constant-time comparison using the subtle crate:
No timing side-channels. No async overhead. No middleware dispatch. Just a direct comparison in the hot path.
Horizontal Scaling with Pub/Sub
v4.3 adds the infrastructure for horizontal scaling. Each Cachee instance maintains its own L1 DashMap (fast, local reads) and shares a Redis L2 backend. The challenge: cache coherence.
When Instance A deletes a key, Instance B still has the old value in L1. Without coherence, you serve stale data until TTL expires.
Our solution: Redis pub/sub. Every mutation (delete, update) publishes to the cachee:invalidate channel. All instances subscribe and evict invalidated keys from their local L1. Self-origin messages are filtered by instance ID to prevent loops.
Why pub/sub, not polling? Polling adds latency proportional to the poll interval. Pub/sub propagates in ~1ms regardless of cluster size. Redis handles millions of pub/sub messages per second — it's not a bottleneck.
Each instance registers in a Redis hash (cachee:instances) with TTL heartbeat keys. The control plane instance monitors healthy instances and can trigger ECS Fargate scale-out when load increases.
The Numbers
All latencies are server-side monotonic clock deltas. This includes inline auth + DashMap lookup + HTTP response body construction. Client-side adds TCP/kernel overhead (~0.2-0.5ms for keep-alive connections, ~1-3ms for cold curl requests).
Measured on a Graviton4 c8g.16xlarge (64 vCPU, ARM64) in us-east-1 with 100 serial requests. L1 hit rate: 100% (104/105 hits, 0 errors).
What's Next
31ns is fast, but there's still room. We're exploring:
io_uring — Replace epoll with io_uring for syscall-free socket reads. Potential 20-30% latency reduction on Linux 6.1+.
CPU pinning — Pin Tokio worker threads to specific cores. Eliminates cache-line bouncing on multi-socket systems.
Huge pages — 2MB TLB entries for large DashMap instances. Especially impactful at 10M+ keys.
DPDK/XDP — Kernel bypass networking for dedicated NIC setups. Sub-microsecond responses become possible when you skip the kernel entirely.
Ready for 31ns cache hits?
Deploy Cachee in under an hour. Sub-microsecond latency on day one. No migration required.
Start Free TrialRelated Reading
The Numbers That Matter
Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.
- L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
- L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
- Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
- L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.
The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.
Average Latency Hides The Real Story
Average latency is the most misleading number in cache benchmarking. The percentile distribution is what actually breaks production systems. Tail latency — the slowest 0.1% of requests — is where users notice the lag and where SLAs get violated.
| Percentile | Network Redis (same-AZ) | In-process L0 |
|---|---|---|
| p50 | ~85 microseconds | 28.9 nanoseconds |
| p95 | ~140 microseconds | ~45 nanoseconds |
| p99 | ~280 microseconds | ~80 nanoseconds |
| p99.9 | ~1.2 milliseconds | ~150 nanoseconds |
The p99.9 spike on networked Redis isn't a bug — it's the cost of running a single-threaded event loop that occasionally blocks on background tasks like RDB snapshots, AOF rewrites, and expired-key sweeps. Cachee's L0 stays inside a few hundred nanoseconds because the hot-path read is a lock-free shard lookup with no background work scheduled on the same thread.
If your application is sensitive to tail latency — payments, real-time bidding, fraud detection, trading — the p99.9 number is the one to optimize against. Average latency improvements that don't move the tail are vanity metrics.
Memory Efficiency Is The Hidden Cost Lever
Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.
Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.
Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.
Observability And What To Measure
You can't tune what you can't measure. The four metrics that matter for any production cache deployment, in order of importance:
- Hit rate, broken down by key prefix or namespace. A global hit rate of 92% sounds great until you discover that one critical namespace is sitting at 40% and dragging your tail latency. Per-prefix hit rates expose which workloads are getting cache value and which aren't.
- Latency percentiles, not averages. p50, p95, p99, and p99.9 for both cache hits and cache misses. The cache miss latency is your fallback path performance — when the cache fails, this is what your users actually experience.
- Memory pressure and eviction rate. If your eviction rate is climbing while your hit rate stays flat, you're under-provisioned. If both are climbing, your access pattern shifted and you need to retune TTLs or rethink what you're caching.
- Stale-read rate. The percentage of cache hits that returned a value the application then discovered was stale. This is the canary for your invalidation strategy. If it's above 1%, your invalidation logic has a bug.
Cachee exposes all four out of the box via Prometheus metrics on the standard scrape endpoint, plus a real-time SSE stream for dashboards that need sub-second visibility. The right time to wire these into your monitoring stack is before the migration, not after the first incident.