Cachee v4.3 drops L1 GET latency from 4.65µs to 1.5µs and adds horizontal scaling with pub/sub cache coherence. These are production measurements from a Graviton4 c8g.16xlarge running our native cache server. Here's what we changed and why.

1.5µs
L1 GET avg
was 4.65µs
3.7µs
P99 Latency
was ~16µs
660K+
ops/sec
was 215K
216x
P99 vs Redis
was 38x

The Bottleneck Wasn't the Cache

Our v3.0 engine (DashMap + adaptive eviction) could look up a key in 4.65µs. But the HTTP response — the thing the client actually sees — took 14.5µs. Where did the extra 10µs go?

Middleware. Every GET request passed through compression negotiation, CORS headers, security headers, rate limiting middleware, and authentication middleware. Each layer adds dispatch overhead, header allocation, and async context switches. For a cache hit that takes microseconds, the middleware stack was the bottleneck — not the cache itself.

The insight: For GET /cache/:key, most middleware does nothing useful. Compression? We can pre-compress at write time. CORS? Not needed for API calls with API keys. Security headers? Already set by the CDN. We needed a fast lane.

The Fast Lane

The fast lane is an Axum middleware that sits before all other middleware. It intercepts GET /cache/:key requests and serves them directly — inline auth, DashMap lookup, pre-compressed response body. Everything else passes through to the normal middleware stack.

// Fast lane intercept — outermost middleware layer pub async fn intercept(state, req, next) -> Response { if req.method() == GET && path.starts_with("/cache/") { // Inline auth (~1µs, constant-time) // DashMap lookup (~0.5µs) // Return pre-compressed bytes return fast_cache_get(state, headers, key); } // Everything else → full middleware stack next.run(req).await }

The key design decision: this is a middleware, not a separate router. Our first implementation used a separate Axum Router with a fallback, but Axum returns 405 Method Not Allowed when a path matches but the method doesn't — so DELETE /cache/:key returned 405 instead of falling through. The middleware approach lets non-GET requests pass through cleanly.

Pre-Compression at Write Time

In v3.0, every GET response ran through compression middleware that negotiated Accept-Encoding and compressed on the fly. For small values this was fast, but for anything over 1KB it added measurable overhead.

v4.3 pre-compresses values at write time. When you SET a key, we store three copies:

raw: original bytes (identity) brotli: compressed at quality 4 (fast, ~65% ratio) gzip: compressed at level 6 (compatible, ~60% ratio)

On GET, the fast lane checks Accept-Encoding and returns the matching pre-compressed variant. Zero runtime compression. Zero negotiation overhead. Just a DashMap lookup and a pre-built response body.

Inline Authentication

The standard auth middleware extracted the API key, dispatched through a middleware chain, and performed the comparison. The fast lane does it inline with constant-time comparison using the subtle crate:

// ~1µs — constant-time, no middleware dispatch fn inline_auth_check(headers, state) -> bool { let key = headers.get("x-api-key"); state.api_keys.iter().any(|stored| stored.len() == key.len() && stored.ct_eq(key) // constant-time ) }

No timing side-channels. No async overhead. No middleware dispatch. Just a direct comparison in the hot path.

Horizontal Scaling with Pub/Sub

v4.3 adds the infrastructure for horizontal scaling. Each Cachee instance maintains its own L1 DashMap (fast, local reads) and shares a Redis L2 backend. The challenge: cache coherence.

When Instance A deletes a key, Instance B still has the old value in L1. Without coherence, you serve stale data until TTL expires.

Our solution: Redis pub/sub. Every mutation (delete, update) publishes to the cachee:invalidate channel. All instances subscribe and evict invalidated keys from their local L1. Self-origin messages are filtered by instance ID to prevent loops.

Why pub/sub, not polling? Polling adds latency proportional to the poll interval. Pub/sub propagates in ~1ms regardless of cluster size. Redis handles millions of pub/sub messages per second — it's not a bottleneck.

Each instance registers in a Redis hash (cachee:instances) with TTL heartbeat keys. The control plane instance monitors healthy instances and can trigger ECS Fargate scale-out when load increases.

The Numbers

v4.2 (Feb 2026)
L1 GET avg4.65µs
P99 latency~16µs
HTTP response14.5µs
Throughput215K ops/sec
P99 vs Redis38x
ScalingSingle instance
v4.3 (Mar 2026)
L1 GET avg1.5µs
P99 latency3.7µs
HTTP response1.5µs
Throughput660K+ ops/sec
P99 vs Redis216x
ScalingPub/sub cluster

All latencies are server-side monotonic clock deltas. This includes inline auth + DashMap lookup + HTTP response body construction. Client-side adds TCP/kernel overhead (~0.2-0.5ms for keep-alive connections, ~1-3ms for cold curl requests).

Measured on a Graviton4 c8g.16xlarge (64 vCPU, ARM64) in us-east-1 with 100 serial requests. L1 hit rate: 99.05% (104/105 hits, 0 errors).

What's Next

1.5µs is fast, but there's still room. We're exploring:

io_uring — Replace epoll with io_uring for syscall-free socket reads. Potential 20-30% latency reduction on Linux 6.1+.

CPU pinning — Pin Tokio worker threads to specific cores. Eliminates cache-line bouncing on multi-socket systems.

Huge pages — 2MB TLB entries for large DashMap instances. Especially impactful at 10M+ keys.

DPDK/XDP — Kernel bypass networking for dedicated NIC setups. Sub-microsecond responses become possible when you skip the kernel entirely.

Ready for 1.5µs cache hits?

Deploy Cachee in under an hour. Sub-microsecond latency on day one. No migration required.

Start Free Trial