Production Benchmark
Mar 7, 2026
v4.3 Cluster: 1.5µs Latency + Horizontal Scaling
Fast lane middleware, pre-compression at write time, pub/sub cache coherence.
Production measurements from Graviton4 c8g.16xlarge with ElastiCache Redis.
All latencies are server-side Instant::now() deltas.
1.5µs
L1 GET Latency (avg)
3.1x faster than v3.0
3.7µs
P99 Latency
4.3x faster than v3.0
660K+
ops/sec (GET)
~3x throughput increase
216x
P99 vs Redis
Up from 38x in v3.0
Optimization Breakdown
| Optimization |
What It Does |
Latency Impact |
| Fast Lane Middleware |
Intercepts GET /cache/:key before all middleware (compression, CORS, security headers) |
-13µs per request |
| Pre-Compression |
Brotli + gzip stored at write time; serves pre-compressed bytes on read |
-2µs (no runtime compress) |
| Inline Auth |
Constant-time API key check in fast lane (~1µs), no middleware dispatch |
-1µs per request |
| ETag / 304 |
If-None-Match support; returns 304 Not Modified without body |
0 bytes transferred on match |
| Request Deduplication |
Concurrent identical GETs coalesced; waiters get first requester's result |
Eliminates redundant lookups |
| Pub/Sub Coherence |
Redis channel broadcasts invalidations to all instances |
~1ms propagation |
| Instance Registry |
Redis hash + TTL heartbeat keys for instance discovery |
Auto-cleanup of stale nodes |
| L2 Promotion |
Redis miss → L1 set (with pre-compression) for subsequent hits |
Cold start → hot in 1 request |
Before vs After: v4.2 → v4.3
| Metric |
v4.2 (Feb 2026) |
v4.3 (Mar 2026) |
Improvement |
| L1 GET Latency (avg) |
4.65µs |
1.5µs |
3.1x faster |
| P50 Latency |
~4µs |
1.4µs |
2.9x faster |
| P99 Latency |
~16µs |
3.7µs |
4.3x faster |
| HTTP Response (L1 hit) |
14.5µs (through middleware) |
1.5µs (fast lane) |
9.7x faster |
| GET Throughput |
215K ops/sec |
660K+ ops/sec |
~3x higher |
| P99 vs Redis |
38x faster |
216x faster |
5.7x better |
| Horizontal Scaling |
Single instance only |
Pub/sub coherent cluster |
Virtually unlimited |
| L1 Hit Rate |
100% (warm set) |
99.05% (production) |
Real production measurement |
Horizontal Scaling Architecture
Pub/Sub Cache Coherence
Client Request
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Instance A │ │ Instance B │ │ Instance N │
│ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │
│ │ Fast Lane │ │ │ │ Fast Lane │ │ │ │ Fast Lane │ │
│ │ 1.5µs │ │ │ │ 1.5µs │ │ │ │ 1.5µs │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │
│ │ DashMap │ │ │ │ DashMap │ │ │ │ DashMap │ │
│ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │
│ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
│ │ │
└───────────┬───────────┴───────────┬───────────┘
│ │
┌──────▼──────┐ ┌──────▼──────┐
│ Redis Pub/Sub│ │ Redis L2 │
│ Invalidation│ │ Persistence │
└─────────────┘ └─────────────┘
How it works
When any instance deletes or updates a key, it publishes to
cachee:invalidate.
All other instances subscribe and evict the key from their local L1. Self-origin messages
are filtered by instance ID to prevent loops.
Scaling Projection
| Instances |
Throughput (GET) |
L1 Latency |
Coherence Overhead |
| 1 |
660K+ ops/sec |
1.5µs |
None |
| 3 |
~2M ops/sec |
1.5µs |
~1ms invalidation propagation |
| 10 |
~6.6M ops/sec |
1.5µs |
~1ms invalidation propagation |
| N |
N x 660K ops/sec |
1.5µs |
Redis pub/sub fan-out |
Virtually unlimited throughput
Each instance serves reads independently from local L1 memory. Adding instances scales
read throughput linearly. The only shared state is Redis L2 (for cold starts) and the
pub/sub channel (for invalidation). Write-heavy workloads are bounded by Redis pub/sub
fan-out, which handles millions of messages/sec.
Production Test Results — Mar 7, 2026
Integration Test on Graviton4 c8g.16xlarge
SET
Cache SET — 0.686ms (includes L2 write-through to Redis)
GET
Cache GET — L1 HIT, gzip content-encoding, fast lane served
DELETE
Cache DELETE — 200 OK, L1 + L2 cleared, pub/sub broadcast sent
GET (miss)
Cache GET after DELETE — miss (confirmed invalidation)
Pub/Sub
Subscriber connected — channel active, ready for multi-instance
Latency Benchmark — 100 Serial Requests
L1 Average
1.5µs
0.0015ms — server-side Instant::now()
L1 P50
1.4µs
0.0014ms — median response
L1 P99
3.7µs
0.0037ms — tail latency
L1 Hit Rate
99.05%
104/105 hits, 0 errors
Infrastructure
Compute
| Instance | c8g.16xlarge (Graviton4) |
| Architecture | ARM64 (aarch64) |
| vCPUs | 64 |
| Region | us-east-1 |
| Container | Docker (121MB image) |
| Base image | debian:bookworm-slim |
Cache Stack
| L1 Engine | DashMap (lock-free concurrent) |
| L2 Backend | ElastiCache Redis |
| Compression | Brotli (q=4) + gzip (6) at write |
| Auth | Inline constant-time (subtle) |
| HTTP Server | Axum + Hyper + Tokio |
| Coherence | Redis pub/sub + instance registry |
Room for Improvement
Future Optimizations
io_uring
Zero-copy I/O
Replace epoll with io_uring for syscall-free socket reads. Potential 20-30% latency reduction on Linux 6.1+.
CPU Pinning
NUMA-aware workers
Pin Tokio worker threads to specific cores. Eliminates cache-line bouncing on multi-socket systems.
Huge Pages
2MB TLB entries
Reduce TLB misses for large DashMap instances. Especially impactful at 10M+ keys.
DPDK / XDP
Kernel bypass networking
Bypass the kernel network stack entirely. Sub-microsecond responses possible for dedicated NIC setups.