Production Benchmark Mar 7, 2026

v4.3 Cluster: 1.5µs Latency + Horizontal Scaling

Fast lane middleware, pre-compression at write time, pub/sub cache coherence. Production measurements from Graviton4 c8g.16xlarge with ElastiCache Redis. All latencies are server-side Instant::now() deltas.

1.5µs
L1 GET Latency (avg)
3.1x faster than v3.0
3.7µs
P99 Latency
4.3x faster than v3.0
660K+
ops/sec (GET)
~3x throughput increase
216x
P99 vs Redis
Up from 38x in v3.0

Optimization Breakdown

Optimization What It Does Latency Impact
Fast Lane Middleware Intercepts GET /cache/:key before all middleware (compression, CORS, security headers) -13µs per request
Pre-Compression Brotli + gzip stored at write time; serves pre-compressed bytes on read -2µs (no runtime compress)
Inline Auth Constant-time API key check in fast lane (~1µs), no middleware dispatch -1µs per request
ETag / 304 If-None-Match support; returns 304 Not Modified without body 0 bytes transferred on match
Request Deduplication Concurrent identical GETs coalesced; waiters get first requester's result Eliminates redundant lookups
Pub/Sub Coherence Redis channel broadcasts invalidations to all instances ~1ms propagation
Instance Registry Redis hash + TTL heartbeat keys for instance discovery Auto-cleanup of stale nodes
L2 Promotion Redis miss → L1 set (with pre-compression) for subsequent hits Cold start → hot in 1 request

Before vs After: v4.2 → v4.3

Metric v4.2 (Feb 2026) v4.3 (Mar 2026) Improvement
L1 GET Latency (avg) 4.65µs 1.5µs 3.1x faster
P50 Latency ~4µs 1.4µs 2.9x faster
P99 Latency ~16µs 3.7µs 4.3x faster
HTTP Response (L1 hit) 14.5µs (through middleware) 1.5µs (fast lane) 9.7x faster
GET Throughput 215K ops/sec 660K+ ops/sec ~3x higher
P99 vs Redis 38x faster 216x faster 5.7x better
Horizontal Scaling Single instance only Pub/sub coherent cluster Virtually unlimited
L1 Hit Rate 100% (warm set) 99.05% (production) Real production measurement

Horizontal Scaling Architecture

Pub/Sub Cache Coherence
Client Request │ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Instance A │ │ Instance B │ │ Instance N │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ Fast Lane │ │ │ │ Fast Lane │ │ │ │ Fast Lane │ │ │ │ 1.5µs │ │ │ │ 1.5µs │ │ │ │ 1.5µs │ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │ │ └─────┬─────┘ │ │ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │ │ │ DashMap │ │ │ │ DashMap │ │ │ │ DashMap │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │ │ │ L1 Cache │ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └───────────┬───────────┴───────────┬───────────┘ │ │ ┌──────▼──────┐ ┌──────▼──────┐ │ Redis Pub/Sub│ │ Redis L2 │ │ Invalidation│ │ Persistence │ └─────────────┘ └─────────────┘
How it works
When any instance deletes or updates a key, it publishes to cachee:invalidate. All other instances subscribe and evict the key from their local L1. Self-origin messages are filtered by instance ID to prevent loops.

Scaling Projection

Instances Throughput (GET) L1 Latency Coherence Overhead
1 660K+ ops/sec 1.5µs None
3 ~2M ops/sec 1.5µs ~1ms invalidation propagation
10 ~6.6M ops/sec 1.5µs ~1ms invalidation propagation
N N x 660K ops/sec 1.5µs Redis pub/sub fan-out
Virtually unlimited throughput
Each instance serves reads independently from local L1 memory. Adding instances scales read throughput linearly. The only shared state is Redis L2 (for cold starts) and the pub/sub channel (for invalidation). Write-heavy workloads are bounded by Redis pub/sub fan-out, which handles millions of messages/sec.

Production Test Results — Mar 7, 2026

Integration Test on Graviton4 c8g.16xlarge
SET
Cache SET — 0.686ms (includes L2 write-through to Redis)
GET
Cache GET — L1 HIT, gzip content-encoding, fast lane served
DELETE
Cache DELETE — 200 OK, L1 + L2 cleared, pub/sub broadcast sent
GET (miss)
Cache GET after DELETE — miss (confirmed invalidation)
Pub/Sub
Subscriber connected — channel active, ready for multi-instance
Latency Benchmark — 100 Serial Requests
L1 Average
1.5µs
0.0015ms — server-side Instant::now()
L1 P50
1.4µs
0.0014ms — median response
L1 P99
3.7µs
0.0037ms — tail latency
L1 Hit Rate
99.05%
104/105 hits, 0 errors

Infrastructure

Compute

Instancec8g.16xlarge (Graviton4)
ArchitectureARM64 (aarch64)
vCPUs64
Regionus-east-1
ContainerDocker (121MB image)
Base imagedebian:bookworm-slim

Cache Stack

L1 EngineDashMap (lock-free concurrent)
L2 BackendElastiCache Redis
CompressionBrotli (q=4) + gzip (6) at write
AuthInline constant-time (subtle)
HTTP ServerAxum + Hyper + Tokio
CoherenceRedis pub/sub + instance registry

Room for Improvement

Future Optimizations
io_uring
Zero-copy I/O
Replace epoll with io_uring for syscall-free socket reads. Potential 20-30% latency reduction on Linux 6.1+.
CPU Pinning
NUMA-aware workers
Pin Tokio worker threads to specific cores. Eliminates cache-line bouncing on multi-socket systems.
Huge Pages
2MB TLB entries
Reduce TLB misses for large DashMap instances. Especially impactful at 10M+ keys.
DPDK / XDP
Kernel bypass networking
Bypass the kernel network stack entirely. Sub-microsecond responses possible for dedicated NIC setups.