Redis Performance: Why Latency Doubles Under Load

April 24, 2026 | 13 min read | Engineering

Every engineer has been here. Your application runs fine in staging. Redis latency is a consistent 0.3ms. You deploy to production, traffic ramps up, and within hours your monitoring dashboard shows Redis P99 at 2ms, then 5ms, then 15ms. You Google "redis slow" and find a hundred results telling you to check your slow log, increase your connection pool, or upgrade your instance. None of them explain why Redis latency doubles under load in the first place.

The answer is not one thing. It is five things, each measurable, each with a different severity profile, and only three of which have a structural fix. This post walks through all five causes with measured data, explains which ones you can tune away and which ones are architectural limitations, and shows what the latency profile looks like after the structural fix.

The Baseline: Redis at Low Load

Redis at low load is genuinely fast. On an r7g.xlarge in the same AZ, a single-connection GET of a 1 KB value returns in 0.35ms at P50 and 0.61ms at P99. The P99/P50 ratio is 1.7x, which is healthy. The event loop is idle most of the time. There is no queueing. Each command gets processed the instant it arrives. This is the performance you see in benchmarks, in staging environments, and in the first few weeks of production.

Then traffic increases, and the five causes activate.

Cause 1: Single-Threaded Event Loop Contention

Redis processes all commands on a single thread. This is a deliberate design choice that eliminates lock contention and makes every operation atomic without explicit locking. At low load, it is an advantage. At high load, it becomes the primary bottleneck.

When multiple clients send commands concurrently, those commands queue in the event loop. Redis processes them sequentially: read command from client A, execute, write response to A, read command from client B, execute, write response to B. The time each command spends waiting for preceding commands to complete is pure queueing delay.

At 10,000 commands per second with an average processing time of 10 microseconds per command, the event loop is busy for 100 milliseconds per second -- 10% utilization. Queueing delay is negligible. At 100,000 commands per second, the event loop is busy for 1,000 milliseconds per second -- 100% utilization. The math does not work. Redis cannot process 100,000 commands per second if each command takes 10 microseconds on a single thread, because 100,000 * 10us = 1 second of work per second. In practice, simple commands (GET of small values) take 1-5 microseconds, so Redis can handle 200,000-1,000,000 simple ops/sec. But the moment you mix in commands that take longer -- MGET of 50 keys, LRANGE of a large list, GET of a 100 KB value -- the event loop backs up.

The Queueing Math

At 70% event loop utilization, queueing theory (M/D/1 model) predicts average queueing delay equals approximately 1.17x the service time. At 90% utilization, average queueing delay is 4.5x the service time. This means a command that takes 5 microseconds to execute waits 22.5 microseconds in the queue at 90% utilization. Your P50 goes from 0.3ms to 0.33ms (barely noticeable). Your P99, which catches the commands that arrived during a burst of expensive operations, goes from 0.6ms to 2-5ms. This is where "redis slow" complaints originate.

Measured Impact

Event Loop Utilization	P50 Latency	P99 Latency	P99/P50 Ratio
10%	0.31ms	0.55ms	1.8x
30%	0.33ms	0.72ms	2.2x
50%	0.37ms	1.10ms	3.0x
70%	0.45ms	2.30ms	5.1x
90%	0.82ms	8.50ms	10.4x

At 90% utilization, the P99/P50 ratio is 10.4x. Your P99 is an order of magnitude worse than your P50. This is not a bug. This is the mathematical consequence of queueing on a single thread. Every additional client, every larger value, every more expensive command pushes utilization higher and makes the tail worse.

What You Can Do (Limited)

You can shard your workload across multiple Redis instances to reduce per-instance utilization. This trades operational complexity for lower per-shard load. You can also pipeline commands to amortize the round-trip overhead. But pipelining does not reduce event loop utilization -- it just batches the queueing. The fundamental constraint remains: one thread, sequential processing, queueing under load.

Cause 2: Large Value Serialization Blocking the Event Loop

When Redis processes a GET for a 100 KB value, it must copy those 100 KB into the output buffer. This copy takes approximately 2-5 microseconds depending on whether the value is in the CPU cache (unlikely for large values in a large dataset). During those microseconds, the event loop is blocked. No other commands are processed.

At low load, a 5-microsecond blockage is invisible. At high load, it is catastrophic. Every concurrent client waiting for a small-value GET is delayed by that 5 microseconds. If ten large-value GETs arrive in quick succession, they create a 50-microsecond bubble in the event loop that delays every other command queued behind them.

This is the "head-of-line blocking" problem. One expensive operation at the head of the queue delays everything behind it. And the effect is non-linear: the probability of a large-value GET arriving during a burst increases with traffic rate, which means the frequency of head-of-line blocking events increases with load.

Measured Impact

We tested a workload mixing 95% small-value GETs (64 bytes) with 5% large-value GETs (100 KB) at varying request rates. The small-value P99 latency, which should be constant if only small values were in play, degrades as follows:

Total Ops/Sec	Small-Value P99 (no large values)	Small-Value P99 (5% large values mixed in)	Degradation
10,000	0.55ms	0.62ms	+13%
50,000	0.70ms	1.40ms	+100%
100,000	1.10ms	3.80ms	+245%

At 100,000 ops/sec with just 5% large values mixed in, the small-value P99 degrades by 245%. The large values poison the tail latency of the small values. This is one of the most common causes of "redis slow" reports, and it is invisible unless you segment your latency metrics by value size.

Cause 3: Network Bandwidth Saturation on the NIC

Every byte transferred to and from Redis consumes network bandwidth. An r7g.xlarge has up to 12.5 Gbps of network bandwidth. At 100,000 ops/sec with an average value size of 4 KB, Redis generates approximately 400 MB/sec of outbound traffic -- roughly 3.2 Gbps. That is 26% of the available bandwidth. Add client-to-Redis traffic (keys plus protocol overhead), and total NIC utilization approaches 35-40%.

At 60-70% NIC utilization, packet queuing begins in the kernel's transmit buffer. TCP congestion control may reduce the send rate. Individual GET responses that would normally transfer in one round-trip now span multiple TCP windows because the NIC cannot drain the transmit buffer fast enough.

The effect on latency is progressive. Below 50% NIC utilization, the network adds negligible delay. Above 50%, every additional 10% utilization adds approximately 0.1-0.3ms of queueing delay. At 80% NIC utilization, the network is adding 1-2ms to every operation, and you are one traffic spike away from packet drops and TCP retransmissions.

Measured Impact

NIC Utilization	Additional Latency (P50)	Additional Latency (P99)
20%	+0.01ms	+0.03ms
40%	+0.02ms	+0.08ms
60%	+0.10ms	+0.40ms
80%	+0.50ms	+2.10ms
90%	+1.80ms	+8.50ms

At 90% NIC utilization, the network alone is adding 8.5ms to P99 latency. Combined with event loop queueing, your total P99 can exceed 15ms -- fifty times the baseline. And your monitoring dashboard shows "Redis is slow" when the actual cause is network saturation on the NIC of your Redis instance.

Cause 4: Memory Fragmentation from jemalloc Under Churn

Redis uses jemalloc as its memory allocator. jemalloc is excellent for long-running processes with stable allocation patterns. It is less excellent when key-value pairs are constantly created, updated with different sizes, and deleted -- which is exactly what a cache does.

When a 2 KB value is updated with a 3 KB value, jemalloc frees the 2 KB block and allocates a 3 KB block. The 2 KB block becomes a fragment. Over time, millions of these fragments accumulate. jemalloc's fragmentation ratio (the ratio of resident memory to active memory) can reach 1.3-1.5x for high-churn cache workloads. A Redis instance with 10 GB of actual data may consume 13-15 GB of resident memory.

Fragmentation does not directly increase command processing time. But it has two indirect effects on latency. First, it increases the working set size, which reduces the effectiveness of CPU caches. Values that would fit in L3 cache with zero fragmentation may spill to main memory with 1.5x fragmentation, adding 50-100ns per access. Second, when Redis reaches its maxmemory limit, it must evict keys before processing new writes. The eviction scan takes time proportional to the number of keys scanned, and this time is spent on the event loop, blocking all other commands.

Measured Impact

Fragmentation's impact is subtle and accumulative. In our tests, a Redis instance under steady-state churn (50% of keys updated per minute with variable-size values) showed the following degradation over time:

Runtime	Fragmentation Ratio	P50 Latency	P99 Latency
0 hours (fresh)	1.00x	0.31ms	0.55ms
24 hours	1.15x	0.33ms	0.62ms
72 hours	1.28x	0.36ms	0.78ms
168 hours (1 week)	1.42x	0.40ms	0.95ms

After one week, P99 has degraded by 73%. This is gradual enough that teams often do not notice until a restart "magically" fixes performance. The restart defragments memory because jemalloc starts fresh. Performance is great for a few days, then gradually degrades again. Teams call this "Redis needs periodic restarts" without understanding that memory fragmentation is the cause.

What You Can Do

Redis 4.0 introduced activedefrag, which performs online defragmentation. Enable it with CONFIG SET activedefrag yes. It runs during idle cycles and can reduce the fragmentation ratio to 1.05-1.10x. The trade-off is that defragmentation consumes CPU cycles on the event loop during idle periods, which can cause latency spikes if the instance is never truly idle. For high-traffic instances, activedefrag helps but does not eliminate the problem.

Cause 5: Cross-AZ Latency Variance

The four causes above apply to same-AZ deployments. Cross-AZ adds a fifth cause that is entirely outside your control: network path variance between availability zones.

Same-AZ latency is consistent because traffic stays within the same data center. Cross-AZ latency traverses the physical link between data centers, which has a base latency of 0.5-1.0ms depending on the AZ pair. But the variance is the problem. The base latency might be 0.7ms, but the P99 can spike to 2-3ms during periods of high cross-AZ traffic in the region.

AWS provides no SLA on cross-AZ latency, and the latency is not constant. It fluctuates based on the aggregate traffic in the AZ-to-AZ link, which includes traffic from every AWS customer in that region. Your Redis P99 includes this variance on top of all four causes above.

Measured Impact

Deployment	P50 Latency	P99 Latency	P99/P50
Same-AZ (baseline)	0.31ms	0.55ms	1.8x
Cross-AZ (quiet period)	0.85ms	1.60ms	1.9x
Cross-AZ (peak traffic)	1.10ms	3.80ms	3.5x

During peak traffic, cross-AZ P99 is nearly 7x the same-AZ baseline. And if half your application fleet is in a different AZ than Redis (which is common for resilience), half your cache traffic pays this penalty on every single operation.

The Structural Fix for Causes 1-3

Causes 1 through 3 -- event loop contention, large value serialization blocking, and NIC saturation -- share a common root cause: every cache read is a network operation. Every GET sends bytes over the network to Redis, waits for Redis to process the command on its single-threaded event loop, and receives bytes back over the network. The event loop contends because every client shares it. The NIC saturates because every value traverses it. Large values block because serialization happens on the event loop.

The structural fix is to stop sending hot-path reads over the network. An in-process L1 cache moves the hottest values into the application's own address space. A GET becomes a hash table lookup and a pointer dereference -- 31 nanoseconds, no event loop, no serialization, no network.

31ns

In-Process L1 GET

Network Round-Trips

Event Loop Contention

Before and After: P99 Latency Under Load

We deployed Cachee as an in-process L1 cache in front of Redis on a workload running 80,000 ops/sec with a mix of 95% small values and 5% 100 KB values. The L1 hit rate stabilized at 87%. Here is the before/after comparison:

Metric	Before (Redis Only)	After (Cachee L1 + Redis L2)	Improvement
P50 overall	0.45ms	0.031ms	14.5x faster
P99 overall	3.80ms	0.38ms	10x faster
P50 (L1 hits only)	n/a	0.000031ms (31ns)	--
P99 (L1 misses, Redis)	3.80ms	0.62ms	6.1x faster
Redis event loop util	72%	9%	8x reduction
Redis NIC throughput	3.1 Gbps	0.4 Gbps	7.8x reduction

The P99 improvement is 10x, but look at the second-order effects. Redis event loop utilization dropped from 72% to 9%. At 9%, there is no queueing. The event loop is idle 91% of the time. Every command that reaches Redis -- the 13% that miss L1 -- gets processed immediately without waiting behind other commands. This is why even the Redis-only P99 (for L1 misses) improved from 3.80ms to 0.62ms. The Redis instance is simply less loaded.

NIC throughput dropped from 3.1 Gbps to 0.4 Gbps. The network is no longer a factor. And because 87% of reads never touch Redis at all, the large-value serialization blocking problem disappears for the majority of operations. The 5% of requests that access 100 KB values are now a small fraction of an already-small Redis workload, not a blocking force on 80,000 ops/sec.

The Latency-by-Value-Size Table

This table shows Redis GET latency versus in-process latency across a range of value sizes. These numbers come from same-AZ benchmarks at moderate load (50% event loop utilization).

Value Size	Redis P50	Redis P99	In-Process	Improvement
64 B	0.35ms	0.85ms	31ns	11,290x
1 KB	0.42ms	1.10ms	31ns	13,548x
4 KB	0.61ms	1.65ms	31ns	19,677x
10 KB	0.92ms	2.50ms	31ns	29,677x
50 KB	1.95ms	5.20ms	31ns	62,903x
100 KB	3.50ms	9.00ms	31ns	112,903x

The in-process column is constant at 31 nanoseconds because the GET operation returns a pointer -- it never touches the value bytes. Whether the value behind that pointer is 64 bytes or 100 KB, the hash lookup and pointer dereference take the same amount of time. Redis, on the other hand, must serialize, transfer, and deserialize every byte, every time.

What About Causes 4 and 5?

Memory fragmentation (Cause 4) is partially mitigated by the L1 tier. With 87% fewer writes hitting Redis (because reads that would trigger eviction-related writes no longer reach Redis), the churn rate on Redis decreases. Lower churn means less fragmentation. In our tests, the fragmentation ratio after one week dropped from 1.42x (Redis-only) to 1.12x (with L1 tier). Not eliminated, but significantly reduced.

Cross-AZ latency variance (Cause 5) is fully eliminated for L1 hits. 87% of your reads are now local memory accesses -- no network, no cross-AZ penalty. The 13% that miss L1 still pay the cross-AZ cost, but that is 13% of your traffic, not 100%. And since Redis is lightly loaded, the cross-AZ P99 for those misses is better than the original cross-AZ P99 at full load.

How to Diagnose Which Cause Is Hitting You

Before implementing any fix, diagnose which of the five causes dominates your latency degradation. Each has a distinct signature.

Event loop contention (Cause 1): Check redis-cli info stats for instantaneous_ops_per_sec. Multiply by average command processing time (check slowlog for the distribution). If the product approaches 1,000,000 microseconds per second (i.e., the event loop is near 100% utilized), this is your problem. The tell-tale sign is P99/P50 ratio above 5x.

Large value blocking (Cause 2): Run redis-cli --bigkeys to find large values. Then correlate P99 spikes with access patterns for those keys. If large-value GETs correlate temporally with P99 spikes in small-value GETs, head-of-line blocking is the cause. The tell-tale sign is bimodal latency distribution -- most requests are fast, but periodic spikes are 5-10x the baseline.

NIC saturation (Cause 3): Check the CloudWatch NetworkBandwidthIn and NetworkBandwidthOut metrics for your ElastiCache instance. If outbound bandwidth is above 50% of the instance type's maximum, the NIC is contributing to latency. The tell-tale sign is latency that correlates with throughput and does not improve with pipelining or connection pooling.

Memory fragmentation (Cause 4): Check redis-cli info memory for mem_fragmentation_ratio. If it is above 1.2, fragmentation is contributing to latency. The tell-tale sign is latency that gradually increases over days and improves after a restart.

Cross-AZ (Cause 5): Compare latency from same-AZ clients versus cross-AZ clients. If cross-AZ is consistently 2-3x worse, this is the cause. The tell-tale sign is latency variance that you cannot explain from Redis server-side metrics.

# Quick diagnostic commands
redis-cli info stats | grep instantaneous_ops
redis-cli info memory | grep mem_fragmentation_ratio
redis-cli --bigkeys
redis-cli slowlog get 25
redis-cli info clients | grep connected_clients

The Bottom Line

Redis latency doubles under load because of five measurable causes. Three of them -- event loop contention, large value serialization, and NIC saturation -- are structural consequences of the network cache architecture. They cannot be tuned away. The fix is architectural: move hot-path reads to an in-process L1 cache where a GET is 31 nanoseconds with zero event loop contention, zero serialization, and zero network transfer. The result is 10x P99 improvement on the hot path and a 6x improvement on the remaining Redis traffic because the Redis instance is no longer overloaded.

Fix the structural cause, not the symptoms. 31ns reads, zero queueing, zero NIC load.

brew install cachee The Hidden Cost of Redis