← Back to Blog

Redis Connection Timeout: Debug and Fix

April 29, 2026 | 14 min read | Engineering

It is 2 AM. Your pager goes off. The alerts say "Redis connection timeout." Your application logs are full of ETIMEDOUT, ECONNREFUSED, or "connection pool exhausted." Users are seeing 500 errors. You need to diagnose and fix this in the next fifteen minutes, and you do not have time to read a textbook on Redis internals. This guide is the answer you are looking for.

Redis connection timeouts have five common causes. They are ranked below by how frequently we have seen them in production incidents across dozens of deployments. For each cause, we provide the diagnostic command to confirm the hypothesis, the output to look for, and the fix. Start at Cause 1 and work down. In most outages, the problem is one of the first two causes.

Before diving into diagnostics, run these three commands immediately. They give you the state of your Redis instance in under 5 seconds and will narrow your search.

# Run these three commands FIRST during any Redis timeout incident:

# 1. Is Redis responding at all?
redis-cli -h your-redis-host PING
# Expected: PONG
# If no response: Redis is down or unreachable (skip to network checks)

# 2. What does the client state look like?
redis-cli -h your-redis-host INFO clients
# Look at: connected_clients, blocked_clients, rejected_connections

# 3. What does the server state look like?
redis-cli -h your-redis-host INFO stats
# Look at: total_connections_received, rejected_connections,
#          instantaneous_ops_per_sec
5
Common Timeout Causes
3
Triage Commands
15 min
Target Fix Time

Cause 1: Connection Pool Exhaustion

The Symptoms

Your application logs show "connection pool exhausted," "timeout waiting for idle connection," or "could not acquire connection from pool within timeout." The Redis server itself is fine -- if you SSH into the server and run redis-cli PING, it responds instantly. The problem is on the client side: your application is trying to use more simultaneous Redis connections than your connection pool allows, and new requests are queuing behind the pool limit.

This is the most common cause of Redis timeouts in production. It is also the most confusing, because Redis appears healthy from the server side. The bottleneck is the client's connection pool, not Redis itself. It typically manifests during traffic spikes when the number of concurrent requests exceeds the pool size, or during slow Redis commands that hold connections longer than expected (see Cause 4).

The Diagnostic

# On the Redis server:
redis-cli INFO clients

# Key fields:
# connected_clients:247    <-- How many connections are currently open
# maxclients:10000          <-- Redis's max connection limit
# blocked_clients:0         <-- Clients blocked on BLPOP/BRPOP/etc.
# rejected_connections:0    <-- Connections rejected because maxclients was hit

# If connected_clients is WELL BELOW maxclients, the problem is
# your client-side pool, not Redis.

# On the application side, check your pool configuration:
# Node.js (ioredis): pool size is implicit (1 connection per client by default)
# Python (redis-py): max_connections parameter (default: 2**31, effectively unlimited)
# Java (Jedis): maxTotal in JedisPoolConfig (default: 8 -- often too low!)
# Java (Lettuce): uses Netty, default is connection-per-command (no pool needed)

The Fix

Increase your connection pool size to match your actual concurrency. The right pool size is the number of concurrent Redis operations your application performs during peak traffic. If your application handles 500 concurrent requests and each request makes 2 Redis calls sequentially, you need at most 500 connections (not 1,000, because the two calls are sequential within each request). In practice, start with a pool size equal to your web server's worker/thread count. If you have 50 Puma workers, set the Redis pool to 50. If you have 200 Go goroutines making Redis calls, set the pool to 200.

However, increasing the pool size is a band-aid if your fundamental problem is too many concurrent Redis operations. A more sustainable fix is to reduce the number of Redis operations per request. Use MGET instead of multiple GET calls. Use pipelines to batch commands. And most importantly, move hot reads to an in-process L1 cache so they never touch Redis at all. If 80% of your Redis reads are for a few thousand hot keys, an L1 cache eliminates 80% of your connection pool pressure with zero additional Redis infrastructure.

# Jedis pool configuration (Java) -- before and after:

# BEFORE: default pool (8 connections, frequently exhausted)
JedisPoolConfig config = new JedisPoolConfig();
// config.maxTotal defaults to 8

# AFTER: sized to match concurrency
JedisPoolConfig config = new JedisPoolConfig();
config.setMaxTotal(50);          // Match your thread pool size
config.setMaxIdle(50);           // Keep connections warm
config.setMinIdle(10);           // Pre-warm 10 connections
config.setMaxWaitMillis(2000);   // Fail fast: 2 second timeout
config.setTestOnBorrow(true);    // Validate connections before use
config.setTestWhileIdle(true);   // Detect dead connections
config.setTimeBetweenEvictionRunsMillis(30000); // Evict idle every 30s

Cause 2: Network Latency Spike

The Symptoms

Redis commands that normally take 100-300 microseconds are suddenly taking 5-50 milliseconds. The Redis server's CPU is low (under 30%), memory is well within limits, and SLOWLOG shows no slow commands. The latency is coming from the network, not from Redis. This is common when your application and Redis are in different availability zones, when there is VPC peering overhead, when a network path is congested, or when a security group or ACL change has introduced an unexpected routing change.

The Diagnostic

# Measure raw network latency to Redis:
redis-cli -h your-redis-host --latency
# Output: min: 0, max: 3, avg: 0.42 (1523 samples)
# This shows latency in MILLISECONDS.
# If avg is > 1ms on a same-AZ connection, you have a network problem.
# If avg is > 5ms on a cross-AZ connection, you have a network problem.

# Measure latency over time to spot spikes:
redis-cli -h your-redis-host --latency-history -i 5
# Shows latency samples every 5 seconds. Look for periodic spikes.

# Check network path:
traceroute your-redis-host
# Look for: extra hops, high-latency hops, packet loss

# Check MTU (mismatched MTU causes fragmentation and retransmits):
ping -M do -s 1472 your-redis-host
# If this fails but ping -s 1400 works, you have an MTU mismatch.
# Fix: set MTU to 1500 on both ends, or 9001 for jumbo frames.

# Check if security groups are blocking or throttling:
# AWS: check the security group attached to your Redis/ElastiCache instance
# Ensure your application's security group is allowed on port 6379

The Fix

If your application and Redis are in different availability zones, move them to the same AZ. Cross-AZ latency on AWS is typically 0.5-1.5 milliseconds, compared to 0.1-0.3 milliseconds for same-AZ. This alone can reduce your Redis P99 by 2-5x. If cross-AZ placement is required for redundancy, use a read replica in your application's AZ and direct reads to it.

Check for MTU mismatches. If your VPC uses jumbo frames (MTU 9001) but a NAT gateway or VPN tunnel in the path uses standard frames (MTU 1500), packets larger than 1500 bytes will be fragmented. TCP handles fragmentation transparently but it adds latency: each fragmented packet requires reassembly at the receiver, which adds 0.1-0.5 milliseconds. For Redis, this typically only affects large values (over 1400 bytes), but if you are storing JSON objects or serialized data structures, many of your values may exceed this threshold.

Verify that no security group or network ACL changes were recently made. A misconfigured security group can cause connection timeouts if it drops packets silently (no REJECT, just DROP), which causes the client to wait for the TCP timeout (typically 15-30 seconds) before failing. Check VPC Flow Logs for REJECT entries on port 6379 to your Redis instance.

Cause 3: Redis maxmemory Hit with noeviction Policy

The Symptoms

Your application logs show OOM command not allowed when used memory > 'maxmemory' on write operations. Read operations still work. Redis is responding but refusing to accept new writes. Any application code path that writes to Redis (SET, HSET, LPUSH, SADD, etc.) fails with an error that your application may interpret as a timeout (depending on your client library's error handling).

This happens when Redis reaches its configured maxmemory limit and the eviction policy is set to noeviction. With this policy, Redis does not remove any existing keys to make room. It simply rejects new writes. This is the default policy on some Redis configurations and is almost never what you want for a cache workload. A cache that refuses to cache new data is not a cache; it is a read-only snapshot that grows staler by the second.

The Diagnostic

# Check memory status:
redis-cli INFO memory

# Key fields:
# used_memory_human:6.43G      <-- Current memory usage
# maxmemory_human:6.00G        <-- Configured limit
# maxmemory_policy:noeviction  <-- THE PROBLEM

# used_memory > maxmemory means Redis is over the limit.
# With noeviction policy, all writes are rejected.

# Check how many OOM errors have occurred:
redis-cli INFO stats | grep rejected
# rejected_connections:0     <-- Connection rejections (different issue)
# Look for OOM entries in your Redis log file instead.

# Check eviction stats:
redis-cli INFO stats | grep evicted
# evicted_keys:0  <-- With noeviction, this is always 0. That is the problem.

The Fix

Change the eviction policy to allkeys-lfu (or allkeys-lru if you prefer recency-based eviction). This allows Redis to evict the least frequently used keys when it needs to make room for new writes. Your cache will always accept new data, and it will evict cold data to make room.

# Immediate fix (takes effect instantly, no restart needed):
redis-cli CONFIG SET maxmemory-policy allkeys-lfu

# Persist the change across restarts:
# Add to redis.conf:
# maxmemory-policy allkeys-lfu

# If you are also out of memory, increase maxmemory:
redis-cli CONFIG SET maxmemory 8gb

# Or free memory immediately by deleting known-cold keys:
redis-cli SCAN 0 COUNT 100
# Identify keys that can be safely deleted and remove them.

# Verify the fix:
redis-cli INFO memory
# maxmemory_policy should now show allkeys-lfu
# evicted_keys should start incrementing as Redis evicts cold data

For the longer term, investigate why Redis is reaching maxmemory in the first place. Run redis-cli --bigkeys to find unexpectedly large keys. Check for keys without TTLs that are accumulating over time. Consider whether your working set has genuinely grown beyond your Redis instance's capacity, in which case you need a larger instance or a cluster, not just a policy change. And consider whether an in-process L1 cache can absorb enough hot reads to reduce the working set that Redis needs to hold.

Cause 4: Slow Command Blocking the Event Loop

The Symptoms

Redis latency spikes intermittently. Most commands complete in microseconds, but periodically all commands stall for 5-100 milliseconds. The stalls are correlated across all clients: every client experiences the stall at the same time. This is because Redis processes commands on a single thread (even Redis 7+ with I/O threads uses a single thread for command execution). A single slow command blocks all other commands until it completes. If a KEYS * scan takes 50 milliseconds on a database with 10 million keys, every other command waits 50 milliseconds.

The Diagnostic

# Check the slow log for recent offenders:
redis-cli SLOWLOG GET 25

# Example output:
# 1) 1) (integer) 142           <-- Entry ID
#    2) (integer) 1745900400     <-- Timestamp (Unix)
#    3) (integer) 52341          <-- Duration in MICROSECONDS (52ms!)
#    4) 1) "KEYS"               <-- The command
#       2) "session:*"          <-- The argument
#    5) "10.0.1.45:52341"       <-- Client address

# Common slow commands:
# KEYS *             -- O(N) scan of entire keyspace. NEVER use in production.
# SMEMBERS large_set -- O(N) where N = set size. Use SSCAN instead.
# HGETALL large_hash -- O(N) where N = hash field count. Use HMGET instead.
# SORT               -- O(N+M*log(M)). Avoid on large datasets.
# LRANGE 0 -1        -- O(N) for the entire list. Use bounded LRANGE.
# DEBUG SLEEP        -- Literally blocks Redis. Should never be in production.

# Lower the slow log threshold to catch more commands:
redis-cli CONFIG SET slowlog-log-slower-than 100
# Now any command taking > 100 microseconds is logged.
# Default is 10000 (10ms), which misses most problems.

# Check current slow log length:
redis-cli SLOWLOG LEN

# Check if specific commands are being used heavily:
redis-cli INFO commandstats
# Look for: cmdstat_keys, cmdstat_smembers, cmdstat_hgetall, cmdstat_sort

The Fix

Identify and eliminate the slow commands. The fixes are command-specific:

If you cannot eliminate the slow command (perhaps a third-party library uses KEYS * internally), isolate it on a separate Redis instance so it does not block your primary cache. Run administrative and analytical commands against a read replica, not the primary. And in Redis 7+, enable io-threads to at least parallelize the I/O portion of command handling, even though command execution itself remains single-threaded.

# Replace KEYS (blocks entire server) with SCAN (non-blocking):

# BAD: blocks event loop for 10-100ms on large keyspaces
KEYS session:*

# GOOD: cursor-based, non-blocking
SCAN 0 MATCH session:* COUNT 100
# Returns: cursor + batch of matching keys
# Repeat with returned cursor until cursor = 0

# In application code (Python):
cursor = 0
keys = []
while True:
    cursor, batch = redis.scan(cursor, match="session:*", count=100)
    keys.extend(batch)
    if cursor == 0:
        break

Cause 5: TCP Backlog Full Under Connection Burst

The Symptoms

During a traffic spike or deployment (when many application instances restart simultaneously), you see a burst of ECONNREFUSED or ETIMEDOUT errors that resolves within 30-60 seconds. Redis is not overloaded -- the problem is that too many new TCP connections arrive at once and exceed the TCP backlog queue. New connection attempts are dropped by the kernel before Redis even sees them.

This is most common during rolling deployments, auto-scaling events, or traffic surges. When 50 application instances restart within 10 seconds and each instance opens 20 Redis connections, that is 1,000 new TCP connections in 10 seconds. If the TCP backlog is set to 128 (a common Linux default), connections 129 through 1,000 are silently dropped by the kernel. The client retries after a timeout (typically 1-5 seconds), creating a cascading delay.

The Diagnostic

# Check Redis's configured TCP backlog:
redis-cli CONFIG GET tcp-backlog
# Default: 511

# Check the kernel's actual backlog limit:
cat /proc/sys/net/core/somaxconn
# Default on many Linux systems: 128
# Redis uses min(tcp-backlog, somaxconn), so if somaxconn is 128,
# Redis's effective backlog is 128 even if tcp-backlog is 511.

# Check for SYN drops (connections dropped due to full backlog):
netstat -s | grep -i "listen"
# Look for: "times the listen queue of a socket overflowed"
# or: "SYNs to LISTEN sockets dropped"
# If these counters are increasing, the backlog is full.

# On the Redis server, check for connection rejection:
redis-cli INFO stats | grep rejected
# rejected_connections:0  <-- Redis-level rejections (maxclients reached)
# NOTE: TCP backlog drops happen BELOW Redis, so rejected_connections
# will be 0 even if the backlog is full. You must check kernel stats.

The Fix

Increase both the Redis tcp-backlog and the kernel's somaxconn to handle connection bursts. The right value depends on the maximum number of new connections you expect in a short window. For most production deployments, 4096 is a safe value that handles even aggressive rolling deployments.

# 1. Increase kernel backlog (requires root):
sudo sysctl -w net.core.somaxconn=4096

# Persist across reboots:
echo "net.core.somaxconn = 4096" | sudo tee -a /etc/sysctl.conf

# Also increase the SYN backlog:
sudo sysctl -w net.ipv4.tcp_max_syn_backlog=4096

# 2. Increase Redis tcp-backlog:
redis-cli CONFIG SET tcp-backlog 4096

# Persist in redis.conf:
# tcp-backlog 4096

# 3. Stagger your deployments:
# Instead of restarting all instances at once, use rolling deploys
# with a 10-30 second delay between instances. This spreads the
# connection burst over time.

# 4. Pre-warm connections:
# Open Redis connections during startup before accepting traffic.
# Most connection pool libraries support min-idle settings:
# JedisPoolConfig.setMinIdle(10)  -- pre-create 10 connections
# This shifts the connection burst from "under load" to "during startup"

The deeper fix is to reduce the total number of Redis connections your application needs. Connection pooling with appropriately sized pools (Cause 1) helps. But the most effective approach is to reduce the number of operations that require a Redis connection in the first place. An in-process L1 cache that absorbs 80% of reads means your application needs 80% fewer simultaneous Redis connections, which means the connection pool can be 80% smaller, which means the TCP backlog is 80% less likely to fill up during a burst.

Quick Reference: Diagnostic Decision Tree

When Redis timeouts occur, work through this decision tree in order. Each step either identifies the cause or eliminates it.

CheckCommandIf TrueCause
Redis responds to PING?redis-cli PINGNo responseRedis is down or unreachable
connected_clients near maxclients?INFO clientsWithin 90%Connection limit / pool too large
used_memory near maxmemory?INFO memoryOver 95%Cause 3: OOM with noeviction
Slow commands in SLOWLOG?SLOWLOG GET 10Commands > 10msCause 4: Blocking commands
Network latency elevated?redis-cli --latencyavg > 1ms same-AZCause 2: Network issue
SYN drops in kernel stats?netstat -sCounter increasingCause 5: TCP backlog full
App logs show pool exhaustion?Application logs"pool exhausted"Cause 1: Pool too small

The Structural Fix: Reduce Redis Dependency

Each of the five causes above has a specific fix. But they all share a common root cause: your application depends on Redis for too many operations. Every Redis operation requires a network connection, a network round-trip, and a slot in the connection pool. The more operations you send to Redis, the more pressure you put on connections, the more sensitive you are to network latency, the more likely you are to hit maxmemory, the more likely a slow command will block your fast commands, and the more likely a deployment will overwhelm the TCP backlog.

The structural fix is to reduce the number of operations that require Redis at all. An in-process L1 cache absorbs hot reads before they reach the network. If 80% of your Redis reads are for a few thousand frequently-accessed keys (session tokens, feature flags, user permissions, configuration values), an L1 cache eliminates 80% of your Redis operations. That means 80% fewer connections needed, 80% less network bandwidth consumed, 80% less Redis CPU used, and 80% more headroom before you hit any of the five failure modes above.

The 2 AM Incident Checklist

Print this and keep it near your on-call workstation. When Redis timeouts hit, run these five commands in order. Each takes under 5 seconds. Together, they will identify the cause in under 2 minutes.

1. redis-cli PING -- Is Redis alive?

2. redis-cli INFO clients -- Connection count and rejections?

3. redis-cli INFO memory -- Memory vs maxmemory?

4. redis-cli SLOWLOG GET 10 -- Any blocking commands?

5. redis-cli --latency -- Network latency normal?

The Bottom Line

Redis connection timeouts are almost never caused by Redis being slow. They are caused by connection pool exhaustion, network issues, memory limits with the wrong eviction policy, blocking commands on the single-threaded event loop, or TCP backlog saturation during connection bursts. Each has a specific fix. The structural fix that prevents all five is to reduce the number of operations that require a Redis connection. An in-process L1 cache at 31 nanoseconds absorbs hot reads before they touch the network, reducing connection pressure, network bandwidth, Redis CPU, and the blast radius of every failure mode on this list.

Stop debugging Redis timeouts at 2 AM. Move hot reads to L1 at 31 nanoseconds.

brew install cachee Cut Redis Memory 50%