How to Debug Cache Performance Issues in Production
Your application was fast yesterday. Today it's slow. Requests that took 50ms now take 2 seconds. The cache is the likely culprit, but where do you start? This guide provides a systematic approach to diagnosing and fixing cache performance issues in production without causing more problems.
Symptoms and Root Causes
Common cache performance symptoms and what they indicate:
| Symptom | Likely Cause |
|---|---|
| Slow response times | Low hit rate, cache stampede |
| Database overload | Cache misses, poor hit rate |
| Memory errors | Cache full, aggressive eviction |
| Intermittent slowness | Cache stampede at expiry |
| High CPU on cache server | Inefficient queries, large keys |
Step 1: Check Hit Rate
Hit rate is the first metric to examine. Below 85% typically indicates a problem.
# Redis hit rate
redis-cli INFO stats | grep -E "keyspace_hits|keyspace_misses"
# Calculate hit rate
keyspace_hits:125423
keyspace_misses:8234
hit_rate = 125423 / (125423 + 8234) = 93.8%
# Low hit rate (<85%) indicates:
# - Insufficient cache size
# - Poor TTL configuration
# - Cache not warming properly
# - Keys not structured optimally
Quick Hit Rate Fixes
# 1. Check cache memory usage
redis-cli INFO memory
# If memory is maxed out:
used_memory:8.5G
maxmemory:8G
# → Increase memory or optimize eviction
# 2. Check eviction stats
evicted_keys:45231 # High = cache too small
# 3. Examine key distribution
redis-cli --bigkeys
# Identifies large keys consuming memory
Step 2: Identify Cache Stampedes
Cache stampede: many requests simultaneously hit a cold cache, overwhelming your backend.
Detecting Stampedes
# Symptom: Periodic latency spikes
# Check application logs for patterns
# Look for:
# - Latency spikes every N minutes (matching TTL)
# - Database query spikes
# - Multiple identical cache miss logs
# Example log pattern indicating stampede:
2025-12-21 10:00:00 Cache miss: product:123
2025-12-21 10:00:00 Cache miss: product:123
2025-12-21 10:00:00 Cache miss: product:123
[... 50 more identical misses in same second]
Stampede Mitigation
// Add request coalescing
const inFlightRequests = new Map();
async function getCached(key) {
const cached = await cache.get(key);
if (cached) return cached;
// Check if another request is already fetching
if (inFlightRequests.has(key)) {
return await inFlightRequests.get(key);
}
// Fetch and share result with concurrent requests
const promise = fetchFromDatabase(key).then(async (data) => {
await cache.set(key, data, { ttl: 300 });
inFlightRequests.delete(key);
return data;
});
inFlightRequests.set(key, promise);
return promise;
}
Step 3: Analyze Slow Cache Operations
Sometimes the cache itself is slow. Diagnose using slowlog and latency tracking.
# Redis SLOWLOG - shows slow commands
redis-cli SLOWLOG GET 10
# Example output:
1) 1) (integer) 12 # Log entry ID
2) (integer) 1640000000 # Timestamp
3) (integer) 15234 # Execution time (microseconds)
4) 1) "GET"
2) "user:profile:12345:preferences:settings"
# Slow operations indicate:
# - Large values being retrieved
# - Network latency
# - Blocking operations (KEYS, SMEMBERS on large sets)
Common Slow Operations
# BAD: KEYS command in production (blocks Redis)
KEYS user:* # Scans all keys - O(N) operation
# GOOD: Use SCAN instead
SCAN 0 MATCH user:* COUNT 100 # Non-blocking iteration
# BAD: Retrieving entire large set
SMEMBERS large:set # Returns all members at once
# GOOD: Use SSCAN for large sets
SSCAN large:set 0 COUNT 100
# BAD: Large value storage
SET config:data "{ ... 10MB JSON ... }"
# GOOD: Compress or break into smaller chunks
SET config:data:compressed [compressed 1MB data]
Step 4: Monitor Memory Usage
Memory issues cause evictions, which hurt hit rate and performance.
# Check Redis memory stats
redis-cli INFO memory
# Key metrics:
used_memory_human:7.5G # Actual memory used
maxmemory_human:8.0G # Configured limit
mem_fragmentation_ratio:1.23 # >1.5 indicates fragmentation
# Memory policy
maxmemory_policy:allkeys-lru # How Redis evicts
# If memory is full:
# Option 1: Increase memory
# Option 2: Optimize data structures
# Option 3: Reduce TTLs
# Option 4: Implement better eviction policy
Finding Memory Hogs
# Find largest keys
redis-cli --bigkeys
# Output shows:
[00.00%] Biggest string: user:123:session (512KB)
[00.00%] Biggest list: notifications:456 (2048 items)
[00.00%] Biggest hash: product:789 (10MB)
# Investigate large keys
redis-cli MEMORY USAGE product:789
# Shows: (integer) 10485760 bytes
# Fix: Compress or restructure
# Before:
SET product:789 [10MB JSON]
# After: Store only essential fields
HSET product:789 id 789 name "Product" price 49.99
# Reference full data in database
Step 5: Detect Inefficient Key Patterns
Poor key naming and structure lead to slow operations and memory waste.
# Anti-pattern: Using KEYS to find related data
KEYS user:123:* # Scans ALL keys - very slow
# Better: Use hash tags for co-location
# Group related data with {user:123} tag
user:{123}:profile
user:{123}:preferences
user:{123}:sessions
# Best: Use Redis hashes for structured data
HSET user:123 profile {JSON} preferences {JSON}
# Anti-pattern: Very long key names
SET user_profile_data_for_authenticated_user_id_123_with_preferences "data"
# Better: Concise key names
SET user:123:profile "data"
Step 6: Trace End-to-End Latency
Identify where time is spent in cache operations.
// Instrument cache operations
async function getCached(key) {
const startTime = Date.now();
// Network latency to cache
const cacheStart = Date.now();
const cached = await cache.get(key);
const cacheLatency = Date.now() - cacheStart;
if (cached) {
metrics.record('cache.latency', cacheLatency, { result: 'hit' });
return cached;
}
// Database latency on miss
const dbStart = Date.now();
const data = await database.query(key);
const dbLatency = Date.now() - dbStart;
// Cache write latency
const writeStart = Date.now();
await cache.set(key, data, { ttl: 300 });
const writeLatency = Date.now() - writeStart;
metrics.record('cache.latency', cacheLatency, { result: 'miss' });
metrics.record('db.latency', dbLatency);
metrics.record('cache.write_latency', writeLatency);
const totalLatency = Date.now() - startTime;
if (totalLatency > 100) {
logger.warn(`Slow cache operation: ${totalLatency}ms`, {
key,
cacheLatency,
dbLatency,
writeLatency
});
}
return data;
}
Step 7: Check Connection Pool Health
Connection pool exhaustion causes slow cache operations.
// Monitor connection pool stats
const poolStats = cache.getPoolStats();
console.log({
total: poolStats.totalConnections, // All connections
active: poolStats.activeConnections, // Currently in use
idle: poolStats.idleConnections, // Available
waiting: poolStats.waitingRequests // Queued requests
});
// Warning signs:
// - waiting > 0: Pool is saturated
// - active ≈ max: Need larger pool
// - idle = 0: Pool too small for load
// Fix: Increase pool size
const cache = new Redis({
host: 'localhost',
maxRetriesPerRequest: 3,
// Increase connection pool
connectionPool: {
min: 10,
max: 100 // Was 20, increased
}
});
Step 8: Examine Eviction Patterns
Understand what's being evicted and why.
# Track evicted keys
redis-cli INFO stats | grep evicted
evicted_keys:12453 # Total evictions since start
# Set up eviction monitoring
CONFIG SET notify-keyspace-events Ex
# E = eviction events, x = expired events
# Subscribe to eviction notifications
redis-cli --csv PSUBSCRIBE '__keyevent@0__:evicted'
# Log evicted keys to understand patterns
# High eviction of frequently-accessed keys = memory too small
Production Debugging Checklist
Immediate Actions (5 minutes)
- Check hit rate:
redis-cli INFO stats - Check memory:
redis-cli INFO memory - Check slow operations:
redis-cli SLOWLOG GET 10 - Review recent deployments or traffic changes
Deep Investigation (30 minutes)
- Analyze access patterns with
redis-cli MONITOR(sample briefly!) - Check for cache stampedes in application logs
- Examine key distribution with
--bigkeys - Review eviction policy and tune if needed
- Trace end-to-end latency in application code
Long-Term Fixes (hours to days)
- Optimize data structures (use hashes, compress large values)
- Implement request coalescing for stampede prevention
- Adjust TTLs based on observed access patterns
- Scale cache infrastructure (more memory/nodes)
- Add monitoring and alerting for key metrics
Essential Monitoring Metrics
Set up continuous monitoring for these metrics:
// Key cache metrics to track
const metrics = {
// Performance
hitRate: 0.95, // Target: >90%
p50Latency: 2, // milliseconds
p95Latency: 5, // milliseconds
p99Latency: 15, // milliseconds
// Capacity
memoryUsage: 0.75, // 75% of max
evictionRate: 100, // evictions/second
connectionPoolUtilization: 0.60,
// Health
errorRate: 0.001, // 0.1%
timeouts: 5, // timeouts/minute
connectionFailures: 0
};
// Alert thresholds
if (metrics.hitRate < 0.85) alert('Low cache hit rate');
if (metrics.p99Latency > 50) alert('High cache latency');
if (metrics.memoryUsage > 0.9) alert('Cache memory high');
if (metrics.evictionRate > 1000) alert('High eviction rate');
Conclusion
Debugging cache performance issues requires systematic analysis: start with hit rate, identify stampedes, analyze slow operations, monitor memory, trace latency, and check connection pools. Most issues fall into a few categories: insufficient memory, cache stampedes, inefficient operations, or poor key design.
The key is having good monitoring in place before problems occur. Track hit rate, latency percentiles, memory usage, and eviction rates continuously. When issues arise, use the debugging checklist to quickly identify and resolve root causes.
Automated Cache Performance Monitoring
Cachee.ai includes built-in performance monitoring with automatic anomaly detection and optimization recommendations.
Start Free Trial