Why Your Redis Latency Spikes at 3 AM (and How to Stop It)

Every night between 2 and 4 AM, your Redis P99 latency spikes from 1ms to 50ms. Your alerts fire. PagerDuty wakes someone up. By the time they pull up the dashboard, latency is back to normal. The engineer shrugs, marks the alert as resolved, and goes back to sleep. This happens three or four times a week. You have been ignoring it, chalking it up to “noisy monitoring.” But something is actually wrong — and it is costing you more than lost sleep. Here is what is happening, why it happens at night, and how to make your application completely immune to it.

The 3 AM Suspects

Redis does not randomly slow down at night. The timing is predictable because the causes are scheduled. There are four processes that converge in the early morning hours, each one capable of spiking latency on its own. Together, they create a perfect storm.

1. RDB Snapshot Fork

Redis persists data to disk using RDB snapshots. By default, it triggers a BGSAVE when a certain number of writes occur within a time window — for example, 10,000 writes in 60 seconds. During low-traffic overnight hours, write thresholds tend to accumulate and trigger right around the 2–4 AM window. When BGSAVE fires, Redis forks the entire process. The operating system must copy the page tables for the entire memory space. On a 10GB dataset, this fork operation takes 10–20 milliseconds. During that time, the main Redis thread is completely blocked. Every incoming command — every GET, every SET, every pipeline — waits. Your P99 jumps from 1ms to 50ms in a single frame because the main thread literally cannot process commands while the kernel duplicates page tables.

# Check when the last BGSAVE happened and how long it took
redis-cli INFO persistence
# Look for:
#   rdb_last_bgsave_time_sec: 2     (fork duration)
#   rdb_last_bgsave_status: ok
#   rdb_current_bgsave_time_sec: -1  (not running now)
        

2. AOF Rewrite

If you are using Append-Only File persistence, Redis periodically rewrites the AOF to compact it. The rewrite itself runs in a background child process, but it creates significant I/O contention. The child process writes the entire dataset to a temporary file while the parent process continues appending new commands to the old AOF. When the rewrite finishes, Redis swaps the files and must fsync the new AOF to disk. On systems with slow I/O or limited throughput (especially network-attached EBS volumes on AWS), this fsync can stall the main thread for tens of milliseconds. If the AOF rewrite coincides with the RDB snapshot — which it often does during off-peak hours — the combined I/O pressure doubles the latency impact.

# Check AOF rewrite status
redis-cli INFO persistence | grep aof
# aof_rewrite_in_progress: 1 means it's happening RIGHT NOW
# aof_last_rewrite_time_sec shows how long it took
        

3. Key Expiration Storms

This one is subtle and often overlooked. When your application deploys at 2 PM and sets TTLs of 12 hours, those keys all expire at 2 AM. When your deploy happens at 4 PM with 12-hour TTLs, they expire at 4 AM. Redis handles expiration in two ways: lazily (when a key is accessed) and actively (a background task that samples 20 random keys with TTLs every 100ms and deletes expired ones). When a large percentage of keys expire simultaneously, the active expiration loop runs aggressively. If more than 25% of sampled keys are expired, Redis loops again immediately — and it keeps looping until fewer than 25% are expired. During a mass expiration event, this loop can dominate the CPU for several hundred milliseconds, starving the event loop and spiking latency for every connected client.

            The deploy-time trap: If your application sets all cache TTLs to the same duration (e.g., 12 hours, 24 hours) relative to deploy time, you are guaranteed to hit an expiration storm at a predictable offset from every deploy. The keys do not expire gradually — they all expire at once, and Redis blocks while cleaning them up.
        

4. Cron Jobs Hammering the Cache

The 2–4 AM window is when every team schedules their batch jobs. Data pipelines rebuild caches. Analytics jobs run SCAN over millions of keys. Report generators issue hundreds of MGET commands. Each job thinks it is the only one using Redis at night. In reality, three or four cron jobs all hit Redis simultaneously, saturating the single-threaded event loop at the exact moment that RDB snapshots and AOF rewrites are also running. The collision is not a coincidence — it is a convergence of defaults.

Why fork() Is the Real Killer

Of all the 3 AM suspects, the RDB fork is the one that causes the most damage and is the hardest to mitigate. Understanding why requires knowing what fork() actually does at the operating system level.

When Redis calls fork() to create a child process for BGSAVE, the kernel must duplicate the parent’s page table — the data structure that maps virtual addresses to physical memory pages. Redis does not use huge pages by default, so a 10GB dataset spread across 4KB pages requires approximately 2.5 million page table entries. Copying those entries is an O(n) operation proportional to the size of the dataset, and it runs on the main thread. The child process gets a copy-on-write view of memory, which is efficient for the child — but the parent is frozen while the kernel does the copying.

The numbers are unforgiving. A 10GB dataset takes 10–20ms to fork. A 25GB dataset takes 25–50ms. A 50GB dataset can take 100ms or more. During that entire window, Redis is not processing commands. It is not even reading from sockets. Every client with an in-flight request sees that full fork duration added directly to their response time. If your P99 target is 5ms and your fork takes 40ms, you have blown your SLA by 8x — and there is nothing you can do about it while the fork is in progress.

# Monitor fork duration in real-time
redis-cli INFO stats | grep latest_fork_usec
# latest_fork_usec: 15432   (15ms fork — your P99 spike)
# Correlate this timestamp with your latency alerts
        

            The memory trap: Enabling Transparent Huge Pages (THP) makes fork faster by reducing the number of page table entries, but it makes copy-on-write dramatically worse — a single write to a 2MB huge page copies the entire 2MB instead of 4KB. Redis documentation explicitly recommends disabling THP. There is no free lunch with fork.
        

The Fixes That Help

Each of these mitigations reduces the severity of 3 AM spikes. None of them eliminate the problem entirely.

Disable RDB if you are using AOF. If you have AOF enabled, you already have durability. There is no reason to also run RDB snapshots, which trigger additional forks. Set save "" in your Redis config to disable all automatic RDB snapshots. This eliminates one fork source entirely.

# redis.conf: Disable RDB snapshots (AOF provides durability)
save ""
# Prevent fsync stalls during AOF rewrite
no-appendfsync-on-rewrite yes
        

Use no-appendfsync-on-rewrite yes. This tells Redis not to call fsync() on the AOF during a rewrite. You accept a small durability risk — if Redis crashes during the rewrite, you could lose the last few seconds of writes. But you eliminate the I/O contention that causes main-thread stalls during rewrites.

Add jitter to TTL expiry. Instead of setting every cache key to TTL = 43200 (12 hours), add a random offset: TTL = 43200 + random(0, 3600). This spreads expirations over an hour-long window instead of a single second. The active expiration loop never triggers its aggressive mode because expired keys are always below the 25% threshold.

# Instead of this:
redis.setex("user:123", 43200, data)

# Do this — add 0-3600 seconds of random jitter:
ttl = 43200 + random.randint(0, 3600)
redis.setex("user:123", ttl, data)
        

Stagger your cron jobs. Do not schedule data rebuilds, analytics scans, and report generators at the same time. Spread them across the overnight window with 15–30 minute gaps. Ensure no batch job runs during the RDB snapshot window. This is operational discipline, not a technical fix — and it breaks the first time a new team adds a cron job without checking the schedule.

These are all good practices. You should implement every one. But even with all four in place, you still have a process that forks to write AOF, you still have keys expiring, and you still have background operations competing with the event loop. The latency floor for a network-bound, single-threaded server remains unchanged. You have reduced the spikes from 50ms to maybe 15ms. You have not eliminated them.

The Fix That Eliminates the Problem

The reason 3 AM spikes hurt your users is that every cache read traverses the network to reach Redis. When Redis stalls for 20ms during a fork, every read that lands during that window sees a 20ms response. The stall propagates directly to your application’s P99 because there is nothing between your application and Redis to absorb the impact.

But what if 99% of your reads never reached Redis at all?

An L1 in-process cache sits inside your application’s memory space. When your code calls cache.get("user:123"), it resolves from a local hash table in 1.5 microseconds — no TCP connection, no network hop, no event loop contention. Redis can fork, rewrite its AOF, expire ten thousand keys, and process four cron jobs simultaneously. Your application does not notice because it is not reading from Redis. The hot path is entirely local.

This is the architecture that Cachee provides. It deploys as a transparent layer between your application and Redis, maintaining an L1 in-process cache that serves hot reads locally and syncs with Redis in the background. Predictive pre-warming learns your access patterns and pre-loads keys into L1 before they are requested, achieving a 99%+ L1 hit rate. The remaining 1% of reads — cold keys, first-access keys — still go to Redis. But that 1% can tolerate a 20ms fork stall because it is not on your critical path.

The 3 AM spikes still happen inside Redis. They just stop mattering. Your RDB snapshot can take 50ms to fork. Your AOF can rewrite for 30 seconds. Keys can expire in batches. None of it affects your application’s P99 because your application is reading from memory that is measured in microseconds, not milliseconds. Redis becomes a background persistence layer — important for durability, irrelevant for latency. You stop getting paged at 3 AM because there is nothing to page about.

            Redis maintenance becomes invisible. With a 99% L1 hit rate, Redis latency spikes affect 1% of reads. A 50ms spike on 1% of traffic adds 0.5ms to your blended P99 — well within any SLA. Without L1, that same spike adds 50ms to 100% of reads. The math is the entire argument.
        

Stop Getting Paged at 3 AM

Here is what the numbers look like before and after an L1 tier absorbs your hot reads.

100% L1 Hit Rate

50ms P99 Without L1 (3 AM)

0.004ms P99 With L1 (3 AM)

0 3 AM Pages/Month

Your Redis instance still forks. It still rewrites AOF files. Keys still expire. But your on-call engineer sleeps through the night because none of those operations touch your hot read path. The latency spike happens inside Redis. Your application never sees it.

Let Redis do its maintenance in peace. Stop routing every read through a process that needs to pause for housekeeping. Move the hot path out of the network and into the application, where a 1.5-microsecond lookup makes background persistence invisible.

Let Redis Do Its Maintenance in Peace.

See how 1.5µs L1 lookups make RDB forks, AOF rewrites, and TTL storms invisible to your users.

Start Free Trial Schedule Demo