← Back to Blog

Reduce ElastiCache Costs: Engineering Playbook

April 27, 2026 | 14 min read | Engineering

The ElastiCache bill arrives and it is higher than expected. You Google "reduce elasticache costs." You find articles that tell you to use reserved instances and smaller node types. You apply those optimizations, save 15%, and the bill is still too high. Six months later, traffic has grown, the bill has grown with it, and you are back to searching for answers.

This is the wrong cycle. Reserved instances and right-sizing are marginal optimizations. They shave percentages off a fundamentally expensive architecture. The structural problem is that ElastiCache is a network cache, and you are using it as your only cache tier. Every read, no matter how hot, crosses the network. Every value, no matter how small, gets serialized and deserialized. Every request, no matter how frequently repeated, pays the same latency and cost.

This post is the engineering playbook for actually reducing ElastiCache costs -- not by 15%, but by 60-70%. The approach is to add an in-process L1 cache tier that absorbs hot reads at 31 nanoseconds, reduce your ElastiCache hit rate from 100% to the 5-15% of requests that genuinely need a network cache, and downsize your cluster to match the reduced load. ElastiCache is not bad. It is expensive for the wrong workload. Move the wrong workload out, and ElastiCache becomes affordable.

Why ElastiCache Costs More Than You Think

The number on the AWS bill is the instance cost. For a typical production deployment -- a 3-node r7g.xlarge cluster with one primary and two replicas -- the instance cost is approximately $0.261 per hour per node, or $1,883 per month for three nodes. That is the number your finance team sees. It is roughly 40% of the actual cost.

The Full Cost Breakdown

Let us walk through a real cost breakdown for a 3-node r7g.xlarge ElastiCache cluster serving a mid-size SaaS application at 50,000 operations per second.

Cost ComponentMonthly Cost% of Total
ElastiCache instance hours (3x r7g.xlarge)$1,88337%
Cross-AZ data transfer ($0.01/GB each way)$1,55531%
Serialization CPU on app fleet$3908%
Over-provisioning headroom (40% idle capacity)$75315%
Engineering time (monitoring, upgrades, incidents)$50010%
Total real cost$5,081100%

The instance cost is $1,883. The real cost is $5,081. That is a 2.7x multiplier, and this is a modest deployment. Larger clusters with higher throughput see multipliers of 3-5x because cross-AZ data transfer scales linearly with throughput while instance costs have step functions.

$1,883
Visible Bill (Instance Hours)
$5,081
Real Cost (All Components)
2.7x
Hidden Cost Multiplier

Cross-AZ Data Transfer: The Silent Killer

Cross-AZ data transfer is the largest hidden cost and the one that surprises teams the most. AWS charges $0.01 per GB in each direction for traffic that crosses availability zone boundaries. A single ElastiCache operation with a 2 KB value generates approximately 2.5 KB of total transfer (value + protocol overhead + response). At 50,000 ops/sec, that is 125 MB/sec of ElastiCache traffic. If your application fleet is spread across two AZs (standard for high availability), approximately half of that traffic crosses AZ boundaries: 62.5 MB/sec cross-AZ, or 5.4 TB per day, or 162 TB per month. At $0.02/GB round-trip, that is $3,240 per month.

The number in our table ($1,555) assumes a more conservative 50% cross-AZ ratio and smaller average value sizes. But the point stands: cross-AZ transfer for cache traffic routinely exceeds the instance cost of the cache itself. It shows up as a generic "Data Transfer" line item on the AWS bill, and nobody attributes it to ElastiCache because there is no per-service breakdown.

Over-Provisioning: Paying for Peak

ElastiCache clusters must be provisioned for peak throughput, not average throughput. If your average load is 50,000 ops/sec but your peak during a marketing campaign or product launch is 120,000 ops/sec, you provision for 120,000. That means 58% of your cluster capacity sits idle during normal operation. You pay for it 24 hours a day, 7 days a week, because ElastiCache bills by the hour regardless of utilization.

The over-provisioning penalty is structural. You cannot scale ElastiCache up and down in seconds like you can with application instances behind an auto-scaler. Scaling a Redis cluster requires adding shards, migrating slots, and waiting for replication to catch up. This takes minutes to hours, not seconds. So you provision for the worst case and eat the cost during normal operation.

The L1 Optimization

The core of this playbook is a single architectural change: add an in-process L1 cache in front of ElastiCache. The L1 cache sits inside your application process. It holds hot keys in local memory. It answers reads in 31 nanoseconds with zero network, zero serialization, and zero cross-AZ transfer. ElastiCache becomes the L2 -- the fallback for cache misses that the L1 cannot serve.

How It Works

On a cache read, the application first checks the L1 in-process cache. If the key is found (L1 hit), the value is returned from local memory in 31 nanoseconds. No network call. No serialization. No cross-AZ transfer. If the key is not found (L1 miss), the request falls through to ElastiCache. The value is fetched from ElastiCache over the network (300 microseconds to 2 milliseconds), returned to the application, and simultaneously promoted into the L1 cache for subsequent reads. Future reads for the same key will hit L1 instead of ElastiCache.

Writes always go to ElastiCache first. ElastiCache remains the source of truth. The L1 cache is a read-through acceleration layer, not a write buffer. L1 entries have a TTL (typically 5-60 seconds) that bounds how stale a cached value can be. When the TTL expires, the next read for that key will miss L1, fetch from ElastiCache, and re-populate L1 with the current value.

The Math: Before and After

With an L1 cache absorbing 90% of reads, the numbers change dramatically. Here is the same 3-node r7g.xlarge cluster after adding L1.

Cost ComponentBefore (No L1)After (90% L1 Hit)Savings
ElastiCache instances$1,883/mo (3 nodes)$628/mo (1 node)$1,255
Cross-AZ data transfer$1,555/mo$156/mo$1,399
Serialization CPU$390/mo$39/mo$351
Over-provisioning$753/mo$125/mo$628
Engineering time$500/mo$300/mo$200
Total$5,081/mo$1,248/mo$3,833/mo

The total cost drops from $5,081 to $1,248 per month -- a 75% reduction. Annualized, that is $45,996 in savings. The ElastiCache cluster shrinks from 3 nodes to 1 because 90% less traffic means 90% less capacity needed. The cross-AZ transfer drops by 90% because 90% of reads never leave the application process. The serialization CPU drops by 90% because 90% of reads do not require serialization. The over-provisioning drops because a 1-node cluster at 10% of the original load has much more headroom relative to its capacity.

75%
Total Cost Reduction
$45,996
Annual Savings
31 ns
L1 Read Latency

What to Keep in ElastiCache

ElastiCache is not the wrong tool. It is the wrong tool for hot reads. After adding L1, you keep ElastiCache for the workloads where a network-shared cache is genuinely necessary. These workloads represent 5-15% of your total cache operations but 100% of your need for shared, mutable, persistent cache state.

Pub/Sub and Event Distribution

If you use ElastiCache for pub/sub messaging, stream processing, or list-based job queues, that workload stays on ElastiCache. These are inherently multi-process operations that require a shared broker. An in-process cache cannot replace cross-process messaging. Keep your pub/sub channels on ElastiCache and let the L1 layer handle the read-heavy workload that dominates your operations count.

Shared Mutable State

Distributed locks, global rate limiters, atomic counters, and any value that must be immediately consistent across all application instances require a shared network cache. When process A increments a counter, process B must see the new value immediately. L1 caches are per-process and eventually consistent (bounded by TTL). They cannot provide the strong consistency required for shared mutable state. Keep these keys on ElastiCache with direct reads that bypass L1.

Persistence and Durability

If you rely on ElastiCache persistence (AOF or RDB snapshots) to survive instance failures without data loss, that durability requirement stays on ElastiCache. L1 caches are ephemeral. They live in process memory and are destroyed on process restart. For workloads where cache loss means database load spikes and cold-start latency, ElastiCache's persistence provides the safety net.

Replication and Failover

ElastiCache provides automatic failover with replica promotion. If your primary node fails, a replica is promoted within seconds. This is a critical availability feature for shared state. L1 caches do not replicate -- each process has its own cache, and if the process restarts, the L1 cache is rebuilt from ElastiCache. The two tiers complement each other: ElastiCache provides durability and availability, L1 provides speed and cost reduction.

Migration: Step by Step

The migration from ElastiCache-only to L1+ElastiCache requires no application code changes if you use Cachee as the L1 layer. Cachee speaks the RESP protocol (Redis Serialization Protocol), which means your existing Redis client library connects to Cachee exactly as it connects to ElastiCache. The only change is the connection endpoint.

Step 1: Install Cachee on Application Instances

# On each application instance
brew tap h33ai-postquantum/tap
brew install cachee

Or on Linux (the more common case for production servers):

# Download the binary
curl -sSL https://cachee.ai/install.sh | sh

# Or via the deb/rpm package
apt install cachee  # Debian/Ubuntu
yum install cachee  # RHEL/Amazon Linux

Step 2: Configure Cachee with ElastiCache as Upstream

# Initialize Cachee with your ElastiCache endpoint as L2
cachee init \
  --upstream redis://your-cluster.cache.amazonaws.com:6379 \
  --l1-memory 512mb \
  --l1-ttl 30s \
  --listen 127.0.0.1:6380

This tells Cachee to listen on localhost:6380 for RESP connections, maintain a 512 MB in-process L1 cache with a 30-second TTL, and fall through to your ElastiCache cluster on L1 misses. The --l1-memory parameter controls how much RAM the L1 cache can use. 512 MB is a good starting point for most applications. The --l1-ttl parameter controls the maximum staleness of L1 entries. 30 seconds works well for session data, user profiles, and feature flags. Reduce it for data that changes more frequently.

Step 3: Start Cachee

# Start the Cachee daemon
cachee start

# Verify it is running
cachee status

Step 4: Update Your Application Connection String

Change your Redis connection string from the ElastiCache endpoint to localhost:6380. This is the only application change required, and it is a configuration change, not a code change.

# Before
REDIS_URL=redis://your-cluster.cache.amazonaws.com:6379

# After
REDIS_URL=redis://127.0.0.1:6380

Deploy this change to one application instance first. Monitor it for 24 hours. Check the L1 hit rate with cachee status. If the hit rate is above 70%, roll out to the remaining instances. If it is below 70%, increase the L1 memory budget or increase the TTL.

Step 5: Downsize the ElastiCache Cluster

After all application instances are running through Cachee, monitor your ElastiCache cluster for one week. You should see operations per second drop by 80-95%, memory utilization drop significantly, and cross-AZ data transfer attributable to port 6379 traffic drop proportionally. Once you have a week of data confirming the reduced load, downsize the cluster. If you were running 3 r7g.xlarge nodes, you can likely move to a single r7g.large or even r7g.medium. If you were running a 6-node cluster, you can likely move to 2 nodes.

Do not skip the monitoring period. Downsize based on observed data, not projections. Some workloads have weekly patterns (higher traffic Monday through Friday, lower on weekends) that a few days of monitoring would miss.

Migration Safety Note

During migration, your application is reading through Cachee to ElastiCache. If Cachee stops for any reason, your application loses its cache connection. To protect against this, configure your Redis client with a failover endpoint list: primary = localhost:6380 (Cachee), secondary = your-cluster.cache.amazonaws.com:6379 (ElastiCache direct). If Cachee is unreachable, the client falls back to ElastiCache directly. Performance degrades to pre-migration levels but availability is preserved.

Advanced: Per-Key TTL Optimization

Not all keys should have the same TTL in L1. Session tokens that are valid for 24 hours can safely have a 60-second L1 TTL. Feature flags that change once per deployment can have a 300-second L1 TTL. Rate limit counters that change on every request should bypass L1 entirely and go straight to ElastiCache.

Cachee supports per-key TTL rules based on key prefix patterns. You define rules in the configuration that map key patterns to TTL values, and Cachee applies the appropriate TTL when promoting a value into L1.

# cachee.toml
[l1.ttl_rules]
"session:*"     = "60s"
"feature:*"     = "300s"
"user:profile:*" = "30s"
"rate:*"        = "0s"     # bypass L1, always hit ElastiCache
"default"       = "15s"

The rate:* rule with a TTL of 0 seconds tells Cachee to never cache keys matching that pattern in L1. Rate limit counters are shared mutable state that must be immediately consistent across all application instances. Caching them in L1 would produce incorrect rate limiting because each process would count independently. By routing rate limit keys directly to ElastiCache, you get the correct shared counting behavior while still caching everything else in L1.

Per-key TTL optimization typically increases the effective L1 hit rate by 5-10 percentage points compared to a single global TTL. The improvement comes from long-TTL keys (feature flags, static configuration) that would be evicted and re-fetched unnecessarily under a short global TTL, and from bypassing keys that should not be in L1 at all, which frees memory for keys that benefit from caching.

Advanced: CacheeLFU vs ElastiCache LRU

ElastiCache uses LRU (Least Recently Used) or a configurable approximation of LRU for eviction. LRU evicts the key that was accessed least recently, regardless of how frequently it was accessed before that. This works well for workloads with temporal locality -- if you accessed a key recently, you are likely to access it again soon.

CacheeLFU (the eviction policy used by Cachee's L1 tier) evicts the key that was accessed least frequently. This is a better fit for cache workloads where a small number of hot keys are accessed thousands of times per second while a long tail of cold keys are accessed once or twice. Under LRU, a burst of cold-key accesses can evict hot keys from the cache, causing a temporary hit rate drop. Under CacheeLFU, hot keys are protected by their high access frequency. A cold key accessed once cannot evict a hot key accessed ten thousand times, regardless of recency.

The practical difference is most visible during traffic spikes. When a marketing campaign or product launch drives a surge of new users, the cache sees a burst of new keys (new session tokens, new user profiles). Under LRU, these new keys push existing hot keys out of the cache. The hit rate drops temporarily, ElastiCache load spikes, and latency increases at exactly the moment when you need performance the most. Under CacheeLFU, the new keys enter the cache and the least frequently accessed keys are evicted. Hot keys that were already being accessed thousands of times per second remain in cache. The hit rate stays stable.

In our production benchmarks, CacheeLFU maintains a 3-7% higher hit rate than LRU under bursty workloads. Under steady-state workloads, the difference is smaller (1-2%) but consistently favors CacheeLFU. The reason is straightforward: frequency is a better predictor of future access than recency for cache workloads, and CacheeLFU captures frequency while LRU captures only recency.

Monitoring the Transition

After deploying L1 caching, you need to monitor five metrics to confirm the optimization is working and to catch any issues before they affect users.

Metric 1: L1 Hit Rate

Target: 80-95%. If your L1 hit rate is below 70%, investigate why. Common causes include an L1 memory budget that is too small (the cache is full and evicting keys before they are re-accessed), a TTL that is too short (keys expire before they are re-accessed), or a workload with too many unique keys and insufficient key reuse. The fix is usually increasing the memory budget. If your application has 100,000 unique hot keys averaging 1 KB each, you need at least 100 MB of L1 memory plus overhead.

Metric 2: ElastiCache Operations Per Second

This should drop proportionally to the L1 hit rate. If L1 is hitting at 90%, ElastiCache ops/sec should drop by approximately 90% compared to pre-migration levels. If the drop is smaller than expected, check whether write operations (which bypass L1) are a larger fraction of your total operations than you estimated. Write-heavy workloads benefit less from L1 read caching.

Metric 3: Application P99 Latency

P99 latency should improve because 80-95% of cache reads now complete in 31 nanoseconds instead of 300 microseconds to 2 milliseconds. The improvement is most visible in request paths that make multiple cache reads -- a request that previously made 5 cache reads at 500 microseconds each (2.5 milliseconds total) now makes 4-5 L1 hits at 31 nanoseconds each plus 0-1 L2 misses at 500 microseconds. The cache contribution to request latency drops from 2.5 milliseconds to approximately 500 microseconds in the worst case (one L2 miss) or 155 nanoseconds in the best case (all L1 hits).

Metric 4: Cross-AZ Data Transfer

Monitor the "Data Transfer" line item in AWS Cost Explorer. Filter by VPC and look at cross-AZ transfer volumes. The total should drop proportionally to the L1 hit rate. This is the largest cost savings component, and it is the one that takes the longest to appear on the bill (AWS billing has a 24-48 hour delay for data transfer). Wait at least one billing cycle before calculating the actual savings.

Metric 5: ElastiCache CPU and Memory Utilization

After L1 deployment, ElastiCache CPU utilization should drop significantly. Memory utilization may stay similar (ElastiCache still holds all keys), but the CPU load from serving requests drops by 80-95%. This is the signal that you can safely downsize. When ElastiCache CPU utilization is consistently below 20%, you are over-provisioned and can move to a smaller instance type.

# Monitor all five metrics from the command line
cachee status --detailed

# Output:
# L1 Cache Status
#   Hit rate:        91.7%
#   Entries:         87,293
#   Memory:          412 MB / 512 MB
#   Eviction policy: CacheeLFU
#
# Upstream (ElastiCache) Status
#   Fallback rate:   8.3%
#   Ops/sec:         4,150 (was ~50,000)
#   Avg latency:     0.34ms
#
# Savings Estimate
#   Ops redirected:  45,850/sec to L1
#   Transfer saved:  ~112 MB/sec cross-AZ
#   CPU saved:       ~0.23 vCPUs (serialization)

Scaling the Savings

The savings described in this playbook scale with your ElastiCache spend. Here is what the numbers look like at three different starting points.

Starting ElastiCache SpendReal Cost (with hidden)After L1 (90% hit)Annual Savings
$500/mo (1 node, small)$1,350/mo$410/mo$11,280
$1,900/mo (3 nodes, mid)$5,081/mo$1,248/mo$45,996
$11,400/mo (6 nodes, large)$31,200/mo$7,800/mo$280,800

At every scale, the payback period for adding an L1 tier is measured in days, not months. The total cost of implementing this playbook is the engineering time for the initial setup (2-4 hours), a monitoring period (1 week), and the cluster resize (1 hour with proper change management). Against annual savings of $11,000 to $280,000, the ROI is immediate.

The savings also compound over time. As your traffic grows, the L1 layer absorbs the growth at zero marginal cost (31-nanosecond reads from existing application memory). Without L1, traffic growth drives proportional increases in ElastiCache instance count, cross-AZ transfer, and serialization CPU. With L1, traffic growth drives proportional increases in L1 hit rate (more repeated keys) and only marginal increases in ElastiCache load (the cold tail grows slowly).

The Bottom Line

Your ElastiCache bill understates your ElastiCache costs by 2-3x. The hidden costs -- cross-AZ data transfer, serialization CPU, over-provisioning headroom, and engineering time -- often exceed the instance cost. Add an in-process L1 cache tier, absorb 90% of reads at 31 nanoseconds, downsize your cluster to match the reduced load, and keep ElastiCache for shared state, pub/sub, and persistence. The result is a 75% reduction in total cache costs with zero application code changes and improved P99 latency as a side effect.

Cut your ElastiCache bill by 75%. Add L1, downsize your cluster, keep what works.

brew install cachee The Hidden Tax on Every Redis Request