← Back to Blog

Kubernetes Cache Sidecar: Complete Pattern

April 27, 2026 | 14 min read | Engineering

The default Kubernetes caching architecture looks like this: every pod connects to a centralized Redis cluster over the pod network. The application container sends a GET request. The request traverses the pod's virtual ethernet interface, the CNI network fabric, the Redis Service, the Redis pod's network stack, and the Redis process. The response makes the same journey back. Total round-trip: 300-500 microseconds on a well-configured cluster, 1-3 milliseconds on a congested one. Multiply by the number of cache operations per request (typically 5-20), and caching overhead adds 1.5-60 milliseconds of latency to every request.

The sidecar pattern eliminates this network path entirely. Instead of connecting to a centralized Redis cluster, each pod runs a Cachee sidecar container. The application container communicates with the sidecar via localhost or a shared Unix socket. Hot reads complete in 31 nanoseconds. There is no pod network traversal, no CNI overhead, no Service routing, no TCP handshake. The data is in the same pod, on the same node, accessed via loopback or inter-process communication.

This post walks through the complete sidecar caching pattern: the deployment manifest, the configuration, memory management, health checks, and the architectural decisions that determine when to use a sidecar versus a DaemonSet versus a centralized cluster.

31 ns
Sidecar Cache Read
500 us
Redis via K8s Service
16,129x
Latency Reduction

The Kubernetes Caching Problem

Kubernetes introduces specific caching challenges that do not exist in traditional deployments. In a traditional deployment, you have a fixed number of application servers connecting to a fixed Redis cluster. The network path is predictable. Connection pools are sized once. Latency is stable.

In Kubernetes, everything is dynamic. Pods scale up and down. Nodes are added and removed. The network fabric is virtualized. DNS resolution adds latency. Service routing adds latency. Every cache operation pays the tax of Kubernetes networking, which was designed for flexibility and isolation, not for sub-millisecond latency.

The Connection Explosion

Each application pod maintains a connection pool to Redis. A typical pool size is 10-20 connections. With 50 pods, that is 500-1000 TCP connections to your Redis cluster. With 200 pods during a traffic spike, it is 2000-4000 connections. Redis handles connections on a single thread. Each connection consumes memory for buffers and state. At 4000 connections, Redis is spending a meaningful fraction of its CPU time on connection management rather than serving commands.

The sidecar pattern eliminates this connection explosion. Each pod's cache is local. There are no cross-pod connections for cache reads. If you still need a shared cache for cross-pod consistency, the sidecar handles L1 caching locally and only falls through to the shared cache on misses -- reducing the shared cache connection count by 70-90%.

The Network Variance Problem

Kubernetes network latency is not constant. It varies based on pod placement (same node vs. cross-node), network plugin (Calico vs. Cilium vs. Flannel), traffic load, and network policies. A Redis GET that takes 200 microseconds on a quiet cluster can take 2 milliseconds during a traffic spike because the CNI is processing more packets. This variance makes it impossible to give tight latency guarantees for cache-dependent operations.

Sidecar cache reads are immune to cluster network conditions. The read stays within the pod. The loopback interface is not affected by CNI congestion, cross-node traffic, or network policy processing. The 31-nanosecond read time is consistent regardless of what else is happening on the cluster.

The Deployment Manifest

A Cachee sidecar is deployed as a second container in the pod spec. It shares the pod's network namespace (for localhost communication) and can optionally share a volume (for Unix socket communication). Here is the complete deployment manifest.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-application
  labels:
    app: my-application
spec:
  replicas: 10
  selector:
    matchLabels:
      app: my-application
  template:
    metadata:
      labels:
        app: my-application
    spec:
      containers:
      # Application container
      - name: app
        image: my-app:latest
        ports:
        - containerPort: 8080
        env:
        - name: CACHE_HOST
          value: "localhost"
        - name: CACHE_PORT
          value: "6380"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"

      # Cachee sidecar container
      - name: cachee
        image: cachee/cachee:latest
        ports:
        - containerPort: 6380
        args:
        - "--port=6380"
        - "--max-memory=256mb"
        - "--eviction=cachee-lfu"
        - "--resp-compat=true"
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "384Mi"
            cpu: "200m"
        readinessProbe:
          tcpSocket:
            port: 6380
          initialDelaySeconds: 2
          periodSeconds: 5
        livenessProbe:
          tcpSocket:
            port: 6380
          initialDelaySeconds: 5
          periodSeconds: 10

The sidecar listens on port 6380 (not 6379, to avoid confusion with external Redis). The application connects to localhost:6380. Because both containers share the pod's network namespace, localhost communication avoids the pod network entirely. The RESP compatibility flag enables the sidecar to speak the Redis protocol, so existing Redis client libraries work without modification.

Memory Configuration

The sidecar's memory limit is the cache capacity. In the manifest above, the sidecar is limited to 384Mi (with 256Mi guaranteed). The --max-memory=256mb flag tells Cachee to use at most 256MB for cached data. The remaining 128MB covers the sidecar process overhead (binary, stack, internal data structures).

How much memory should the sidecar get? That depends on the working set size of your application. If your application accesses 10,000 unique cache keys with an average value size of 5KB, the working set is approximately 50MB. A 256MB cache comfortably holds the entire working set with room for growth. If your working set is 2GB, a 256MB sidecar can only hold 12.5% of the data, and the hit rate will suffer.

The CacheeLFU eviction policy helps maximize hit rate under memory pressure. Frequency-based eviction keeps the most accessed keys in cache and evicts rarely-used keys. For most applications, a sidecar sized at 2-3x the hot working set (the subset of keys that account for 80% of reads) achieves a 90%+ hit rate.

Working Set SizeRecommended Sidecar MemoryExpected Hit Rate
50 MB128 MB95%+
200 MB512 MB92%+
1 GB2 GB90%+
5 GB2 GB (with L2 fallthrough)70-80%

Communication Patterns

The application container can communicate with the sidecar in three ways, each with different performance characteristics.

Option 1: Localhost TCP (RESP Protocol)

The application connects to the sidecar on localhost:6380 using the RESP protocol. Any Redis client library works. This is the simplest option and the one most teams should start with. The localhost TCP round-trip adds approximately 10-30 microseconds of overhead compared to a direct in-process lookup, but it eliminates the 300-500 microseconds of pod network traversal to a centralized Redis. Net improvement: 10-50x faster than centralized Redis.

# Python - standard Redis client, pointed at sidecar
import redis
cache = redis.Redis(host='localhost', port=6380)
cache.set('user:42', json.dumps(user_data), ex=300)
user = json.loads(cache.get('user:42'))

Option 2: Unix Socket

For lower latency, the application can communicate via a Unix socket on a shared emptyDir volume. Unix socket communication avoids the TCP stack entirely and reduces round-trip overhead to approximately 2-5 microseconds. This requires mounting a shared volume in both containers.

# Add to pod spec:
volumes:
- name: cache-socket
  emptyDir: {}

# Add to both containers:
volumeMounts:
- name: cache-socket
  mountPath: /var/run/cachee

# Cachee args:
args:
- "--socket=/var/run/cachee/cachee.sock"

# Application connection:
cache = redis.Redis(unix_socket_path='/var/run/cachee/cachee.sock')

Option 3: Shared Memory

For the absolute lowest latency, the sidecar can expose a shared memory segment that the application reads directly. This eliminates all IPC overhead and achieves the 31-nanosecond read time. The application accesses the cache via a memory-mapped file on a shared tmpfs volume. This requires the Cachee client library (not a standard Redis client) and shared memory support in the application runtime.

# Add to pod spec:
volumes:
- name: cache-shm
  emptyDir:
    medium: Memory
    sizeLimit: 256Mi

# Both containers mount the shared memory volume
volumeMounts:
- name: cache-shm
  mountPath: /dev/shm/cachee

# Application uses Cachee client library
from cachee import SharedMemoryCache
cache = SharedMemoryCache('/dev/shm/cachee')
user = cache.get('user:42')  # 31ns
Communication MethodLatencyClient CompatibilityComplexity
Localhost TCP (RESP)10-30 usAny Redis clientLow
Unix socket2-5 usRedis clients with socket supportMedium
Shared memory31 nsCachee client libraryHigh

Health Checks and Observability

The sidecar needs proper health checks so Kubernetes can manage its lifecycle correctly. A sidecar that crashes or becomes unresponsive should not cause the pod to be killed -- the application should fall through to the centralized cache or the database. But a sidecar that is consistently unhealthy indicates a deployment problem that needs attention.

Readiness Probe

The readiness probe determines when the sidecar is ready to accept traffic. A TCP socket check on the cache port is sufficient. The sidecar is ready when it can accept connections. The initial delay should be 2-3 seconds to allow the sidecar to start and allocate its memory. The period should be 5 seconds -- frequent enough to detect problems quickly, infrequent enough to avoid probe overhead.

Liveness Probe

The liveness probe determines whether the sidecar is still functioning. A TCP socket check works for basic health. For deeper health checking, use an HTTP endpoint that verifies the cache can read and write.

livenessProbe:
  httpGet:
    path: /health
    port: 6381
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

# Cachee /health response:
# {
#   "status": "ok",
#   "entries": 847293,
#   "memory_used": "231 MB",
#   "memory_limit": "256 MB",
#   "hit_rate": 0.913,
#   "uptime_seconds": 86400
# }

Metrics

The sidecar should expose Prometheus metrics for monitoring. Key metrics include cache hit rate (the most important metric -- if it drops below 80%, the sidecar is not providing sufficient value), memory utilization (approaching the limit means evictions are increasing), eviction rate (high eviction rate means the cache is too small for the working set), and latency percentiles (P50, P95, P99 for cache operations).

# Prometheus metrics endpoint (port 9090)
cachee_hits_total{pod="my-app-xyz"} 1847293
cachee_misses_total{pod="my-app-xyz"} 172107
cachee_hit_rate{pod="my-app-xyz"} 0.914
cachee_memory_bytes{pod="my-app-xyz"} 242221056
cachee_evictions_total{pod="my-app-xyz"} 12847
cachee_latency_seconds{pod="my-app-xyz",quantile="0.5"} 3.1e-08
cachee_latency_seconds{pod="my-app-xyz",quantile="0.95"} 4.2e-08
cachee_latency_seconds{pod="my-app-xyz",quantile="0.99"} 8.7e-08

Sidecar vs. DaemonSet vs. Centralized

The sidecar is not the only way to deploy a cache in Kubernetes. There are three deployment patterns, each suited for different use cases.

Sidecar (One Cache Per Pod)

Each pod gets its own cache instance. Cache data is local to the pod and not shared across pods. This is ideal for read-heavy workloads where each pod accesses a predictable subset of data. The advantages are maximum isolation (one pod's cache does not affect another's), predictable performance (no shared resource contention), and the lowest possible latency (localhost or shared memory). The disadvantages are memory duplication (if 10 pods cache the same key, it is stored 10 times) and cold start on scaling events (new pods start with empty caches).

DaemonSet (One Cache Per Node)

Each node gets one cache instance, shared by all pods on that node. Pods communicate with the node-local cache via the node's IP or a hostPath socket. This reduces memory duplication compared to sidecars (each key is cached once per node instead of once per pod) but introduces contention: multiple pods share the same cache process. The DaemonSet pattern is suitable when memory is constrained and the working set is large. It is not suitable when cache isolation between pods is important.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: cachee-node-cache
spec:
  selector:
    matchLabels:
      app: cachee-node
  template:
    spec:
      containers:
      - name: cachee
        image: cachee/cachee:latest
        args:
        - "--port=6380"
        - "--max-memory=2gb"
        - "--eviction=cachee-lfu"
        ports:
        - containerPort: 6380
          hostPort: 6380
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"

Centralized (Redis/Valkey Cluster)

A shared cache cluster that all pods connect to over the network. This is the traditional approach and it remains the right choice for data that must be shared across pods in real-time -- session state, distributed locks, pub/sub messaging. The disadvantage is network latency: every operation crosses the pod network. The advantage is data consistency: all pods see the same cache state.

When to Use Which

PatternLatencyMemory EfficiencyData SharingBest For
Sidecar31 ns - 30 usLow (duplicated)NoneRead-heavy, predictable working set
DaemonSet50 us - 100 usMediumPer-nodeLarge working sets, memory-constrained
Centralized300 us - 3 msHigh (no duplication)Cluster-wideShared state, sessions, pub/sub

The L1/L2 Pattern in Kubernetes

The most effective Kubernetes caching architecture combines a sidecar (L1) with a centralized cluster (L2). Hot reads hit the sidecar at 31 nanoseconds. Sidecar misses fall through to the centralized cluster at 300-500 microseconds. Cluster misses fall through to the database at 5-15 milliseconds.

def get(key):
    # L1: Sidecar cache (31ns via shared memory)
    if value := sidecar.get(key):
        return value

    # L2: Centralized Redis/Valkey (300-500us via pod network)
    if value := redis.get(key):
        sidecar.set(key, value, ttl=60)  # Promote to L1
        return value

    # L3: Database (5-15ms)
    value = database.query(key)
    redis.set(key, value, ttl=3600)      # Populate L2
    sidecar.set(key, value, ttl=60)      # Populate L1
    return value

This architecture gives you the latency of in-process caching for hot data and the consistency of centralized caching for shared state. The sidecar handles 80-90% of reads at sub-microsecond latency. The centralized cluster handles 8-15% of reads at sub-millisecond latency. The database handles 2-5% of reads at millisecond latency.

The key insight is that the sidecar TTL should be shorter than the centralized cache TTL. In the example above, the sidecar TTL is 60 seconds and the centralized TTL is 3600 seconds (1 hour). This ensures that stale data in the sidecar is refreshed from the centralized cache (which has fresher data) rather than from the database. The centralized cache acts as a "warm" tier that absorbs the cost of database misses and distributes the result to all sidecars that request it.

Handling Pod Scaling

When Kubernetes scales your deployment up, new pods start with empty sidecar caches. Every read is a cache miss until the sidecar warms up. This "cold start" problem can cause a temporary spike in latency and database load as new pods populate their caches.

Three strategies mitigate cold start. First, the L1/L2 pattern naturally handles it: cold sidecar misses hit the centralized cache (which is warm), not the database. The cost is 300-500 microseconds per miss instead of 5-15 milliseconds. Second, cache pre-warming: when a pod starts, the sidecar can proactively load the most frequently accessed keys from the centralized cache. A pre-warm of the top 1,000 keys takes approximately 300 milliseconds and eliminates the majority of cold-start misses. Third, gradual traffic shifting: use a readiness gate that delays routing traffic to the new pod until the sidecar has warmed up. The pod reports "not ready" until its cache hit rate exceeds a threshold (e.g., 70%).

# Pre-warm configuration
cachee start \
  --prewarm-source redis://redis-cluster:6379 \
  --prewarm-keys 1000 \
  --prewarm-timeout 500ms

When Kubernetes scales your deployment down, terminating pods lose their sidecar caches. This is fine -- the data is ephemeral and can be reconstructed from the centralized cache or the database. The sidecar does not need graceful shutdown logic for cache data. It only needs to drain in-flight requests before terminating, which is handled by the standard Kubernetes termination grace period.

Resource Budgeting

The sidecar consumes resources that would otherwise be available to the application container. A sidecar with 256MB memory and 200m CPU reduces the resources available on the node for application workloads. On a node with 32GB of memory and 16 pods, sidecar memory totals 4GB -- 12.5% of node capacity dedicated to caching.

This is almost always a good trade-off. The sidecar eliminates network round-trips to a centralized cache, which means the application container spends less CPU time waiting for cache responses. A pod without a sidecar might use 800m CPU (500m for application logic + 300m for network I/O waiting). The same pod with a sidecar might use 600m CPU (500m for application logic + 100m for sidecar overhead) because the sidecar eliminates network wait time. The sidecar costs 200m CPU but saves 200m on the application container -- a net-neutral CPU impact with a massive latency improvement.

Memory is the real cost. The sidecar's memory is dedicated to caching and is not available for application heap, OS page cache, or other uses. Size the sidecar to the hot working set, not the total data set. A sidecar that caches 90% of reads with 256MB is better than a sidecar that caches 95% of reads with 2GB, because the marginal 5% improvement does not justify 8x the memory cost.

Memory Limits Are Hard Limits

Set the sidecar's Kubernetes memory limit to at least 1.5x the configured cache max-memory. If the cache is configured for 256MB, set the container limit to 384MB. The difference covers process overhead, connection buffers, and temporary allocations during eviction. If the container exceeds its memory limit, Kubernetes OOM-kills it, which cold-starts the cache and causes a temporary latency spike.

Production Checklist

Before deploying the sidecar pattern to production, verify the following items. Each one addresses a failure mode we have observed in real deployments.

  1. Memory limits are set correctly. The container memory limit is at least 1.5x the cache max-memory. The cache max-memory is sized to the hot working set, not the total data set.
  2. Readiness and liveness probes are configured. The readiness probe prevents traffic from reaching the pod before the sidecar is ready. The liveness probe restarts the sidecar if it becomes unresponsive.
  3. The application handles sidecar failures gracefully. If the sidecar is unavailable, the application falls through to the centralized cache or the database. A sidecar failure should never cause an application error. It should cause a latency increase.
  4. Metrics are being collected. Cache hit rate, memory utilization, and eviction rate are visible in your monitoring system. Set alerts for hit rate dropping below 80% (cache is too small or access patterns changed) and memory utilization exceeding 90% (approaching eviction pressure).
  5. TTLs are set appropriately. Sidecar TTLs should be shorter than centralized cache TTLs. Centralized cache TTLs should be shorter than the data's actual change frequency.
  6. Pod disruption budgets account for cold starts. If you are using pre-warming, set the pod disruption budget to allow only one pod to restart at a time. This prevents a rolling restart from cold-starting all sidecars simultaneously, which would flood the centralized cache with miss traffic.

The Bottom Line

The Kubernetes sidecar caching pattern puts a Cachee instance in every pod. Hot reads complete in 31 nanoseconds via shared memory or 10-30 microseconds via localhost TCP -- compared to 300-500 microseconds for a centralized Redis cluster over the pod network. Combined with a centralized L2 for shared state, the sidecar handles 80-90% of cache reads at sub-microsecond latency. The cost is 256MB of memory per pod. The return is a 16,000x latency reduction on cache hits.

31ns cache reads in every Kubernetes pod. Deploy Cachee as a sidecar.

brew install cachee Real-Time Leaderboard Architecture