Caching Large Files: Images, Video, Blobs, and Why Redis Falls Apart Above 1KB
Your cache handles small keys well. Session tokens (64 bytes), feature flags (a few bytes), rate limit counters (8 bytes). Your Redis or Memcached cluster serves these in 0.3-0.5 milliseconds over the network. That latency is acceptable because the values are tiny. Serialization is instant. TCP transfer is one packet.
Then you need to cache something larger. A product image thumbnail (15 KB). A PDF preview (200 KB). An ML model embedding (4 KB). A post-quantum cryptographic signature (17-49 KB). A rendered HTML fragment (50 KB). A video poster frame (80 KB).
Your cache latency doubles. Then triples. Then your P99 goes from 0.5ms to 8ms and your on-call gets paged.
This article explains why, what happens at each payload size tier, and how to architect a cache layer that handles large values without degrading latency.
Why Latency Scales with Value Size in Network Caches
Every GET from Redis, Memcached, or ElastiCache follows the same path: your application serializes the key, opens a TCP connection (or reuses one from a pool), sends the request, waits for the response, and deserializes the value. For a 64-byte session token, the serialization and transfer time is negligible -- the dominant cost is the network round-trip itself (~0.3ms within the same AZ).
For a 50 KB value, the math changes. Serialization time scales linearly with payload size. TCP transfer time scales linearly. Deserialization scales linearly. The network round-trip is the same, but everything around it grows proportionally.
| Value Size | Redis GET (same AZ) | Redis GET (cross-AZ) | Cachee L0 (in-process) |
|---|---|---|---|
| 64 B (session token) | 0.3ms | 0.8ms | 31ns |
| 1 KB (JWT) | 0.35ms | 0.9ms | 31ns |
| 4 KB (ML embedding) | 0.5ms | 1.2ms | 31ns |
| 15 KB (image thumbnail) | 0.8ms | 2.1ms | 31ns |
| 50 KB (HTML fragment) | 1.5ms | 3.8ms | 31ns |
| 200 KB (PDF preview) | 3.2ms | 8.5ms | 31ns |
| 1 MB (video poster) | 8ms | 22ms | 31ns |
| 17 KB (SLH-DSA-128f sig) | 0.9ms | 2.3ms | 31ns |
| 49 KB (SLH-DSA-256f sig) | 1.4ms | 3.6ms | 31ns |
The Cachee column is constant. 31 nanoseconds regardless of value size. An in-process cache does not serialize, does not transfer over TCP, does not deserialize. The value is a pointer dereference in the same address space. Whether the value is 64 bytes or 1 megabyte, the access cost is one hash lookup and one memory load.
The Five Tiers of Large-Value Caching
Tier 1: 1-10 KB (JWTs, API responses, small JSON)
Most applications hit this tier without realizing it. A JWT with PQ claims (ML-DSA-65 signature at 3,309 bytes) lands here. A typical API response with a few nested objects lands here. A GraphQL query result with 20 fields lands here.
At this size, Redis is still functional but no longer free. Every GET adds 0.35-0.5ms. In a request that hits the cache 3-4 times (auth token + user profile + feature flags + rate limit), the accumulated cache latency is 1.5-2ms. That is 10-20% of your total request budget on a fast API.
Recommendation: Move hot-path lookups (auth, session, rate limit) to in-process L1. Keep warm-path data (user preferences, settings) in Redis L2. The latency difference at 4 KB is 0.5ms vs 31ns -- a 16,000x factor.
Tier 2: 10-100 KB (images, thumbnails, HTML fragments, PQ signatures)
This is where network caches start to hurt. A 50 KB value takes 1.5ms from Redis in the same availability zone. Cross-AZ, it is 3.8ms. If you are caching rendered HTML fragments for server-side rendering, every page load pays this cost per fragment.
Post-quantum signatures live in this tier. SLH-DSA-128f signatures are 17,088 bytes. SLH-DSA-256f signatures are 49,856 bytes. If your application verifies PQ signatures and caches the results, you are either caching the full signature (17-49 KB per entry, expensive over the network) or caching a boolean verification result (1 byte, cheap but requires re-verification on eviction).
Recommendation: Cache the verification result, not the full signature. If you must cache the full value (CDN origin shields, SSR fragment caches), use in-process caching. At 50 KB, the difference between 31ns and 3.8ms is the difference between a responsive page and a perceptible delay.
Tier 3: 100 KB - 1 MB (PDFs, video posters, model weights, rendered reports)
Network caches effectively break at this tier. A 200 KB PDF preview takes 3.2ms from Redis -- longer than most database queries. A 1 MB video poster frame takes 8ms. At this point, it is faster to re-generate the value from the database than to fetch it from your "fast" cache.
This is also where Redis memory efficiency collapses. Redis stores values as SDS (Simple Dynamic Strings) with per-key overhead of ~70 bytes for metadata, plus jemalloc allocation rounding. For a 200 KB value, the overhead is negligible (0.03%). For a 64-byte value, the overhead is 109%. At large value sizes, the memory efficiency is fine but the latency is not. At small value sizes, the memory is wasteful but the latency is acceptable. Redis cannot optimize for both simultaneously.
Recommendation: Tiered architecture. In-process L0 for hot large values (the 100-200 items accessed most frequently). Redis L1 for warm data. Origin/database for cold. CacheeLFU admission control determines tier placement automatically -- frequently accessed 500 KB values stay in L0, rarely accessed ones fall to L1 or evict.
Tier 4: 1-10 MB (high-res images, document renders, model checkpoints)
At this tier, you should not be using a key-value cache at all for the raw bytes. Redis maxes out at 512 MB per value, but the practical limit is far lower -- a single 10 MB GET blocks the Redis event loop for the duration of the transfer, stalling all other clients on that shard.
The pattern here is metadata caching + streaming. Cache the metadata (URL, dimensions, content-type, ETag, last-modified) in your fast cache. Serve the bytes from object storage (S3, R2, GCS) with CDN. The cache hit path is: check L0 for metadata (31ns), return a redirect or pre-signed URL, let the CDN handle the bytes.
Recommendation: Never cache multi-megabyte values in a key-value store. Cache the pointer, not the payload. Your in-process cache stores the metadata, your object store serves the bytes. Total latency: 31ns (metadata lookup) + CDN transfer time.
Tier 5: 10 MB+ (ML models, video segments, database snapshots)
This is not caching. This is memory-mapped storage. Use mmap, shared memory segments, or memory-mapped files. The operating system's page cache is your L1. Your application reads the mapped region directly. No serialization, no copies.
For ML model weights, frameworks like ONNX Runtime and PyTorch already use memory-mapped model loading. Do not attempt to shove a 2 GB model into Redis. The framework handles it.
The Post-Quantum Angle: Why PQ Key Sizes Change the Equation
The post-quantum transition pushes previously Tier-1 data into Tier-2 territory. Key material that was 32-64 bytes is now 1,568-4,896 bytes. Signatures that were 64 bytes are now 690-49,856 bytes.
A session store holding 1 million sessions transitions from 96 MB of key material (classical) to 4.49 GB (ML-KEM-768 + ML-DSA-65). At the SLH-DSA security level, that becomes 18.27 GB.
This is not a hypothetical. NIST FIPS 203/204/205 are final. Chrome and Firefox ship ML-KEM in TLS 1.3 today. CNSA 2.0 mandates PQ for national security systems by 2030. The key sizes are coming, and they are coming to your cache layer first.
Redis at PQ Scale
Consider a rate limiter that caches the last ML-DSA-65 signature per API client for replay detection. At 3,309 bytes per signature and 100K active clients, the cache holds 331 MB of signature data. Every rate limit check is a Redis GET of 3.3 KB -- 0.5ms per check. At 10,000 requests per second, the rate limiter alone consumes 5 seconds of cumulative Redis latency per second. The system cannot keep up. In-process at 31ns per check, the same workload consumes 0.31 milliseconds of cumulative latency per second.
Architecture: How Cachee Handles Large Values
Cachee is an in-process cache engine written in Rust. It runs in the same address space as your application. There is no network hop, no serialization, no deserialization. A GET is a hash lookup in a lock-free DashMap and a pointer dereference. The cost is constant regardless of value size.
CacheeLFU: Admission That Understands Value Size
Not every large value deserves L0 residency. A 200 KB PDF preview accessed once per hour should not evict 3,000 session tokens accessed 100 times per second. CacheeLFU admission scoring factors access frequency against entry size. The scoring function: frequency / ln(age_since_last_access). Higher score = less likely to evict. A 200 KB entry accessed once gets a low score and evicts quickly. A 4 KB ML-DSA public key accessed 10,000 times per second gets a high score and stays permanently.
The admission sketch uses a count-min sketch with 4 rows of 65,536 atomic counters -- 512 KiB of constant memory regardless of whether the cache holds 100K or 10M keys. At 10M keys, this is 1,239x more memory-efficient than tracking frequency per-key in a DashMap.
Zero-Copy Reads
When your application calls cache.get("key"), Cachee returns a reference to the value in shared memory. There is no copy. The application reads the bytes directly from the cache's memory region. For a 200 KB value, this eliminates a 200 KB memcpy that even in-process hash maps would normally require. The read is the pointer dereference itself.
Tiered Eviction
Cachee maintains three tiers internally:
- L0 (hot): 64-shard concurrent DashMap. 31ns reads. Self-promoting on GET -- every access bumps the entry's priority.
- L1 (warm): 128-shard lock-free map. 59ns reads. Entries demoted from L0 on memory pressure land here.
- L2 (fallthrough): Your existing Redis, database, or origin. Cachee fetches on L0+L1 miss and auto-promotes hot results.
Large values that are accessed frequently stay in L0. Large values accessed rarely drop to L1, then evict. The eviction decision is automatic -- CacheeLFU handles it based on observed access patterns, not manual TTL tuning.
The Result
Cachee serves a 50 KB HTML fragment at the same latency as a 64-byte session token: 31 nanoseconds. A 17 KB SLH-DSA signature at 31 nanoseconds. A 200 KB PDF preview at 31 nanoseconds. Value size does not enter the latency equation. Your application reads from local memory. The network is not involved.
When to Use What
| Value Size | Access Pattern | Where to Cache | Why |
|---|---|---|---|
| < 1 KB | Any | In-process L0 | Everything fits. No reason to go to network. |
| 1-100 KB | Hot (100+ reads/sec) | In-process L0 | Network latency is 16,000-50,000x slower. |
| 1-100 KB | Warm (1-10 reads/min) | Redis L1 | Memory cost of L0 residency not justified. |
| 100 KB - 1 MB | Hot | In-process L0 (top 100-200 items) | Only the hottest items justify the memory. |
| 100 KB - 1 MB | Warm/Cold | CDN + metadata in L0 | Cache the pointer, not the payload. |
| 1 MB+ | Any | Object storage + CDN | Not a cache problem. Memory-map if local. |
Getting Started
# Install
brew tap h33ai-postquantum/tap
brew install cachee
# Start with large-value support
cachee init
cachee start
# Cache a 50KB HTML fragment
cachee set "fragment:homepage:hero" "$(cat hero.html)"
# Retrieve at 31ns
cachee get "fragment:homepage:hero"
# Check hit rate and memory
cachee status
Cachee speaks RESP -- any Redis client works. Point your existing Redis client at localhost:6380 and your large-value caching runs at 31ns instead of 1.5ms. No code changes beyond the connection string.