Cache Serialization Cost: Why Large Values Lag
You added caching to speed up your application. It worked for session tokens and feature flags. Then you cached a 100 KB API response and the latency barely improved. You assumed the cache was cold, warmed it up, and measured again. Still slow. The value is in Redis. Redis is fast. Why is your cached read taking over a millisecond?
Because the cache read is not just a read. It is six distinct operations, and at least three of them scale linearly with the size of the value. Serialization -- the process of converting data structures to byte streams and back -- is the hidden tax on every cache operation. For small values, the tax is invisible. For large values, it dominates your latency budget.
This article dissects every step of a Redis GET, benchmarks each step in isolation, compares serialization formats, and explains why in-process caching eliminates the problem entirely.
The Six Steps Inside Every Redis GET
When your application calls redis.get("user:profile:12345"), the following operations execute in sequence. Each one has a cost. For a 64-byte session token, the total cost is dominated by the network round trip. For a 100 KB API response, the cost profile changes dramatically.
Step 1: Client Serializes the Key to RESP
The Redis Serialization Protocol (RESP) is a text-based protocol. Your client library converts the GET command and key into RESP wire format. For a simple key like "user:profile:12345", this produces:
*2\r\n$3\r\nGET\r\n$20\r\nuser:profile:12345\r\n
This step is fast and constant-size regardless of value size because you are only serializing the key and command. Cost: 0.1-0.5 microseconds. This step is not the problem.
Step 2: TCP Send
The serialized RESP command is sent over a TCP connection to the Redis server. If connection pooling is configured (it should be), this reuses an existing connection. The command is small (typically under 100 bytes), so it fits in a single TCP segment. Cost: variable, dominated by kernel syscall overhead (~1-5 microseconds for the write() syscall) plus network latency (50-300 microseconds same-AZ, 500-2000 microseconds cross-AZ). For this analysis, we will use 150 microseconds as a representative same-AZ network latency.
Step 3: Redis Finds the Value
Redis receives the command, parses it, and performs a hash table lookup. Redis uses a chained hash table with incremental rehashing. The lookup is O(1) amortized. For a database with 1 million keys, the lookup takes 0.5-2 microseconds, including hash computation and pointer chasing. This step does not depend on value size. Redis finds the pointer to the value; it does not touch the value data yet.
Step 4: Redis Serializes the Value to RESP
This is where value size starts to matter. Redis must convert the stored value into RESP wire format for transmission. For a bulk string response, RESP produces:
$[length]\r\n[data bytes]\r\n
The length prefix is trivial. The data bytes are a memcpy from Redis's internal SDS (Simple Dynamic String) buffer into the output buffer. For a 64-byte value, this memcpy is negligible -- a few cache lines, a few nanoseconds. For a 100 KB value, Redis must copy 100,000 bytes into the output buffer. On modern hardware, memcpy throughput is approximately 10-20 GB/s, so 100 KB copies in about 5-10 microseconds.
But the real cost is not the memcpy itself. It is the interaction with Redis's event loop. Redis is single-threaded. While it is writing 100 KB into the output buffer and flushing it to the TCP socket, it is not processing any other commands. For a 1 MB value, the serialization and write phase blocks the event loop for 50-100 microseconds. Every other client waiting for a response from this Redis instance is stalled for that duration.
Step 5: TCP Receive
The RESP-encoded response travels back over the network. For a 64-byte value, this is a single TCP segment. For a 100 KB value, this is approximately 70 TCP segments (at a typical MSS of 1,460 bytes). The kernel must reassemble these segments, copy them from kernel buffers to userspace, and deliver them to the client. Cost: network latency (150 microseconds) plus transfer time. Transfer time for 100 KB at 10 Gbps: ~80 microseconds. At 1 Gbps: ~800 microseconds. This step scales linearly with value size.
Step 6: Client Deserializes the RESP Response
The client library receives the raw RESP bytes and must parse them back into a usable data structure. For a simple string value, this is a buffer slice -- nearly free. But most applications do not store raw strings. They store serialized objects: JSON, MessagePack, Protobuf, or application-specific binary formats. The client library first parses the RESP envelope (trivial), then the application code deserializes the payload from its chosen format into an in-memory object.
This is where the real serialization cost lives. Deserializing 100 KB of JSON involves parsing every byte, allocating strings, building hash maps for objects, and converting numbers from text to native types. This is CPU-intensive work that scales linearly -- and sometimes super-linearly -- with value size.
Benchmarking Serialization Formats at Scale
We benchmarked four common serialization formats at multiple value sizes. The benchmark measures only the serialization and deserialization time -- no network, no Redis, no I/O. Pure CPU cost of converting an in-memory data structure (a nested user profile object with arrays, strings, integers, and nested objects) to bytes and back.
| Format | 1 KB Serialize | 1 KB Deserialize | 10 KB Ser | 10 KB Deser | 100 KB Ser | 100 KB Deser |
|---|---|---|---|---|---|---|
| JSON | 2.1 us | 3.8 us | 18 us | 35 us | 175 us | 340 us |
| MessagePack | 1.4 us | 2.1 us | 12 us | 19 us | 115 us | 185 us |
| Protobuf | 0.8 us | 1.2 us | 7 us | 11 us | 68 us | 108 us |
| RESP (raw bytes) | 0.3 us | 0.4 us | 2.5 us | 3.2 us | 24 us | 31 us |
At 1 KB, serialization cost is barely measurable against network latency. JSON round-trip (serialize + deserialize) costs 5.9 microseconds. The network costs 300+ microseconds. Serialization is 2% of total latency. Nobody notices.
At 100 KB, the picture inverts. JSON round-trip costs 515 microseconds. The network costs 300-400 microseconds. Serialization is now 55-60% of total latency. You have spent more CPU time converting data to and from bytes than you spent moving those bytes across the network.
The JSON Tax at 100 KB
A 100 KB JSON value round-trips through serialize + network + deserialize in approximately 915 microseconds (175us serialize + 300us network + 100us RESP overhead + 340us deserialize). Of that, 515 microseconds -- 56% -- is pure serialization work. Switching from JSON to Protobuf reduces the serialization portion to 176 microseconds, but that is still 30% of total latency. The only way to eliminate this cost entirely is to not serialize at all.
Why Deserialization Is More Expensive Than Serialization
In every format, deserialization is more expensive than serialization. The reasons are structural:
- Memory allocation: Serialization writes into a pre-allocated buffer. Deserialization must allocate new objects, strings, arrays, and hash maps on the heap. Each allocation involves a malloc call, potential contention on the allocator, and eventual GC pressure in managed languages.
- Parsing ambiguity: The serializer knows the data types at compile time (or from the schema). The deserializer must discover them from the byte stream. JSON does not even encode type information explicitly -- the parser must distinguish strings, numbers, booleans, and null by examining the first byte of each value.
- Validation: The deserializer must validate that the input is well-formed. A malformed JSON payload must produce an error, not a segfault. This validation requires checking every byte. The serializer, working from trusted in-memory data, performs no validation.
- String handling: JSON strings require escape processing on deserialization (converting
\"to",\\nto newlines,\\uXXXXto Unicode codepoints). This is a byte-by-byte operation with branches. Serialization escaping is simpler because the input is already a valid Unicode string.
For a 100 KB JSON payload with many string fields, deserialization allocates dozens of heap objects and processes thousands of escape-eligible characters. The cost is not just CPU cycles -- it is cache pollution. The deserializer touches memory in an unpredictable pattern (allocating objects scattered across the heap), which thrashes L1/L2 CPU caches and slows down subsequent application code.
The Full Cost Breakdown: Redis GET of a 100 KB Value
Putting all six steps together for a 100 KB value stored as JSON, accessed from the same availability zone:
| Step | Operation | Cost | % of Total |
|---|---|---|---|
| 1 | Client serializes key to RESP | 0.5 us | <0.1% |
| 2 | TCP send (command) | 155 us | 16.3% |
| 3 | Redis hash lookup | 1.5 us | 0.2% |
| 4 | Redis serializes value to RESP + write | 35 us | 3.7% |
| 5 | TCP receive (100 KB payload) | 230 us | 24.2% |
| 6 | Client deserializes JSON | 340 us | 35.8% |
| -- | Application-level JSON parse | 175 us | 18.4% |
| Total | 937 us |
The two serialization steps (Redis RESP serialization + client JSON deserialization) plus the application-level JSON parsing account for 550 microseconds -- 58.7% of total latency. The actual network transfer is 385 microseconds. The hash lookup that represents the "cache" part of the operation is 1.5 microseconds.
Your 100 KB cached value takes 937 microseconds to retrieve. Of that, less than 0.2% is the cache doing its job (finding the value). The remaining 99.8% is overhead: moving data across the network and converting it between formats.
In-Process: Zero Serialization Because There Is Nothing to Serialize
An in-process cache stores values in the same address space as your application. When you call cache.get("user:profile:12345"), the cache performs a hash lookup (the same 1.5 microsecond operation Redis performs) and returns a pointer to the value. There is no Step 1 (no RESP command to construct). No Step 2 (no TCP send). No Step 4 (no RESP serialization). No Step 5 (no TCP receive). No Step 6 (no deserialization).
The value is already in the application's memory. It is already an in-memory data structure -- the same struct, the same object, the same bytes that the application works with. Accessing it is a pointer dereference, not a data transformation. The cost is the hash lookup itself: 31 nanoseconds.
There is no serialization because there is nothing to convert. The cache entry IS the application object. Asking "how long does it take to serialize?" is like asking "how long does it take to convert a variable into itself?" The answer is zero. Not "very fast." Zero.
When Serialization Cost Matters and When It Does Not
Not every application caches 100 KB values. Not every application is latency-sensitive. Here is a practical guide for when serialization overhead is a real problem versus when you can ignore it.
Serialization Does Not Matter When:
- Values are under 1 KB. At 1 KB, JSON round-trip serialization costs 5.9 microseconds. Against a 300+ microsecond network round trip, this is noise. If all your cached values are small (session tokens, feature flags, counters, rate limit entries), serialization is not your bottleneck.
- Cache reads are not in the hot path. If the cache is used for background jobs, batch processing, or pre-computation, an extra 500 microseconds per read is irrelevant. Your batch job takes minutes anyway.
- You read each value once per request. A single 937-microsecond cache read is not noticeable in a request that takes 50ms total. The problem arises when you hit the cache multiple times per request, each time deserializing a large value.
- Your values are raw bytes. If you store pre-rendered HTML, images, or binary blobs, there is no application-level deserialization. The RESP overhead is small (24-31 microseconds at 100 KB). Network transfer dominates.
Serialization Matters When:
- Values are 10 KB+ and accessed in the hot path. At 10 KB with JSON, serialization costs 53 microseconds round-trip. At 100 KB, 515 microseconds. If your API endpoint hits the cache 3-4 times with large values, serialization alone adds 1.5-2ms to every request.
- You use JSON for cache serialization. JSON is the most common format and the most expensive. It is human-readable, which means it wastes bytes on field names, braces, brackets, quotes, and escape sequences. A 50 KB JSON object is often 20-30 KB as Protobuf. You are paying to serialize and transmit characters that carry no information.
- You are in a latency-sensitive domain. Financial trading, real-time bidding, game server tick processing, fraud detection -- any domain where microseconds matter. If your SLA is "respond in 5ms" and serialization costs 0.5ms, you have consumed 10% of your budget on format conversion.
- You hit the cache at high frequency. 100,000 cache reads per second at 100 KB each means your application spends 51.5 CPU-seconds per second on serialization alone. You need 52 cores just to deserialize cache responses. This is not theoretical. Large API gateways, CDN origin shields, and ML inference pipelines routinely hit these numbers.
- You cache post-quantum cryptographic material. ML-DSA-65 public keys are 1,952 bytes. SLH-DSA-128f signatures are 17,088 bytes. ML-KEM-768 ciphertexts are 1,088 bytes. If you cache these as base64-encoded JSON (a common pattern), the base64 encoding adds 33% overhead and the JSON wrapper adds field names and brackets. A 17 KB signature becomes 24 KB of JSON. Serialization cost: 45 microseconds round-trip. Multiply by the number of signature verifications per second.
Choosing a Serialization Format (If You Must Serialize)
If you cannot move to in-process caching (multi-instance deployments that need shared cache state), choosing the right serialization format reduces but does not eliminate the overhead.
| Format | Pros | Cons | Best For |
|---|---|---|---|
| JSON | Human-readable, universal support, debuggable | Slowest, largest wire size, no schema | Development, debugging, small values |
| MessagePack | Binary JSON, 1.5-2x faster than JSON, compact | Still schemaless, requires library | Drop-in JSON replacement |
| Protobuf | Schema-enforced, 2-3x faster, smallest wire size | Requires .proto files, code generation, versioning discipline | Production large-value caching |
| FlatBuffers/Cap'n Proto | Zero-copy access, no deserialization step | Complex API, alignment requirements, not human-debuggable | Extreme latency sensitivity |
| Raw bytes (RESP only) | No application serialization, direct buffer access | No structure, application must interpret bytes | Binary blobs, pre-rendered content |
FlatBuffers and Cap'n Proto deserve special mention. These formats store data in a wire-compatible format that can be accessed directly without deserialization. You read fields from the buffer in place, following offsets instead of parsing. This eliminates Step 6 (client deserialization) at the cost of a more complex programming model. However, you still pay for Steps 2-5 (network transfer). The serialization savings are real but the network cost remains.
The Architecture Decision: Eliminate Steps, Not Optimize Them
There are two approaches to the serialization problem:
- Optimize serialization. Switch from JSON to Protobuf. Use connection pooling. Enable pipelining. Use Unix domain sockets instead of TCP. Compress large values. Each optimization reduces a step's cost by some percentage.
- Eliminate serialization. Move the cache in-process. Remove Steps 1-6 entirely. The cost goes from 937 microseconds to 31 nanoseconds. No optimization of the individual steps can compete with removing them.
Option 1 is incremental improvement. Switching from JSON to Protobuf reduces serialization cost from 515 microseconds to 176 microseconds at 100 KB. That is a 2.9x improvement. Impressive in isolation. But total latency goes from 937 microseconds to 598 microseconds. You have optimized one component of a six-step pipeline. The other five steps still execute.
Option 2 is architectural elimination. Total latency goes from 937 microseconds to 31 nanoseconds. That is a 30,226x improvement. Not because in-process hash tables are magically faster, but because five of the six steps no longer exist. You do not optimize what you remove.
Zero Serialization with Cachee
Cachee stores values in-process as native memory. A GET is a DashMap lookup and a pointer dereference. No RESP encoding. No TCP transfer. No JSON parsing. No Protobuf decoding. The value is already the data structure your application uses. 31 nanoseconds at any value size. The 100 KB API response, the 17 KB PQ signature, the 200 KB rendered report -- all accessed at the same cost as a 64-byte session token.
Practical Steps
1. Measure Your Serialization Overhead
Before changing anything, measure. Instrument your Redis client to log the time spent in serialization and deserialization separately from network time. Most client libraries have hooks or middleware for this. You may be surprised -- teams often assume "Redis is slow" when the actual bottleneck is their JSON library.
# Python example: measure serialization cost
import time, json, redis
r = redis.Redis()
raw = r.get("large:key") # Returns bytes (RESP already parsed)
t0 = time.perf_counter_ns()
obj = json.loads(raw) # This is your deserialization cost
t1 = time.perf_counter_ns()
print(f"Deserialization: {(t1-t0)/1000:.1f} us for {len(raw)} bytes")
2. Profile Value Size Distribution
Run redis-cli --bigkeys or use MEMORY USAGE key to understand the size distribution of your cached values. If 90% of your keys are under 1 KB, serialization is not your problem. If 10% of your keys are over 10 KB and those keys are accessed frequently, those are your candidates for in-process caching.
3. Move Hot Large Values In-Process
You do not need to replace Redis entirely. Identify the large values that are accessed most frequently and move them to an in-process L0 tier. Keep Redis as L1 for shared state and less-frequently-accessed data. The hot path avoids serialization entirely. The warm path still uses Redis but handles only the values where serialization cost is tolerable.
# Install Cachee
brew tap h33ai-postquantum/tap
brew install cachee
# Start with RESP compatibility
cachee init
cachee start
# Point hot-path reads at localhost:6380 (Cachee)
# Point warm-path reads at your Redis cluster
# No serialization on the hot path. Zero.
4. If You Must Use Network Cache: Switch to Protobuf
If architectural constraints prevent in-process caching (shared state across multiple instances, for example), at minimum switch from JSON to Protobuf for values over 10 KB. The 2.9x serialization speedup is free performance. The schema enforcement also catches bugs at compile time that JSON catches at runtime (or never).
Conclusion
Serialization is invisible at small value sizes and dominant at large ones. The crossover happens around 10 KB, where serialization begins to consume more time than the network transfer itself. At 100 KB, serialization is the majority of your cache latency. At 1 MB, it is overwhelming.
The fix is not faster serialization. The fix is no serialization. An in-process cache eliminates the concept of format conversion from the read path. The value does not travel across a network. It does not transform between representations. It exists once, in memory, and your application reads it directly. That is why large values lag in network caches and why they do not in Cachee.
Stop serializing. Start reading from memory. 31ns for any value size.
Install Cachee View Benchmarks