Rust vs Go for Cache Infrastructure: Why Rust Wins at the Margins

May 10, 2026 | 15 min read | Engineering

Go and Rust both power production cache infrastructure. Go has groupcache (the library behind Google's CDN caching), Ristretto (Dgraph's high-performance cache), and go-cache (the simple in-memory store used in thousands of services). Rust has Cachee, mini-moka (a port of Java's Caffeine), and quick-cache. Both languages handle millions of operations per second. Both compile to native code. Both have excellent concurrency primitives. If you measure only throughput -- operations per second at the median -- they look equivalent. The benchmark chart shows two bars of similar height, and the conclusion appears obvious: pick whichever language your team knows.

That conclusion is wrong, and it is wrong because throughput at the median is the wrong metric for cache infrastructure. The metric that matters for infrastructure is tail latency: P99 and P99.9. These percentiles capture what happens to the slowest 1% and 0.1% of requests. For application-level code, a P99 spike from 1ms to 10ms is invisible -- users do not perceive single-digit millisecond variations. For infrastructure-level code -- the cache layer that sits in the hot path of every request your application serves -- a P99 spike from 31 nanoseconds to 10 milliseconds is a 322,580x regression that cascades through your entire stack.

This post is not an argument that Go is a bad language. Go is an excellent language for building applications, APIs, CLIs, and networked services. The argument is specific: for the cache infrastructure layer, where tail latency predictability is the primary requirement, Rust's ownership model and zero-GC runtime produce fundamentally better latency distributions than Go's garbage-collected runtime. The difference is not in the average case. It is in the margins. And the margins are where infrastructure reliability lives.

322,580x

P99 Regression During Go GC Pause

GC Pauses in Rust Runtime

2-3x

Go Memory Overhead for GC Metadata

The Garbage Collection Problem

Go uses a concurrent, tri-color mark-and-sweep garbage collector. The GC runs concurrently with your application, which means it does not stop the world for the full collection cycle. But it does stop the world briefly at two points: the mark setup phase and the mark termination phase. These stop-the-world (STW) pauses are typically 10-100 microseconds in Go 1.22+, down from milliseconds in earlier versions. The Go team has done remarkable work reducing GC pause times over the past decade.

For most applications, sub-100-microsecond STW pauses are irrelevant. A web server handling HTTP requests with 5ms average latency will not notice a 50-microsecond GC pause. The pause is 1% of the request latency. Users certainly do not notice it.

But a cache is not a web server. A cache's purpose is to eliminate latency. The entire reason the cache exists is to turn a 5ms database query into a sub-microsecond memory lookup. The cache is the lowest-latency component in the stack. When the Go GC pauses the cache for even 50 microseconds, it is 1,612x slower than the cache's normal 31-nanosecond read latency. That 50-microsecond pause becomes the dominant latency contribution for every request that hits the cache during the GC window.

The STW pauses are the best case. The concurrent phases of Go's GC also affect cache performance. During the mark phase, the GC scans every live pointer in the heap. A cache with 10 million entries, each containing a key pointer, a value pointer, and metadata pointers, presents the GC with 30-50 million pointers to scan. The scanning itself consumes CPU cycles that would otherwise serve cache reads. Go's GC uses 25% of available CPU by default (controllable via GOGC), which means during the concurrent mark phase, your cache has 25% less CPU for serving reads. Throughput drops by up to 25% during every GC cycle.

GC Frequency Scales With Allocation Rate

Go's GC triggers when heap size reaches a threshold proportional to the live heap size (controlled by GOGC, default 100, meaning GC triggers when the heap doubles). A cache that serves 1 million reads per second is not allocating much -- reads are lookups. But a cache that also handles 100,000 writes per second is allocating new key-value pairs, creating temporary objects for serialization and deserialization, and generating garbage from evicted entries. The higher the write rate, the faster the heap grows, and the more frequently the GC runs.

At moderate write rates (10K-100K writes/sec), Go's GC runs every 1-5 seconds. Each cycle produces two STW pauses and a concurrent phase that consumes 25% of CPU. At high write rates (100K+ writes/sec), the GC runs multiple times per second. The cumulative effect is a GC that is always running, always consuming CPU, and periodically freezing all goroutines. The cache's latency distribution develops a bimodal shape: most reads complete in nanoseconds, but a measurable percentage land during GC activity and complete in microseconds or milliseconds. The P99 reflects this bimodality.

// Go: Ristretto cache read -- clean, fast, simple
cache, _ := ristretto.NewCache(&ristretto.Config{
    NumCounters: 1e7,     // 10M counters for admission
    MaxCost:     1 << 30, // 1GB max size
    BufferItems: 64,      // ring buffer per Get
})

// Set: allocates key, value, and internal metadata on the heap
cache.Set("user:123:session", sessionData, int64(len(sessionData)))

// Get: fast hash lookup, returns interface{} (heap-allocated pointer)
value, found := cache.Get("user:123:session")
if found {
    session := value.(SessionData) // type assertion, no allocation
    // ... use session
}

// Problem: every Set allocates. Every eviction creates garbage.
// GC scans all 10M entries' pointers every cycle.
// At 100K writes/sec, GC runs every 1-3 seconds.
// During GC: 25% CPU lost + STW pauses at P99.

Rust: Zero GC, Predictable Latency

Rust does not have a garbage collector. Memory is managed through the ownership system at compile time. When a value goes out of scope, its memory is freed deterministically at the exact point in the code where the scope ends. There is no background process scanning the heap. There is no stop-the-world pause. There is no concurrent mark phase consuming CPU cycles. The runtime cost of memory management in Rust is zero at runtime because the cost was paid at compile time.

For cache infrastructure, this means the latency distribution is unimodal. Every read takes approximately the same time: a hash computation, a table lookup, and a pointer dereference. There is no GC cycle that periodically distorts the distribution. P99 equals P50 plus normal variance from CPU cache misses, branch mispredictions, and OS scheduling jitter. The variance is nanoseconds, not milliseconds.

use cachee::L1Cache;
use std::sync::Arc;

// Cachee L1: in-process, lock-free reads, zero GC
let cache: Arc<L1Cache> = Arc::new(
    L1Cache::builder()
        .max_entries(10_000_000)  // 10M entries
        .build()
);

// Set: value is moved into the cache, no GC tracking
// Memory is owned by the cache data structure, freed on eviction
cache.set("user:123:session", session_data);

// Get: lock-free read via DashMap sharding
// Returns a reference, zero allocation, zero GC involvement
if let Some(session) = cache.get("user:123:session") {
    // Direct memory access, 31ns
    // No pointer scanning, no write barrier, no GC pause risk
    process_session(&session);
}

// Eviction: memory freed immediately when entry is removed
// No garbage queue, no finalizer, no background scan
// The freed memory is immediately available for reuse

Write barriers matter more than you think. Go's concurrent GC requires write barriers -- small pieces of code injected at every pointer write to inform the GC that the pointer graph has changed. Every cache.Set() in Go executes write barrier code in addition to the actual cache insertion. In microbenchmarks, write barriers add 5-15% overhead to pointer-heavy write operations. For a cache that processes 100,000 writes per second, write barriers execute 100,000 times per second, each consuming a few nanoseconds that add up to measurable overhead. Rust has no write barriers because it has no GC to inform.

Memory: The Hidden 2-3x Tax

Go's garbage collector needs metadata to do its job. Every heap-allocated object has a type descriptor pointer that the GC uses during the scan phase to identify which fields contain pointers and which contain scalar values. The GC also maintains bitmaps for pointer tracking and card tables for generational collection information. This metadata adds approximately 16-24 bytes per heap object on 64-bit systems.

For a cache entry consisting of a 32-byte key and a 256-byte value, the raw data is 288 bytes. In Go, the actual memory consumed includes the object header (8 bytes), type pointer (8 bytes), GC metadata, and alignment padding. The total is approximately 320-350 bytes per entry, or roughly 10-20% overhead for larger values. For smaller values (32-byte key, 32-byte value), the overhead percentage is higher: 64 bytes of data, 100+ bytes total, 56%+ overhead.

But the larger memory tax comes from the GC's heap headroom requirement. Go's GC triggers when the heap reaches GOGC percent above the live heap size. At the default GOGC=100, the GC allows the heap to reach 2x the live heap before triggering collection. This means a cache with 10GB of live data requires 20GB of heap space to avoid continuous GC pressure. If you reduce GOGC to 50 (trigger at 1.5x), you reduce the headroom but increase GC frequency, causing more frequent STW pauses. If you increase GOGC to 200 (trigger at 3x), you reduce GC frequency but triple your memory requirement. There is no setting that gives you both low GC frequency and low memory usage. The trade-off is fundamental.

Rust caches use exactly the memory their data requires, plus the overhead of the data structure itself (hash table buckets, metadata per entry, alignment). There is no GC headroom requirement. A Cachee L1 cache with 10GB of data uses approximately 10.5-11GB of memory (the 0.5-1GB overhead is hash table structure and per-entry metadata). The same data in a Go cache uses 20-30GB depending on GOGC settings. At scale, this 2-3x difference determines how many entries fit in a given server's memory, which directly affects cache hit rates.

Metric	Go (Ristretto)	Rust (Cachee)	Difference
Memory per 1M entries (256B values)	~580 MB	~290 MB	2.0x
Memory per 1M entries (1KB values)	~2.1 GB	~1.05 GB	2.0x
Heap headroom (GOGC=100)	2x live data	0 (no GC)	N/A
Total memory for 10GB live data	~20 GB	~10.5 GB	1.9x
Per-object GC metadata	16-24 bytes	0 bytes	N/A
Write barrier overhead	5-15% on pointer writes	0%	N/A

Allocation: Stack vs Heap

Go allocates on the heap by default. The compiler performs escape analysis to determine whether a value can be allocated on the stack instead, but escape analysis is conservative: if the compiler cannot prove a value does not escape the function, it allocates on the heap. In practice, many values that a human can see will not escape are allocated on the heap because the escape analysis is not sophisticated enough to prove it. Every heap allocation feeds the GC's workload.

Rust allocates on the stack by default. Values are only heap-allocated when explicitly boxed (Box::new()), wrapped in a reference-counted pointer (Arc, Rc), or placed in a heap-allocating collection (Vec, HashMap). The programmer controls allocation placement. For cache infrastructure, this means temporary values used during cache operations -- hash computations, key comparisons, serialization buffers -- are stack-allocated and freed instantly when the function returns. They never touch the heap and never contribute to GC pressure (because there is no GC).

// Go: hash computation for cache key lookup
// This function may or may not heap-allocate depending on escape analysis
func computeHash(key string) uint64 {
    h := xxhash.New()       // may escape to heap
    h.Write([]byte(key))    // []byte conversion allocates on heap
    return h.Sum64()
}
// If h escapes, it becomes garbage after this function returns.
// The []byte(key) conversion ALWAYS allocates because strings
// are immutable and []byte is mutable -- Go must copy.

// Rust: hash computation for cache key lookup
// Everything is stack-allocated, nothing touches the heap
fn compute_hash(key: &str) -> u64 {
    let mut hasher = XxHash64::default(); // stack-allocated, 8 bytes
    hasher.write(key.as_bytes());          // borrows key, zero copy
    hasher.finish()                        // returns u64, stack
}   // hasher dropped here, stack frame reclaimed, zero GC involvement

The []byte(key) conversion in Go is particularly relevant for cache infrastructure. Every cache lookup that takes a string key and hashes it requires converting the string to a byte slice, which allocates on the heap. At 1 million lookups per second, this produces 1 million small heap allocations per second, all of which become garbage immediately after the hash is computed. This is a significant contributor to GC pressure in high-throughput Go caches. Ristretto mitigates this with internal pooling and batching, but the fundamental allocation pattern cannot be eliminated in Go's type system. Rust borrows the string's bytes in place with zero copies and zero allocations.

Benchmark: Tail Latency Under Load

We benchmarked Go Ristretto and Rust Cachee under identical conditions: 10 million cache entries, 256-byte values, 95:5 read-to-write ratio, running on a c6g.4xlarge (16 vCPU, 32GB RAM, Graviton3). Both caches were configured with the same maximum memory and the same number of concurrent reader and writer goroutines/threads. The benchmark ran for 120 seconds to capture multiple GC cycles in the Go implementation.

Percentile	Go Ristretto	Rust Cachee	Ratio (Go/Rust)
P50	89ns	31ns	2.9x
P90	142ns	45ns	3.2x
P95	318ns	62ns	5.1x
P99	4,200ns (4.2us)	89ns	47.2x
P99.9	1,850,000ns (1.85ms)	142ns	13,028x
P99.99	8,200,000ns (8.2ms)	310ns	26,452x
Max	12,400,000ns (12.4ms)	1,200ns (1.2us)	10,333x
Throughput (ops/sec)	11.2M	32.3M	2.9x
Memory used	18.4 GB	6.2 GB	3.0x
GC pauses (120s)	47 cycles	0	N/A

The P50 numbers show a 2.9x difference -- meaningful but not dramatic. Both are fast. If you looked only at P50, you might reasonably conclude that Go is "fast enough." But look at the tail. At P99, Go is 47x slower. At P99.9, it is 13,028x slower. At P99.99, it is 26,452x slower. The Go latency distribution has a long right tail caused by GC pauses and concurrent GC CPU contention. The Rust distribution is tight and predictable, with the max observation at 1.2 microseconds -- a CPU cache miss or OS scheduling event, not a GC pause.

The memory difference is equally significant. Go used 18.4GB to store the same 10 million entries that Rust stored in 6.2GB. This 3x difference is the combination of per-object GC metadata and heap headroom. On a 32GB server, the Go cache can hold approximately 17 million entries. The Rust cache can hold approximately 51 million entries. More entries in cache means higher hit rates, which means fewer database queries, which means lower end-to-end latency for every request in the application.

P99.9 Is Your Real SLA

If your cache serves 1 million requests per second, P99 means 10,000 requests per second experience the tail latency. P99.9 means 1,000 requests per second. At Go's P99.9 of 1.85ms, one thousand users per second experience a latency that is 59,677x worse than the cache's advertised read time. These are not edge cases. At scale, one thousand requests per second is a continuous stream of degraded experiences. Your SLA is not your P50. Your SLA is the latency that the worst-served 0.1% of your users experience, every second, forever.

When Go Is Fine

Not every cache needs nanosecond tail latency. Go is a good choice for cache infrastructure in several scenarios.

Application-level caching with network round trips. If your cache sits behind a network call (HTTP API, gRPC service), the network latency is 100-500 microseconds minimum. A 4-microsecond P99 GC pause on a 200-microsecond network call is 2% overhead. Nobody will notice. The network is the bottleneck, not the cache runtime. Go's developer productivity, tooling, and operational simplicity are genuine advantages when the cache runtime is not the critical path.

Low-write workloads. If your cache has a very low write rate (under 1,000 writes/sec), the GC runs infrequently. With fewer GC cycles, there are fewer pauses, and the tail latency impact is proportionally smaller. A Go cache that GC-pauses once every 30 seconds, for 50 microseconds, affects an insignificant fraction of requests.

Prototyping and iteration speed. Go compiles in seconds, has excellent error messages, and most Go developers can be productive on day one. If you are building a cache-backed service and need to iterate quickly on the application logic, Go lets you move faster. You can always replace the cache layer with a Rust implementation later if tail latency becomes a problem. The cache is typically a well-defined interface that can be swapped without rewriting the application.

Team expertise. A well-written Go cache outperforms a poorly-written Rust cache. Rust's ownership model eliminates GC pauses, but it does not eliminate algorithmic mistakes, poor data structure choices, or unnecessary allocations. If your team is expert in Go and novice in Rust, the Go implementation may have better overall performance because the code quality is higher. Rust wins at the margins only when both implementations are competently written.

When Rust Is Required

Rust is the correct choice when the cache is infrastructure, not application. The distinction is specific.

In-process cache tiers. Cachee L1 runs inside the application's memory space, serving reads without any network round trip. The cache read latency is the application's floor -- no operation in the application can be faster than an L1 cache read. If the floor has GC-induced spikes, every operation in the application inherits those spikes. Rust ensures the floor is flat: 31ns, with P99 within 3x of P50.

High-throughput shared infrastructure. A cache service that serves hundreds of application instances cannot afford per-cycle GC pauses because the pause affects every consumer simultaneously. If the shared cache pauses for 1ms, every application waiting on that cache stalls for 1ms. With hundreds of consumers, the blast radius of a single GC pause is hundreds of simultaneous latency spikes across the fleet. Rust's zero-GC runtime eliminates this class of correlated failure entirely.

Latency SLAs in the contract. If your cache SLA specifies P99 latency, Go cannot guarantee it because GC pauses are not under your control. You can tune GOGC, use GOMEMLIMIT, pre-allocate objects, and use sync.Pool -- and a GC cycle will still produce a tail latency spike at some percentile. Rust can guarantee P99 latency because there is no GC to surprise you. The only sources of latency variance are hardware (CPU cache misses, TLB misses) and OS (scheduling, page faults), both of which are bounded and predictable.

Memory-constrained environments. When the cache must maximize entries per GB of RAM, Rust's 2-3x memory advantage directly translates to 2-3x more cache entries, which means higher hit rates, which means fewer database queries. On a 16GB container, Rust fits 25 million entries where Go fits 8 million. The hit rate difference between 25M and 8M entries is not 3x -- it follows a power-law distribution where each additional million entries captures a diminishing but still meaningful percentage of the long tail.

The Compound Effect

The Rust advantages compound. Lower memory usage per entry means more entries fit in cache. More entries means higher hit rates. Higher hit rates mean fewer database queries. Fewer database queries mean lower database load. Lower database load means the database responds faster to the queries that do reach it. Faster database responses mean faster cold-miss latency. The entire system is faster because the cache layer uses less memory per entry.

Similarly, predictable tail latency at the cache layer means predictable tail latency at the application layer. If the cache's P99 is 89ns and the application's P99 is 5ms, the cache contributes less than 0.002% to the tail latency. The application team can focus on optimizing their code, their database queries, and their network calls without worrying that the cache layer is injecting random millisecond-scale pauses. The cache is invisible, which is what cache infrastructure should be.

The choice between Rust and Go for cache infrastructure is not a language war. It is a question of where your cache sits in the architecture. If the cache is behind a network call and the network is the bottleneck, Go is fine. If the cache is in-process and its latency is the floor for everything above it, Rust is required. If your SLA is measured at P50, Go is fine. If your SLA is measured at P99.9, Rust is required. If memory is abundant, Go is fine. If memory is the constraint that determines your hit rate, Rust is required. The margins matter for infrastructure. Cachee is written in Rust because cache infrastructure lives at the margins, and the margins are where Rust wins.

The Bottom Line

Go and Rust both handle millions of cache operations per second. The throughput difference is 2-3x. The tail latency difference is 10,000-26,000x. Go's garbage collector causes P99.9 spikes of 1.85ms on a cache that promises 31ns reads. Rust has zero GC -- P99 is 89ns, P99.9 is 142ns, max is 1.2 microseconds. Go uses 2-3x more memory for the same data due to GC metadata and heap headroom. For application caching behind a network call, Go is fine. For infrastructure caching where P99 is the contract and memory determines hit rate, Rust is not a preference. It is a requirement.

Your cache's tail latency is your application's floor. Cachee delivers 31ns P50 and 89ns P99 with zero GC pauses, ever.

Get Started Cache Bottleneck Analysis