Overview
Caches go stale. Invalidation messages get lost. TTLs are set too high. Background jobs crash. The result is silent data drift: your cache says one thing, your database says another, and your users see stale data without anyone knowing.
The SelfHealingEngine solves this by continuously sampling cached values against their source of truth. It uses Fisher-Yates sampling to select random keys, fetches the current value from the source, compares it to the cached value, and automatically repairs mismatches. Each key prefix gets a consistency score that quantifies how trustworthy that section of your cache is.
Enable self-healing when you cannot guarantee that every mutation reaches the cache. This includes systems with event-driven invalidation (messages can be lost), write-through caches with eventual consistency sources, and any cache where TTLs are the primary freshness mechanism.
Architecture
Sampling with Fisher-Yates
Each verification interval, the engine selects a random sample of keys using a partial Fisher-Yates shuffle. This guarantees uniform sampling without replacement within each interval. The sample size is total_keys * sample_rate — at the default 1% sample rate with 1M keys, 10,000 keys are verified per interval.
Source Verification
For each sampled key, the engine resolves the key's prefix to a source configuration (typically an HTTP endpoint or database query template). It fetches the current value from the source and performs a byte-level comparison with the cached value.
- Match: The key is consistent. Increment
consistentcounter. - Mismatch: The cached value differs from the source. The engine writes the source value to the cache via
SET, incrementsrepairedcounter, and logs the drift event. - Source unavailable: Skip the key. Do not count it against the consistency score. Log a source-failure metric.
Consistency Score
Each key prefix maintains a rolling consistency score: score = consistent / sampled. Scores are computed over a sliding window (default: last 1000 samples per prefix).
1.0 = every sampled key matched its source. 0.99+ = normal for most workloads (a few races between source writes and cache invalidation). <0.95 = investigate your invalidation pipeline — significant drift is occurring. <0.90 = your cache is serving materially stale data for this prefix.
Configuration
| Parameter | Default | Description |
|---|---|---|
healing.enabled |
false | Enable the self-healing background loop |
healing.sample_rate |
0.01 | Fraction of keys to sample per interval (0.01 = 1%) |
healing.verify_interval_ms |
5000 | How often the sampling loop runs, in milliseconds |
healing.auto_repair |
true | Automatically write source value to cache on mismatch. When false, mismatches are logged but not repaired. |
healing.score_window |
1000 | Number of samples per prefix used for rolling score calculation |
healing.source_timeout_ms |
2000 | Timeout for source verification HTTP calls |
Registering Sources
Each key prefix must be mapped to a source endpoint before self-healing can verify it.
Metrics & Observability
| Metric | Type | Description |
|---|---|---|
healing.samples_total |
Counter | Total keys sampled across all intervals |
healing.repairs_total |
Counter | Total auto-repairs performed |
healing.source_failures |
Counter | Source verification calls that failed (timeout, error) |
healing.interval_duration_ms |
Histogram | Time taken to complete each sampling interval |
healing.score.{prefix} |
Gauge | Current consistency score per prefix (0.0–1.0) |
Self-healing generates read traffic to your source of truth. At 1% sample rate, 1M keys, and a 5-second interval, you are making ~2,000 source requests per second. Size your source capacity accordingly, or reduce the sample rate for large caches. The engine respects the source_timeout_ms setting and backs off on persistent failures.