Overview

Caches go stale. Invalidation messages get lost. TTLs are set too high. Background jobs crash. The result is silent data drift: your cache says one thing, your database says another, and your users see stale data without anyone knowing.

The SelfHealingEngine solves this by continuously sampling cached values against their source of truth. It uses Fisher-Yates sampling to select random keys, fetches the current value from the source, compares it to the cached value, and automatically repairs mismatches. Each key prefix gets a consistency score that quantifies how trustworthy that section of your cache is.

When to Use

Enable self-healing when you cannot guarantee that every mutation reaches the cache. This includes systems with event-driven invalidation (messages can be lost), write-through caches with eventual consistency sources, and any cache where TTLs are the primary freshness mechanism.

Architecture

Rust struct SelfHealingEngine { cache: Arc<CacheEngine>, sources: DashMap<String, SourceConfig>, // prefix → source endpoint scores: DashMap<String, ConsistencyScore>, sample_rate: f64, // fraction of keys to sample per interval verify_interval: Duration, // how often to run the sampling loop metrics: HealingMetrics, } struct ConsistencyScore { prefix: String, sampled: u64, // total keys sampled consistent: u64, // keys that matched source repaired: u64, // keys that were auto-repaired score: f64, // consistent / sampled (0.0 – 1.0) last_check: Instant, }

Sampling with Fisher-Yates

Each verification interval, the engine selects a random sample of keys using a partial Fisher-Yates shuffle. This guarantees uniform sampling without replacement within each interval. The sample size is total_keys * sample_rate — at the default 1% sample rate with 1M keys, 10,000 keys are verified per interval.

Source Verification

For each sampled key, the engine resolves the key's prefix to a source configuration (typically an HTTP endpoint or database query template). It fetches the current value from the source and performs a byte-level comparison with the cached value.

  1. Match: The key is consistent. Increment consistent counter.
  2. Mismatch: The cached value differs from the source. The engine writes the source value to the cache via SET, increments repaired counter, and logs the drift event.
  3. Source unavailable: Skip the key. Do not count it against the consistency score. Log a source-failure metric.

Consistency Score

Each key prefix maintains a rolling consistency score: score = consistent / sampled. Scores are computed over a sliding window (default: last 1000 samples per prefix).

RESP # Get consistency score for a specific prefix CONSISTENCY SCORE "user:" # → { "prefix": "user:", "score": 0.9987, "sampled": 4200, "repaired": 5 } # Get all prefix scores CONSISTENCY SCORES # → [ # { "prefix": "user:", "score": 0.9987 }, # { "prefix": "product:", "score": 0.9942 }, # { "prefix": "session:", "score": 1.0000 } # ] # Detailed stats including repair history CONSISTENCY STATS # → { "total_sampled": 15200, "total_repaired": 12, "avg_score": 0.9976, # "source_failures": 3, "last_interval_ms": 5012 }
Score Interpretation

1.0 = every sampled key matched its source. 0.99+ = normal for most workloads (a few races between source writes and cache invalidation). <0.95 = investigate your invalidation pipeline — significant drift is occurring. <0.90 = your cache is serving materially stale data for this prefix.

Configuration

Parameter Default Description
healing.enabled false Enable the self-healing background loop
healing.sample_rate 0.01 Fraction of keys to sample per interval (0.01 = 1%)
healing.verify_interval_ms 5000 How often the sampling loop runs, in milliseconds
healing.auto_repair true Automatically write source value to cache on mismatch. When false, mismatches are logged but not repaired.
healing.score_window 1000 Number of samples per prefix used for rolling score calculation
healing.source_timeout_ms 2000 Timeout for source verification HTTP calls
Config Commands CONFIG SET healing.enabled true CONFIG SET healing.sample_rate 0.01 CONFIG SET healing.verify_interval_ms 5000 CONFIG SET healing.auto_repair true

Registering Sources

Each key prefix must be mapped to a source endpoint before self-healing can verify it.

RESP # Register source for user: prefix CONSISTENCY SOURCE "user:" "https://api.internal/users/{key}" # {key} is replaced with the actual cache key CONSISTENCY SOURCE "product:" "https://api.internal/products/{key}"

Metrics & Observability

Metric Type Description
healing.samples_total Counter Total keys sampled across all intervals
healing.repairs_total Counter Total auto-repairs performed
healing.source_failures Counter Source verification calls that failed (timeout, error)
healing.interval_duration_ms Histogram Time taken to complete each sampling interval
healing.score.{prefix} Gauge Current consistency score per prefix (0.0–1.0)
Source Load

Self-healing generates read traffic to your source of truth. At 1% sample rate, 1M keys, and a 5-second interval, you are making ~2,000 source requests per second. Size your source capacity accordingly, or reduce the sample rate for large caches. The engine respects the source_timeout_ms setting and backs off on persistent failures.