Self-Healing Consistency Spec

Overview

Caches go stale. Invalidation messages get lost. TTLs are set too high. Background jobs crash. The result is silent data drift: your cache says one thing, your database says another, and your users see stale data without anyone knowing.

The SelfHealingEngine solves this by continuously sampling cached values against their source of truth. It uses Fisher-Yates sampling to select random keys, fetches the current value from the source, compares it to the cached value, and automatically repairs mismatches. Each key prefix gets a consistency score that quantifies how trustworthy that section of your cache is.

When to Use

Enable self-healing when you cannot guarantee that every mutation reaches the cache. This includes systems with event-driven invalidation (messages can be lost), write-through caches with eventual consistency sources, and any cache where TTLs are the primary freshness mechanism.

Architecture

Rust
struct SelfHealingEngine {
    cache:          Arc<CacheEngine>,
    sources:        DashMap<String, SourceConfig>, // prefix → source endpoint
    scores:         DashMap<String, ConsistencyScore>,
    sample_rate:    f64,           // fraction of keys to sample per interval
    verify_interval: Duration,    // how often to run the sampling loop
    metrics:        HealingMetrics,
}

struct ConsistencyScore {
    prefix:      String,
    sampled:     u64,   // total keys sampled
    consistent:  u64,   // keys that matched source
    repaired:    u64,   // keys that were auto-repaired
    score:       f64,   // consistent / sampled (0.0 – 1.0)
    last_check:  Instant,
}
        

Sampling with Fisher-Yates

Each verification interval, the engine selects a random sample of keys using a partial Fisher-Yates shuffle. This guarantees uniform sampling without replacement within each interval. The sample size is total_keys * sample_rate — at the default 1% sample rate with 1M keys, 10,000 keys are verified per interval.

Source Verification

For each sampled key, the engine resolves the key's prefix to a source configuration (typically an HTTP endpoint or database query template). It fetches the current value from the source and performs a byte-level comparison with the cached value.

Match: The key is consistent. Increment consistent counter.
Mismatch: The cached value differs from the source. The engine writes the source value to the cache via SET, increments repaired counter, and logs the drift event.
Source unavailable: Skip the key. Do not count it against the consistency score. Log a source-failure metric.

Consistency Score

Each key prefix maintains a rolling consistency score: score = consistent / sampled. Scores are computed over a sliding window (default: last 1000 samples per prefix).

RESP
# Get consistency score for a specific prefix
CONSISTENCY SCORE "user:"
# → { "prefix": "user:", "score": 0.9987, "sampled": 4200, "repaired": 5 }

# Get all prefix scores
CONSISTENCY SCORES
# → [
#   { "prefix": "user:", "score": 0.9987 },
#   { "prefix": "product:", "score": 0.9942 },
#   { "prefix": "session:", "score": 1.0000 }
# ]

# Detailed stats including repair history
CONSISTENCY STATS
# → { "total_sampled": 15200, "total_repaired": 12, "avg_score": 0.9976,
#      "source_failures": 3, "last_interval_ms": 5012 }
        

Score Interpretation

1.0 = every sampled key matched its source. 0.99+ = normal for most workloads (a few races between source writes and cache invalidation). <0.95 = investigate your invalidation pipeline — significant drift is occurring. <0.90 = your cache is serving materially stale data for this prefix.

Configuration

Parameter	Default	Description
`healing.enabled`	false	Enable the self-healing background loop
`healing.sample_rate`	0.01	Fraction of keys to sample per interval (0.01 = 1%)
`healing.verify_interval_ms`	5000	How often the sampling loop runs, in milliseconds
`healing.auto_repair`	true	Automatically write source value to cache on mismatch. When false, mismatches are logged but not repaired.
`healing.score_window`	1000	Number of samples per prefix used for rolling score calculation
`healing.source_timeout_ms`	2000	Timeout for source verification HTTP calls

Config Commands
CONFIG SET healing.enabled true
CONFIG SET healing.sample_rate 0.01
CONFIG SET healing.verify_interval_ms 5000
CONFIG SET healing.auto_repair true
        

Registering Sources

Each key prefix must be mapped to a source endpoint before self-healing can verify it.

RESP
# Register source for user: prefix
CONSISTENCY SOURCE "user:" "https://api.internal/users/{key}"
# {key} is replaced with the actual cache key

CONSISTENCY SOURCE "product:" "https://api.internal/products/{key}"
        

Metrics & Observability

Metric	Type	Description
`healing.samples_total`	Counter	Total keys sampled across all intervals
`healing.repairs_total`	Counter	Total auto-repairs performed
`healing.source_failures`	Counter	Source verification calls that failed (timeout, error)
`healing.interval_duration_ms`	Histogram	Time taken to complete each sampling interval
`healing.score.{prefix}`	Gauge	Current consistency score per prefix (0.0–1.0)

Source Load

Self-healing generates read traffic to your source of truth. At 1% sample rate, 1M keys, and a 5-second interval, you are making ~2,000 source requests per second. Size your source capacity accordingly, or reduce the sample rate for large caches. The engine respects the source_timeout_ms setting and backs off on persistent failures.

Self-Healing Consistency:
Automatic Drift Detection & Repair

Overview

Architecture

Sampling with Fisher-Yates

Source Verification

Consistency Score

Configuration

Registering Sources

Metrics & Observability

Also Read

Caches Drift. Cachee Heals.
Automatically.

Overview

Architecture

Sampling with Fisher-Yates

Source Verification

Consistency Score

Configuration

Registering Sources

Metrics & Observability

Also Read

Caches Drift. Cachee Heals.Automatically.

Caches Drift. Cachee Heals.
Automatically.