Computation Fingerprint Supersession Chains Content-Addressed

Data Lineage Verification

Where did this data come from? How was it transformed? Has it been modified?
Computation fingerprints answer all three. Cryptographically.

31ns
Lineage Verification
32B
Fingerprint Size
3
PQ Attestation Families
Version History Depth
Definition

Data lineage verification is the cryptographic proof of where data came from, how it was transformed, and whether it has been modified since creation. In Cachee, every computation result carries a computation fingerprintSHA3-256(input || computation || parameters || version) — that binds the result to its inputs. Data lineage only works if results come from verifiable computation. The result's storage address is derived from this fingerprint. The address IS the proof. If the data at an address matches the hash, the data is authentic and unmodified.

Why It Matters

Three Reasons You Need Verifiable Lineage

Regulatory

GDPR Article 30 / SOX Section 404

GDPR Article 30 requires records of all processing activities. SOX Section 404 requires provable data integrity for financial reporting. Without verifiable lineage, you have metadata logs that can be fabricated. With Cachee, you have cryptographic proof.

Operational

Debugging / Root Cause Analysis

A downstream system produces a wrong result. Which upstream input caused it? Traditional systems require manually tracing ETL pipelines. With computation fingerprints, every result points back to its exact inputs. The lineage graph is self-documenting.

Trust: AI Training Data Provenance

Model Governance / Data Marketplace

AI models are only as trustworthy as their training data. Without verifiable lineage, training data is a black box. With computation fingerprints, every training sample carries proof of origin, transformation history, and integrity — enabling verifiable model governance and trusted data marketplaces.

Architecture

Traditional Lineage vs Cachee

Traditional Lineage Tracking
Metadata database (separate system)
+
ETL logging (can be disconnected from data)
+
Manual annotations (can be fabricated)
+
Lineage separate from data (can diverge)
The lineage metadata can be modified independently of the data it describes. No integrity binding.
Cachee Computation Fingerprints
Fingerprint = SHA3-256(inputs || computation)
Address = fingerprint (content-addressed)
Lineage IS the data (inseparable)
PQ-signed by 3 independent families
The data and its lineage are cryptographically bound. Cannot diverge. Cannot be fabricated.
Visual

The Data Lineage Graph

Every result points back to its inputs. Every input points back to its source. The graph is self-verifying.

Input → Computation → Result → Verification

Dataset A
fp: 0x3a1f...b7c2
ML Inference
model: risk_v2.1
Risk Score
fp: 0x8a2f...c317
Verified
3/3 PQ sigs valid
Dataset B
fp: 0x7d4e...a891
Aggregation
compute: sum_weighted
Portfolio NAV
fp: 0x2c9b...df45
Verified
3/3 PQ sigs valid
Risk Score
fp: 0x8a2f...c317
+
Portfolio NAV
fp: 0x2c9b...df45
Compliance Check
compute: sox_404
Final Report
fp: 0xe71a...9f03

Every node carries a computation fingerprint. Every edge is a verifiable input-output relationship. The entire graph is self-proving.

Data Model

The Provenance Record

Every cached computation result in Cachee carries a provenance record. This record is not metadata stored in a separate system — it is part of the cache entry itself, signed by three PQ families.

struct Provenance { computed_by: String, // "risk_model_v2.1" — what computed this computed_at: Timestamp, // 2026-05-02T12:00:03Z — when computation_duration: u64, // 4,821us — how long it took input_fingerprints: Vec<[u8;32]>, // fingerprints of all inputs parameters: Map<String,String>, // model thresholds, config software_version: String, // exact binary hash hardware_id: String, // platform identifier verified_by: [Signature; 3], // ML-DSA + FALCON + SLH-DSA supersedes: Option<[u8;32]>, // fingerprint of previous version }

The input_fingerprints field creates the lineage graph. Each result knows exactly which inputs produced it. The supersedes field creates the supersession chain — the version history of this result.

Version History

Supersession Chains

Every time a result is recomputed, the new version references the previous. The full history is preserved.

risk:portfolio_47 — Full Version History

v1
score: 0.68 model: risk_v1.9 2026-04-15 09:30 fp: 0x1a2b...3c4d
v2
score: 0.71 model: risk_v2.0 2026-04-22 14:15 fp: 0x5e6f...7a8b supersedes: 0x1a2b...
v3 (current)
score: 0.73 model: risk_v2.1 2026-05-02 12:00 fp: 0x8a2f...c317 supersedes: 0x5e6f...

Every version is preserved. Query any version at any point in time. The supersession chain is the complete history.

Architecture

Content-Addressed Storage

In traditional storage, you choose where to put data (a file path, a database row, a cache key). The address is arbitrary — it has no relationship to the content. In content-addressed storage, the address is derived from the content itself.

// Traditional: address is arbitrary SET "my_key" "my_value" // address tells you nothing about content // Content-addressed: address IS the content hash address = SHA3-256("my_value" || computation || params) SET address "my_value" // address proves content integrity // Verification is implicit: // If SHA3-256(data_at_address) == address, data is authentic // No separate integrity check needed

This is the same principle behind Git (content-addressed commits), IPFS (content-addressed files), and Docker (content-addressed layers). Cachee applies it to computation results: the computation fingerprint is both the cache key and the integrity proof. The address IS the proof.

Location-Addressed

Traditional caches, databases, file systems

Address chosen by the writer. Content can change without address changing. Integrity requires separate verification. Lineage requires separate metadata. Data and proof are decoupled.

Content-Addressed

Cachee, Git, IPFS, Docker

Address derived from content. Content cannot change without address changing. Integrity is implicit. Lineage is embedded. Data and proof are one.

Comparison

Lineage Approaches Compared

PropertyMetadata DatabaseETL LoggingCachee Fingerprints
Integrity bindingNone (metadata can diverge)None (logs can be fabricated)Cryptographic (SHA3-256)
Tamper evidenceNoWeak (append-only trust)Yes (hash chain + PQ sigs)
Query performance10-100msSeconds (log search)31ns
Independent verificationNo (trust the DB)No (trust the logs)Yes (offline, no account)
Schema evolutionMigration neededLog format changesHash is schema-agnostic
Post-quantumNoNo3 PQ signature families

Metadata databases and ETL logs describe lineage. Cachee fingerprints prove lineage. The difference is the difference between a claim and a cryptographic proof. That lineage is preserved inside a tamper-proof audit trail.

cachee-lineage-demo
[1] $ cachee lineage risk:portfolio_47
    v3 (current) score=0.73 model=risk_v2.1 fp=0x8a2f...c317
      supersedes v2 (score=0.71 model=risk_v2.0)
      supersedes v1 (score=0.68 model=risk_v1.9)
 
[2] $ cachee lineage --inputs risk:portfolio_47
    input[0]: dataset_A fp=0x3a1f...b7c2 (market_data)
    input[1]: dataset_B fp=0x7d4e...a891 (positions)
    params: threshold=0.5 window=30d ver=2.1.0
 
[3] $ cachee-verify --lineage risk:portfolio_47
    Fingerprint: MATCH | Inputs: 2/2 VERIFIED | Sigs: 3/3 VALID
 
    Full lineage verified in 31ns. Inputs, computation, and result are bound.

Run it yourself: brew install cachee && cachee lineage-demo

Applications

Where Data Lineage Matters

🤖
AI Training Data
Prove the provenance of every training sample. Which dataset? Which version? Has it been modified since training? Computation fingerprints make model governance verifiable, not aspirational.
💰
Financial Data Pipelines
NAV calculations, risk aggregations, and regulatory reports flow through multi-stage pipelines. Each stage's output fingerprint becomes the next stage's input fingerprint. The entire pipeline is self-proving.
🔬
Scientific Reproducibility
Reproducibility crises end when every result carries proof of its computation. Same inputs + same code + same parameters = same fingerprint. Independently verifiable by any researcher.
🌐
GDPR Data Processing
Article 30 requires records of all processing activities. Computation fingerprints are machine-verifiable processing records — not human-authored logs that can be backdated or fabricated.
📦
Supply Chain Verification
Raw material inspections, manufacturing tests, quality scores — each step is fingerprinted with its inputs. The final product carries the full provenance chain from origin to delivery.
🔗
Data Marketplace Trust
Buyers need to trust that data is authentic before purchase. Computation fingerprints prove origin, transformation history, and integrity without revealing the data itself until payment.
Install

Get Started

brew tap h33ai-postquantum/tap && brew install cachee cachee init && cachee start # Store a result with input lineage SET result:calc_789 '{"nav":42.7}' FP compute=aggregation \   inputs=dataset_A,dataset_B ver=1.3.0 # Query lineage (inputs, computation, versions) cachee lineage result:calc_789 # Verify full lineage chain cachee-verify --lineage result:calc_789 # Query historical version cachee get result:calc_789 --version 1

Lineage tracking is automatic. Every SET with the FP flag records computation fingerprints. The inputs parameter binds the result to its source data. Supersession chains are maintained automatically when a key is updated. And can be reconstructed through replayable system state.

Know where your data came from. Prove it cryptographically. Verify in 31 nanoseconds.

Install Cachee Computation Fingerprinting

Deep Dives

Knowledge Base

Explore Verifiable Computation Infrastructure

Every page in the Cachee knowledge base. Proven computation, not cached data.

Post-Quantum Caching
The category definition. Run computation once, serve forever.
Computation Fingerprinting
Identity for results. Provenance, not just output.
Cache Attestation
Signed cache entries. Three PQ families per SET.
Tamper-Proof Audit Trails
SHA3-256 hash-chained immutable logging.
Verifiable Computation
Prove results without re-execution.
Replayable Systems
Reconstruct any state at any point in time.