Data Lineage Verification
Where did this data come from? How was it transformed? Has it been modified?
Computation fingerprints answer all three. Cryptographically.
Data lineage verification is the cryptographic proof of where data came from, how it was transformed, and whether it has been modified since creation. In Cachee, every computation result carries a computation fingerprint — SHA3-256(input || computation || parameters || version) — that binds the result to its inputs. Data lineage only works if results come from verifiable computation. The result's storage address is derived from this fingerprint. The address IS the proof. If the data at an address matches the hash, the data is authentic and unmodified.
Three Reasons You Need Verifiable Lineage
Regulatory
GDPR Article 30 requires records of all processing activities. SOX Section 404 requires provable data integrity for financial reporting. Without verifiable lineage, you have metadata logs that can be fabricated. With Cachee, you have cryptographic proof.
Operational
A downstream system produces a wrong result. Which upstream input caused it? Traditional systems require manually tracing ETL pipelines. With computation fingerprints, every result points back to its exact inputs. The lineage graph is self-documenting.
Trust: AI Training Data Provenance
AI models are only as trustworthy as their training data. Without verifiable lineage, training data is a black box. With computation fingerprints, every training sample carries proof of origin, transformation history, and integrity — enabling verifiable model governance and trusted data marketplaces.
Traditional Lineage vs Cachee
The Data Lineage Graph
Every result points back to its inputs. Every input points back to its source. The graph is self-verifying.
Input → Computation → Result → Verification
Every node carries a computation fingerprint. Every edge is a verifiable input-output relationship. The entire graph is self-proving.
The Provenance Record
Every cached computation result in Cachee carries a provenance record. This record is not metadata stored in a separate system — it is part of the cache entry itself, signed by three PQ families.
struct Provenance {
computed_by: String, // "risk_model_v2.1" — what computed this
computed_at: Timestamp, // 2026-05-02T12:00:03Z — when
computation_duration: u64, // 4,821us — how long it took
input_fingerprints: Vec<[u8;32]>, // fingerprints of all inputs
parameters: Map<String,String>, // model thresholds, config
software_version: String, // exact binary hash
hardware_id: String, // platform identifier
verified_by: [Signature; 3], // ML-DSA + FALCON + SLH-DSA
supersedes: Option<[u8;32]>, // fingerprint of previous version
}
The input_fingerprints field creates the lineage graph. Each result knows exactly which inputs produced it. The supersedes field creates the supersession chain — the version history of this result.
Supersession Chains
Every time a result is recomputed, the new version references the previous. The full history is preserved.
risk:portfolio_47 — Full Version History
Every version is preserved. Query any version at any point in time. The supersession chain is the complete history.
Content-Addressed Storage
In traditional storage, you choose where to put data (a file path, a database row, a cache key). The address is arbitrary — it has no relationship to the content. In content-addressed storage, the address is derived from the content itself.
// Traditional: address is arbitrary
SET "my_key" "my_value" // address tells you nothing about content
// Content-addressed: address IS the content hash
address = SHA3-256("my_value" || computation || params)
SET address "my_value" // address proves content integrity
// Verification is implicit:
// If SHA3-256(data_at_address) == address, data is authentic
// No separate integrity check needed
This is the same principle behind Git (content-addressed commits), IPFS (content-addressed files), and Docker (content-addressed layers). Cachee applies it to computation results: the computation fingerprint is both the cache key and the integrity proof. The address IS the proof.
Location-Addressed
Address chosen by the writer. Content can change without address changing. Integrity requires separate verification. Lineage requires separate metadata. Data and proof are decoupled.
Content-Addressed
Address derived from content. Content cannot change without address changing. Integrity is implicit. Lineage is embedded. Data and proof are one.
Lineage Approaches Compared
| Property | Metadata Database | ETL Logging | Cachee Fingerprints |
|---|---|---|---|
| Integrity binding | None (metadata can diverge) | None (logs can be fabricated) | Cryptographic (SHA3-256) |
| Tamper evidence | No | Weak (append-only trust) | Yes (hash chain + PQ sigs) |
| Query performance | 10-100ms | Seconds (log search) | 31ns |
| Independent verification | No (trust the DB) | No (trust the logs) | Yes (offline, no account) |
| Schema evolution | Migration needed | Log format changes | Hash is schema-agnostic |
| Post-quantum | No | No | 3 PQ signature families |
Metadata databases and ETL logs describe lineage. Cachee fingerprints prove lineage. The difference is the difference between a claim and a cryptographic proof. That lineage is preserved inside a tamper-proof audit trail.
Run it yourself: brew install cachee && cachee lineage-demo
Where Data Lineage Matters
Get Started
brew tap h33ai-postquantum/tap && brew install cachee
cachee init && cachee start
# Store a result with input lineage
SET result:calc_789 '{"nav":42.7}' FP compute=aggregation \
inputs=dataset_A,dataset_B ver=1.3.0
# Query lineage (inputs, computation, versions)
cachee lineage result:calc_789
# Verify full lineage chain
cachee-verify --lineage result:calc_789
# Query historical version
cachee get result:calc_789 --version 1
Lineage tracking is automatic. Every SET with the FP flag records computation fingerprints. The inputs parameter binds the result to its source data. Supersession chains are maintained automatically when a key is updated. And can be reconstructed through replayable system state.
Know where your data came from. Prove it cryptographically. Verify in 31 nanoseconds.
Install Cachee Computation FingerprintingDeep Dives
Explore Verifiable Computation Infrastructure
Every page in the Cachee knowledge base. Proven computation, not cached data.
The category definition. Run computation once, serve forever. →Computation Fingerprinting
Identity for results. Provenance, not just output. →Cache Attestation
Signed cache entries. Three PQ families per SET. →Tamper-Proof Audit Trails
SHA3-256 hash-chained immutable logging. →Verifiable Computation
Prove results without re-execution. →Replayable Systems
Reconstruct any state at any point in time.