AI Agents

How to Prove What an AI Agent Actually Did

By Eric Beans, CEO, H33.ai, Inc. · May 12, 2026

An AI agent made a decision that cost your company money. Maybe it approved a transaction it should not have. Maybe it sent a communication with incorrect information. Maybe it accessed data it was not authorized to access. Maybe it delegated a task to another agent that made the actual error. Whatever happened, you need to know exactly what the agent did. Not what it was supposed to do. Not what its logs say it did. What it actually did.

Which tools did it call? In what order? What data did it receive back from each tool? What intermediate reasoning did it perform between tool calls? What output did it produce at each step? Who or what authorized each action? Did it delegate to sub-agents, and if so, what did they do? Were there retries, fallbacks, or error-handling paths that influenced the outcome?

Today, in the vast majority of AI agent deployments, you cannot answer these questions. Not because you did not build the agent carefully. Not because you did not instrument it. But because agent execution is fundamentally ephemeral. Context windows clear. Tool call records are scattered across the APIs the agent called. Intermediate reasoning exists only in the model's context and is discarded when the conversation ends. Delegation chains are not recorded as coherent graphs. The complete execution path, from initial request to final output, does not exist as a single, queryable, verifiable record.

This is the agent accountability gap. It is growing wider as agents become more autonomous, more capable, and more deeply integrated into business operations. And it is going to produce a crisis.

Why Agent Execution Is Uniquely Hard to Audit

Traditional software is deterministic and traceable. Function A calls function B, which calls function C. The call stack is recorded. The inputs and outputs at each level are available in the debugger. The execution path is reproducible. Given the same inputs and the same code, you get the same outputs every time. AI agents break every assumption in this model.

Non-Deterministic Execution Paths

An AI agent does not follow a fixed code path. It decides, at runtime, which tools to call, in what order, with what parameters. The same input can produce different execution paths depending on the model's reasoning, which is influenced by temperature, context window content, and the stochastic nature of language model inference. This means that even if you capture the initial input, you cannot reproduce the execution path without also capturing the model's internal state at each decision point. The execution path is an emergent property of the interaction between the model, its context, and the tools available to it. It is not predetermined by code.

Distributed Tool Calls

An agent's actions are distributed across the APIs and services it calls. When an agent queries a database, the database logs the query. When it calls an API, the API logs the request. When it sends an email, the email service logs the send. But these logs are in different systems, with different formats, different retention policies, and different access controls. Reconstructing the agent's complete execution path requires correlating logs across every system the agent touched, aligning timestamps, handling clock skew, and filling in the gaps where the agent's own reasoning connected one tool call to the next.

This is not a log aggregation problem that can be solved by shipping everything to a central system. The agent's reasoning, the part that connects one tool call to the next and explains why the agent chose to call tool B after tool A returned result X, exists only in the model's context window. It is not logged anywhere. It is not stored anywhere. When the context window clears, this reasoning is gone forever.

Delegation and Multi-Agent Chains

Modern agent architectures involve delegation. A planning agent delegates to a research agent, which delegates to a retrieval agent, which calls a search API. The output flows back up the chain: search results to retrieval agent, synthesized knowledge to research agent, completed research to planning agent, final output to user. Each agent in this chain has its own context window, its own reasoning, its own tool calls. The delegation chain is a tree, not a line. And the tree is not recorded as a coherent structure anywhere.

When something goes wrong in a multi-agent chain, determining which agent made the error requires reconstructing the entire delegation tree, understanding the context at each level, and identifying where the incorrect information was introduced, propagated, or amplified. Without a coherent record of the tree, this reconstruction is forensic guesswork.

Authorization Without Accountability

Agents act with delegated authority. A user authorizes an agent to perform tasks on their behalf. The agent then calls tools, accesses data, and produces outputs using this delegated authority. But the authorization chain is rarely recorded with the agent's actions. You know that the user authorized the agent. You know that the agent called a tool. But you may not have a verifiable record that connects the user's authorization to the specific tool call. The gap between "user authorized agent to perform tasks" and "agent called this specific API with these parameters at this time" is the gap between authorization and accountability.

Agent execution is a distributed, non-deterministic, multi-party process where the reasoning that connects actions is ephemeral and the authorization that enables actions is disconnected from the actions themselves. This is not an observability problem. This is an accountability architecture problem.

The Cost of the Accountability Gap

The accountability gap is not an abstract concern. It produces concrete costs that increase as agent autonomy increases.

Incident Response

When an agent produces an incorrect output, the first question is "what happened." Without a verifiable execution record, answering this question requires an investigation. Engineers review whatever logs they can find across multiple systems. They try to reconstruct the agent's execution path from fragments. They make inferences about what the agent's reasoning might have been based on the inputs and outputs they can observe. This investigation takes hours or days. It produces conclusions with varying degrees of confidence. And it often cannot definitively determine what happened because the critical information, the agent's intermediate reasoning, no longer exists.

With a verifiable execution graph, answering "what happened" is a query. You retrieve the complete execution record for the agent's session. Every tool call, every data access, every intermediate reasoning step, every delegation, every output, with hash-chained proof at every node. The investigation takes seconds, not days. The conclusions are definitive, not probabilistic.

Liability

When an agent's action causes financial loss, the question of liability depends on what the agent did and whether it was authorized to do it. Without a verifiable execution record, establishing what the agent did is a matter of inference and assertion. The organization asserts, based on its logs, that the agent did X. But the logs are mutable, fragmented, and incomplete. The assertion may be challenged. With a verifiable execution graph, what the agent did is a matter of cryptographic proof. The execution record is signed, hash-chained, and independently verifiable. The assertion is not "we believe the agent did X." The assertion is "here is the cryptographically verified execution record showing the agent did X, and here is how you verify it yourself."

Regulatory Compliance

Regulators are beginning to ask about agent actions. The EU AI Act addresses automated decision-making. Financial regulators are examining AI-driven trading and advisory actions. Healthcare regulators are scrutinizing AI diagnostic and treatment recommendation systems. In all of these contexts, the regulatory requirement is not "log what the agent did" but "demonstrate what the agent did in a way that can be independently verified." Logs cannot satisfy this requirement because their authenticity depends on trust. Verifiable execution records can satisfy this requirement because their authenticity depends on mathematics.

Customer Trust

As agents interact directly with customers, customers will ask what the agent did on their behalf. "Why did the agent recommend this product?" "Why did the agent deny my claim?" "Why did the agent send that email?" These questions require not just an answer but a verifiable answer. If the customer disputes the organization's account of what the agent did, who prevails? With logs, it is a matter of trust. With verifiable execution records, it is a matter of proof.

How Cachee Captures the Execution Graph

Cachee provides a verifiable execution graph for AI agent operations. This is not a logging system. It is a cryptographic accountability layer that captures every action an agent takes, binds it to the actions that preceded and followed it, and provides independent verifiability at every node.

Node-Level Attestation

Every action in the agent's execution is a node in the graph. Each node is a Cachee entry containing the action type (tool call, data access, reasoning step, delegation, output), the complete input to the action, the complete output from the action, the agent identity and authorization context, the timestamp, and the hash link to the previous node. Each node is individually signed by the computing authority at creation time. Each node's hash includes the hash of the previous node, creating the chain. Each node is independently verifiable. You can verify any single node without verifying the entire graph. You can verify the entire graph by walking the chain from any starting point.

Graph Structure

Agent execution is not linear. It is a graph. An agent may call multiple tools in parallel. It may delegate to multiple sub-agents. It may retry failed operations. It may take branching paths based on intermediate results. Cachee captures this graph structure natively. Each node contains not just a link to its predecessor but links to all nodes that preceded it in the execution. Parallel tool calls are recorded as sibling nodes with a common parent. Delegation is recorded as a subgraph rooted at the delegation node. Retries are recorded as sequential nodes with the same parent. The graph structure is not inferred from timestamps or log correlation. It is explicitly recorded at execution time as part of the attestation.

Reasoning Capture

The agent's intermediate reasoning, the part that connects one action to the next, is captured as explicit nodes in the graph. When the agent decides to call a tool, the reasoning that led to that decision is recorded as a node. When the agent interprets a tool's output and decides what to do next, that interpretation is recorded as a node. This is the critical information that is lost in traditional logging architectures. The reasoning is ephemeral in the model's context window, but in Cachee it is a permanent, hash-chained, signed entry in the execution graph. When you need to understand why the agent called tool B after tool A returned result X, the answer is in the reasoning node between them. It is not an inference. It is a record.

Authorization Binding

Every node in the execution graph includes the authorization context. Who authorized the agent? What permissions did the agent have? Did the agent's actions fall within its authorized scope? This authorization context is not metadata attached to the node. It is part of the hashed content. It is bound to the action by the same cryptographic mechanisms that bind the input and output. You cannot modify the authorization context without breaking the hash. You cannot claim that an action was authorized by a different party without detection. The authorization chain from user to agent to action is cryptographically unbroken.

With Cachee, reconstructing what an agent did is a query, not an investigation. The complete execution graph, every action, every delegation, every tool call, every reasoning step, with cryptographic proof at every node, is available as a single, traversable, verifiable data structure.

From Investigation to Query

Consider the difference in practice. Without Cachee, an agent incident investigation looks like this: An engineer receives an alert that an agent produced an incorrect output. The engineer identifies the agent session from the application logs. The engineer searches the application database for the agent's tool calls. The engineer searches each tool's API logs for the corresponding requests. The engineer attempts to correlate timestamps across systems to establish ordering. The engineer discovers gaps where the agent's reasoning connected tool calls but was not logged. The engineer makes inferences about what the agent was thinking based on the inputs and outputs they can observe. The engineer writes a report with qualified conclusions. The investigation takes 4 to 8 hours. The conclusions are probabilistic.

With Cachee, the same investigation looks like this: An engineer receives an alert that an agent produced an incorrect output. The engineer queries Cachee for the execution graph of the agent session. The engineer receives the complete graph including every tool call with inputs and outputs, every reasoning step, every delegation, every authorization context, with cryptographic verification at every node. The engineer traverses the graph to the point where the incorrect output originated. The engineer examines the reasoning node that preceded the incorrect action. The investigation takes 5 minutes. The conclusions are definitive.

This is not a 10x improvement in investigation speed. It is a categorical change in the nature of investigation. The investigation goes from "reconstruct what probably happened from fragments" to "traverse the verified execution record." The output goes from "we believe the agent did X based on available evidence" to "the cryptographically verified execution record shows the agent did X."

Building Accountable Agent Systems

Agent accountability is not a feature you can add after the fact. It must be built into the agent's execution architecture. Every tool call must pass through the attestation layer. Every reasoning step must be captured. Every delegation must be recorded. Every authorization must be bound to the actions it enables.

This requires a shift in how agents are built. Instead of treating observability as a logging concern that is bolted on after the agent works, observability becomes a first-class architectural concern. The agent does not call a tool and then log that it called the tool. The agent calls the tool through Cachee, and the call is attested, hash-chained, and signed as part of the execution itself.

This shift has a cost: every action passes through an additional layer. But the cost of this layer is measured in microseconds. The cost of not having it is measured in hours of investigation, uncertain conclusions, regulatory risk, and the inability to answer the most basic question about your AI agent: "What did it actually do?"

Agents are becoming more autonomous. They are making more consequential decisions. They are operating with less human oversight. The accountability gap is growing wider. The organizations that close this gap now, by building verifiable execution into their agent architectures from the start, will be the organizations that can operate agents at scale with confidence. The organizations that do not will be the organizations explaining to regulators, courts, and customers that they cannot prove what their agents did.

The question is not whether you will need to prove what your agent did. The question is whether you will be able to.

Make Your Agents Accountable

Cachee captures the complete execution graph for AI agents. Every action, every delegation, every tool call, with hash-chained cryptographic proof at every step. Accountability by design.

Explore Verifiable Agent Execution