Deterministic Replay for AI Agents — Pre-Deploy Validation
Deterministic Replay re-executes historical production decisions from the Tenet Reasoning Ledger against a candidate agent version — a new model, an updated prompt, or a modified policy — before deployment. Synthetic benchmarks test performance on your test set. Deterministic Replay tests performance on your production reality: the edge cases, outliers, and long-tail input distributions that define your actual business. A loan approval agent that passes your benchmark suite may still regress on the specific edge cases that characterize your real applicant population. Deterministic Replay closes this gap.
Why Production Data Beats Synthetic Benchmarks
Production AI agents fail on scenarios you didn't think to include in your benchmark — the edge cases in your actual user base, the input distributions specific to your vertical, the combinations that look normal in aggregate but produce wrong decisions in practice. Deterministic Replay exposes regressions on real production data before they reach production users. Synthetic benchmarks are built by humans who anticipate the scenarios they expect to see. Production data is built by users who generate the scenarios that actually occur. For financial AI agents, this means the rare income and debt configurations that stress-test policy boundaries. For clinical AI agents, this means the comorbidity combinations that complicate triage logic. For insurance underwriters, this means the claim types that sit at the boundary of coverage rules. These are exactly the scenarios that fail silently on synthetic benchmarks and surface as costly errors in production. Tenet Deterministic Replay uses stored context snapshots from your Reasoning Ledger — the exact state your agent processed at decision time — to replay those specific edge cases against any new agent version before you deploy it.
Three Deterministic Replay Use Cases
Pre-deploy model validation: replay the last 30 days of production decisions against a new model checkpoint before routing live traffic to the new version. If the new checkpoint changes outcomes on more than a threshold percentage of prior decisions — or changes high-stakes decision types at any rate — you have a concrete, data-backed reason to delay deployment or investigate the divergence. Prompt change validation: compare the behavioral delta of a prompt update against your real decision history. A seemingly minor clarification to your system prompt may shift agent reasoning on specific input types in ways that eval suites fail to surface. Deterministic Replay quantifies the impact on actual production decisions and surfaces which specific cases diverge. Policy backtesting: replay historical decisions against a new compliance threshold, a revised policy rule, or an updated regulatory guideline to understand the retroactive impact before it becomes a live regulatory exposure. If your legal team proposes tightening a lending policy, backtesting against 90 days of real decisions shows the exact scope of impact before any change is deployed.
How Deterministic Replay Works
Tenet stores a full context snapshot for every agent decision — the exact input state, policy context, retrieved documents, and intermediate reasoning steps at decision time. These snapshots are written to the immutable Reasoning Ledger with SHA-256 hashing and Ed25519 signing. When you run Deterministic Replay, Tenet re-executes selected historical decisions against your candidate agent version using these stored context snapshots as the exact input, bypassing live data retrieval entirely to ensure the replay is truly deterministic. The comparison output shows which decisions produce different outcomes, which reasoning steps diverged and at what point in the chain, what percentage of production decisions are affected by the change, severity classification by decision type and business impact, and a side-by-side diff of prior vs candidate reasoning for each divergent case. The replay result is stored as a validation artifact in the Reasoning Ledger, providing an immutable record that pre-deployment behavioral testing was conducted against production data — evidence that satisfies EU AI Act Article 9 risk management requirements.
Compliance Use Case: Pre-Deployment Validation Evidence
EU AI Act Article 9 requires high-risk AI systems to implement risk management measures including systematic testing under realistic conditions before deployment. For AI agents in loan approval, medical triage, insurance underwriting, or hiring, replay against production decision history is the closest available approximation to realistic conditions — it uses the actual edge cases, distributions, and input patterns from your real user base rather than synthetic test sets assembled by your development team. The replay report provides documented evidence that behavioral testing was conducted before deployment, which exact production scenarios were replayed, what percentage of decisions diverged, and how divergences were assessed and resolved. SR 11-7 model risk management guidance from US banking regulators requires model validation to address the model's actual use, including testing in the context of the specific types of decisions the model will support. OCC model validation expectations align with this standard. Deterministic Replay against your specific production decision history satisfies this requirement more directly than generic benchmark evaluations.
Detecting Regressions That Evals Miss
Standard evals test the scenarios you anticipated — the canonical cases in your evaluation dataset. Deterministic Replay tests all production scenarios, including the ones you did not anticipate, because production reality always contains cases that developers did not pre-populate into test sets. The long tail of unusual inputs that synthetic benchmarks underrepresent is precisely where consequential regressions accumulate: the rare financial profile that sits at the exact boundary of approval criteria, the clinical note structure that produces ambiguous triage scores, the claim description that activates multiple competing policy rules simultaneously. These are not exotic failures — they are the normal distribution of real-world complexity. Teams using Tenet Deterministic Replay consistently identify behavioral regressions on specific input types that their eval suites missed entirely. When you find that a new model checkpoint changes decisions on 2% of historical loans — specifically loans in the 680-720 credit score range with variable income — you can investigate that specific pattern before it becomes a disparate impact finding. When your eval shows 99.2% consistency, you may not have tested the 0.8% that matters.
Integration and Setup
Deterministic Replay requires only that you have the Tenet Ghost SDK capturing decisions in production. Once the Reasoning Ledger accumulates decision snapshots — typically meaningful signal is available within the first week of production traffic — you can run Deterministic Replay from the dashboard or via the Tenet REST API. The API workflow: select a time range of past decisions, optionally filter to specific decision types or input characteristics, specify your candidate agent endpoint or version identifier, and submit the replay job. Tenet routes the stored context snapshots from the Reasoning Ledger to your candidate agent and captures the outputs for comparison. Results appear in the dashboard within minutes for most batch sizes, showing divergence rate, breakdown by decision type, side-by-side reasoning diffs for divergent cases, and a severity classification. The replay artifact is automatically stored as a validation record. No separate replay infrastructure, no new data stores, no additional instrumentation. Ghost SDK writes happen with under 5ms overhead using fire-and-forget async writes — the Reasoning Ledger accumulates automatically as your agent operates in production.