Simulation and Dry-Run¶
Overview¶
ai-tool-guard provides two mechanisms for evaluating policies without executing real tools:
- Global dry-run mode — configure a
ToolGuardinstance withdryRun: trueto intercept all tool calls and return a safe placeholder instead of running the underlyingexecute()function. - Batch simulation — use the standalone
simulate()function to replay a recorded trace of tool calls through a policy configuration, producing a fullSimulationResultwith per-call decisions and summary statistics.
These features are designed for policy testing before deployment, regression testing after policy changes, and analyzing audit traces to understand what would have been blocked.
Basic Usage¶
Global Dry-Run¶
Set dryRun: true on createToolGuard. All tools wrapped by this guard will return { dryRun: true, toolName, args } instead of executing.
import { createToolGuard } from 'ai-tool-guard';
const guard = createToolGuard({
rules: [...],
dryRun: true,
});
const tools = guard.guardTools({
deleteRecord: { tool: deleteRecordTool, riskLevel: 'critical' },
});
// Safe to call — no deletion happens.
const result = await tools.deleteRecord.execute({ id: '123' }, execOptions);
// result => { dryRun: true, toolName: 'deleteRecord', args: { id: '123' } }
Policy evaluation still runs in dry-run mode. DecisionRecords are produced and passed to onDecision, and OTel spans are emitted. The only thing skipped is the actual execute() call.
Batch Simulation¶
Use simulate() to evaluate a recorded trace against a policy without any live tool calls:
import { simulate } from 'ai-tool-guard/policy';
import type { RecordedToolCall } from 'ai-tool-guard/policy';
const trace: RecordedToolCall[] = [
{ toolName: 'readFile', args: { path: '/etc/passwd' } },
{ toolName: 'writeFile', args: { path: '/tmp/out.txt', content: 'hello' } },
{ toolName: 'deleteRecord', args: { id: '42' }, userAttributes: { role: 'admin' } },
];
const result = await simulate(trace, {
rules: [
{
id: 'block-sensitive-reads',
toolPatterns: ['readFile'],
verdict: 'deny',
condition: (ctx) => String(ctx.args.path).startsWith('/etc/'),
},
],
});
console.log(result.summary);
// { total: 3, allowed: 2, denied: 1, requireApproval: 0 }
for (const { toolCall, decision } of result.blocked) {
console.log(`Blocked: ${toolCall.toolName} — ${decision.reason}`);
}
RecordedToolCall¶
A RecordedToolCall represents one entry in a simulation trace:
interface RecordedToolCall {
/** Name of the tool that was (or would be) called. */
toolName: string;
/** Arguments the model supplied. */
args: Record<string, unknown>;
/** Optional override for user attributes during simulation. */
userAttributes?: Record<string, unknown>;
}
The userAttributes field lets you replay the same tool call under different identity contexts — for example, to verify that an admin role bypasses a restriction that blocks regular users.
SimulationResult¶
simulate() returns a SimulationResult:
interface SimulationResult {
/** Full decision records for every tool call in the trace, in order. */
decisions: DecisionRecord[];
/** Aggregate counts. */
summary: {
total: number;
allowed: number;
denied: number;
requireApproval: number;
};
/**
* Tool calls that would not have been allowed outright.
* Includes both "deny" and "require-approval" verdicts.
*/
blocked: Array<{
toolCall: RecordedToolCall;
decision: DecisionRecord;
}>;
}
Every element in decisions corresponds to the RecordedToolCall at the same index in the input trace. The blocked array is a filtered view containing only the non-allow decisions, paired with their originating tool call for convenient reporting.
simulate() Function Signature¶
async function simulate(
trace: RecordedToolCall[],
options: GuardOptions,
toolConfigs?: Record<string, ToolGuardConfig>,
): Promise<SimulationResult>
| Parameter | Type | Description |
|---|---|---|
trace |
RecordedToolCall[] |
Ordered list of tool calls to evaluate. |
options |
GuardOptions |
Policy configuration: rules, backend, risk level defaults, etc. |
toolConfigs |
Record<string, ToolGuardConfig> |
Optional per-tool risk level and category overrides, keyed by tool name. |
All evaluations run with dryRun: true internally. The options.dryRun flag does not need to be set explicitly.
Note
simulate() runs evaluations sequentially in trace order, not in parallel. This matches the serial execution model of a single-threaded agent and ensures that stateful policy rules (e.g., ones that accumulate failure counts) behave consistently.
Use Cases¶
Testing a Policy Before Deployment¶
Write a simulation test that asserts expected verdicts for known tool call patterns:
import { simulate } from 'ai-tool-guard/policy';
import { productionPolicyOptions } from './policy-config.js';
const result = await simulate(
[
{ toolName: 'executeSQL', args: { query: 'DROP TABLE users;' } },
{ toolName: 'executeSQL', args: { query: 'SELECT * FROM orders WHERE id = 1;' } },
],
productionPolicyOptions,
{
executeSQL: { riskLevel: 'high', riskCategories: ['data-delete', 'data-read'] },
},
);
// Assert the destructive query is blocked.
const [dropDecision, selectDecision] = result.decisions;
console.assert(dropDecision.verdict === 'deny');
console.assert(selectDecision.verdict === 'allow');
Comparing Two Policy Configurations¶
Run the same trace through two different policy configurations and compare their outputs:
import { simulate } from 'ai-tool-guard/policy';
const [resultA, resultB] = await Promise.all([
simulate(productionTrace, policyConfigV1),
simulate(productionTrace, policyConfigV2),
]);
const v1Denials = resultA.summary.denied;
const v2Denials = resultB.summary.denied;
console.log(`Policy V1 denied: ${v1Denials}`);
console.log(`Policy V2 denied: ${v2Denials}`);
console.log(`Delta: ${v2Denials - v1Denials} (${v2Denials > v1Denials ? 'stricter' : 'more permissive'})`);
Audit Analysis¶
Replay a recorded production trace to understand which calls would have been blocked by a new policy:
import { simulate } from 'ai-tool-guard/policy';
import type { RecordedToolCall } from 'ai-tool-guard/policy';
// Load trace from audit log (e.g., written by onDecision callback).
const auditLog = JSON.parse(fs.readFileSync('audit.json', 'utf-8'));
const trace: RecordedToolCall[] = auditLog.map((entry: any) => ({
toolName: entry.toolName,
args: entry.args,
userAttributes: entry.attributes,
}));
const result = await simulate(trace, newPolicyOptions);
console.log(`Replayed ${result.summary.total} calls.`);
console.log(`New policy would have blocked ${result.blocked.length} of them.`);
for (const { toolCall, decision } of result.blocked) {
console.log(` ${toolCall.toolName}: ${decision.reason}`);
}
How It Works¶
simulate()iterates over the trace array sequentially.- For each
RecordedToolCall, it constructs aPolicyContextwithdryRun: trueand the call'suserAttributes(defaulting to{}). - It calls the internal
evaluatePolicy()function with the context, theGuardOptions, and any per-tool config fromtoolConfigs. - The resulting
DecisionRecordis collected. If the verdict is notallow, the call is also added to theblockedarray. - After processing all calls, the summary counts are computed from the collected
decisionsarray.
No approval handlers, rate limiters, injection detectors, or output filters run during simulation. Only the policy engine is invoked.