.gemini/skills/behavioral-evals/references/creating.md
| Rig Type | Import From | Architecture | Use When |
|---|---|---|---|
evalTest | ./test-helper.js | Subprocess. Runs the CLI in a separate process + waits for exit. | Standard workspace tests. Do not use setBreakpoint; auditing history (readToolLogs) is safer. |
appEvalTest | ./app-test-helper.js | In-Process. Runs directly inside the runner loop. | UI/Ink rendering. Safe for setBreakpoint triggers. |
Evals must simulate realistic agent environments to effectively test decision-making.
package.json for NodeJS environments.tsconfig.json, GEMINI.md).Before asserting a new capability or locking in a fix, verify that the test fails first.
Verifies the agent intends to use a tool BEFORE executing it. Useful for interactive prompts or safety checks.
// ⚠️ Only works with appEvalTest (AppRig)
setup: async (rig) => {
rig.setBreakpoint(['ask_user']);
},
assert: async (rig) => {
const confirmation = await rig.waitForPendingConfirmation('ask_user');
expect(confirmation).toBeDefined();
}
When asserting multiple triggers (e.g., "enters plan mode then asks question"):
assert: async (rig) => {
let confirmation = await rig.waitForPendingConfirmation([
'enter_plan_mode',
'ask_user',
]);
if (confirmation?.name === 'enter_plan_mode') {
rig.acceptConfirmation('enter_plan_mode');
confirmation = await rig.waitForPendingConfirmation('ask_user');
}
expect(confirmation?.toolName).toBe('ask_user');
};
Audit exact operations to ensure efficiency (e.g., no redundant reads).
assert: async (rig, result) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const writeCall = toolLogs.find(
(log) => log.toolRequest.name === 'write_file',
);
expect(writeCall).toBeDefined();
};
To evaluate tools connected via MCP without hitting live endpoints, load a mock
server configuration in the setup hook.
setup: async (rig) => {
rig.addMockMcpServer('workspace-server', 'google-workspace');
},
assert: async (rig) => {
await rig.waitForTelemetryReady();
const toolLogs = rig.readToolLogs();
const workspaceCall = toolLogs.find(
(log) => log.toolRequest.name === 'mcp_workspace-server_docs.getText'
);
expect(workspaceCall).toBeDefined();
};
Breakpoints (setBreakpoint) pause execution. In standard evalTest,
rig.run() waits for the process to exit before assertions run. This will
hang indefinitely.
appEvalTest or interactive simulations.Always set a budget boundary in the EvalCase to prevent runaway loops on
quota:
evalTest('USUALLY_PASSES', {
name: '...',
timeout: 60000, // 1 minute safety limit
// ...
});
Check if a tool is called early using index checks:
assert: async (rig) => {
const toolLogs = rig.readToolLogs();
const toolCallIndex = toolLogs.findIndex(
(log) => log.toolRequest.name === 'cli_help',
);
expect(toolCallIndex).toBeGreaterThan(-1);
expect(toolCallIndex).toBeLessThan(5); // Called within first 5 turns
};