documentation/docs/tutorials/benchmarking.md
The goose benchmarking system allows you to evaluate goose performance on complex tasks with one or more system
configurations.
</br>
This guide covers how to use the goose bench command to run benchmarks and analyze results.
goose bench selectors
goose bench init-config -n bench-config.json
cat bench-config.json
{
"models": [
{
"provider": "databricks",
"name": "goose",
"parallel_safe": true
}
],
"evals": [
{
"selector": "core",
"parallel_safe": true
}
],
"repeat": 1
}
...etc.
goose bench run -c bench-config.json
The benchmark configuration is specified in a JSON file with the following structure:
{
"models": [
{
"provider": "databricks",
"name": "goose",
"parallel_safe": true,
"tool_shim": {
"use_tool_shim": false,
"tool_shim_model": null
}
}
],
"evals": [
{
"selector": "core",
"post_process_cmd": null,
"parallel_safe": true
}
],
"include_dirs": [],
"repeat": 2,
"run_id": null,
"eval_result_filename": "eval-results.json",
"run_summary_filename": "run-results-summary.json",
"env_file": null
}
Each model entry in the models array specifies:
provider: The model provider (e.g., "databricks")name: Model identifierparallel_safe: Whether the model can be run in paralleltool_shim: Optional configuration for tool shimming
use_tool_shim: Enable/disable tool shimmingtool_shim_model: Optional model to use for tool shimmingEach evaluation entry in the evals array specifies:
selector: The evaluation suite to run (e.g., "core")post_process_cmd: Optional path to a post-processing scriptparallel_safe: Whether the evaluation can run in parallelinclude_dirs: Additional directories to include in the evaluationrepeat: Number of times to repeat each evaluationrun_id: Optional identifier for the benchmark runeval_result_filename: Name of the evaluation results filerun_summary_filename: Name of the summary results fileenv_file: Optional path to an environment fileThe include_dirs config parameter makes the items at all paths listed within the option, available to all
evaluations.
</br>
It accomplishes this by:
You can customize runs in several ways:
{
"evals": [
{
"selector": "core",
"post_process_cmd": "/path/to/process-script.sh",
"parallel_safe": true
}
]
}
{
"include_dirs": [
"/path/to/custom/eval/data"
]
}
{
"env_file": "/path/to/env-file"
}
The benchmark generates two main output files within a file-hierarchy similar to the following. </br> Results from running ach model/provider pair are stored within their own directory:
benchmark-${datetime}/
${model}-${provider}[-tool-shim[-${shim-model}]]/
run-${i}/
${an-include_dir-asset}
run-results-summary.json
core/developer/list_files/
${an-include_dir-asset}
run-results-summary.json
eval-results.json: Contains detailed results from each evaluation, including:
run-results-summary.json: A collection of all eval results across all suites.
For detailed logging, you can enable debug mode:
RUST_LOG=debug goose bench bench-config.json
Tool shimming allows you to use a non-tool-capable models with goose, provided Ollama is installed on the system.
See this guide for important details on tool shimming.