scripts/README.md
This directory contains scripts for running and analyzing Goose benchmarks.
This script runs Goose benchmarks across multiple provider:model pairs and analyzes the results.
jq command-line tool for JSON processing (optional, but recommended for result analysis)./scripts/run-benchmarks.sh [options]
-p, --provider-models: Comma-separated list of provider:model pairs (e.g., 'openai:gpt-4o,anthropic:claude-sonnet-4')-s, --suites: Comma-separated list of benchmark suites to run (e.g., 'core,small_models')-o, --output-dir: Directory to store benchmark results (default: './benchmark-results')-d, --debug: Use debug build instead of release build-h, --help: Show help message# Run with release build (default)
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o,anthropic:claude-sonnet-4' --suites 'core,small_models'
# Run with debug build
./scripts/run-benchmarks.sh --provider-models 'openai:gpt-4o' --suites 'core' --debug
The script:
GOOSE_PROVIDER and GOOSE_MODEL environment variablesThe script creates the following files in the output directory:
summary.md: A summary of all benchmark results{provider}-{model}.json: Raw JSON output from each benchmark run{provider}-{model}-analysis.txt: Analysis of each benchmark run0: All benchmarks completed successfully1: One or more benchmarks failedThis script analyzes a single benchmark JSON result file and identifies any failures.
./scripts/parse-benchmark-results.sh path/to/benchmark-results.json
The script outputs an analysis of the benchmark results to stdout, including:
0: All metrics passed successfully1: One or more metrics failed