specification/v0_9/eval/README.md
This is for evaluating A2UI (v0.9) against various LLMs.
This version embeds the JSON schemas directly into the prompt and instructs the LLM to output a JSON object within a markdown code block. The framework then extracts and validates this JSON.
To use the models, you need to set the following environment variables with your API keys:
GEMINI_API_KEYOPENAI_API_KEYANTHROPIC_API_KEYYou can set these in a .env file in the root of the project, or in your shell's configuration file (e.g., .bashrc, .zshrc).
A .env.example file is provided as a template:
cp .env.example .env
# Edit .env with your API keys (do not commit .env)
You also need to install dependencies before running:
pnpm install
To run the flow, use the following command:
pnpm run evalAll
You can run the script for a single model and data point by using the --model and --prompt command-line flags. This is useful for quick tests and debugging.
pnpm run eval --model=<model_name> --prompt=<prompt_name>
To run the test with the gemini-2.5-flash-lite model and the loginForm prompt, use the following command:
pnpm run eval --model=gemini-2.5-flash-lite --prompt=loginForm
By default, the script prints a progress bar and the final summary table to the console. Detailed logs are written to output.log in the results directory.
--log-level=<level>: Sets the console logging level (default: info). Options: error, warn, info, http, verbose, debug, silly.
output.log in the results directory) always captures debug level logs regardless of this setting.--results=<output_dir>: (Default: results/output-<model> or results/output-combined if multiple models are specified) Preserves output files. To specify a custom directory, use --results=my_results.--clean-results: If set, cleans the results directory before running tests.--runs-per-prompt=<number>: Number of times to run each prompt (default: 1).--model=<model_name>: (Default: all models) Run only the specified model(s). Can be specified multiple times.--prompt=<prompt_name>: (Default: all prompts) Run only the specified prompt.Run with debug output in console:
pnpm run eval -- --log-level=debug
Run 5 times per prompt and clean previous results:
pnpm run eval -- --runs-per-prompt=5 --clean-results
The framework includes a two-tiered rate limiting system:
src/models.ts).RESOURCE_EXHAUSTED (429) error is received, resuming only after the requested retry duration.