Log in Get started

Back to Tensorzero

Tutorial: TensorZero Evaluations

examples/evaluations/tutorial/README.md

2026.4.12.1 KB

Original Source

Tutorial: TensorZero Evaluations

This directory contains the code for the TensorZero Evaluations Guide.

Getting Started

TensorZero

We provide a configuration file (./config/tensorzero.toml) that specifies:

A write_haiku function that generates a haiku, with gpt_4o and gpt_4o_mini variants.
Evaluators for the write_haiku function, including exact match and assorted LLM judges.

Prerequisites

Install Docker.
Install Python 3.10+.
Install the Python dependencies. We recommend using uv: uv sync
Generate an API key for OpenAI (OPENAI_API_KEY).

Setup

Create a .env file with the OPENAI_API_KEY environment variable (see .env.example for an example).
Run docker compose up to launch the TensorZero Gateway, the TensorZero UI, and a development ClickHouse database.
Run the main.py script to generate 100 haikus.

Evaluations

Create a Dataset

Let's generate a dataset composed of our 100 haikus.

Open the UI, navigate to "Datasets", and select "Build Dataset" (http://localhost:4000/datasets/builder).
Create a new dataset called haiku_dataset. Select your write_haiku function, "None" as the metric, and "Inference" as the dataset output.

Run an Evaluation — CLI

Let's evaluate our gpt_4o variant using the TensorZero Evaluations CLI tool.

Launch an evaluation with the CLI:

bash

docker compose run --rm evaluations \
    --function-name write_haiku \
    --evaluator-names valid_haiku,metaphor_count,exact_match,compare_haikus \
    --dataset-name haiku_dataset \
    --variant-name gpt_4o \
    --concurrency 5

Evaluate a Dataset — UI

Let's evaluate our gpt_4o_mini variant using the TensorZero Evaluations UI, and compare the results.

Navigate to "Evaluations" (http://localhost:4000/evaluations) and select "New Run".
Launch an evaluation with the gpt_4o_mini variant.
Select the previous evaluation run in the dropdown to compare the results.