examples/babyai/README.md
BabyAI is a grid world environment designed to test the sample efficiency of grounded language acquisition.
Each task is described in natural language (e.g. put the red ball next to the blue ball).
To complete a task, the agent must execute a sequence of actions given partial observations of the environment.
The set of actions are go forward, turn right, turn left, pick up, drop, and toggle.
An example observation is you carry a yellow ball\n a wall 2 steps right\n a red ball 1 step forward.
We use the BALROG agentic LLM benchmark implementation of the BabyAI environment to demonstrate how you can use TensorZero to develop an LLM application to solve such tasks.
We provide a TensorZero configuration file to tackle BALROG's BabyAI benchmark.
Our setup implements the function act with multiple variants (e.g. baseline, reasoning, history).
cmake (e.g. brew install cmake or sudo apt-get install cmake)uv: uv syncOPENAI_API_KEY)OPENAI_API_KEY environment variabledocker compose up to launch the TensorZero Gateway, the TensorZero UI, and a development ClickHouse databasebabyai.ipynb Jupyter notebookHere are our results showing the success rate, episode return, episode length, and input tokens. We find that the history_and_reasoning variant perfoms best and that using our fine-tuning recipe can improve its performance.
The notebook will evaluate the performance of multiple variants that use the gpt-4o-mini model.
You'll notice that adding history to the prompt improves performance.
Later, you can use the a fine-tuning recipe to improve the performance of the history_and_reasoning variant.
The simplest way to fine-tune a model is to use the TensorZero UI (available at http://localhost:4000).
The fine-tuned variant will achieve materially higher success rate and episode return.