qwencoder-eval/tool_calling_eval/README.md
This guide provides instructions for evaluating the model on the Tau-Bench and Berkeley Function-Calling Leaderboard (BFCL-v3) benchmarks.
Our evaluation scripts use LiteLLM to interface with various model endpoints. Before you begin, ensure you have a litellm server running that provides an OpenAI-compatible API for the model you want to evaluate.
For detailed instructions, please refer to the LiteLLM documentation.
Once your API endpoint is active, export the following environment variables in your terminal:
export OPENAI_API_BASE="YOUR_API_BASE_URL"
export OPENAI_API_KEY="YOUR_API_KEY"
Navigate to the tau-bench directory, create a virtual environment, and install the required packages.
cd tau-bench
uv venv
source .venv/bin/activate
uv pip install -e .
Execute the provided scripts to run the evaluation on the retail and airline domains.
# Evaluate the retail domain
bash retail-qwen3-coder.bash
# Evaluate the airline domain
bash airline-qwen3-coder.bash
The Berkeley Function-Calling Leaderboard (BFCL) assesses the model's function-calling capabilities.
Navigate to the berkeley-function-call-leaderboard directory, create a virtual environment, and install the dependencies.
cd berkeley-function-call-leaderboard
uv venv
source .venv/bin/activate
uv pip install -e .
First, generate the model's responses. Then, run the evaluation script to score the results.
# Generate model responses for Qwen3-Coder
bfcl generate --model Qwen3-Coder-480B-A35B-Instruct --num-threads 16
# Evaluate the results
bfcl evaluate --model Qwen3-Coder-480B-A35B-Instruct
For more details on the benchmarks, please visit their official repositories: