benchmark/deepplanning/travelplanning/README.md
This domain can be run as part of the unified benchmark or independently.
Note: The unified environment is set up in the project root directory.
# Navigate to project root (if you're in travelplanning/)
cd ..
# Create a new conda environment (recommended Python 3.10)
conda create -n deepplanning python=3.10 -y
# Activate the environment
conda activate deepplanning
# Install all required packages from the unified requirements.txt
pip install -r requirements.txt
# Return to travelplanning directory
cd travelplanning
Required Files:
database/database_zh.zip - Chinese databasedatabase/database_en.zip - English databaseDownload from: HuggingFace Dataset
First, download the required data files from HuggingFace and place them in the project:
travelplanning/database/: put database_zh.zip and database_en.zip.After downloading, extract the compressed travel databases:
# Navigate to the database directory
cd database
# Extract both language databases
unzip database_zh.zip # Chinese database (flights, hotels, restaurants, attractions)
unzip database_en.zip # English database
# Return to travelplanning directory
cd ..
Note: Model configuration is shared across all domains and located in the project root.
Edit models_config.json in the project root directory (one level up from travelplanning/):
{
"models": {
"qwen-plus": {
"model_name": "qwen-plus",
"model_type": "openai",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key_env": "DASHSCOPE_API_KEY"
},
"gpt-4o-2024-11-20": {
"model_name": "gpt-4o-2024-11-20",
"model_type": "openai",
"base_url": "https://api.openai.com/v1/models",
"api_key_env": "OPENAI_API_KEY"
}
}
}
Important Note about qwen-plus:
qwen-plus configuration is required because it's used by default in the conversion stage (evaluation/convert_report.py) to parse and format agent-generated travel plans.conversion_model variable in evaluation/convert_report.py.Supported Model Types:
openai: OpenAI and compatible models (GPT-4, Qwen, DeepSeek, etc.)Note: API keys are configured in the project root directory.
Create a .env file in the project root directory or set environment variables:
# Option 1: Create .env file in project root
# Navigate to project root
cd ..
cp .env.example .env
# Edit .env and add your API keys
# Option 2: Set environment variables directly
export DASHSCOPE_API_KEY="your_dashscope_api_key"
export OPENAI_API_KEY="your_openai_api_key"
Set environment variables to configure the run:
BENCHMARK_MODEL="qwen-plus" \
BENCHMARK_LANGUAGE="" \
BENCHMARK_WORKERS=10 \
BENCHMARK_MAX_LLM_CALLS=400 \
BENCHMARK_START_FROM="inference" \
BENCHMARK_OUTPUT_DIR="" \
bash run.sh
Available Environment Variables:
BENCHMARK_MODEL: Model name from models_config.jsonBENCHMARK_LANGUAGE: Language version (zh, en, or empty for both)BENCHMARK_WORKERS: Number of parallel workersBENCHMARK_MAX_LLM_CALLS: Maximum LLM calls per taskBENCHMARK_START_FROM: Start point (inference, conversion, evaluation)BENCHMARK_OUTPUT_DIR: Custom output directoryBENCHMARK_VERBOSE: Enable verbose output (true/false)BENCHMARK_DEBUG: Enable debug mode (true/false)Or edit default values in run.sh for permanent changes:
Find and modify these lines in run.sh (change the values after :-):
MODEL="${BENCHMARK_MODEL:-${TRAVEL_AGENT_MODEL:-qwen-plus}}" # Change qwen-plus
LANGUAGE="${BENCHMARK_LANGUAGE:-zh}" # Change zh
WORKERS="${BENCHMARK_WORKERS:-40}" # Change 40
MAX_LLM_CALLS="${BENCHMARK_MAX_LLM_CALLS:-400}" # Change 400
START_FROM="${BENCHMARK_START_FROM:-inference}" # Change inference
OUTPUT_DIR="${BENCHMARK_OUTPUT_DIR:-}" # Set custom path
Then simply run:
bash run.sh
Smart Caching & Resume Functionality:
When using run.sh with START_FROM="inference", the script automatically:
reports/ folder first to find missing report files (e.g., id_0_report.txt, id_1_report.txt, etc.)converted_plans/ folder to find missing converted plan files (e.g., id_0_converted.json, id_1_converted.json, etc.)conversioninferenceThis allows you to safely interrupt and resume long-running evaluations without losing progress.
The benchmark runs in three stages:
What it does:
data/travelplanning_query_{lang}.jsonOutput:
results/{model}_{lang}/
āāā trajectories/ # Agent execution traces
ā āāā id_0_trajectory.json
āāā reports/ # Human-readable reports
āāā id_0_report.txt
What it does:
qwen-plus, configurable) to convert plansconverted_plans/ directoryWhy conversion is needed: The agent generates human-readable plans in Markdown format, but the evaluation code requires structured JSON data to automatically score compliance with constraints and calculate metrics.
Output:
results/{model}_{lang}/
āāā converted_plans/ # Structured travel plans
ā āāā id_0_converted.json
What it does:
Output:
results/{model}_{lang}/
āāā evaluation/
āāā evaluation_summary.json # Overall metrics and statistics
āāā id_0_score.json # Individual task scores
āāā id_1_score.json
āāā ... # One score file per task
cat results/{model}_{lang}/evaluation/evaluation_summary.json
Example Output:
{
"total_test_samples": 120,
"evaluation_success_count": 115,
"metrics": {
"delivery_rate": 0.958,
"commonsense_score": 0.875,
"personalized_score": 0.742,
"composite_score": 0.809,
"case_acc": 0.683
}
}
# View detailed score for a specific task
cat results/{model}_{lang}/evaluation/id_0_score.json
# View human-readable report for a specific task
cat results/{model}_{lang}/reports/id_0_report.txt
The summary includes error statistics showing common failure patterns:
"error_statistics": [
{
"rank": 1,
"error_type": "[Hard] train_seat_status",
"count": 15,
"affected_samples": ["0", "12", "25", ...]
}
]