benchmark/deepplanning/shoppingplanning/README.md
This domain can be run as part of the unified benchmark or independently.
Note: The unified environment is set up in the project root directory.
# Navigate to project root (if you're in shoppingplanning/)
cd ..
# Create a new conda environment (recommended Python 3.10)
conda create -n deepplanning python=3.10 -y
# Activate the environment
conda activate deepplanning
# Install all required packages from the unified requirements.txt
pip install -r requirements.txt
# Return to shoppingplanning directory
cd shoppingplanning
Required Files:
database_zip/database_level1.tar.gz - Level 1 shopping databasedatabase_zip/database_level2.tar.gz - Level 2 shopping databasedatabase_zip/database_level3.tar.gz - Level 3 shopping databaseDownload from: HuggingFace Dataset
First, download the required data files from HuggingFace and place them in the project:
shoppingplanning/database_zip/: put database_level1.tar.gz, database_level2.tar.gz, and database_level3.tar.gz.After downloading, extract the compressed shopping databases:
# Extract database files for all levels
cd database_zip
tar -xzf database_level1.tar.gz -C ..
tar -xzf database_level2.tar.gz -C ..
tar -xzf database_level3.tar.gz -C ..
cd ..
Note: Model configuration is shared across all domains and located in the project root.
Edit models_config.json in the project root directory (one level up from shoppingplanning/):
{
"models": {
"qwen-plus": {
"model_name": "qwen-plus",
"model_type": "openai",
"base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
"api_key_env": "DASHSCOPE_API_KEY",
"temperature": 0.0
},
"gpt-4o-2024-11-20": {
"model_name": "gpt-4o-2024-11-20",
"model_type": "openai",
"base_url": "https://api.openai.com/v1/models",
"api_key_env": "OPENAI_API_KEY",
"temperature": 0.0
}
}
}
Supported Model Types:
openai: OpenAI and compatible models (GPT-4, Qwen, DeepSeek, etc.)Note: API keys are configured in the project root directory.
Create a .env file in the project root directory or set environment variables:
# Option 1: Create .env file in project root
# Navigate to project root
cd ..
cp .env.example .env
# Edit .env and add your API keys
# Option 2: Set environment variables directly
export DASHSCOPE_API_KEY="your_dashscope_api_key"
export OPENAI_API_KEY="your_openai_api_key"
Set environment variables to configure the run:
SHOPPING_AGENT_MODEL="qwen-plus" \
SHOPPING_LEVELS="1 2 3" \
SHOPPING_WORKERS=50 \
SHOPPING_MAX_LLM_CALLS=400 \
bash run.sh
Available Environment Variables:
SHOPPING_AGENT_MODEL: Model name(s) from models_config.json (space-separated for multiple models)SHOPPING_LEVELS: Levels to run (space-separated, e.g., "1 2 3")SHOPPING_WORKERS: Number of parallel workersSHOPPING_MAX_LLM_CALLS: Maximum LLM calls per sampleOr edit default values in run.sh for permanent changes:
Find and modify these lines in run.sh (change the values after the last :-):
TEST_LEVELS="${BENCHMARK_LEVELS:-${SHOPPING_LEVELS:-1 2 3}}" # Change levels
WORKERS="${BENCHMARK_WORKERS:-${SHOPPING_WORKERS:-50}}" # Change workers
MAX_LLM_CALLS="${BENCHMARK_MAX_LLM_CALLS:-${SHOPPING_MAX_LLM_CALLS:-400}}" # Change max LLM calls
SHOPPING_AGENT_MODEL="${BENCHMARK_MODEL:-${SHOPPING_AGENT_MODEL:-qwen-plus}}" # Change model
Then simply run:
bash run.sh
How it works:
database_run_qwen-plus_level1_20250105143022_12345/). This allows multiple concurrent runs without interference.database_infered/ after completionresult_report/ for each level (reports are always saved, even if model is invalid)result_report/{model_name}_statistics.jsonNote on concurrent runs: Each run uses an isolated database directory, so you can safely run multiple benchmarks simultaneously (e.g., testing different models in parallel).
The benchmark runs in two main stages:
What it does:
data/level_{level}_query_meta.jsondatabase/case_{id}/Output:
database/
āāā case_0/
ā āāā messages.json # Agent execution traces
ā āāā cart.json # Final shopping cart
ā āāā validation_cases.json # Ground truth
āāā case_1/
ā āāā ...
āāā ...
What it does:
Output:
result_report/database_{MODEL}_level{LEVEL}_{TIMESTAMP}/
āāā summary_report.json # Overall metrics and statistics
āāā case_0_report.json # Individual case detailed reports
āāā case_1_report.json
āāā ... # One report file per case
After running all levels for a model, the script automatically calculates overall statistics across all levels. This provides a comprehensive view of model performance across different difficulty levels.
# View overall statistics for a model
cat result_report/{MODEL}_statistics.json
Example Output:
{
"model_name": "qwen-plus",
"statistics_time": "2026-01-05T12:30:45.123456",
"levels": {
"level_1": {
"folder_name": "database_qwen-plus_level1_202601051200",
"total_cases": 50,
"successful_cases": 45,
"failed_cases": 5,
"total_matched_products": 200,
"total_expected_products": 210,
"total_extra_products": 10,
"average_case_score": 0.90,
"overall_match_rate": 0.952,
"incomplete_cases": 0,
"incomplete_rate": 0.0,
"valid": true
},
"level_2": {
"folder_name": "database_qwen-plus_level2_202601051300",
"total_cases": 50,
"successful_cases": 30,
"failed_cases": 20,
"total_matched_products": 150,
"total_expected_products": 180,
"total_extra_products": 25,
"average_case_score": 0.60,
"overall_match_rate": 0.833,
"incomplete_cases": 2,
"incomplete_rate": 0.04,
"valid": true
},
"level_3": {
"folder_name": "database_qwen-plus_level3_202601051400",
"total_cases": 50,
"successful_cases": 20,
"failed_cases": 30,
"total_matched_products": 100,
"total_expected_products": 200,
"total_extra_products": 40,
"average_case_score": 0.40,
"overall_match_rate": 0.500,
"incomplete_cases": 5,
"incomplete_rate": 0.10,
"valid": true
}
},
"total": {
"total_cases": 150,
"successful_cases": 95,
"failed_cases": 55,
"total_matched_products": 450,
"total_expected_products": 590,
"total_extra_products": 75,
"successful_rate": 0.6333,
"match_rate": 0.7627,
"weighted_average_case_score": 0.6333,
"incomplete_cases": 7,
"incomplete_rate": 0.0467,
"valid": true,
"levels_completed": [1, 2, 3]
}
}
Key Metrics Explained:
successful_rate: Overall percentage of cases that achieved perfect scores (all products and coupons matched)match_rate ā: Overall percentage of expected products that were correctly matched. This is the main metric reported in the paper.weighted_average_case_score ā: Average case score weighted by the number of cases in each level. This is the main metric reported in the paper.levels_completed: List of levels included in the statisticsvalid: Whether the model is considered valid (incomplete_rate ⤠10% for all levels)Note: Evaluation reports are always saved regardless of the valid status. This allows for debugging and analysis even when a model has high incomplete rates (e.g., due to early termination or errors). The valid flag in the report indicates whether the results should be considered reliable for benchmarking.
cat result_report/database_{MODEL}_level{LEVEL}_{TIMESTAMP}/summary_report.json
Example Output:
{
"evaluation_time": "2026-01-04T12:09:18.522300",
"overall_statistics": {
"total_cases": 50,
"successful_cases": 11,
"failed_cases": 39,
"average_score": 0.22,
"average_case_score": 0.22,
"max_score": 1.0,
"min_score": 0.0,
"total_matched_products": 152,
"total_expected_products": 215,
"total_extra_products": 54,
"overall_match_rate": 0.707,
"incomplete_cases": 0,
"incomplete_rate": 0.0,
"valid": true
},
"case_results": [
{
"case_name": "case_1",
"success": false,
"score": 0.8,
"matched_count": 4,
"expected_count": 5,
"extra_products_count": 1,
"case_score": 0.0,
"is_completed": true
}
],
"detailed_results": [...]
}
# View detailed report for a specific case
cat result_report/database_{MODEL}_level{LEVEL}_{TIMESTAMP}/case_0_report.json
Example Case Report:
{
"case_name": "case_1",
"evaluation_time": "2026-01-04T12:09:18.174467",
"summary": {
"score": 0.8,
"matched_count": 4,
"expected_count": 5,
"extra_products_count": 1,
"coupon_score": 0.0
},
"query": "User shopping query...",
"matched_products": ["706395e1", "3b5b2e0e", ...],
"matched_coupons": [],
"ground_truth_coupons": [],
"unmatched_ground_truth_products": [...],
"extra_products": [...],
"ground_truth_products": [...]
}
database_infered/ after each model inferenceresult_report/