v2/benchmark/swe-bench-official/README.md
This is the OFFICIAL SWE-bench integration for creating verified submissions to the SWE-bench leaderboard.
SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks from GitHub issues.
# Install official SWE-bench
pip install datasets swebench
# Download SWE-bench-Lite dataset
python download_swebench.py
# Setup evaluation environment
python setup_evaluation.py
# Run on SWE-bench-Lite (300 instances)
python run_swebench.py --dataset lite --model claude-flow
# Run on specific instance
python run_swebench.py --instance "django__django-11099"
# Generate submission file
python generate_submission.py --output predictions.json
The submission must be a predictions.json file with format:
{
"instance_id": {
"model_patch": "diff --git a/file.py b/file.py\n...",
"model_name_or_path": "claude-flow"
}
}
swe-bench-official/
├── download_swebench.py # Download official dataset
├── setup_evaluation.py # Setup test environment
├── run_swebench.py # Main runner
├── claude_flow_agent.py # Claude Flow SWE-bench agent
├── generate_submission.py # Create submission file
├── data/ # Downloaded datasets
├── predictions/ # Generated predictions
└── logs/ # Execution logs