v2/benchmark/swe-bench-official/IMPLEMENTATION_COMPLETE.md
We have successfully integrated the official SWE-bench evaluation system with the swarm-bench CLI tool. The implementation is complete and ready for full evaluation runs.
official_integration.py)swarm-bench swe-bench official - Run official evaluationswarm-bench swe-bench official --limit 10 - Run limited testswarm-bench swe-bench official --validate - Validate predictionsThe system uses the correct claude-flow command format:
npx claude-flow@alpha swarm "task" --strategy optimization --agents 8 --non-interactive
npx claude-flow@alpha hive-mind spawn "task" --agents 8 --non-interactive
Based on our testing, the optimal configuration is:
python run_swe_bench_optimized.py --quick
python run_swe_bench_optimized.py --evaluate
swarm-bench swe-bench official
swarm-bench swe-bench official --limit 10 | tee swe-bench.log
The system generates:
predictions.json - Submission file for leaderboardevaluation_report_*.json - Detailed metricsRun evaluation:
swarm-bench swe-bench official --limit 300
Validate predictions:
swarm-bench swe-bench official --validate
Submit to leaderboard:
predictions.jsonThe system has been optimized with:
If patches aren't generating:
npx claude-flow@alpha --version--limit 1 to test single instancebenchmark/
├── src/swarm_benchmark/swe_bench/
│ ├── official_integration.py # Main integration
│ ├── engine.py # Original engine
│ ├── datasets.py # Dataset handling
│ ├── evaluator.py # Evaluation logic
│ ├── metrics.py # Performance tracking
│ └── optimizer.py # Configuration optimization
├── run_swe_bench_optimized.py # Optimization runner
├── run_real_swe_bench.py # Simple runner
└── swe-bench-official/ # Results directory
The implementation is complete and ready for:
The system correctly uses claude-flow commands and generates predictions in the required format for official submission.