v2/benchmark/swe-bench/README.md
SWE-bench is a comprehensive benchmark suite designed to evaluate software engineering capabilities of AI systems. This implementation integrates SWE-bench with Claude Flow's swarm benchmark system.
# Run basic SWE-bench suite
python run_swe_bench.py
# Run with specific configuration
python run_swe_bench.py --config configs/swe_bench_config.yaml
# Run optimization pipeline
python optimize_swe_bench.py --iterations 10
# Generate performance report
python generate_swe_report.py --output reports/
| Metric | Baseline | Target | Optimized |
|---|---|---|---|
| Task Success Rate | 60% | 80% | TBD |
| Average Time/Task | 30s | 15s | TBD |
| Token Efficiency | 5000 | 3000 | TBD |
| Memory Usage | 500MB | 300MB | TBD |
| Parallel Tasks | 1 | 5 | TBD |
swe-bench/
├── configs/ # Configuration files
├── datasets/ # SWE-bench task datasets
├── evaluators/ # Task evaluation logic
├── executors/ # Task execution engines
├── optimizers/ # Performance optimization
├── reports/ # Generated reports
└── tests/ # Test suites
The SWE-bench suite leverages Claude Flow's advanced features:
Results are tracked in:
reports/swe_bench_results_*.jsonswe_bench.dbSee CONTRIBUTING.md for guidelines on adding new benchmarks or improving existing ones.