v2/benchmark/docs/real-benchmark-architecture.md
The Real Benchmark Engine is a comprehensive system designed to execute and measure actual claude-flow commands, capturing detailed performance metrics, resource usage, and quality assessments. This architecture enables systematic testing of all 17 SPARC modes and 6 swarm strategies across 5 coordination modes.
RealBenchmarkEngine)The main orchestrator that manages the entire benchmarking lifecycle:
class RealBenchmarkEngine:
- Locates claude-flow executable
- Manages benchmark execution
- Coordinates resource monitoring
- Handles result aggregation
- Manages temporary workspaces
Key Features:
Captures fine-grained resource usage data:
Background thread-based monitoring:
Overall system resource tracking:
Task Definition → Command Building → Process Execution → Resource Monitoring → Result Collection → Quality Assessment → Persistence
Intelligent command construction based on:
Asynchronous subprocess management:
asyncio-based process controlMulti-dimensional quality assessment:
QualityMetrics:
- accuracy_score: Command execution correctness
- completeness_score: Output comprehensiveness
- consistency_score: Result reliability
- relevance_score: Task alignment
- overall_quality: Weighted composite score
Quality estimation algorithm:
Configurable parallelism modes:
Seamless integration with existing benchmark models:
Task: Enhanced with real execution parametersResult: Populated with actual metricsBenchmark: Comprehensive execution recordsAgent: Real claude-flow process representationresult = await engine.run_benchmark("Create a REST API")
results = await engine.execute_batch(task_list)
all_results = await engine.benchmark_all_modes("Build a web app")
All 17 SPARC modes with specialized handling:
Structured data with:
Queryable storage for:
# Run real benchmarks
python -m swarm_benchmark real --objective "Build a CLI tool" --all-modes
# Specific mode testing
python -m swarm_benchmark real --mode sparc-coder --objective "Create a parser"
# Swarm strategy testing
python -m swarm_benchmark real --strategy development --mode hierarchical
from swarm_benchmark.core.real_benchmark_engine import RealBenchmarkEngine
engine = RealBenchmarkEngine(config)
results = await engine.run_benchmark("Create a web scraper")
# CI/CD pipeline integration
- name: Run Claude-Flow Benchmarks
run: |
python -m swarm_benchmark real \
--objective "${{ matrix.objective }}" \
--timeout 300 \
--output-format json sqlite
The Real Benchmark Engine provides a robust, extensible framework for systematically evaluating claude-flow performance across all supported modes and strategies. Its modular architecture enables easy extension while maintaining reliability and accuracy in measurements.