Back to Ruflo

Real Claude-Flow Benchmark Architecture

v2/benchmark/docs/real-benchmark-architecture.md

3.6.308.2 KB
Original Source

Real Claude-Flow Benchmark Architecture

Overview

The Real Benchmark Engine is a comprehensive system designed to execute and measure actual claude-flow commands, capturing detailed performance metrics, resource usage, and quality assessments. This architecture enables systematic testing of all 17 SPARC modes and 6 swarm strategies across 5 coordination modes.

Architecture Components

1. Core Engine (RealBenchmarkEngine)

The main orchestrator that manages the entire benchmarking lifecycle:

python
class RealBenchmarkEngine:
    - Locates claude-flow executable
    - Manages benchmark execution
    - Coordinates resource monitoring
    - Handles result aggregation
    - Manages temporary workspaces

Key Features:

  • Automatic claude-flow discovery across multiple paths
  • Subprocess-based execution with full isolation
  • Configurable parallelism and timeout handling
  • Comprehensive error handling and recovery

2. Resource Monitoring System

ProcessMetrics

Captures fine-grained resource usage data:

  • CPU utilization (per-process and children)
  • Memory consumption (RSS, peak usage)
  • I/O operations (read/write bytes and counts)
  • Temporal sampling for trend analysis

ResourceMonitor

Background thread-based monitoring:

  • Configurable sampling interval (default: 100ms)
  • Hierarchical process tracking (parent + children)
  • Non-blocking operation
  • Graceful handling of process termination

SystemMonitor

Overall system resource tracking:

  • Baseline measurements
  • System-wide CPU and memory usage
  • Disk and network I/O statistics
  • Comparative analysis capabilities

3. Execution Pipeline

Task Definition → Command Building → Process Execution → Resource Monitoring → Result Collection → Quality Assessment → Persistence

Command Building

Intelligent command construction based on:

  • Task strategy type (SPARC mode vs. Swarm)
  • Coordination mode selection
  • Configuration parameters
  • Non-interactive flag enforcement

Process Execution

Asynchronous subprocess management:

  • asyncio-based process control
  • Configurable timeout enforcement
  • Stdout/stderr capture
  • Environment variable injection

4. Quality Metrics System

Multi-dimensional quality assessment:

python
QualityMetrics:
    - accuracy_score: Command execution correctness
    - completeness_score: Output comprehensiveness
    - consistency_score: Result reliability
    - relevance_score: Task alignment
    - overall_quality: Weighted composite score

Quality estimation algorithm:

  1. Base scoring on execution success
  2. Output analysis for completeness
  3. Error pattern detection
  4. Weighted aggregation

5. Parallel Execution Framework

Configurable parallelism modes:

Sequential Mode

  • Ordered task execution
  • Predictable resource usage
  • Simplified debugging

Parallel Mode

  • Semaphore-based concurrency control
  • Configurable max parallel tasks
  • Resource-aware scheduling
  • Exception isolation

6. Data Models Integration

Seamless integration with existing benchmark models:

  • Task: Enhanced with real execution parameters
  • Result: Populated with actual metrics
  • Benchmark: Comprehensive execution records
  • Agent: Real claude-flow process representation

Execution Modes

1. Single Task Benchmarking

python
result = await engine.run_benchmark("Create a REST API")

2. Batch Execution

python
results = await engine.execute_batch(task_list)

3. Comprehensive Mode Testing

python
all_results = await engine.benchmark_all_modes("Build a web app")

SPARC Mode Support

All 17 SPARC modes with specialized handling:

Core Development Modes

  • coder: Code generation benchmarks
  • architect: System design evaluation
  • reviewer: Code review simulation
  • tdd: Test-driven development cycles

Analysis & Research Modes

  • researcher: Information gathering tests
  • analyzer: Code analysis benchmarks
  • optimizer: Performance optimization
  • debugger: Issue resolution testing

Creative & Support Modes

  • designer: UI/UX design tasks
  • innovator: Creative problem solving
  • documenter: Documentation generation
  • tester: Test suite creation

Orchestration Modes

  • orchestrator: Multi-agent coordination
  • swarm-coordinator: Swarm management
  • workflow-manager: Process automation
  • batch-executor: Parallel execution
  • memory-manager: Knowledge management

Swarm Strategy Support

Strategies (6 types)

  1. auto: Adaptive strategy selection
  2. research: Information gathering swarms
  3. development: Code creation swarms
  4. analysis: Code analysis swarms
  5. testing: Quality assurance swarms
  6. optimization: Performance improvement swarms
  7. maintenance: System maintenance swarms

Coordination Modes (5 types)

  1. centralized: Single coordinator model
  2. distributed: Peer-to-peer coordination
  3. hierarchical: Multi-level management
  4. mesh: Full interconnection
  5. hybrid: Adaptive coordination

Monitoring & Metrics

Performance Metrics

  • Execution time (wall clock)
  • Queue time (if applicable)
  • Throughput (tasks/second)
  • Success/failure rates
  • Retry counts
  • Coordination overhead

Resource Metrics

  • CPU utilization (average, peak)
  • Memory usage (average, peak)
  • I/O operations (read/write)
  • Network traffic (if measurable)
  • Process count
  • Thread count

Quality Metrics

  • Output quality scoring
  • Task completion assessment
  • Error rate analysis
  • Consistency evaluation

Output Formats

JSON Reports

Structured data with:

  • Complete execution details
  • Resource usage graphs
  • Quality assessments
  • Comparative analysis

SQLite Database

Queryable storage for:

  • Historical trending
  • Cross-benchmark analysis
  • Performance regression detection
  • Resource usage patterns

Error Handling

Graceful Degradation

  • Process timeout handling
  • Resource exhaustion management
  • Network failure resilience
  • Filesystem error recovery

Error Classification

  • Execution errors
  • Resource errors
  • Quality failures
  • System errors

Best Practices

1. Resource Management

  • Use temporary workspaces for isolation
  • Clean up after execution
  • Monitor system resources
  • Set appropriate timeouts

2. Parallel Execution

  • Configure based on system capacity
  • Use semaphores for rate limiting
  • Monitor resource contention
  • Handle partial failures

3. Quality Assessment

  • Define task-specific quality criteria
  • Use multiple quality dimensions
  • Validate output correctness
  • Track quality trends

4. Performance Optimization

  • Batch similar tasks
  • Reuse process resources
  • Cache command templates
  • Minimize I/O operations

Integration Points

1. CLI Integration

bash
# Run real benchmarks
python -m swarm_benchmark real --objective "Build a CLI tool" --all-modes

# Specific mode testing
python -m swarm_benchmark real --mode sparc-coder --objective "Create a parser"

# Swarm strategy testing
python -m swarm_benchmark real --strategy development --mode hierarchical

2. API Integration

python
from swarm_benchmark.core.real_benchmark_engine import RealBenchmarkEngine

engine = RealBenchmarkEngine(config)
results = await engine.run_benchmark("Create a web scraper")

3. Continuous Integration

yaml
# CI/CD pipeline integration
- name: Run Claude-Flow Benchmarks
  run: |
    python -m swarm_benchmark real \
      --objective "${{ matrix.objective }}" \
      --timeout 300 \
      --output-format json sqlite

Future Enhancements

1. Advanced Monitoring

  • GPU usage tracking
  • Container resource limits
  • Distributed execution support
  • Cloud provider integration

2. Quality Improvements

  • ML-based quality scoring
  • Semantic output analysis
  • Task-specific validators
  • Human evaluation integration

3. Scalability Features

  • Distributed benchmarking
  • Result streaming
  • Incremental processing
  • Cloud-native deployment

4. Analysis Capabilities

  • Real-time dashboards
  • Anomaly detection
  • Performance prediction
  • Optimization recommendations

Conclusion

The Real Benchmark Engine provides a robust, extensible framework for systematically evaluating claude-flow performance across all supported modes and strategies. Its modular architecture enables easy extension while maintaining reliability and accuracy in measurements.