Back to Ruflo

Benchmark Analysis Guide

v2/benchmark/docs/analysis.md

3.6.3012.4 KB
Original Source

Benchmark Analysis Guide

Learn how to analyze and interpret benchmark results to optimize your swarm performance.

📊 Understanding Benchmark Results

Result Structure

Each benchmark produces a comprehensive result structure:

json
{
  "benchmark_id": "uuid",
  "name": "benchmark-name",
  "status": "completed",
  "duration": 0.25,
  "config": { ... },
  "tasks": [ ... ],
  "results": [ ... ],
  "metrics": {
    "performance_metrics": { ... },
    "quality_metrics": { ... },
    "resource_usage": { ... },
    "coordination_metrics": { ... }
  }
}

Key Metrics Explained

Performance Metrics

  • execution_time: Total time to complete the task (seconds)
  • queue_time: Time spent waiting in queue
  • throughput: Tasks completed per second
  • coordination_overhead: Time spent on agent coordination
  • communication_latency: Average inter-agent communication delay

Quality Metrics

  • accuracy_score: How accurate the results are (0-1)
  • completeness_score: How complete the solution is (0-1)
  • consistency_score: Consistency across multiple runs (0-1)
  • relevance_score: How relevant the output is to the objective (0-1)
  • overall_quality: Weighted average of all quality metrics (0-1)

Resource Metrics

  • cpu_percent: CPU usage percentage
  • memory_mb: Memory usage in megabytes
  • peak_memory_mb: Maximum memory used
  • network_bytes: Total network traffic
  • resource_efficiency: Resource utilization efficiency (0-1)

🔍 Analysis Commands

Basic Analysis

bash
# Analyze a specific benchmark
swarm-benchmark analyze <benchmark-id>

# Example output:
# Performance Summary:
# - Execution Time: 0.25s
# - Success Rate: 95%
# - Average Quality: 0.87
# - Resource Efficiency: 0.82

Detailed Analysis

bash
# Comprehensive analysis
swarm-benchmark analyze <benchmark-id> --detailed

# Specific analysis types
swarm-benchmark analyze <benchmark-id> --type performance
swarm-benchmark analyze <benchmark-id> --type quality
swarm-benchmark analyze <benchmark-id> --type resource
swarm-benchmark analyze <benchmark-id> --type coordination

Comparative Analysis

bash
# Compare multiple benchmarks
swarm-benchmark compare id1 id2 id3

# Compare specific metrics
swarm-benchmark compare id1 id2 --metrics execution_time,quality_score

# Visual comparison
swarm-benchmark compare id1 id2 --format chart --export comparison.png

📈 Performance Analysis

Execution Time Breakdown

Understand where time is spent:

python
# Time distribution analysis
total_time = benchmark.duration
execution_time = sum(r.execution_time for r in results)
coordination_time = sum(r.coordination_overhead for r in results)
queue_time = sum(r.queue_time for r in results)

print(f"Execution: {execution_time/total_time*100:.1f}%")
print(f"Coordination: {coordination_time/total_time*100:.1f}%")
print(f"Queue: {queue_time/total_time*100:.1f}%")

Identifying Bottlenecks

bash
# Find performance bottlenecks
swarm-benchmark analyze <id> --bottlenecks

# Output:
# Performance Bottlenecks:
# 1. High coordination overhead (18% of total time)
#    - Consider switching from mesh to hierarchical mode
# 2. Long queue times for analysis tasks
#    - Increase agent pool size or use parallel execution
# 3. Memory peaks during data processing
#    - Enable streaming or batch processing

Track performance over time:

bash
# Analyze performance trends
swarm-benchmark analyze --trend --strategy development --days 7

# Generate trend report
swarm-benchmark report --type trends --period weekly

🎯 Quality Analysis

Quality Score Components

Understanding quality metrics:

Overall Quality = (
    accuracy_score * 0.35 +
    completeness_score * 0.30 +
    consistency_score * 0.20 +
    relevance_score * 0.15
)

Improving Quality Scores

bash
# Analyze quality issues
swarm-benchmark analyze <id> --quality-breakdown

# Example output:
# Quality Analysis:
# - Accuracy: 0.92 ✅
# - Completeness: 0.78 ⚠️  (Missing test cases)
# - Consistency: 0.85 ✓
# - Relevance: 0.90 ✅
# 
# Recommendations:
# 1. Increase max_retries for better completeness
# 2. Enable review mode for consistency
# 3. Use more specific objectives for relevance

Quality Comparison

bash
# Compare quality across strategies
swarm-benchmark analyze --compare-quality \
  --strategies research,development,analysis

# Quality by coordination mode
swarm-benchmark analyze --quality-by-mode \
  --task-type "API development"

💻 Resource Analysis

Resource Utilization Patterns

bash
# Detailed resource analysis
swarm-benchmark analyze <id> --resource-details

# Resource usage over time
swarm-benchmark analyze <id> --resource-timeline --export resources.csv

Resource Efficiency Metrics

python
# Calculate resource efficiency
resource_efficiency = (
    (tasks_completed / total_tasks) * 
    (1 - (avg_cpu_usage / 100)) * 
    (1 - (avg_memory_usage / memory_limit))
)

Optimization Recommendations

bash
# Get resource optimization suggestions
swarm-benchmark analyze <id> --optimize-resources

# Example output:
# Resource Optimization Suggestions:
# 1. CPU Usage: 85% average (high)
#    - Consider increasing task_timeout
#    - Use distributed mode to spread load
# 2. Memory Usage: 450MB/1024MB
#    - Optimal, no changes needed
# 3. Network: 15MB transferred
#    - Enable compression for large data

🔗 Coordination Analysis

Coordination Efficiency

Analyze how well agents work together:

bash
# Coordination analysis
swarm-benchmark analyze <id> --coordination-metrics

# Output:
# Coordination Analysis:
# - Mode: hierarchical
# - Agents: 8
# - Coordination Overhead: 12%
# - Communication Latency: 45ms avg
# - Task Distribution: balanced
# - Agent Utilization: 78%

Mode Effectiveness

bash
# Compare coordination modes for task type
swarm-benchmark analyze --mode-effectiveness \
  --task-pattern "Build*" \
  --min-samples 10

# Best modes by task type
swarm-benchmark report --coordination-recommendations

📊 Advanced Analysis Techniques

Statistical Analysis

python
# Python script for statistical analysis
import json
import numpy as np
from scipy import stats

# Load benchmark results
with open('benchmark_results.json') as f:
    data = json.load(f)

# Calculate statistics
execution_times = [r['execution_time'] for r in data['results']]
mean_time = np.mean(execution_times)
std_time = np.std(execution_times)
ci_95 = stats.t.interval(0.95, len(execution_times)-1, 
                          mean_time, std_time/np.sqrt(len(execution_times)))

print(f"Mean execution time: {mean_time:.3f}s")
print(f"95% CI: [{ci_95[0]:.3f}, {ci_95[1]:.3f}]")

Pattern Recognition

Identify patterns in benchmark results:

bash
# Find patterns in failures
swarm-benchmark analyze --failure-patterns --days 30

# Success patterns by configuration
swarm-benchmark analyze --success-patterns \
  --group-by strategy,mode

Predictive Analysis

bash
# Predict execution time for configuration
swarm-benchmark predict \
  --strategy development \
  --mode hierarchical \
  --agents 8 \
  --task-complexity high

# Predicted: 0.35s (±0.05s)
# Based on 50 similar benchmarks

📈 Visualization and Reporting

Generate Visual Reports

bash
# Performance dashboard
swarm-benchmark report --type dashboard \
  --period monthly \
  --export dashboard.html

# Strategy comparison chart
swarm-benchmark report --type comparison \
  --strategies all \
  --metrics execution_time,quality_score \
  --export strategy_comparison.png

Custom Analysis Scripts

python
# custom_analysis.py
import matplotlib.pyplot as plt
import json

# Load benchmark data
def analyze_benchmark(benchmark_id):
    with open(f'reports/{benchmark_id}.json') as f:
        data = json.load(f)
    
    # Extract metrics
    metrics = data['metrics']
    
    # Create visualization
    fig, axes = plt.subplots(2, 2, figsize=(12, 10))
    
    # Performance pie chart
    axes[0, 0].pie(
        [metrics['execution_time'], metrics['coordination_overhead'], metrics['queue_time']],
        labels=['Execution', 'Coordination', 'Queue'],
        autopct='%1.1f%%'
    )
    axes[0, 0].set_title('Time Distribution')
    
    # Quality scores bar chart
    quality = data['quality_metrics']
    axes[0, 1].bar(quality.keys(), quality.values())
    axes[0, 1].set_title('Quality Scores')
    
    # Resource usage over time
    # ... (additional visualizations)
    
    plt.tight_layout()
    plt.savefig(f'analysis_{benchmark_id}.png')

🎯 Analysis Best Practices

1. Regular Analysis

bash
# Daily analysis script
#!/bin/bash
DATE=$(date +%Y%m%d)
swarm-benchmark analyze --since yesterday \
  --report daily \
  --export "reports/daily_$DATE.html"

2. Baseline Comparison

Always compare against baselines:

bash
# Set baseline
swarm-benchmark baseline set <benchmark-id>

# Compare against baseline
swarm-benchmark analyze <new-id> --compare-baseline

3. Multi-dimensional Analysis

Consider multiple factors:

bash
# Comprehensive analysis
swarm-benchmark analyze <id> \
  --dimensions performance,quality,resource,coordination \
  --export comprehensive_analysis.json

4. Actionable Insights

Focus on actionable recommendations:

bash
# Get specific recommendations
swarm-benchmark analyze <id> --recommendations

# Example output:
# Top 3 Recommendations:
# 1. Switch to distributed mode (est. 25% faster)
# 2. Increase quality threshold to 0.9 (improve accuracy)
# 3. Enable parallel execution (reduce time by 40%)

📊 Analysis Workflows

Performance Optimization Workflow

  1. Baseline Measurement

    bash
    swarm-benchmark run "Task" --name baseline
    swarm-benchmark baseline set <id>
    
  2. Identify Issues

    bash
    swarm-benchmark analyze <id> --bottlenecks
    
  3. Test Improvements

    bash
    swarm-benchmark run "Task" --mode distributed --parallel
    
  4. Compare Results

    bash
    swarm-benchmark compare <baseline-id> <new-id>
    
  5. Validate Improvements

    bash
    swarm-benchmark analyze <new-id> --compare-baseline
    

Quality Improvement Workflow

  1. Quality Assessment

    bash
    swarm-benchmark analyze <id> --quality-breakdown
    
  2. Apply Improvements

    bash
    swarm-benchmark run "Task" \
      --quality-threshold 0.95 \
      --review \
      --max-retries 5
    
  3. Verify Quality

    bash
    swarm-benchmark analyze <new-id> --quality-validation
    

🔍 Troubleshooting with Analysis

Common Issues and Solutions

bash
# High failure rate
swarm-benchmark analyze <id> --failure-analysis
# → Increase timeouts, add retries

# Poor quality scores
swarm-benchmark analyze <id> --quality-issues
# → Enable review, increase threshold

# Resource exhaustion
swarm-benchmark analyze <id> --resource-problems
# → Reduce agent count, enable limits

# Coordination overhead
swarm-benchmark analyze <id> --coordination-issues
# → Simplify mode, reduce communication

📚 Export and Integration

Export Formats

bash
# Export for further analysis
swarm-benchmark export <id> --format csv --include-raw
swarm-benchmark export <id> --format json --pretty
swarm-benchmark export <id> --format sql --table benchmarks

Integration with Analytics Tools

python
# Export to pandas DataFrame
import pandas as pd
import json

def benchmark_to_dataframe(benchmark_file):
    with open(benchmark_file) as f:
        data = json.load(f)
    
    # Flatten results
    records = []
    for result in data['results']:
        record = {
            'benchmark_id': data['id'],
            'task_id': result['task_id'],
            'execution_time': result['execution_time'],
            'quality_score': result['quality_metrics']['overall_quality'],
            'cpu_usage': result['resource_usage']['cpu_percent'],
            'memory_usage': result['resource_usage']['memory_mb']
        }
        records.append(record)
    
    return pd.DataFrame(records)

# Analyze with pandas
df = benchmark_to_dataframe('benchmark_results.json')
print(df.describe())
print(df.groupby('strategy').mean())

🎉 Summary

Effective benchmark analysis helps you:

  • Identify performance bottlenecks
  • Improve quality scores
  • Optimize resource usage
  • Select optimal strategies and modes
  • Make data-driven decisions

Remember: Regular analysis and comparison are key to continuous improvement!