v2/docs/architecture/github-workflows-optimization-strategy.md
Document Version: 1.0 Date: 2025-11-24 Project: claude-code-flow Status: Architecture Recommendation
This document provides a comprehensive architectural strategy to optimize GitHub Actions workflows for the claude-code-flow project. Based on analysis of recent workflow runs showing consistent failures in integration tests, rollback manager, and CI/CD pipeline, this strategy focuses on:
Recent Workflow Results (Last 20 runs):
Key Issues Identified:
Problems:
Cost Impact: ~12-15 minutes per run, 5 failed runs daily = 60-75 minutes wasted
New Structure:
jobs:
quality-and-security: # Combines security + lint + typecheck
test-and-build: # Combines test + build
deploy: # Only runs on main branch
Rationale: Reduce overhead, improve speed, simplify dependency management
quality-and-security:
steps:
- name: Install once
run: npm ci
# Run in parallel within single job
- name: Parallel quality checks
run: |
npm run lint &
npm run typecheck &
npm audit --audit-level=high &
wait
Benefits:
- name: Security audit
run: npm audit --audit-level=high || echo "⚠️ Vulnerabilities found, review required"
continue-on-error: true
- name: License check
run: npx license-checker --summary || true
continue-on-error: true
Rationale: Non-critical checks shouldn't block deployments
- name: Cache dependencies
uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ runner.os }}-${{ hashFiles('package-lock.json') }}
restore-keys: npm-${{ runner.os }}-
Benefits: 30-50% faster dependency installation
ADR-001: Consolidate CI Jobs
ADR-002: Remove Matrix Testing
Critical Problems:
node -e "..." with fake dataExample of Problematic Pattern:
// Lines 200-226: Fake communication test
node -e "
async function testCommunication() {
const results = {
messagesSent: Math.floor(Math.random() * 50) + 10, // Random!
messagesReceived: Math.floor(Math.random() * 50) + 10,
successRate: 0.95 + Math.random() * 0.05 // Always succeeds!
};
console.log('Communication test results:', JSON.stringify(results, null, 2));
}
"
Cost Impact: 20-30 minutes per run, 100% failure rate, provides NO VALUE
Current: Simulated coordination
# REMOVE: Lines 165-195 - Fake swarm initialization
run: |
timeout 300s node -e "
console.log('Swarm initialized with topology: mesh');
for (let i = 0; i < count; i++) {
console.log('Agent spawned');
}
"
Recommended: Actual integration testing
- name: Real agent coordination test
run: |
# Test actual CLI functionality
./bin/claude-flow swarm init --topology mesh
./bin/claude-flow agent spawn --type coder --count 2
# Verify agents can communicate
./bin/claude-flow task orchestrate --task "Simple coordination test"
Current: Complex multi-agent matrix with fake results
strategy:
matrix: ${{ fromJson(needs.integration-setup.outputs.agent-matrix) }}
Recommended: Simple, real test scenarios
strategy:
matrix:
scenario: [swarm-init, agent-spawn, task-orchestrate, memory-ops]
DELETE these jobs entirely:
integration-setup (lines 40-136) - Creates fake database and metadatatest-agent-coordination (lines 138-288) - All simulatedtest-memory-integration (lines 290-410) - Fake memory operationstest-fault-tolerance (lines 412-538) - Random failure scenariostest-performance-integration (lines 540-679) - Fake performance dataintegration-test-report (lines 681-880) - Reports on fake dataREPLACE with:
jobs:
real-integration-tests:
runs-on: ubuntu-latest
strategy:
matrix:
test:
- name: "Swarm Initialization"
command: "npm run test:integration -- swarm"
- name: "Agent Coordination"
command: "npm run test:integration -- coordination"
- name: "Memory Operations"
command: "npm run test:integration -- memory"
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- name: Run ${{ matrix.test.name }}
run: ${{ matrix.test.command }}
timeout-minutes: 5
Current: timeout 300s (5 minutes) for fake operations
Recommended: timeout-minutes: 3 for real tests
ADR-003: Remove Simulated Integration Tests
ADR-004: Real Integration Test Requirements
npm run test:integration scriptsProblems:
Cost Impact: Runs on every workflow failure, adds complexity, 100% failure rate
Current: 125 lines of complex detection logic
failure-detection:
outputs:
rollback-required: ...
failure-type: ...
failure-severity: ...
rollback-target: ...
Recommended: Simple condition-based approach
jobs:
rollback-assessment:
if: github.event.workflow_run.conclusion == 'failure' && github.event.workflow_run.name == 'CI/CD Pipeline'
steps:
- name: Check if rollback needed
run: |
# Simple check: if main branch CI fails, notify team
echo "⚠️ CI failed on main branch"
echo "Manual review required before rollback"
Current: Automatic commits, force pushes, tags Recommended: Notification-only workflow
Rationale:
name: 🚨 CI Failure Notification
on:
workflow_run:
workflows: ["CI/CD Pipeline"]
types: [completed]
branches: [main]
jobs:
notify-failure:
if: github.event.workflow_run.conclusion == 'failure'
runs-on: ubuntu-latest
steps:
- name: Create issue for CI failure
uses: actions/github-script@v7
with:
script: |
github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: '🚨 CI Failed on Main Branch',
body: `CI/CD Pipeline failed on main branch.
**Commit:** ${{ github.sha }}
**Workflow Run:** ${{ github.event.workflow_run.html_url }}
Please investigate and determine if rollback is needed.`,
labels: ['ci-failure', 'urgent']
});
Benefits:
ADR-005: Disable Automated Rollbacks
Problems:
Cost Impact: Duplicates CI work, adds 15-20 minutes per run
Current: Separate 667-line workflow Recommended: Integrate into main CI as quality score step
# In ci.yml
jobs:
quality-and-security:
steps:
- run: npm run lint
- run: npm run typecheck
- name: Calculate quality score
run: |
LINT_ERRORS=$(npm run lint 2>&1 | grep -c "error" || echo 0)
TS_ERRORS=$(npm run typecheck 2>&1 | grep -c "error" || echo 0)
SCORE=$((100 - LINT_ERRORS * 2 - TS_ERRORS * 3))
echo "Quality Score: $SCORE/100"
if [ $SCORE -lt 85 ]; then
echo "⚠️ Quality score below threshold"
exit 1
fi
DELETE these jobs:
code-accuracy-scoring - Duplicates npm run lint and npm run typechecktest-coverage-scoring - Already done in CIdocumentation-scoring - Trivial checksperformance-regression-scoring - Unreliable comparisonKEEP:
Current: Complex weighted scoring with JSON artifacts Recommended: Simple pass/fail with clear thresholds
- name: Quality gate check
run: |
set -e
npm run lint # Must pass
npm run typecheck # Must pass
npm run test:coverage # Must have >80% coverage
ADR-006: Merge Truth Scoring into CI
Problems:
Current:
strategy:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
node: [18, 20]
Recommended:
strategy:
matrix:
node-version: [20] # Latest LTS only
runs-on: ubuntu-latest # Single platform
Rationale:
Current: Performance tests in verification workflow Recommended: Separate scheduled workflow
# .github/workflows/performance-benchmarks.yml
name: Performance Benchmarks
on:
schedule:
- cron: '0 2 * * 0' # Weekly on Sunday
workflow_dispatch:
jobs:
benchmark:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run test:benchmark
Benefits:
Current: Link checking, package validation, file existence Recommended: Basic file existence only
- name: Check documentation
run: |
test -f README.md && test -f LICENSE && test -f CHANGELOG.md
echo "✅ Core documentation present"
ADR-007: Single Platform Testing
ADR-008: Move Performance to Scheduled Workflow
Problems:
Decision: Delete test.yml entirely
Rationale:
Implementation:
# Remove test.yml
rm .github/workflows/test.yml
# Update branch protection rules to use ci.yml instead
ADR-009: Remove Duplicate Test Workflow
Current Performance: ✅ 100% success rate
Assessment: This workflow is well-designed and doesn't need optimization
Recommendations: Keep as-is, no changes needed
Priority: URGENT
Delete fake integration tests (ADR-003)
Disable automated rollback (ADR-005)
Remove duplicate test.yml (ADR-009)
Expected Benefits:
Priority: HIGH
Consolidate CI pipeline (ADR-001, ADR-002)
Merge truth scoring into CI (ADR-006)
Simplify verification pipeline (ADR-007, ADR-008)
Expected Benefits:
Priority: MEDIUM
Optimize caching strategy
Add failure retry logic
Documentation updates
| Metric | Current Value |
|---|---|
| Average CI Duration | ~15 minutes |
| Integration Tests Duration | ~25 minutes |
| Workflow Failure Rate | 75% (3 of 4 workflows failing) |
| Daily Wasted Compute Time | ~120 minutes |
| Lines of Workflow Code | ~2,500 lines |
| Active Workflows | 7 workflows |
| Metric | Target Value | Improvement |
|---|---|---|
| Average CI Duration | ~5 minutes | 67% faster |
| Integration Tests Duration | ~5 minutes | 80% faster |
| Workflow Failure Rate | <10% | 87% improvement |
| Daily Wasted Compute Time | ~10 minutes | 92% reduction |
| Lines of Workflow Code | ~800 lines | 68% reduction |
| Active Workflows | 4 workflows | 43% reduction |
Reliability:
Speed:
Cost:
Maintainability:
✅ Can implement immediately:
Risk Level: LOW Impact: HIGH Recommendation: Implement in Phase 1
⚠️ Require testing:
Risk Level: MEDIUM Impact: MEDIUM-HIGH Recommendation: Implement with monitoring in Phase 2
🔴 Require careful planning:
Restore previous workflows:
git revert <commit-sha>
git push origin main
Monitor for 24 hours:
Adjust strategy if needed:
Simulated Tests → Real Integration Tests
Duplicate Workflows → Single CI Pipeline
Over-Engineering → Simplification
False Reliability → True Quality Gates
Week 0 (Before Implementation):
Week 1 (Phase 1):
Week 2 (Phase 2):
Week 3 (Phase 3):
CI Health Dashboard:
Metrics to Monitor:
- CI success rate (target: >95%)
- Average duration (target: <5min)
- P95 duration (target: <8min)
- Failure categories (test, build, lint, etc.)
- Cost per run (GitHub Actions minutes)
Alert Thresholds:
This optimization strategy will transform the claude-code-flow GitHub Actions workflows from a complex, unreliable system to a streamlined, maintainable CI/CD pipeline. The three-phase implementation plan minimizes risk while maximizing benefits.
Key Outcomes:
Next Steps:
| Current Workflows | Status | Optimized Workflows | Status |
|---|---|---|---|
| ci.yml (7 jobs) | ❌ Failing | ci-optimized.yml (3 jobs) | ✅ Designed |
| test.yml | ❌ Duplicate | Deleted | ✅ Removed |
| integration-tests.yml | ❌ Fake tests | integration-real.yml | ✅ Designed |
| rollback-manager.yml | ❌ Dangerous | ci-failure-notify.yml | ✅ Designed |
| truth-scoring.yml | ⚠️ Redundant | Merged into CI | ✅ Simplified |
| verification-pipeline.yml | ⚠️ Slow | verification-simple.yml | ✅ Designed |
| status-badges.yml | ✅ Working | Keep as-is | ✅ No change |
# .github/workflows/ci-optimized.yml
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
NODE_VERSION: '20'
jobs:
quality-and-security:
name: Quality & Security
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run quality checks in parallel
run: |
# Run checks in parallel for speed
npm run lint &
npm run typecheck &
wait
# Security checks (non-blocking)
npm audit --audit-level=high || echo "⚠️ Security review needed"
- name: Calculate quality score
run: |
ERRORS=$(npm run lint 2>&1 | grep -c "error" || echo 0)
SCORE=$((100 - ERRORS * 5))
echo "Quality Score: $SCORE/100"
[ $SCORE -ge 85 ] || exit 1
test-and-build:
name: Test & Build
runs-on: ubuntu-latest
needs: quality-and-security
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ env.NODE_VERSION }}
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run tests with coverage
run: npm run test:coverage
- name: Build project
run: npm run build:ts
- name: Verify CLI
run: |
./bin/claude-flow --version
./bin/claude-flow --help
- name: Upload artifacts
uses: actions/upload-artifact@v4
with:
name: build-artifacts
path: |
dist/
coverage/
retention-days: 7
deploy:
name: Deploy
runs-on: ubuntu-latest
needs: test-and-build
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- name: Download artifacts
uses: actions/download-artifact@v4
with:
name: build-artifacts
- name: Deploy
run: echo "✅ Ready for deployment"
# .github/workflows/integration-real.yml
name: Integration Tests
on:
push:
branches: [main, develop]
pull_request:
workflow_dispatch:
jobs:
integration-tests:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
test-suite:
- swarm
- coordination
- memory
- cli
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- name: Run ${{ matrix.test-suite }} integration tests
run: npm run test:integration -- ${{ matrix.test-suite }}
timeout-minutes: 5
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: integration-results-${{ matrix.test-suite }}
path: test-results/
retention-days: 7
# .github/workflows/ci-failure-notify.yml
name: CI Failure Notification
on:
workflow_run:
workflows: ["CI Pipeline"]
types: [completed]
branches: [main]
jobs:
notify-failure:
if: github.event.workflow_run.conclusion == 'failure'
runs-on: ubuntu-latest
steps:
- name: Create failure issue
uses: actions/github-script@v7
with:
script: |
const { data: issues } = await github.rest.issues.listForRepo({
owner: context.repo.owner,
repo: context.repo.repo,
labels: 'ci-failure',
state: 'open'
});
// Don't create duplicate issues
if (issues.length > 0) {
console.log('CI failure issue already exists');
return;
}
await github.rest.issues.create({
owner: context.repo.owner,
repo: context.repo.repo,
title: '🚨 CI Failed on Main Branch',
body: `CI/CD Pipeline failed on main branch.
**Commit:** ${context.sha}
**Workflow:** ${context.payload.workflow_run.html_url}
**Time:** ${new Date().toISOString()}
## Action Required
1. Review the failed workflow run
2. Determine root cause
3. Decide if rollback is needed
4. Close this issue when resolved
cc @team-leads`,
labels: ['ci-failure', 'urgent', 'main-branch']
});
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2025-11-24 | System Architect | Initial comprehensive optimization strategy |
End of Document