backend/core/test_harness/README.md
Comprehensive API-based E2E testing system for benchmarking the Kortix agent system.
cd backend
supabase db push
python api.py
Note: The test harness automatically creates a test user ([email protected]) if it doesn't exist. No manual user setup required!
# Core test (real LLM)
curl -X POST http://localhost:8000/v1/admin/test-harness/run \
-H "X-Admin-Api-Key: $KORTIX_ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"mode": "core_test",
"concurrency": 3,
"model": "kortix/basic"
}'
# Get results
curl http://localhost:8000/v1/admin/test-harness/runs/{run_id} \
-H "X-Admin-Api-Key: $KORTIX_ADMIN_API_KEY"
The test harness includes 13 deterministic test prompts covering:
View all prompts:
curl http://localhost:8000/v1/admin/test-harness/prompts \
-H "X-Admin-Api-Key: $KORTIX_ADMIN_API_KEY"
Start a new benchmark test
Request:
{
"mode": "core_test",
"prompt_ids": ["file_ops_1", "shell_1"],
"concurrency": 5,
"model": "kortix/basic",
"num_executions": 100,
"metadata": {"branch": "main", "commit": "abc123"}
}
Response:
{
"run_id": "uuid",
"status": "running",
"message": "Test started successfully"
}
Get test results and summary
Response:
{
"run_id": "uuid",
"status": "completed",
"summary": {
"total_prompts": 10,
"successful": 9,
"failed": 1,
"avg_duration_ms": 5234,
"avg_cold_start_ms": 450,
"avg_tool_call_time_ms": 1200,
"tool_call_breakdown": {...},
"slowest_tool_calls": [...]
},
"results": [...]
}
List recent test runs
Query Parameters:
limit: Max results (default: 20)run_type: Filter by 'core_test' or 'stress_test'Cancel an active test run
šØ Emergency stop - cancel ALL active test runs
Use this in emergency situations to immediately stop all running tests.
Response:
{
"message": "Emergency stop completed - cancelled 2 test runs",
"cancelled_count": 2,
"cancelled_runs": ["run_id_1", "run_id_2"],
"errors": null
}
List all available test prompts
For each test run:
GitHub Actions ā Admin API ā TestHarnessRunner ā /agent/start
ā
SSE Stream Parser
ā
MetricsCollector
ā
Supabase Storage
Edit prompts.py:
TestPrompt(
id="my_test_1",
text="Your test prompt here",
category="custom",
expected_tools=["tool_name"],
min_tool_calls=1,
max_duration_ms=30000,
description="What this tests"
)
Extend BenchmarkResult in metrics.py to track additional metrics.
The mock_llm.py module provides deterministic responses. Customize _determine_tool_calls() to add new tool patterns.
X-Admin-Api-Key header[email protected]) is automatically created with minimal permissionsTest hangs indefinitely:
High failure rate:
Timeout errors:
max_duration_ms for prompts