eval/README.md
Evaluate whether GitNexus code intelligence improves AI agent performance on real software engineering tasks. Runs SWE-bench instances across multiple models and compares baseline (no graph) vs GitNexus-enhanced configurations.
Hypothesis: Giving AI agents structural code intelligence (call graphs, execution flows, blast radius analysis) improves their ability to resolve real GitHub issues — measured by resolve rate, cost, and efficiency.
Evaluation modes:
| Mode | What the agent gets |
|---|---|
baseline | Standard bash tools (grep, find, cat, sed) — control group |
native | Baseline + explicit GitNexus tools via eval-server (~100ms) |
native_augment | Native tools + grep results automatically enriched with graph context (recommended) |
Recommended: Use
native_augmentmode. It mirrors the Claude Code model — the agent gets both explicit GitNexus tools (fast bash commands) AND automatic enrichment of grep results with callers, callees, and execution flows. The agent decides when to use explicit tools vs rely on enriched search output.
Models supported:
cd eval
# Install dependencies
pip install -e .
# Set up API keys — copy the template and fill in your keys
cp .env.example .env
# Then edit .env and paste your key(s)
All models are routed through OpenRouter by default, so a single OPENROUTER_API_KEY is all you need. To use provider APIs directly (Anthropic, ZhipuAI, etc.), edit the model YAML in configs/models/ and set the corresponding key in .env.
# Pull SWE-bench Docker images (pulled on-demand, but you can pre-pull)
docker pull swebench/sweb.eval.x86_64.django_1776_django-16527:latest
Set GITNEXUS_EVAL_DEBUG=1 to include full Python tracebacks in run summaries and logs. By default, errors are sanitized to avoid leaking host paths or stack traces.
# Fastest way to verify everything works
python run_eval.py debug -m claude-haiku -i django__django-16527 --subset lite
# 5 instances, Claude Sonnet, native_augment mode (default)
python run_eval.py single -m claude-sonnet --subset lite --slice 0:5
# Baseline comparison (no GitNexus)
python run_eval.py single -m claude-sonnet --mode baseline --subset lite --slice 0:5
# Full Lite benchmark, 4 parallel workers
python run_eval.py single -m claude-sonnet --subset lite -w 4
# All models x all modes
python run_eval.py matrix --subset lite -w 4
# Key comparison: baseline vs native_augment
python run_eval.py matrix -m claude-sonnet -m claude-haiku --modes baseline --modes native_augment --subset lite --slice 0:50
# Summary table
python -m analysis.analyze_results results/
# Compare modes for a specific model
python -m analysis.analyze_results compare-modes results/ -m claude-sonnet
# GitNexus tool usage analysis
python -m analysis.analyze_results gitnexus-usage results/
# Export as CSV for further analysis
python -m analysis.analyze_results summary results/ --format csv > results.csv
# Run official SWE-bench test evaluation
python -m analysis.analyze_results summary results/ --swebench-eval
python run_eval.py list-configs
eval/
run_eval.py # Main entry point (single, matrix, debug commands)
agents/
gitnexus_agent.py # GitNexusAgent: extends DefaultAgent with augmentation + metrics
environments/
gitnexus_docker.py # Docker env with GitNexus + eval-server + standalone tool scripts
bridge/
gitnexus_tools.sh # Bash wrappers (legacy — now standalone scripts are installed directly)
mcp_bridge.py # Legacy MCP bridge (kept for reference)
prompts/
system_baseline.jinja # System: persona + format rules
instance_baseline.jinja # Instance: task + workflow
system_native.jinja # System: + GitNexus tool reference
instance_native.jinja # Instance: + GitNexus debugging workflow
system_native_augment.jinja # System: + GitNexus tools + grep enrichment docs
instance_native_augment.jinja # Instance: + GitNexus workflow + risk assessment
configs/
models/ # Per-model YAML configs
modes/ # Per-mode YAML configs (baseline, native, native_augment)
analysis/
analyze_results.py # Post-run comparative analysis
results/ # Output directory (gitignored)
mini-swe-agent requires two Jinja templates:
{{task}})Each mode has a system_{mode}.jinja + instance_{mode}.jinja pair. The agent loads both automatically based on the configured mode.
gitnexus analyze runs (or restores from cache)gitnexus eval-server daemon (persistent HTTP server, keeps LadybugDB warm)/usr/local/bin/ — works with subprocess.run (no .bashrc needed)Agent → bash command → /usr/local/bin/gitnexus-query
→ curl localhost:4848/tool/query (fast path: eval-server, ~100ms)
→ npx gitnexus query (fallback: cold CLI, ~5-10s)
Each tool script in /usr/local/bin/ is standalone — no sourcing, no env inheritance needed. This is critical because mini-swe-agent runs every command via subprocess.run in a fresh subshell.
The eval-server is a lightweight HTTP daemon that:
SWE-bench repos repeat (Django has 200+ instances at different commits). The harness caches GitNexus indexes per (repo, commit) hash in ~/.gitnexus-eval-cache/ to avoid redundant re-indexing.
When the agent runs grep or rg, the observation is post-processed: the agent class calls gitnexus-augment on the search pattern and appends [GitNexus] annotations showing callers, callees, and execution flows for matched symbols. This mirrors the Claude Code / Cursor hook integration.
Create a YAML file in configs/models/:
# configs/models/my-model.yaml
model:
model_name: "openrouter/provider/model-name"
cost_tracking: "ignore_errors" # if not in litellm's cost DB
model_kwargs:
max_tokens: 8192
temperature: 0
The model name follows litellm conventions.
| Metric | Description |
|---|---|
| Patch Rate | % of instances where agent produced a patch |
| Resolve Rate | % of instances where patch passes tests (requires --swebench-eval) |
| Total Cost | API cost across all instances |
| Avg Cost/Instance | Cost efficiency |
| API Calls | Number of LLM calls |
| GN Tool Calls | How many GitNexus tools the agent used |
| Augment Hits | How many grep/find results got enriched |
| Augment Hit Rate | % of search commands that got useful enrichment |