apps/opik-documentation/documentation/fern/docs-v2/integrations/harbor.mdx
Harbor is a benchmark evaluation framework for autonomous LLM agents. It provides standardized infrastructure for running agents against benchmarks like SWE-bench, LiveCodeBench, Terminal-Bench, and others.
<Frame> </Frame>Harbor enables you to evaluate LLM agents on complex coding tasks, tracking their trajectories using the ATIF (Agent Trajectory Interchange Format) specification.
Opik integrates with Harbor to log traces for all trial executions, including:
Comet provides a hosted version of the Opik platform, simply create an account and grab your API Key.
You can also run the Opik platform locally, see the installation guide for more information.
First, ensure you have both opik and harbor installed:
pip install opik harbor
Configure the Opik Python SDK for your deployment type. See the Python SDK Configuration guide for detailed instructions on:
opik configureopik.configure()Harbor requires configuration for the agent and benchmark you want to evaluate. Refer to the Harbor documentation for details on setting up your job configuration.
The easiest way to use Harbor with Opik is through the opik harbor CLI command. This automatically enables Opik tracking for all trial executions without modifying your code.
# Run a benchmark with Opik tracking
opik harbor run -d terminal-bench@head -a terminus_2 -m gpt-4.1
# Use a configuration file
opik harbor run -c config.yaml
# Set project name via environment variable
export OPIK_PROJECT_NAME=my-benchmark
opik harbor run -d swebench@lite
All Harbor CLI commands are available as subcommands:
# Run a job (alias for jobs start)
opik harbor run [HARBOR_OPTIONS]
# Job management
opik harbor jobs start [HARBOR_OPTIONS]
opik harbor jobs resume -p ./jobs/my-job
# Single trial
opik harbor trials start -p ./my-task -a terminus_2
# View available options
opik harbor --help
opik harbor run --help
Here's a complete example running a SWE-bench evaluation with Opik tracking:
# Configure Opik
opik configure
# Set project name
export OPIK_PROJECT_NAME=swebench-claude-sonnet
# Run SWE-bench evaluation with tracking
opik harbor run \
-d swebench-lite@head \
-a claude-code \
-m claude-3-5-sonnet-20241022
Harbor supports integrating your own custom agents without modifying the Harbor source code. There are two types of agents you can create:
BaseEnvironment interface, typically by executing bash commandsFor details on implementing custom agents, see the Harbor Agents documentation.
To run a custom agent with Opik tracking, use the --agent-import-path flag:
opik harbor run -d "terminal-bench@head" --agent-import-path path.to.agent:MyCustomAgent
When building custom agents, you can use Opik's @track decorator on methods within your agent implementation. These decorated functions will automatically be captured as spans within the trial trace, giving you detailed visibility into your agent's internal logic:
from harbor.agents.base import BaseAgent
from opik import track
class MyCustomAgent(BaseAgent):
@staticmethod
def name() -> str:
return "my-custom-agent"
@track
async def plan_next_action(self, observation: str) -> str:
# This function will appear as a span in Opik
# Add your planning logic here
return action
@track
async def execute_tool(self, tool_name: str, args: dict) -> str:
# This will also be tracked as a nested span
result = await self._run_tool(tool_name, args)
return result
async def run(self, instruction: str, environment, context) -> None:
# Your main agent loop
while not done:
observation = await environment.exec("pwd")
action = await self.plan_next_action(observation)
result = await self.execute_tool(action.tool, action.args)
This allows you to trace not just the ATIF trajectory steps, but also the internal decision-making processes of your custom agent.
Each trial completion creates an Opik trace with:
The integration automatically creates spans for each step in the agent's trajectory, giving you detailed visibility into the agent-environment interaction. Each trajectory step becomes a span showing:
Harbor's verifier produces rewards like {"pass": 1, "tests_passed": 5}. These are automatically converted to Opik feedback scores, allowing you to:
The Harbor integration automatically extracts token usage and cost from ATIF trajectory metrics. If your agent records prompt_tokens, completion_tokens, and cost_usd in step metrics, these are captured in Opik spans.
| Variable | Description |
|---|---|
OPIK_PROJECT_NAME | Default project name for traces |
OPIK_API_KEY | API key for Opik Cloud |
OPIK_WORKSPACE | Workspace name (for Opik Cloud) |