apps/opik-documentation/documentation/fern/docs-v2/self_improving_agents/overview.mdx
Most teams build an agent, ship it, and hope for the best. When something breaks, they dig through logs, guess at the fix, and redeploy. There's no systematic way to learn from production failures or prove that a change actually helped.
Opik takes a different approach. Every production trace becomes a potential learning signal. Every failure becomes a test case. Every fix is verified before it ships. The result is an agent that gets measurably better with every iteration — not because you're doing more work, but because the platform closes the loop for you.
The flywheel is the core loop that makes your agent better over time. Each turn through the cycle adds a new test case, fixes a real failure, and verifies the fix before it reaches production. The more you use it, the faster and more reliable it becomes.
<Tip> You don't write test suites from scratch. You build them incrementally from real production failures. Your test coverage always reflects the actual failure modes of your agent. </Tip> <Steps>Every request your agent handles is captured as a trace in Opik. LLM calls, tool invocations, retrieval steps, token usage, latencies — the full execution path is logged automatically.
import opik
@opik.track(entrypoint=True, project_name="my-agent")
def my_agent(query: str) -> str:
context = retrieve_context(query)
response = call_llm(query, context)
return response
Where this happens in Opik: The Traces dashboard shows every request with filtering by status, latency, cost, and custom tags.
When you spot a trace that looks wrong — a hallucinated answer, ignored context, a failed tool call — open Ollie from the trace view. Describe what looks off, and Ollie walks the span tree to find the root cause.
Ollie doesn't just summarize the trace. It reads the full execution path, compares failing runs to successful ones, and identifies exactly which step went wrong and why.
Example prompts:
Where this happens in Opik: The Ollie debug panel is available from any trace view.
Once you understand the failure, turn it into a test case. Add the trace to a test suite with a natural-language assertion that captures the expected behavior. This is how your test coverage grows — not from a separate test-writing phase, but from real production failures.
Example assertions:
You can add test cases through Ollie, the Opik UI, or the SDK:
import opik
client = opik.Opik()
suite = client.get_or_create_test_suite(name="customer-support-qa")
suite.add_test_case(
input={"query": "How do I reset my password?"},
assertions=["The response must include the specific steps from the help article"],
)
Where this happens in Opik: Test Suites in the Evaluation section.
With the root cause identified and a test case in place, fix the issue. This might mean updating a prompt, adjusting tool definitions, or changing retrieval parameters.
With opik connect running, Ollie can read your source files and propose code changes directly. You see the diff, approve it, and the file is updated on your machine. Nothing changes without your click.
This is the step where Ollie's full power shows — it has context from the trace (what went wrong), the test suite (what "correct" looks like), and your code (where to make the change). See Ollie and your codebase for the full workflow.
Where this happens in Opik: The Ollie chat panel with opik connect active.
Before shipping, run the test suite against your updated agent. The suite checks every test case — including the one you just added — so you confirm the fix works and nothing else regressed.
opik test run --suite customer-support-qa
You get a pass/fail summary for every assertion. If something fails, you're back to step 2 — but now with a tighter feedback loop because the test case already exists.
Where this happens in Opik: Experiment comparison shows side-by-side results across runs.
</Steps>The flywheel isn't just a process — it's a compounding investment:
opik connect lets Ollie read and edit your code