.agents/skills/opik-integrations/workflow.md
The end-to-end playbook for building or updating an Opik SDK integration. Run the phases in order.
Pick the language reference up front — python.md or typescript.md — and keep it open throughout.
Either way, always end with the high-level report (Phase 8). Never invent results — every "supported" claim must be backed by a passing test or an MCP-verified trace; everything unverified goes under "not supported / not verified".
new (no integration exists), update (extend an existing one), or maintain (verify/repair an existing one). Maintain mode skips Phases 2–4 and runs Investigate → Verify → Test against current code to catch drift.Get the environment ready yourself before investigating. Record what you did for the report; never print secret values.
sdks/python/.venv for Python; npm i in the integration package for TS). Capture the resolved version. If the install is broken or pulls an incompatible core dep (e.g. it bumps pydantic and breaks litellm), pin to a known-good version and restore the disturbed dep.sdks/python/tests/pytest.ini (e.g. MISTRALAI_API_KEY); also check the shell env. If the library's own default env var differs from the test var (e.g. SDK wants MISTRAL_API_KEY but the test sets MISTRALAI_API_KEY), read the test var explicitly and pass it to the client.OPIK_PROJECT_NAME (e.g. <name>-integration-demo).fake_backend) tests, with live MCP verification marked "not verified".Understand the target library before touching Opik. Prefer fanning out parallel explore agents over the library's installed source / docs.
Answer all of these:
chat.complete, embed, rerank). List sync and async variants.opik.types.LLMProvider).Read the closest sibling integration in full — it is the template, and most decisions are already made there.
Clone-ability checkpoint (do this before designing). "Clone the closest sibling" only holds if the target is actually a clean analog. Before committing, confirm it — and if any of these are true, surface it as a design decision even in autonomous mode instead of silently picking:
Client and a v2 ClientV2 with different method signatures and response shapes). Decide which to support, and whether both are in scope.A target that looks like "just another provider" but has any of the above is not a clean clone — say so before writing code.
Produce two artifacts in the scratchpad before designing:
A minimal runnable example script that exercises the target library directly (no Opik yet) — one non-streaming call, one streaming call, and any second method you plan to trace. This is what you'll later run in Phase 5 to verify logging.
A findings note — a short mapping table:
| Opik span field | Source in the library's request/response |
|---|---|
| input | … |
| output | … |
| usage | … |
| model | … |
| provider | … |
| span type | llm / tool / general |
Decide and write down:
track_<name> for Python patching; trackXxx / XxxCallbackHandler / XxxExporter for TS).In interactive mode, present this and wait for approval before writing code. In autonomous mode, proceed — but capture the design and any open questions (e.g. dedicated integration vs. an OpenAI-compatible docs page) in the Phase 8 report so the reviewer sees the decisions.
peerDependency (TS); never import it at SDK package top level.This is the proof that the integration actually logs correctly. Do not rely on reading code.
Verifiability is a hard gate for new/update. If you cannot run this phase — no API credential for the target, or no reachable backend the MCP can read — stop and surface it before writing integration code, in autonomous mode too. Do not produce an integration and then present it as done with verification "skipped": unverified integration code is the exact failure this skill exists to prevent. Offer the user the choice to supply a credential, proceed explicitly-unverified (clearly labelled, tests key-gated and skipped), or pick a different target. Only maintain mode on already-passing code may relax this.
~/.opik.config (e.g. MCP → a hosted *.dev.comet.com, local config → localhost). Before relying on the MCP, confirm they match: log a trace, then try to read its project through the MCP. If the MCP can't see it, the backends differ. Either reconfigure the script's env (OPIK_API_KEY, OPIK_URL_OVERRIDE/base URL, OPIK_WORKSPACE, OPIK_PROJECT_NAME) to log into the MCP's backend, or fall back to SDK read-back (next note). Don't silently assume the MCP sees your trace.flush() before exit.list (entity_type trace, filtered by the project) then read (entity_type trace, which inlines spans; read span for detail).type is correct (llm for model calls)input / output captured and well-shaped (not empty, not the raw object dump)usage present with prompt/completion/total tokensmodel and provider set correctlyerror_info and still re-raisesSDK read-back fallback (equivalent evidence). When the MCP can't read the backend you can write to, verify against that backend over REST instead: client = opik.Opik(); client.search_traces(project_name=...) then client.search_spans(trace_id=...), and assert the same checklist (type, input/output, usage, model, provider). This is a real backend round-trip — note in the report that read-back was via the SDK, not the MCP tool, and why.
Add coverage with the language's harness — see the test section of the language reference, which delegates to the python-sdk / typescript-sdk testing skills.
sdks/python/tests/library_integration/<name>/, using fake_backend and testlib TraceModel/SpanModel trees with ANY_* matchers. Assert input/output/usage/model/provider. Gate real API calls behind an ensure_<name>_configured fixture. Name tests test_<what>__<case>__<expected>.*.test.ts with vitest, mocking the API layer (or MSW), await flush() before asserting, fake timers for batching. Mirror the sibling integration's test file.Author the Fern page following the write-docs skill for MDX/components, plus these integration-specific conventions:
track_openai, or LiteLLM) and an entry already in fern/versions/latest.yml. If so, this is an update: lead the page with the new native integration and demote the workaround to an "Alternative" section (see how mistral.mdx keeps LiteLLM). Don't create a duplicate page or a second routing entry.apps/opik-documentation/documentation/fern/docs-v2/integrations/<name>.mdx for Python, <name>-typescript.mdx for TypeScript.Observability for <Lib> (Python) with Opik vs (TypeScript).openai.mdx / langchain.mdx): intro/tips → account setup → getting started (install, configure Opik, configure the library) → basic usage (the wrap/handler call + a screenshot) → advanced usage → cost tracking → supported methods.- page: entry under the right language → category section (Frameworks, Model Providers, …) in fern/versions/latest.yml. Do not edit docs.yml.<Card> to docs-v2/integrations/overview.mdx under the matching section. Cards are title + href only; section icons are Font Awesome — there is no per-integration icon to create.<API_KEY>), never real keys.End every run with a high-level report. Keep it scannable — it's for a reviewer deciding whether to ship, not a changelog. Use this template:
## <Library> integration — report
**Mode:** new | update | maintain **Language:** Python | TypeScript
**Pattern:** method-patching | proxy | callback | OTel exporter **Entrypoint:** `track_<name>(...)`
**Library version prepared:** <name>==<version>
### What was done
- <files created/changed — bullets, grouped by integration / tests / docs>
- <prep actions: deps installed, version pins, fixtures/env added>
### Verification
- **MCP:** <project name + trace ids read back, or "not verified — <reason>">
- **Tests:** <N passing — list cases>; ruff/mypy <clean | issues>
### Flows supported & test coverage
Always enumerate **every user-facing flow** the integration handles and map each to its coverage — don't collapse this into a one-line "supported". A flow is a distinct way a user invokes the library; enumerate the cross-product that applies: each traced method, sync vs async, streaming vs non-streaming, nested under `@track`, the error path, and any option that changes behavior (custom `project_name`, `provider` override, tool/function calls, structured output). For each flow state whether it's implemented, which test covers it (by name), and whether it was MCP-verified.
| Flow | Implemented | Test | MCP-verified |
|---|---|---|---|
| `chat.complete` (sync, non-stream) | ✅ | `test_<name>_complete__happyflow` | ✅ trace `<id>` |
| `chat.complete_async` | ✅ | `test_<name>_complete_async__happyflow` | — |
| `chat.stream` (sync) | ✅ | `test_<name>_stream__happyflow` | ✅ trace `<id>` |
| `chat.stream_async` | ✅/❌ | … | … |
| nested under `@track` | ✅/❌ | … | … |
| error → `error_info` logged | ✅ | `test_<name>__error...` | — |
| token usage captured | ✅ | asserted in above | ✅ |
| custom `project_name` | ✅/❌ | param case | — |
Explicitly flag the gaps: any flow **implemented but not covered by a test**, and any flow **not implemented at all** (list it in the next section). The goal is that a reviewer can see, per flow, exactly what was proven.
### What's NOT supported / limitations
- <methods intentionally not patched, flows implemented-but-untested, known gaps, provider-not-in-LLMProvider caveats, env/backend blockers>
### Follow-ups
- <suggested next steps: more methods, TS counterpart, PR split, etc.>
### How to use
<minimal code snippet>
Implementation merged-quality, MCP verification passed (Phase 5 checklist) or its absence reported, tests added and passing, docs page authored and routed, and the Phase 8 report delivered. Then run make claude if you added/edited files under .agents/.