.agents/skills/playwright-e2e/opik-app-context.md
This document provides domain knowledge about the Opik application for AI agents generating E2E tests.
Opik is an LLM observability and evaluation platform. It helps developers trace, evaluate, and monitor their LLM-powered applications. Think of it as "application performance monitoring (APM) for LLM apps."
http://localhost:5173http://localhost:5173/api (proxied through frontend)http://localhost:5555 (Python SDK bridge for tests)Every URL in Opik is workspace-scoped:
default -> URLs are http://localhost:5173/default/...https://cloud.opik.com/{username}/...The BasePage class handles workspace-aware navigation automatically.
/{workspace}/projects)Container for traces. Every trace belongs to a project.
/{workspace}/projects/{project-id}/traces)A trace represents one complete execution of an LLM pipeline. Traces belong to projects.
Sub-operations within a trace (e.g., an LLM call, a retrieval step). Viewed in the trace detail sidebar.
/{workspace}/projects/{project-id} -> Logs tab -> Threads toggle)Conversation threads group related traces into multi-turn conversations. Accessed via Logs tab inside a project, then toggling to "Threads" view.
/{workspace}/datasets)Collections of input/output pairs used for evaluation.
Individual data records within a dataset. Managed inside the dataset detail page.
/{workspace}/experiments)Evaluation runs that test an LLM pipeline against a dataset. Each experiment belongs to a dataset.
Individual evaluation results within an experiment.
/{workspace}/prompts)Versioned prompt templates. Each prompt can have multiple commits (versions).
/{workspace}/configuration?tab=feedback-definitions)Definitions for scoring traces/spans. Two types:
{good: 1, bad: 0})/{workspace}/configuration?tab=ai-provider)API key management for LLM providers (OpenAI, Anthropic, etc.). Required for Playground and Online Scoring features.
/{workspace}/playground)Interactive LLM prompt testing interface. Select a model, enter a prompt, run it, see the response.
Automated scoring rules that evaluate new traces as they arrive. Configured per-project.
Project
├── Traces
│ ├── Spans
│ └── Feedback Scores (on traces)
└── Online Scoring Rules
Dataset
├── Dataset Items
└── Experiments
└── Experiment Items
Prompts
└── Commits (versions)
Configuration
├── Feedback Definitions
└── AI Provider Configs
Every entity can be managed through two interfaces:
localhost:5173opik Python packageTests should verify both directions:
TypeScript E2E tests cannot directly use the Python SDK. Instead, they use a Flask HTTP service that wraps SDK calls:
TypeScript Test --> TestHelperClient (HTTP) --> Flask Service (localhost:5555) --> Python SDK --> Opik Backend
The TestHelperClient class in helpers/test-helper-client.ts provides typed methods for all SDK operations.
After creating/updating entities via the SDK, there may be a short delay before they appear in the UI. The test infrastructure handles this with:
helperClient.waitForProjectVisible(name, retries)helperClient.waitForDatasetVisible(name, retries)helperClient.waitForTracesVisible(projectName, count, retries)Always use these wait methods between SDK creation and UI verification.