sdks/python/design/TESTING.md
The Opik Python SDK has a comprehensive test suite organized into multiple categories:
tests/
├── conftest.py # Root fixtures (context cleanup, client shutdown)
├── pytest.ini # Pytest configuration
├── test_requirements.txt # Test dependencies
│
├── testlib/ # Shared testing utilities
│ ├── models.py # Test data models (TraceModel, SpanModel, etc.)
│ ├── backend_emulator_message_processor.py # Fake backend
│ ├── assert_helpers.py # Assertion utilities
│ ├── any_compare_helpers.py # Flexible matchers (ANY, ANY_BUT_NONE)
│ ├── fake_message_factory.py # Message creation helpers
│ ├── noop_file_upload_manager.py # No-op file uploader
│ └── environment.py # Environment utilities
│
├── unit/ # Unit tests (no external dependencies)
│ ├── conftest.py # Unit test fixtures
│ ├── api_objects/ # Tests for API objects
│ │ ├── test_opik_client.py
│ │ ├── dataset/
│ │ ├── experiment/
│ │ ├── trace/
│ │ └── ...
│ ├── decorator/ # Decorator tests
│ │ ├── test_tracker_outputs.py # Comprehensive decorator tests
│ │ ├── test_dynamic_tracing.py
│ │ ├── test_span_context_manager.py
│ │ └── ...
│ ├── evaluation/ # Evaluation framework tests
│ │ ├── test_evaluate.py
│ │ ├── metrics/ # Metric tests
│ │ └── ...
│ ├── message_processing/ # Message processing tests
│ │ ├── test_message_streaming.py
│ │ ├── batching/
│ │ └── ...
│ └── ... # Other unit tests
│
├── library_integration/ # Integration tests with fake backend
│ ├── conftest.py # Shared fixtures
│ ├── openai/ # OpenAI integration tests
│ │ ├── requirements.txt
│ │ ├── constants.py
│ │ ├── test_openai_responses.py
│ │ └── ...
│ ├── anthropic/ # Anthropic integration tests
│ ├── langchain/ # LangChain integration tests
│ ├── bedrock/ # AWS Bedrock tests
│ ├── litellm/ # LiteLLM tests
│ └── ... # Other integrations
│
├── e2e/ # End-to-end tests (real backend)
│ ├── conftest.py # E2E fixtures
│ ├── verifiers.py # Backend verification helpers
│ ├── test_tracing.py # Core tracing tests
│ ├── test_dataset.py # Dataset tests
│ ├── test_prompt.py # Prompt tests
│ ├── evaluation/ # Evaluation E2E tests
│ └── ...
│
├── e2e_library_integration/ # E2E library integration (real backend)
│ ├── conftest.py # E2E lib integration fixtures
│ ├── litellm/ # LiteLLM E2E tests
│ ├── adk/ # ADK E2E tests
│ └── ...
│
└── e2e_smoke/ # Quick smoke tests
├── dry_run_import.py
└── smoke_tests_runner.sh
tests/unit/)Purpose: Fast, isolated tests with no external dependencies.
Characteristics:
fake_backend fixture)Key Fixtures:
@pytest.fixture
def fake_backend(patch_streamer):
"""
Replaces Streamer with fake backend emulator.
Captures messages and builds trace/span trees.
Access via: fake_backend.trace_trees, fake_backend.span_trees
"""
Example Structure:
def test_track__one_nested_function__happyflow(fake_backend):
@opik.track
def f_inner(x):
return "inner-output"
@opik.track
def f_outer(x):
f_inner("inner-input")
return "outer-output"
f_outer("outer-input")
opik.flush_tracker()
# Verify against expected tree structure
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="f_outer",
spans=[
SpanModel(name="f_outer", spans=[
SpanModel(name="f_inner", spans=[])
])
]
)
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
What to Test:
tests/library_integration/)Purpose: Test integrations with external libraries using fake backend.
Characteristics:
Directory Structure:
library_integration/
├── openai/
│ ├── requirements.txt # OpenAI-specific dependencies
│ ├── constants.py # Test constants (models, etc.)
│ ├── test_openai_responses.py
│ └── test_openai_chat_completions.py
├── anthropic/
├── langchain/
└── ...
Example Structure:
def test_openai_client_responses_create__happyflow(fake_backend):
client = openai.OpenAI()
wrapped_client = track_openai(client, project_name="test")
# Real OpenAI API call
response = wrapped_client.responses.create(
model=MODEL_FOR_TESTS,
input=[{"role": "user", "content": "Hello"}]
)
opik.flush_tracker()
# Verify trace structure with fake backend
assert len(fake_backend.trace_trees) == 1
trace = fake_backend.trace_trees[0]
assert trace.name == "responses_create"
assert trace.spans[0].type == "llm"
assert trace.spans[0].provider == "openai"
What to Test:
Requirements Files:
Each integration has its own requirements.txt:
# openai/requirements.txt
openai>=1.0.0
# langchain/requirements.txt
langchain>=0.1.0
langchain-openai>=0.1.0
tests/e2e/)Purpose: Test core functionality against real Opik backend.
Characteristics:
Key Fixtures:
@pytest.fixture()
def opik_client(configure_e2e_tests_env, shutdown_cached_client_after_test):
"""Real Opik client for E2E tests"""
opik_client_ = opik.Opik(_use_batching=True)
yield opik_client_
opik_client_.end()
@pytest.fixture
def dataset_name(opik_client):
"""Generate unique dataset name"""
name = f"e2e-tests-dataset-{random_chars()}"
yield name
Example Structure:
def test_trace_creation_and_retrieval(opik_client, temporary_project_name):
# Create trace
trace_id = opik_client.trace(
name="test_trace",
input={"query": "test"},
project_name=temporary_project_name
)
opik_client.flush()
# Verify against real backend
verify_trace(
opik_client,
trace_id=trace_id,
name="test_trace",
input={"query": "test"},
project_name=temporary_project_name
)
What to Test:
Verifiers (verifiers.py):
def verify_trace(opik_client, trace_id, name, input, output, ...):
"""Wait for trace to appear in backend and verify fields"""
if not synchronization.until(
lambda: opik_client.get_trace_content(id=trace_id) is not None,
allow_errors=True
):
raise AssertionError(f"Failed to get trace {trace_id}")
trace = opik_client.get_trace_content(id=trace_id)
assert trace.name == name
assert trace.input == input
# ... more assertions
def verify_span(opik_client, span_id, ...):
"""Similar verification for spans"""
def verify_experiment_items(opik_client, experiment_id, expected_items):
"""Verify experiment items match expected"""
tests/e2e_library_integration/)Purpose: Test library integrations against real backend.
Characteristics:
Example Structure:
def test_litellm_chat_model_e2e(opik_client_unique_project_name):
"""Test LiteLLM integration with real backend"""
from litellm import completion
from opik.integrations.litellm import track_litellm
track_litellm()
# Real LiteLLM call (which calls real LLM provider)
response = completion(
model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hello"}]
)
opik.flush_tracker()
# Verify in real backend
traces = opik_client_unique_project_name.search_traces()
assert len(traces) > 0
When to Use:
tests/e2e_smoke/)Purpose: Quick sanity checks that SDK can be imported and basic operations work.
Example:
# dry_run_import.py
import opik
import opik.evaluation.metrics as metrics
# Verify basic imports work
client = opik.Opik()
testlib/models.py)Domain-specific models for test assertions:
@dataclasses.dataclass
class SpanModel:
"""Represents expected span structure"""
id: str
name: Optional[str] = None
input: Any = None
output: Any = None
type: str = "general"
usage: Optional[Dict[str, Any]] = None
spans: List["SpanModel"] = dataclasses.field(default_factory=list)
# ... more fields
@dataclasses.dataclass
class TraceModel:
"""Represents expected trace structure"""
id: str
name: Optional[str]
input: Any = None
output: Any = None
spans: List[SpanModel] = dataclasses.field(default_factory=list)
# ... more fields
@dataclasses.dataclass
class FeedbackScoreModel:
"""Represents expected feedback score"""
id: str
name: str
value: float
reason: Optional[str] = None
testlib/backend_emulator_message_processor.py)Purpose: Emulate backend behavior for unit and library integration tests.
Key Features:
class BackendEmulatorMessageProcessor(BaseMessageProcessor):
def __init__(self, merge_duplicates: bool = True):
self.processed_messages: List[messages.BaseMessage] = []
self._trace_trees: List[TraceModel] = []
self._span_trees: List[SpanModel] = []
# ... internal state
@property
def trace_trees(self) -> List[TraceModel]:
"""Build and return trace trees from processed messages"""
@property
def span_trees(self) -> List[SpanModel]:
"""Build and return span trees from processed messages"""
def process(self, message: messages.BaseMessage) -> None:
"""Process message and update internal state"""
Usage:
def test_example(fake_backend):
# Execute code that creates traces/spans
@opik.track
def my_function():
return "result"
my_function()
opik.flush_tracker()
# Access built trees
assert len(fake_backend.trace_trees) == 1
assert fake_backend.trace_trees[0].name == "my_function"
testlib/any_compare_helpers.py)Special matchers for flexible assertions:
ANY = SpecialValue("ANY") # Matches anything
ANY_BUT_NONE = SpecialValue("ANY_BUT_NONE") # Matches anything except None
ANY_STRING = StringMatcher() # String-specific matcher
ANY_DICT = DictMatcher() # Dict-specific matcher
# Usage
assert_equal(
expected=TraceModel(
id=ANY_BUT_NONE, # Don't care about ID, but must exist
name="test",
start_time=ANY_BUT_NONE, # Don't care about time but must exist
input={"key": "value"} # Exact match
),
actual=fake_backend.trace_trees[0]
)
# String matchers
ANY_STRING.starting_with("gpt-")
ANY_STRING.ending_with(".txt")
ANY_STRING.containing("test")
testlib/assert_helpers.py)def assert_equal(expected, actual):
"""
Deep equality check with support for:
- SpecialValue matchers (ANY, ANY_BUT_NONE)
- Nested dataclasses
- Lists and dicts
- Provides detailed diff on mismatch
"""
def assert_dict_has_keys(dict_obj, required_keys):
"""Verify dict contains all required keys"""
tests/conftest.py)@pytest.fixture(autouse=True)
def clear_context_storage():
"""Automatically clear context after each test"""
yield
context_storage.clear_all()
@pytest.fixture(autouse=True)
def shutdown_cached_client_after_test():
"""Clean up cached Opik client after each test"""
yield
if opik_client.get_client_cached.cache_info().currsize > 0:
opik_client.get_client_cached().end()
opik_client.get_client_cached.cache_clear()
@pytest.fixture
def fake_backend(patch_streamer):
"""Fake backend for unit/library integration tests"""
streamer, fake_message_processor = patch_streamer
# ... setup
yield fake_message_processor
# ... cleanup
@pytest.fixture
def patch_streamer():
"""Create streamer with fake backend"""
fake_processor = BackendEmulatorMessageProcessor()
fake_upload_manager = NoopFileUploadManager()
streamer = streamer_constructors.construct_streamer(
message_processor=fake_processor,
n_consumers=1,
use_batching=True,
file_uploader=fake_upload_manager,
max_queue_size=None
)
yield streamer, fake_processor
streamer.close(timeout=5)
tests/e2e/conftest.py)@pytest.fixture()
def opik_client(configure_e2e_tests_env):
"""Real Opik client with batching enabled"""
client = opik.Opik(_use_batching=True)
yield client
client.end()
@pytest.fixture
def dataset_name(opik_client):
"""Generate unique dataset name for test"""
name = f"e2e-tests-dataset-{random_chars()}"
yield name
@pytest.fixture
def temporary_project_name(opik_client):
"""Create and cleanup temporary project"""
name = f"e2e-tests-temporary-project-{random_chars()}"
yield name
# Cleanup
project_id = opik_client.rest_client.projects.retrieve_project(name=name).id
opik_client.rest_client.projects.delete_project_by_id(project_id)
# tests/library_integration/conftest.py
@pytest.fixture(autouse=True)
def reset_tracing_to_config_default():
"""Reset tracing config between tests"""
opik.reset_tracing_to_config_default()
yield
opik.reset_tracing_to_config_default()
# tests/library_integration/openai/conftest.py
@pytest.fixture
def ensure_openai_configured():
"""Verify OpenAI API key is configured"""
if not os.getenv("OPENAI_API_KEY"):
pytest.skip("OPENAI_API_KEY not configured")
Location: tests/unit/decorator/test_tracker_outputs.py
def test_track__one_nested_function__happyflow(fake_backend):
"""
Test naming convention:
test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT
"""
@opik.track
def f_inner(x):
return "inner-output"
@opik.track
def f_outer(x):
f_inner("inner-input")
return "outer-output"
f_outer("outer-input")
opik.flush_tracker() # Wait for async processing
# Build expected tree structure
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="f_outer",
input={"x": "outer-input"},
output={"output": "outer-output"},
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[
SpanModel(
name="f_outer",
input={"x": "outer-input"},
output={"output": "outer-output"},
spans=[
SpanModel(
name="f_inner",
input={"x": "inner-input"},
output={"output": "inner-output"},
spans=[]
)
]
)
]
)
assert len(fake_backend.trace_trees) == 1
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
Location: tests/library_integration/openai/test_openai_responses.py
@pytest.mark.parametrize(
"project_name, expected_project_name",
[
(None, OPIK_PROJECT_DEFAULT_NAME),
("custom-project", "custom-project"),
],
)
def test_openai_client_responses_create__happyflow(
fake_backend, project_name, expected_project_name
):
# Setup integration
client = openai.OpenAI()
wrapped_client = track_openai(client, project_name=project_name)
# Real API call
response = wrapped_client.responses.create(
model=MODEL_FOR_TESTS,
input=[{"role": "user", "content": "Tell a fact"}],
max_output_tokens=50
)
opik.flush_tracker()
# Build expected structure
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="responses_create",
input={"input": ANY_BUT_NONE},
output={"output": ANY_BUT_NONE, "reasoning": ANY},
tags=["openai"],
metadata=ANY_DICT,
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
project_name=expected_project_name,
spans=[
SpanModel(
id=ANY_BUT_NONE,
type="llm",
name="responses_create",
provider="openai",
model=ANY_STRING.starting_with(MODEL_FOR_TESTS),
usage=ANY_BUT_NONE,
metadata=ANY_DICT,
tags=["openai"],
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[]
)
]
)
assert len(fake_backend.trace_trees) == 1
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
# Optional: Verify specific metadata keys if needed
assert_dict_has_keys(
fake_backend.trace_trees[0].spans[0].metadata,
["created_from", "model"]
)
Location: tests/e2e/test_tracing.py
def test_trace_creation_with_spans(opik_client, temporary_project_name):
# Create trace
trace_id = opik_client.trace(
name="parent_trace",
input={"query": "test"},
project_name=temporary_project_name
)
# Create spans
span_id_1 = opik_client.span(
name="span_1",
trace_id=trace_id,
input={"step": 1}
)
span_id_2 = opik_client.span(
name="span_2",
trace_id=trace_id,
parent_span_id=span_id_1,
input={"step": 2}
)
opik_client.flush()
# Verify in backend
verify_trace(
opik_client,
trace_id=trace_id,
name="parent_trace",
input={"query": "test"},
project_name=temporary_project_name
)
verify_span(
opik_client,
span_id=span_id_1,
name="span_1",
trace_id=trace_id,
parent_span_id=None
)
verify_span(
opik_client,
span_id=span_id_2,
name="span_2",
trace_id=trace_id,
parent_span_id=span_id_1
)
def test_track__function_raises_exception__error_info_captured(fake_backend):
@opik.track
def failing_function():
raise ValueError("Test error")
with pytest.raises(ValueError, match="Test error"):
failing_function()
opik.flush_tracker()
# Build expected structure with error_info
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="failing_function",
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[
SpanModel(
id=ANY_BUT_NONE,
name="failing_function",
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
error_info={
"exception_type": "ValueError",
"message": ANY_STRING.containing("Test error"),
"traceback": ANY_BUT_NONE
},
spans=[]
)
]
)
assert len(fake_backend.trace_trees) == 1
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
def test_openai_streaming_response(fake_backend):
client = openai.OpenAI()
wrapped_client = track_openai(client)
# Stream response
stream = wrapped_client.chat.completions.create(
model=MODEL_FOR_TESTS,
messages=[{"role": "user", "content": "Count to 5"}],
stream=True
)
# Consume stream
for chunk in stream:
pass # Consume all chunks
opik.flush_tracker()
# Verify accumulated data using models
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="chat_completions_create",
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[
SpanModel(
id=ANY_BUT_NONE,
name="chat_completions_create",
type="llm",
provider="openai",
model=ANY_STRING.starting_with(MODEL_FOR_TESTS),
usage=ANY_BUT_NONE, # Usage accumulated from chunks
output=ANY_BUT_NONE, # Output accumulated from chunks
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[]
)
]
)
assert len(fake_backend.trace_trees) == 1
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
def test_hallucination_metric__happyflow():
metric = Hallucination()
result = metric.score(
input="What is the capital of France?",
output="Paris is the capital of France.",
context=["Paris is the capital and largest city of France."]
)
assert isinstance(result, ScoreResult)
assert 0 <= result.value <= 1
assert result.name == "hallucination_metric"
assert result.reason is not None
Follow the pattern: test_WHAT__CASE_DESCRIPTION__EXPECTED_RESULT
# ✅ Good
def test_track__one_nested_function__happyflow(fake_backend):
def test_track__function_raises_exception__error_info_captured(fake_backend):
def test_evaluate__with_custom_metric__scores_computed_correctly(fake_backend):
# ❌ Bad
def test_tracking():
def test_error():
def test_evaluate():
def test_my_feature(fake_backend):
# 1. Execute code that creates traces/spans
@opik.track
def my_function(x):
return x * 2
result = my_function(5)
opik.flush_tracker() # Always flush!
# 2. Build expected structure
EXPECTED_TRACE_TREE = TraceModel(
id=ANY_BUT_NONE,
name="my_function",
input={"x": 5},
output={"output": 10},
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[
SpanModel(
id=ANY_BUT_NONE,
name="my_function",
input={"x": 5},
output={"output": 10},
start_time=ANY_BUT_NONE,
end_time=ANY_BUT_NONE,
spans=[]
)
]
)
# 3. Assert
assert len(fake_backend.trace_trees) == 1
assert_equal(EXPECTED_TRACE_TREE, fake_backend.trace_trees[0])
def test_my_e2e_feature(opik_client, temporary_project_name):
# 1. Create resources
trace_id = opik_client.trace(
name="test_trace",
project_name=temporary_project_name
)
opik_client.flush()
# 2. Verify using verifiers
verify_trace(
opik_client,
trace_id=trace_id,
name="test_trace",
project_name=temporary_project_name
)
@pytest.mark.parametrize(
"input_value, expected_output",
[
(5, 10),
(10, 20),
(0, 0),
],
)
def test_double_function__various_inputs__correct_outputs(
fake_backend, input_value, expected_output
):
@opik.track
def double(x):
return x * 2
result = double(input_value)
opik.flush_tracker()
assert len(fake_backend.trace_trees) == 1
assert fake_backend.trace_trees[0].spans[0].output == {"output": expected_output}
Each integration should have:
requirements.txt with integration dependenciesconftest.py with integration-specific fixturesconstants.py for test constants (models, etc.)# library_integration/myintegration/requirements.txt
myintegration>=1.0.0
# library_integration/myintegration/conftest.py
import pytest
import os
@pytest.fixture
def ensure_myintegration_configured():
if not os.getenv("MYINTEGRATION_API_KEY"):
pytest.skip("MYINTEGRATION_API_KEY not configured")
# library_integration/myintegration/test_myintegration.py
def test_myintegration_basic(fake_backend, ensure_myintegration_configured):
# Test implementation
pytest tests/
# Unit tests only (fast)
pytest tests/unit/
# Library integration tests
pytest tests/library_integration/
# E2E tests
pytest tests/e2e/
# Specific integration
pytest tests/library_integration/openai/
Some library integration and E2E tests require certain environment variables to be configured:
# Backend configuration
export OPIK_URL_OVERRIDE="http://localhost:5000"
export OPIK_API_KEY="your_api_key"
# LLM provider keys (for library integration tests)
export OPENAI_API_KEY="..."
export ANTHROPIC_API_KEY="..."
export GOOGLE_API_KEY="..."
ANY, ANY_BUT_NONE for non-critical fields@pytest.mark.parametrizeFor more information, see: