site/docs/providers/python.md
The Python provider enables you to create custom evaluation logic using Python scripts. This allows you to integrate Promptfoo with any Python-based model, API, or custom logic.
:::tip Python Overview
For an overview of all Python integrations (providers, assertions, test generators, prompts), see the Python integration guide.
:::
Common use cases:
Before using the Python provider, ensure you have:
Let's create a simple Python provider that echoes back the input with a prefix.
# echo_provider.py
def call_api(prompt, options, context):
"""Simple provider that echoes the prompt with a prefix."""
config = options.get('config', {})
prefix = config.get('prefix', 'Tell me about: ')
return {
"output": f"{prefix}{prompt}"
}
# promptfooconfig.yaml
providers:
- id: 'file://echo_provider.py'
prompts:
- 'Tell me a joke'
- 'What is 2+2?'
npx promptfoo@latest eval
That's it! You've created your first custom Python provider.
Python providers use persistent worker processes. Your script is loaded once when the worker starts, not on every call. This makes subsequent calls much faster, especially for scripts with heavy imports like ML models.
When Promptfoo evaluates a test case with a Python provider:
prompt: The final prompt stringoptions: Provider configuration from your YAMLcontext: Variables and metadata for the current test┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Promptfoo │────▶│ Your Python │────▶│ Your Logic │
│ Evaluation │ │ Provider │ │ (API/Model) │
└─────────────┘ └──────────────┘ └─────────────┘
▲ │
│ ▼
│ ┌──────────────┐
└────────────│ Response │
└──────────────┘
Your Python script must implement one or more of these functions. Both synchronous and asynchronous versions are supported:
Synchronous Functions:
def call_api(prompt: str, options: dict, context: dict) -> dict:
"""Main function for text generation tasks."""
pass
def call_embedding_api(prompt: str, options: dict, context: dict) -> dict:
"""For embedding generation tasks."""
pass
def call_classification_api(prompt: str, options: dict, context: dict) -> dict:
"""For classification tasks."""
pass
Asynchronous Functions:
async def call_api(prompt: str, options: dict, context: dict) -> dict:
"""Async main function for text generation tasks."""
pass
async def call_embedding_api(prompt: str, options: dict, context: dict) -> dict:
"""Async function for embedding generation tasks."""
pass
async def call_classification_api(prompt: str, options: dict, context: dict) -> dict:
"""Async function for classification tasks."""
pass
prompt ParameterThe prompt can be either:
"What is the capital of France?"'[{"role": "user", "content": "Hello"}]'def call_api(prompt, options, context):
# Check if prompt is a conversation
try:
messages = json.loads(prompt)
# Handle as chat messages
for msg in messages:
print(f"{msg['role']}: {msg['content']}")
except:
# Handle as simple string
print(f"Prompt: {prompt}")
options ParameterContains your provider configuration and metadata:
{
"id": "file://my_provider.py",
"config": {
# Your custom configuration from promptfooconfig.yaml
"model_name": "gpt-3.5-turbo",
"temperature": 0.7,
"max_tokens": 100,
# Automatically added by promptfoo:
"basePath": "/absolute/path/to/config" # Directory containing your config (promptfooconfig.yaml)
}
}
context ParameterProvides information about the current test case:
{
"vars": {
"user_input": "Hello world",
"system_prompt": "You are a helpful assistant"
},
"prompt": {
"raw": "...",
"label": "...",
},
"test": {
"vars": { ... },
"metadata": {
"pluginId": "...", # Redteam plugin (e.g. "promptfoo:redteam:harmful:hate")
"strategyId": "...", # Redteam strategy (e.g. "jailbreak", "prompt-injection")
},
},
}
For redteam evals, use context['test']['metadata']['pluginId'] and context['test']['metadata']['strategyId'] to identify which plugin and strategy generated the test case.
:::note
Non-serializable fields (logger, getCache, filters, originalProvider) are removed before passing context to Python. Additional fields like evaluationId, testCaseId, testIdx, promptIdx, and repeatIndex are also available.
:::
Your function must return a dictionary with these fields:
def call_api(prompt, options, context):
# Required field
result = {
"output": "Your response here"
}
# Optional fields
result["tokenUsage"] = {
"total": 150,
"prompt": 50,
"completion": 100
}
result["cost"] = 0.0025 # in dollars
result["cached"] = False
result["logProbs"] = [-0.5, -0.3, -0.1]
result["latencyMs"] = 150 # custom latency in milliseconds
result["conversationEnded"] = False
result["conversationEndReason"] = "thread_closed"
# Error handling
if something_went_wrong:
result["error"] = "Description of what went wrong"
return result
The types passed into the Python script function and the ProviderResponse return type are defined as follows:
class ProviderOptions:
id: Optional[str]
config: Optional[Dict[str, Any]]
class CallApiContextParams:
vars: Dict[str, str]
prompt: Optional[Dict[str, Any]] # Prompt template (raw, label, config)
test: Optional[Dict[str, Any]] # Full test case including metadata
class TokenUsage:
total: int
prompt: int
completion: int
class ProviderResponse:
output: Optional[Union[str, Dict[str, Any]]]
error: Optional[str]
tokenUsage: Optional[TokenUsage]
cost: Optional[float]
cached: Optional[bool]
logProbs: Optional[List[float]]
latencyMs: Optional[int] # overrides measured latency
conversationEnded: Optional[bool]
conversationEndReason: Optional[str]
metadata: Optional[Dict[str, Any]]
class ProviderEmbeddingResponse:
embedding: List[float]
tokenUsage: Optional[TokenUsage]
cached: Optional[bool]
class ProviderClassificationResponse:
classification: Dict[str, Any]
tokenUsage: Optional[TokenUsage]
cached: Optional[bool]
:::tip
Always include the output field in your response, even if it's an empty string when an error occurs.
:::
For multi-turn red team strategies, return conversationEnded: True (with optional
conversationEndReason) when your target intentionally closes the active thread so promptfoo
stops probing gracefully instead of continuing into timeout/error turns.
# openai_provider.py
import os
import json
from openai import OpenAI
def call_api(prompt, options, context):
"""Provider that calls OpenAI API."""
config = options.get('config', {})
# Initialize client
client = OpenAI(
api_key=os.getenv('OPENAI_API_KEY'),
base_url=config.get('base_url', 'https://api.openai.com/v1')
)
# Parse messages if needed
try:
messages = json.loads(prompt)
except:
messages = [{"role": "user", "content": prompt}]
# Make API call
try:
response = client.chat.completions.create(
model=config.get('model', 'gpt-3.5-turbo'),
messages=messages,
temperature=config.get('temperature', 0.7),
max_tokens=config.get('max_tokens', 150)
)
return {
"output": response.choices[0].message.content,
"tokenUsage": {
"total": response.usage.total_tokens,
"prompt": response.usage.prompt_tokens,
"completion": response.usage.completion_tokens
}
}
except Exception as e:
return {
"output": "",
"error": str(e)
}
# local_model_provider.py
import torch
from transformers import pipeline
# Initialize model once
generator = pipeline('text-generation', model='gpt2')
def preprocess_prompt(prompt, context):
"""Add context-specific preprocessing."""
template = context['vars'].get('template', '{prompt}')
return template.format(prompt=prompt)
def call_api(prompt, options, context):
"""Provider using a local Hugging Face model."""
config = options.get('config', {})
# Preprocess
processed_prompt = preprocess_prompt(prompt, context)
# Generate
result = generator(
processed_prompt,
max_length=config.get('max_length', 100),
temperature=config.get('temperature', 0.7),
do_sample=True
)
return {
"output": result[0]['generated_text'],
"cached": False
}
# mock_provider.py
import time
import random
def call_api(prompt, options, context):
"""Mock provider for testing evaluation pipelines."""
config = options.get('config', {})
# Simulate processing time
delay = config.get('delay', 0.1)
time.sleep(delay)
# Simulate different response types
if "error" in prompt.lower():
return {
"output": "",
"error": "Simulated error for testing"
}
# Generate mock response
responses = config.get('responses', [
"This is a mock response.",
"Mock provider is working correctly.",
"Test response generated successfully."
])
response = random.choice(responses)
mock_tokens = len(prompt.split()) + len(response.split())
return {
"output": response,
"tokenUsage": {
"total": mock_tokens,
"prompt": len(prompt.split()),
"completion": len(response.split())
},
"cost": mock_tokens * 0.00001
}
providers:
- id: 'file://my_provider.py'
label: 'My Custom Provider' # Optional display name
config:
# Any configuration your provider needs
api_key: '{{ env.CUSTOM_API_KEY }}'
endpoint: https://api.example.com
model_params:
temperature: 0.7
max_tokens: 100
:::info Promptfoo Cloud Feature Available in Promptfoo Cloud deployments. :::
Link your local provider configuration to a cloud target using linkedTargetId:
providers:
- id: 'file://my_provider.py'
config:
linkedTargetId: 'promptfoo://provider/12345678-1234-1234-1234-123456789abc'
See Linking Local Targets to Cloud for setup instructions.
You can load configuration from external files:
providers:
- id: 'file://my_provider.py'
config:
# Load entire config from JSON
settings: file://config/model_settings.json
# Load from YAML with specific function
prompts: file://config/prompts.yaml
# Load from Python function
preprocessing: file://config/preprocess.py:get_config
# Nested file references
models:
primary: file://config/primary_model.json
fallback: file://config/fallback_model.yaml
Supported formats:
.json) - Parsed as objects/arrays.yaml, .yml) - Parsed as objects/arrays.txt, .md) - Loaded as strings.py) - Must export a function returning config.js, .mjs) - Must export a function returning configPython providers use persistent worker processes that stay alive between calls, making subsequent calls faster.
Control the number of workers per provider:
providers:
# Default: 1 worker
- id: file://my_provider.py
# Multiple workers for parallel execution
- id: file://api_wrapper.py
config:
workers: 4
Or set globally:
export PROMPTFOO_PYTHON_WORKERS=4
When to use 1 worker (default):
When to use multiple workers:
Note that global state is not shared across workers. If your script uses global variables for session management (common in conversational flows like red team evaluations), use workers: 1 to ensure all requests hit the same worker.
Default timeout is 5 minutes (300 seconds). Increase if needed:
providers:
- id: file://slow_model.py
config:
timeout: 300000 # milliseconds
Or set globally for all providers:
export REQUEST_TIMEOUT_MS=600000 # 10 minutes
You can specify a custom Python executable in several ways:
Option 1: Per-provider configuration
providers:
- id: 'file://my_provider.py'
config:
pythonExecutable: /path/to/venv/bin/python
Option 2: Global environment variable
# Use specific Python version globally
export PROMPTFOO_PYTHON=/usr/bin/python3.11
npx promptfoo@latest eval
Promptfoo automatically detects your Python installation in this priority order:
PROMPTFOO_PYTHON (if set)pythonExecutable in your configwhere python and filters out Microsoft Store stubs (Windows only)python -c "import sys; print(sys.executable)" to find the actual Python pathpython, python3, py -3, pypython3, pythonThis enhanced detection is especially helpful on Windows where the Python launcher (py.exe) might not be available.
# Use specific Python version
export PROMPTFOO_PYTHON=/usr/bin/python3.11
# Add custom module paths
export PYTHONPATH=/path/to/my/modules:$PYTHONPATH
# Enable Python debugging with pdb
export PROMPTFOO_PYTHON_DEBUG_ENABLED=true
# Run evaluation
npx promptfoo@latest eval
Override the default function name:
providers:
- id: 'file://my_provider.py:generate_response'
config:
model: 'custom-model'
# my_provider.py
def generate_response(prompt, options, context):
# Your custom function
return {"output": "Custom response"}
def call_api(prompt, options, context):
"""Handle various prompt formats."""
# Text prompt
if isinstance(prompt, str):
try:
# Try parsing as JSON
data = json.loads(prompt)
if isinstance(data, list):
# Chat format
return handle_chat(data, options)
elif isinstance(data, dict):
# Structured prompt
return handle_structured(data, options)
except:
# Plain text
return handle_text(prompt, options)
def call_api(prompt, options, context):
"""Provider with safety guardrails."""
# Check for prohibited content
prohibited_terms = config.get('prohibited_terms', [])
for term in prohibited_terms:
if term.lower() in prompt.lower():
return {
"output": "I cannot process this request.",
"guardrails": {
"flagged": True,
"reason": "Prohibited content detected"
}
}
# Process normally
result = generate_response(prompt)
# Post-process checks
if check_output_safety(result):
return {"output": result}
else:
return {
"output": "[Content filtered]",
"guardrails": {"flagged": True}
}
Python providers automatically emit OpenTelemetry spans when tracing is enabled. This provides visibility into Python provider execution as part of your evaluation traces.
Requirements:
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-http
Enable tracing:
tracing:
enabled: true
otlp:
http:
enabled: true
Install the Python OpenTelemetry packages and enable the wrapper instrumentation:
export PROMPTFOO_ENABLE_OTEL=true
When wrapper OTEL instrumentation is enabled, the Python provider wrapper:
tokenUsage in your responseThe spans follow GenAI semantic conventions with attributes like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.usage.output_tokens.
This span covers the provider call itself. If you need internal workflow telemetry for tools, agents, or handoffs, create custom child spans or export framework-native traces into Promptfoo. See the OpenAI Agents Python SDK guide for a full example that makes trajectory:* assertions work with the Python openai-agents SDK.
When calling external APIs, implement retry logic in your script to handle rate limits and transient failures:
import time
import requests
def call_api(prompt, options, context):
"""Provider with retry logic for external API calls."""
config = options.get('config', {})
max_retries = config.get('max_retries', 3)
for attempt in range(max_retries):
try:
response = requests.post(
config['api_url'],
json={'prompt': prompt},
timeout=30
)
# Handle rate limits
if response.status_code == 429:
wait_time = int(response.headers.get('Retry-After', 2 ** attempt))
time.sleep(wait_time)
continue
response.raise_for_status()
return response.json()
except requests.exceptions.RequestException as e:
if attempt == max_retries - 1:
return {"output": "", "error": f"Failed after {max_retries} attempts: {str(e)}"}
time.sleep(2 ** attempt) # Exponential backoff
Custom providers handle multimodal content the same way whether the media comes from a standard eval or a red team strategy: read the media variable from context['vars'] and translate it into the target API's expected payload shape.
For standard evals, provide the media value through tests[].vars, defaultTest.vars, a dataset column, or a dynamic variable:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
providers:
- id: file://multimodal_provider.py
prompts:
- '{{image}} {{question}}'
tests:
- vars:
image: 'data:image/png;base64,iVBORw0KGgo...'
question: Describe this image.
In this case, context['vars']['image'] contains the configured value. It may be raw base64, a data: URL, an external URL, or another representation your provider knows how to forward.
For red team runs, image, audio, and video strategies generate media and store it in the template variable named by redteam.injectVar. The rendered prompt also contains the media value, but context['vars'] is safer because it preserves variable boundaries and avoids parsing a very long prompt.
| Red team strategy | context['vars'][inject_var] | Extra context | Forwarding notes |
|---|---|---|---|
image | Raw PNG base64, no data: prefix | context['vars']['image_text'], context['test']['metadata']['originalText'] | Wrap as data:image/png;base64,... for APIs that expect data URLs. |
audio | Raw MP3 base64 from remote generation, no data: prefix | context['test']['metadata']['originalText'] | Requires remote generation. Forward with MIME type audio/mpeg or your provider's equivalent audio format. |
video | Raw MP4 base64 when local FFmpeg generation succeeds | context['vars']['video_text'], context['test']['metadata']['originalText'] | Install FFmpeg and set PROMPTFOO_DISABLE_REMOTE_GENERATION=true or PROMPTFOO_DISABLE_REDTEAM_REMOTE_GENERATION=true for real MP4 bytes. If generation falls back, the value may decode to the original text instead of an MP4. |
Audio and video have opposite generation requirements today: audio requires remote generation, while real MP4 video requires the local FFmpeg path. Run separate scans if you need to verify both remote audio and local MP4 handling.
import os
import requests
def call_api(prompt, options, context):
api_key = os.environ.get('OPENAI_API_KEY')
if not api_key:
return {'error': 'OPENAI_API_KEY is required'}
image_base64 = context['vars'].get('image', '')
question = context['vars'].get('question', 'Describe this image')
# Red team image runs provide raw PNG base64. Eval vars may already provide a URL.
image_url = (
image_base64
if image_base64.startswith(('data:', 'http://', 'https://'))
else f'data:image/png;base64,{image_base64}'
)
response = requests.post(
'https://api.openai.com/v1/chat/completions',
headers={'Authorization': f'Bearer {api_key}'},
json={
'model': 'gpt-5',
'messages': [{
'role': 'user',
'content': [
{'type': 'image_url', 'image_url': {'url': image_url}},
{'type': 'text', 'text': question},
],
}],
},
)
if not response.ok:
return {'error': f'OpenAI API error {response.status_code}: {response.text}'}
result = response.json()
output = result.get('choices', [{}])[0].get('message', {}).get('content')
if output:
return {'output': output}
return {'error': f'OpenAI API returned no output: {result}'}
For red team runs, set redteam.injectVar to the same template variable:
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
providers:
- id: file://multimodal_provider.py
prompts:
- '{{image}} {{question}}'
defaultTest:
vars:
question: Describe this image.
redteam:
purpose: A vision assistant that answers questions about images.
injectVar: image
plugins:
- harmful:hate
strategies:
- image
- id: basic
config:
enabled: false
:::note
injectVar defaults to the last template variable in your prompt. With {{image}} {{question}}, it defaults to question — not image. Always set injectVar explicitly when using media strategies.
:::
Static variables and dataset-driven media may already provide a data: URL or a different MIME type, so check the value before prepending data:image/png;base64,. Avoid logging full media strings; screenshots, audio, and video can be large or sensitive. For debugging, log length, detected MIME type, a hash, or the first few bytes after decoding instead of the full base64 payload.
See the multimodal red team guide and JavaScript provider multimodal docs for more examples.
| Issue | Solution |
|---|---|
spawn py -3 ENOENT errors | Set PROMPTFOO_PYTHON env var or use pythonExecutable in config |
Python 3 not found errors | Ensure python command works or set PROMPTFOO_PYTHON |
| "Module not found" errors | Set PYTHONPATH or use pythonExecutable for virtual environments |
| Script not executing | Check file path is relative to promptfooconfig.yaml |
| No output visible | Use LOG_LEVEL=debug to see print statements |
| JSON parsing errors | Ensure prompt format matches your parsing logic |
| Timeout errors | Optimize initialization code, load models once |
Enable debug logging:
LOG_LEVEL=debug npx promptfoo@latest eval
Add logging to your provider:
import sys
def call_api(prompt, options, context):
print(f"Received prompt: {prompt}", file=sys.stderr)
print(f"Config: {options.get('config', {})}", file=sys.stderr)
# Your logic here
Test your provider standalone:
# test_provider.py
from my_provider import call_api
result = call_api(
"Test prompt",
{"config": {"model": "test"}},
{"vars": {}}
)
print(result)
Use Python debugger (pdb) for interactive debugging:
export PROMPTFOO_PYTHON_DEBUG_ENABLED=true
With this environment variable set, you can use import pdb; pdb.set_trace() in your Python code to set breakpoints:
def call_api(prompt, options, context):
import pdb; pdb.set_trace() # Execution will pause here
# Your provider logic
return {"output": result}
This allows interactive debugging directly in your terminal during evaluation runs.
If you're currently using an HTTP provider, you can wrap your API calls:
# http_wrapper.py
import requests
def call_api(prompt, options, context):
config = options.get('config', {})
response = requests.post(
config.get('url'),
json={"prompt": prompt},
headers=config.get('headers', {})
)
return response.json()
The Python provider follows the same interface as JavaScript providers:
// JavaScript
module.exports = {
async callApi(prompt, options, context) {
return { output: `Echo: ${prompt}` };
},
};
# Python equivalent
def call_api(prompt, options, context):
return {"output": f"Echo: {prompt}"}