site/docs/guides/evaluate-google-adk.md
Use Google ADK's Python SDK with Promptfoo by wrapping your app as a Python provider. That keeps the ADK runtime in process, so Promptfoo can inspect the same sessions, artifacts, and native OpenTelemetry spans that the agent produced.
:::note This guide targets stable ADK 1.x. Google's public docs also advertise ADK Python 2.0 beta releases, but those releases have breaking API and session-schema changes. Validate a 2.0 integration separately before moving production evals onto it. :::
npx promptfoo@latest init --example integration-google-adk
cd integration-google-adk
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
export GOOGLE_API_KEY=your_google_api_key_here
npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
npx promptfoo@latest eval -c promptfooconfig.workflow.yaml --no-cache
npx promptfoo@latest view
The example defaults to gemini-2.5-flash. If you want to use another ADK-supported model, set ADK_MODEL. Provider-style model strings such as openai/gpt-5.4-mini require the optional ADK extensions:
pip install 'google-adk[extensions]>=1.32.0,<2'
export ADK_MODEL=openai/gpt-5.4-mini
If Promptfoo runs outside the activated virtual environment, set the interpreter explicitly:
PROMPTFOO_PYTHON=.venv/bin/python npx promptfoo@latest eval -c promptfooconfig.yaml --no-cache
promptfooconfig.yaml evaluates one conversational task across three user turns against a single Agent:
root_agent = Agent(
name="weather_agent",
model=model,
instruction="...travel weather assistant. Call get_weather... Call save_trip_note...",
tools=[get_weather, save_trip_note],
before_agent_callback=before_agent_callback,
)
app = App(name=APP_NAME, root_agent=root_agent, plugins=[audit_plugin])
The runtime wires up an InMemorySessionService, an InMemoryArtifactService, an app-wide BasePlugin (AuditPlugin), and the native ADK OpenTelemetry exporter. promptfooconfig.workflow.yaml evaluates a SequentialAgent flow with two child agents and asserts that the weather lookup runs before the briefing.
An in-process Python provider lets one Promptfoo row drive a multi-turn task, read session state, load artifacts, observe plugin and callback side effects, and assert against ADK's native trajectory spans — all without going over the wire. The HTTP shape around adk api_server cannot expose any of those without a parallel inspection channel.
Pick the HTTP provider when the deployed HTTP contract itself is what you want to validate (auth, request shape, status codes). Pick the Python provider for everything else.
ADK 1.x already emits OpenTelemetry spans such as:
invocationinvoke_agent weather_agentcall_llmexecute_tool get_weatherThe example provider preserves Promptfoo's W3C traceparent, starts a small provider span beneath it, and exports ADK's child spans to Promptfoo's built-in OTLP receiver. ADK already records gen_ai.tool.name and tool-call arguments, so Promptfoo can normalize them into trajectory steps without a custom span translator.
assert:
- type: trajectory:tool-used
value:
- get_weather
- save_trip_note
- type: trajectory:tool-args-match
value:
name: get_weather
args:
city: London
mode: partial
- type: trajectory:tool-sequence
value:
steps:
- get_weather
- save_trip_note
Use trace-span-count and trace-error-spans alongside trajectory assertions when you also want to prove that the ADK runtime emitted the expected framework spans and stayed error-free.
After the eval, inspect the row in the Trace Timeline. The bundled conversational run should show:
invoke_agent weather_agent once per user turncall_llm spans around the model hopsexecute_tool get_weatherexecute_tool save_trip_notegen_ai.tool.name on tool spansgcp.vertex.agent.tool_call_argsState is mutated through ToolContext. Artifacts go through the same context but a different method:
async def save_trip_note(city: str, summary: str, tool_context: ToolContext):
artifact = types.Part.from_bytes(
data=f"# Trip note for {city}\n\n{summary}\n".encode("utf-8"),
mime_type="text/markdown",
)
await tool_context.save_artifact(filename=f"{city}-trip-note.md", artifact=artifact)
tool_context.state["last_saved_artifact"] = f"{city}-trip-note.md"
The provider drains both back out and returns them as one JSON payload. Deterministic contains and is-json assertions can then check internal effects without a model grader:
{
"artifact_names": ["london-trip-note.md"],
"plugin_events": ["before_run:...", "after_run:..."],
"session_state": {
"callback_invocations": 3,
"last_city": "London",
"last_saved_artifact": "london-trip-note.md"
}
}
One eval row now covers the assistant's reply and the agent's internal bookkeeping.
When the order of work is fixed, encode it as a workflow agent instead of relying on the LLM to route. The bundled example chains a lookup agent into a briefing agent:
weather_lookup_agent = Agent(
name="weather_lookup_agent",
model=model,
tools=[get_weather],
output_key="weather_snapshot", # writes the final response to session state
)
briefing_agent = Agent(name="briefing_agent", model=model, instruction="Use weather_snapshot...")
workflow = SequentialAgent(
name="trip_planning_workflow",
sub_agents=[weather_lookup_agent, briefing_agent],
)
The provider pattern is unchanged — _run_workflow_provider still calls runner.run_async. Swap SequentialAgent for ParallelAgent, LoopAgent, a custom BaseAgent, or any tree that uses sub_agents / AgentTool. Add a trace-span-count for each named workflow agent, and keep trajectory:tool-used focused on the tool calls that must happen.
The same provider shape covers the rest of the stable ADK 1.x surface. Map each ADK feature to the assertion type that proves it works:
| ADK feature | Assertion to add |
|---|---|
output_schema or output_key | is-json with a JSON schema in value, plus contains on returned state |
MemoryService | paired test cases that share a session id vs use distinct ones |
| MCP, OpenAPI, or authenticated tools | trajectory:tool-used, trajectory:tool-args-match, is-refusal |
| code executors | trace-span-count on execute_tool * plus safety contains / regex |
| multi-agent trees | one trace-span-count per invoke_agent <name> you require |
| long-running / resumable apps | contains on session_state snapshots before and after resume |
ADK ships its own adk eval stack — use it for ADK-native eval sets and ADK-specific metrics. Promptfoo is the better fit when one harness has to compare ADK against other frameworks, run red teams against the same surface, or assert on OpenTelemetry traces alongside the final output.
SessionService, ArtifactService, or MemoryService when persistence is part of the behavior under test.google-adk[extensions] set adds hundreds of MB of LiteLLM and provider SDKs. Install it only when you need provider-prefixed model strings (openai/..., anthropic/...), and expect upstream warnings unrelated to your eval.