benchmark/tau2/llm/README.md
This directory contains the OpenViking TAU-2 LLM benchmark entry point. The reproduction surface is intentionally narrow:
no_memory: same-seed TAU-2 baseline without OpenViking memory injection;template_indexed_trajectory_top4_prewrite_top2: the current best
template-indexed trajectory memory treatment.The template-indexed trajectory treatment trains OpenViking Memory V2 from
TAU-2 train conversations, retrieves generated trajectories, and uses the
trajectory embedding template {{ trajectory_name }}\n\n{{ retrieval_anchor }}
instead of broad procedure bodies for retrieval. It injects trajectory top4 at
the first user turn and top2 before write-like tool calls, with the generic
memory scope prompt enabled.
Category rerank, experience-memory routes, fixed-count-only ablations, character-budget ablations, and official-user parity controls are intentionally left out of this README and config set so reproduction agents do not mistake diagnostic routes for current evidence.
benchmark/tau2/llm/
├── config/
│ ├── baseline.yaml
│ ├── fixed_first_user_bootstrap.yaml
│ ├── no_memory.yaml
│ ├── scope_prompts/
│ │ └── generic_memory_scope.md
│ └── template_indexed_trajectory.yaml
├── scripts/
│ ├── build_fixed_first_user_fixture.py
│ ├── run_eval.py
│ ├── setup_tau2_repo.sh
│ └── tau2_common.py
└── run_full_eval.sh
baseline.yaml is a shared protocol/defaults file, not a runnable evidence
cell by itself. Use no_memory.yaml for the baseline-only run and
template_indexed_trajectory.yaml for the paired no-memory + trajectory run.
Generated eval artifacts are written to benchmark/tau2/llm/result/<run_id>/.
Memory corpus artifacts are cached outside the run id at
benchmark/tau2/llm/result/memory_corpora/ by default.
This benchmark delegates task simulation and scoring to an external TAU-2 checkout. Point the runner at that checkout and CLI explicitly when they are not on the default path:
export TAU2_REPO=/path/to/tau2-bench
export TAU2_CLI=/path/to/tau2
For a local one-command setup, clone and install TAU-2 into ignored benchmark directories:
benchmark/tau2/llm/scripts/setup_tau2_repo.sh
source benchmark/tau2/llm/.env.tau2
The default OpenViking TAU-2 memory evidence protocol is
fixed_first_user_full8: retail + airline, 8 repeats, same seeds,
confirmation-aware user simulator, and fixed first-user fixtures for both
domains. Later user simulator turns remain live.
The confirmation-aware simulator behavior is available from sierra-research/tau2-bench#297. Pin the local TAU-2 checkout to a ref that includes that behavior when reproducing these numbers:
benchmark/tau2/llm/scripts/setup_tau2_repo.sh \
--ref refs/pull/297/head
source benchmark/tau2/llm/.env.tau2
When using Doubao through an OpenAI-compatible endpoint, set OPENAI_API_KEY
and OPENAI_API_BASE for LiteLLM before running upstream TAU-2.
Strict reproduction requires fixed first-user fixtures:
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail/fixed_first_user_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline/fixed_first_user_fixture.json
--strict-preflight fails when eval.require_fixed_first_user=true and either
fixture is missing.
For a fresh checkout, run one live-user bootstrap pass per domain:
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/fixed_first_user_bootstrap.yaml \
--domain retail \
--run-id fixed_first_user_bootstrap_retail \
--strict-preflight \
--execute
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/fixed_first_user_bootstrap.yaml \
--domain airline \
--run-id fixed_first_user_bootstrap_airline \
--strict-preflight \
--execute
Then convert each bootstrap results.json into a fixture:
RETAIL_RESULTS=benchmark/tau2/llm/result/fixed_first_user_bootstrap_retail/memory_cells/fixed_first_user_bootstrap_retail_retail_no_memory_r1/fixed_first_user_bootstrap_retail_retail_no_memory_r1.json
AIRLINE_RESULTS=benchmark/tau2/llm/result/fixed_first_user_bootstrap_airline/memory_cells/fixed_first_user_bootstrap_airline_airline_no_memory_r1/fixed_first_user_bootstrap_airline_airline_no_memory_r1.json
python benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py \
--repo "$TAU2_REPO" \
--results-json "$RETAIL_RESULTS" \
--domain retail \
--task-split-name test \
--output benchmark/tau2/llm/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json \
--require-full-split
python benchmark/tau2/llm/scripts/build_fixed_first_user_fixture.py \
--repo "$TAU2_REPO" \
--results-json "$AIRLINE_RESULTS" \
--domain airline \
--task-split-name test \
--output benchmark/tau2/llm/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json \
--require-full-split
Export the generated fixture paths for subsequent strict runs:
export TAU2_RETAIL_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/llm/result/fixed_first_user_fixtures/retail/fixed_first_user_fixture.json"
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE="$PWD/benchmark/tau2/llm/result/fixed_first_user_fixtures/airline/fixed_first_user_fixture.json"
Plan the no-memory baseline without running TAU-2:
python benchmark/tau2/llm/scripts/run_eval.py \
--config benchmark/tau2/llm/config/no_memory.yaml \
--plan-only
Plan the paired current-evidence config without running TAU-2:
python benchmark/tau2/llm/scripts/run_eval.py \
--config benchmark/tau2/llm/config/template_indexed_trajectory.yaml \
--plan-only
Run a tiny no-memory smoke:
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/no_memory.yaml \
--domain retail \
--strategy-id no_memory \
--num-tasks 1 \
--repeat-count 1 \
--strict-preflight \
--execute
Run a tiny template-indexed trajectory smoke against a clean local OpenViking service:
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/template_indexed_trajectory.yaml \
--domain retail \
--strategy-id template_indexed_trajectory_top4_prewrite_top2 \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1 \
--strict-preflight \
--execute
Start the OpenViking service before executing memory cells, and verify it with
ov status. For trajectory memory evidence, start the service from this branch
and inspect generated trajectory files; changing search_uri alone does not
prove the template-indexed trajectory prompt was used.
Run the no-memory full8 baseline:
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/no_memory.yaml \
--run-id no_memory_full8 \
--strict-preflight \
--execute
Run the paired no-memory + current trajectory evidence config:
benchmark/tau2/llm/run_full_eval.sh \
--config benchmark/tau2/llm/config/template_indexed_trajectory.yaml \
--run-id template_indexed_trajectory_full8 \
--strict-preflight \
--execute
The main result is written to
benchmark/tau2/llm/result/template_indexed_trajectory_full8/scoreboard.json.
Per-cell execution records live under cell_results/, raw TAU-2 result JSON
lives under memory_cells/, and corpus identity / generated memory checks live
under memory_corpora/.
Memory cells run through a small TAU-2 agent adapter in this directory:
The current trajectory config uses:
train_memory_mode: experience_only, which selects the Memory V2
session-commit path that writes generated memory artifacts;train_transcript_format: role_tool_blocks, which preserves role-prefixed
messages plus tool-call/tool-response blocks during training;train_include_system_prompt: true, which includes the domain policy in the
training session;train_skip_failed_sessions: true, which avoids learning from failed train
sessions;search_memory_type: trajectories, which retrieves generated trajectory
memory during eval.The runner prepares each distinct domain + corpus_id once and reuses it across
eval run ids when the cached corpus_manifest.json is present. Different
corpora may be prepared in parallel with benchmark.corpus_prepare_concurrency;
session commits inside one corpus remain serial to preserve OpenViking write
semantics.
By default, trajectory extraction is transcript-only: the runner replays TAU-2 messages into an OpenViking session and does not expose held-out reward or assertion results to the extractor.
Eval cells run in parallel with benchmark.strategy_concurrency by default and
can be overridden with --strategy-concurrency. This only parallelizes read-only
TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
prepare step.
For exploratory gates, prefer a bounded run with --cell-timeout-seconds.
Timed-out cells are recorded with return code 124, timed_out=true, and are
excluded from scoreboard metrics, which keeps smoke runs from silently becoming
long-running evidence jobs.
The runner default is the official TAU-2 user simulator if
eval.user_simulator_policy is omitted. The bundled OpenViking memory benchmark
configs set confirmation_aware, because a memory benchmark should not treat
user confirmation as task completion before the backend write has happened.
confirmation_aware applies a small idempotent prompt patch to the configured
TAU-2 checkout before planning or running. The patch appends only the behavioral
confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
the upstream PR link is kept in run artifacts, not in the simulator prompt.
Optional fixed-first-user fixtures keep the first simulated user turn stable while preserving live simulator behavior after that turn.
Only completed retail + airline runs with the same config, same seeds/repeats,
and non-empty artifacts should be read as benchmark evidence. Partial runs,
single-task probes, or missing OpenViking corpus identity are diagnostics.
Executed runs write per-cell JSON under cell_results/ and a strategy/domain
aggregate under scoreboard.json. Memory training artifacts are shared by
domain and strategy under memory_corpora/, so repeated eval cells reuse the
same fresh corpus instead of rewriting it.