benchmark/tau2/README.md
This directory contains a small OpenViking-style entry point for TAU-2 memory evaluation. The scope is intentionally narrow:
Category rerank and other harness-only diagnostics are intentionally left out.
benchmark/tau2/
├── config/
│ ├── baseline.yaml
│ ├── official.yaml
│ ├── prewrite.yaml
│ └── trajectory.yaml
├── scripts/
│ ├── run_eval.py
│ ├── setup_tau2_repo.sh
│ └── tau2_common.py
└── run_full_eval.sh
Generated eval artifacts are written to benchmark/tau2/result/<run_id>/.
Memory corpus artifacts are cached outside the run id at
benchmark/tau2/result/memory_corpora/ by default.
This benchmark delegates task simulation and scoring to an external TAU-2 checkout. Point the runner at that checkout and CLI explicitly when they are not on the default path:
export TAU2_REPO=/path/to/tau2-bench
export TAU2_CLI=/path/to/tau2
The default OpenViking TAU-2 memory evidence protocol is
fixed_first_user_full8: retail + airline, 8 repeats, same seeds, confirmation
aware user simulator, and fixed first user fixtures for both domains. Later user
simulator turns remain live. Set the fixture paths before running the default
configs:
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail/fixed_first_user_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline/fixed_first_user_fixture.json
--strict-preflight fails when eval.require_fixed_first_user=true and either
fixture is missing. Use config/official.yaml for an explicit non-fixed,
official-live-user control.
For a local one-command setup, clone and install TAU-2 into ignored benchmark directories:
benchmark/tau2/scripts/setup_tau2_repo.sh
source benchmark/tau2/.env.tau2
Plan the default benchmark without running TAU-2:
python benchmark/tau2/scripts/run_eval.py --config benchmark/tau2/config/baseline.yaml --plan-only
Add --preflight or --strict-preflight when you want the runner to write a
small environment/config check next to the run plan.
After setup, verify the local TAU-2 link and write a one-cell run plan:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--strict-preflight \
--domain retail \
--strategy-id memory_v2_experience_only \
--task-id 5 \
--repeat-count 1
Plan a one-cell Memory V2 pre-write smoke:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--domain retail \
--strategy-id memory_v2_prewrite \
--num-tasks 1 \
--repeat-count 1
Plan a one-cell trajectory memory smoke:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/trajectory.yaml \
--domain retail \
--strategy-id memory_v2_trajectory_view \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1
Run the Memory V2 8-trial matrix (retail + airline x 2 strategies x 8 repeats):
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--execute
The PR-B headline and content-shape ablation use
config/prb_content_matrix_new_prompt.yaml. It runs the no-memory control plus
trajectory top4, experience top2, and representative 4000-character budget
ablation routes across retail + airline with 8 repeats.
First run one tiny end-to-end smoke against a clean local OpenViking service:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
--domain retail \
--strategy-id new_traj_fixed_first_user_prewrite \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1 \
--strict-preflight \
--execute
Then run the full PR-B matrix:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/prb_content_matrix_new_prompt.yaml \
--run-id prb_content_matrix_new_prompt_full8 \
--strict-preflight \
--execute
The main result is written to
benchmark/tau2/result/prb_content_matrix_new_prompt_full8/scoreboard.json.
Per-cell outputs live under cell_results/; corpus identity and generated
memory checks live under memory_corpora/.
For a small E2E smoke, keep both the eval and train slices tiny:
benchmark/tau2/run_full_eval.sh \
--config benchmark/tau2/config/baseline.yaml \
--domain retail \
--strategy-id memory_v2_experience_only \
--num-tasks 1 \
--train-num-tasks 1 \
--repeat-count 1 \
--execute
When using Doubao through an OpenAI-compatible endpoint, set OPENAI_API_KEY
and OPENAI_API_BASE for LiteLLM before running upstream TAU-2.
Start the OpenViking service before executing memory cells, and verify it with
ov status. For evidence runs, use a clean OpenViking workspace/config and set
OPENVIKING_URL explicitly so local template overrides do not pollute the
Memory V2 baseline. For trajectory memory evidence, start the service from this
branch and inspect generated trajectory files; changing search_uri alone does
not prove the new trajectory prompt was used.
Memory V2 cells run through a small TAU-2 agent adapter in this directory:
scope_prompt_file; the runner still accepts scope_prompt_files for custom
local experiments;For exploratory gates, prefer a bounded run with --cell-timeout-seconds.
Timed-out cells are recorded with return code 124, timed_out=true, and are
excluded from scoreboard metrics, which keeps smoke runs from silently becoming
long-running evidence jobs.
The existing train_memory_mode: experience_only value selects the Memory V2
session-commit path. search_memory_type selects which generated memory bucket
is retrieved during eval (experiences by default, trajectories for
config/trajectory.yaml). The runner prepares each distinct
domain + corpus_id once and reuses it across eval run ids when the cached
corpus_manifest.json is present. Different corpora may be prepared in
parallel with benchmark.corpus_prepare_concurrency; session commits inside one
corpus remain serial to preserve OpenViking write semantics.
By default, trajectory extraction is transcript-only: the runner replays TAU-2 messages into an OpenViking session and does not expose held-out reward or assertion results to the extractor. The PR-B evidence config can also use a structured role/tool transcript, include the domain policy in the training session, skip failed train sessions when building positive procedure memory, and cap injected memory by total character budget for content-shape ablations.
Eval cells run in parallel with benchmark.strategy_concurrency by default and
can be overridden with --strategy-concurrency. This only parallelizes read-only
TAU-2 eval cells; corpus writes inside one corpus are still serialized by the
prepare step.
The runner default is the official TAU-2 user simulator if
eval.user_simulator_policy is omitted. The bundled OpenViking memory benchmark
config sets confirmation_aware, because a memory benchmark should not treat
user confirmation as task completion before the backend write has happened.
confirmation_aware applies a small idempotent prompt patch to the configured
TAU-2 checkout before planning or running. The patch appends only the behavioral
confirmation boundary to the TAU-2 user simulator guidelines; metadata such as
the upstream PR link is kept in run artifacts, not in the simulator prompt.
Reference: sierra-research/tau2-bench#297.
Optional fixed-first-user fixtures keep the first simulated user turn stable while preserving live simulator behavior after that turn:
export TAU2_RETAIL_FIXED_FIRST_USER_FILE=/path/to/retail_fixture.json
export TAU2_AIRLINE_FIXED_FIRST_USER_FILE=/path/to/airline_fixture.json
Use config/official.yaml with a clean TAU-2 checkout when you need an
official-user-simulator parity run. If the checkout was already patched, the
artifact records that boundary instead of labeling the run pure official.
Only completed retail + airline runs with the same config, same seeds/repeats,
and non-empty artifacts should be read as benchmark evidence. Partial runs,
single-task probes, or missing OpenViking corpus identity are diagnostics.
Executed runs write per-cell JSON under cell_results/ and a strategy/domain
aggregate under scoreboard.json. Memory training artifacts are shared by
domain and strategy under memory_corpora/, so repeated eval cells reuse the
same fresh corpus instead of rewriting it.