benchmark/vaka/vikingbot/README.md
This benchmark evaluates Vaka multi-turn long-memory CSV results.
The default input is relative to this benchmark directory:
benchmark/vaka/vikingbot/data/vaka_locomo.csv
You can also pass a custom path with --input.
Rows are grouped by global session_id blocks:
session_id 1-10 is case_0001session_id 11-20 is case_0002session_id 21-30 is case_0003The default split is global across the whole CSV:
session_id 1-70 are committed/imported as memorysession_id 71 through the max session in the CSV are evaluation turnsquery + deepsearch_answer turnImport memory sessions 1-70 into OpenViking:
python3 benchmark/vaka/vikingbot/import_to_ov.py
import_to_ov.py imports each memory row as a two-message conversation:
query is the user message and deepsearch_answer is the assistant message.
All imported memory uses the same OpenViking identity by default:
account=default, user_id=default, and agent_id=default.
Evaluation sessions 71+ are not imported as memory by default.
Use a custom single user/agent when needed:
python3 benchmark/vaka/vikingbot/import_to_ov.py --user-id vaka --agent-id vaka
Prepare the judge input CSV:
python3 benchmark/vaka/vikingbot/run_eval.py
Judge the prepared answers:
uv run python benchmark/vaka/vikingbot/judge.py --parallel 10
Calculate stats:
python3 benchmark/vaka/vikingbot/stat_judge_result.py
Or run all steps:
bash benchmark/vaka/vikingbot/run_full_eval.sh
Skip OpenViking import and only do offline CSV preparation/judge/stat:
bash benchmark/vaka/vikingbot/run_full_eval.sh --skip-import
If the judge dependencies are only available through the project environment:
bash benchmark/vaka/vikingbot/run_full_eval.sh --python "uv run python"
run_eval.py does not call Vaka again. It treats the CSV deepsearch_answer column as
the generated answer to evaluate.
If standard_answer is present, judge.py grades against it. If judge_standard is
present, judge.py treats it as a rubric. If both are empty, the judge evaluates whether
the answer follows the current query while preserving relevant memory from global
session_id 1-70 and prior evaluation turns.