benchmark/retrieval/grep/vikingdb_bm25/README.md
Benchmark suite for evaluating OpenViking's grep retrieval with VikingDB BM25 engine.
vikingdb_bm25/
├── ai_wiki.txt # Source text for synthetic data generation
├── effectiveness/ # Retrieval effectiveness (recall/precision/F1)
│ ├── step1_add_resource.py
│ └── step2_quality.py
└── performance/ # Retrieval performance (latency + returned match count at scale)
├── step0_prepare_data.py
├── step1_add_resource.py
├── step2_reindex.py
└── step3_benchmark.py
Tests whether grep can find all matching files in real code repositories.
Data source: Real code repos (download manually, place under ~/.openviking/data/benchmark/).
| Step | Script | Description |
|---|---|---|
| 1 | step1_add_resource.py | Import code repos (with indexing, single import) |
| 2 | step2_quality.py | Compare grep results vs ground truth (fs engine, cached) |
# Step 1: Import repos (with VLM/embedding, single import)
cd effectiveness/
python3 step1_add_resource.py --source ~/.openviking/data/benchmark/OpenViking-main
# Step 2: Evaluate retrieval quality
# First run MUST use engine=fs in ov.conf to generate ground truth cache:
# 1. Set ov.conf: "grep": {"engine": "fs"}
# 2. Restart server
python3 step2_quality.py --keywords grep reindex SyncHTTPClient
# Subsequent runs can use any engine (ground truth is read from cache):
# 1. Set ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
# 2. Restart server
python3 step2_quality.py --keywords grep reindex SyncHTTPClient
# Optional: --regenerate-ground-truth (force recompute, requires engine=fs)
Tests grep speed and returned match count on a large synthetic dataset (default: 200K files).
Data source: Generated from ai_wiki.txt with target words injected at known probabilities.
| Step | Script | Description |
|---|---|---|
| 0 | step0_prepare_data.py | Generate synthetic dataset (dir_xxx/wiki_xxx.txt) |
| 1 | step1_add_resource.py | Import data (no VLM/embedding, fast) |
| 2 | step2_reindex.py | Async reindex via openviking-server (concurrency=16, polling) |
| 3 | step3_benchmark.py | Measure latency and returned match count with node_limit=256 |
15 words across 5 probability tiers:
These word groups are defined in performance/step0_prepare_data.py and reused by performance/step3_benchmark.py.
| Probability | Words | Expected hits (per 200K files) |
|---|---|---|
| 1% | heliofract, prismcache, fluxkernel | ~2,000 |
| 0.1% | auroracode, kiteshade, glyphvector | ~200 |
| 0.1% | cortexmint, latticewave, spiralsync | ~200 |
| 0.05% | ripplehash, embertrace, novaframe | ~100 |
| 0.01% | zephyrloom, quartzrelay, nebulaindex | ~20 |
cd performance/
# Step 0: Generate data (default: 200 dirs x 1000 files = 200K files)
python3 step0_prepare_data.py
# Optional: append more data for scale-out without overwriting existing dirs
python3 step0_prepare_data.py --start-dir 100 --num-dirs 100
# Step 1: Import without indexing (fast)
python3 step1_add_resource.py
# Step 2: Build vector indexes (requires openviking-server running)
python3 step2_reindex.py
# Optional: --concurrency N (default: 16)
# Step 3: Benchmark — run with different engine configs
# Run A: fs engine
# 1. Set ov.conf: "grep": {"engine": "fs"}
# 2. Restart server
python3 step3_benchmark.py --engine-label fs
# Run B: auto engine (bm25)
# 1. Set ov.conf: "grep": {"engine": "auto", "switch_to_remote_threshold": 0}
# 2. Restart server
python3 step3_benchmark.py --engine-label auto --compare step3_result_fs.json
ov.conf and restart the server between benchmark runs--start-dir,
then rerun Step 1 and Step 2.