docs/Evaluation.md
In LLaVA-1.5, we evaluate models on a diverse set of 12 benchmarks. To ensure the reproducibility, we evaluate the models with greedy decoding. We do not evaluate using beam search to make the inference process consistent with the chat demo of real-time outputs.
Currently, we mostly utilize the official toolkit or server for the evaluation.
You can evaluate LLaVA on your custom datasets by converting your dataset to LLaVA's jsonl format, and evaluate using model_vqa.py.
Below we provide a general guideline for evaluating datasets with some common formats.
<question>
Answer the question using a single word or phrase.
<question>
A. <option_1>
B. <option_2>
C. <option_3>
D. <option_4>
Answer with the option's letter from the given choices directly.
No postprocessing is needed.
Before preparing task-specific data, you MUST first download eval.zip. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to ./playground/data/eval. This also provides a general structure for all datasets.
test2015 and put it under ./playground/data/eval/vqav2.CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/vqav2.sh
./playground/data/eval/vqav2/answers_upload../playground/data/eval/gqa/data. You may need to modify eval.py as this due to the missing assets in the GQA v1.2 release.CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/gqa.sh
test.json and extract test.zip to test. Put them under ./playground/data/eval/vizwiz.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/vizwiz.sh
./playground/data/eval/vizwiz/answers_upload../playground/data/eval/scienceqa, download images, pid_splits.json, problems.json from the data/scienceqa folder of the ScienceQA repo.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/sqa.sh
TextVQA_0.5.1_val.json and images and extract to ./playground/data/eval/textvqa.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/textvqa.sh
coco from POPE and put under ./playground/data/eval/pope.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/pope.sh
MME_Benchmark_release_version.eval_tool and MME_Benchmark_release_version under ./playground/data/eval/MME.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mme.sh
mmbench_dev_20230712.tsv and put under ./playground/data/eval/mmbench.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench.sh
./playground/data/eval/mmbench/answers_upload/mmbench_dev_20230712.mmbench_dev_cn_20231003.tsv and put under ./playground/data/eval/mmbench.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmbench_cn.sh
./playground/data/eval/mmbench/answers_upload/mmbench_dev_cn_20231003../playground/data/eval/seed_bench/SEED-Bench-image../playground/data/eval/seed_bench/SEED-Bench-video-image. We provide our script extract_video_frames.py modified from the official one.CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/seed.sh
./playground/data/eval/seed_bench/answers_upload using the official jupyter notebook.llava-bench-in-the-wild to ./playground/data/eval/llava-bench-in-the-wild.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/llavabench.sh
mm-vet.zip to ./playground/data/eval/mmvet.CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/mmvet.sh
./playground/data/eval/mmvet/results using the official jupyter notebook.Below are awesome benchmarks for multimodal understanding from the research community, that are not initially included in the LLaVA-1.5 release.
llvisionqa_dev.json (for dev-subset) and llvisionqa_test.json (for test-subset). Put them under ./playground/data/eval/qbench../playground/data/eval/qbench/images_llviqionqa.dev to test for evaluation on test set).CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/qbench.sh dev
./playground/data/eval/qbench/llvisionqa_dev_answers.jsonl.质衡-问答-验证集.json (for dev-subset) and 质衡-问答-测试集.json (for test-subset). Put them under ./playground/data/eval/qbench../playground/data/eval/qbench/images_llviqionqa.dev to test for evaluation on test set).CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/qbench_zh.sh dev
./playground/data/eval/qbench/llvisionqa_zh_dev_answers.jsonl.