sample_apps/generative_benchmarking/README.md
This project provides a comprehensive toolkit for generating custom benchmarks and replicating the results outlined in our technical report.
Benchmarking is used to evaluate how well a model is performing, with the aim to generalize that performance to broader real-world scenarios. However, the widely-used benchmarks today often rely on artificially clean datasets and generic domains, with the added concern that they have likely already been seen by embedding models in training.
We introduce generative benchmarking as a way to address these limitations. Given a set of documents, we synthetically generate queries that are representative of the ground truth.
This repository offers tools to:
generate_benchmark.ipynb
A comprehensive guide to generating a custom benchmark based on your data
compare.ipynb
A framework for comparing results, which is useful when evaluating different embedding models or configurations
data/
Example data to immediately test out the notebooks with
functions/
Functions used to run notebooks, includes various embedding functions and llm prompts
results/
Folder for saving benchmark results, includes results produced from example data
pip install -r requirements.txt
poetry install
conda env create -f environment.yml
conda activate generative-benchmarking-env