docs/gen-ai/Benchmark && Evaluation.md
It's critical to evaluate the performance of the GenAI model once it's available. The evaluation && benchmark will be on two-fold:
This document will cover the topic of how to evaluate the model on various eval datasets.
To get the most comparable result with other llms, we evaluate the model in the same way as Open LLM leaderboard, which uses lm-evaluation-harness as the evaluation framework.
For the details of which evaluation datasets are used, please refer to the Open LLM leaderboard.
Because lm-evaluation-harness is written in python, there is no way to directly use it in .NET. Therefore we use the following steps as a workaround:
lm-evaluation-harness to evaluate the model using openai mode.