doc/development/ai_features/ai_evaluation_guidelines.md
Unlike traditional software systems that behave more-or-less predictably, minor input changes can cause AI-powered systems to produce significantly different outputs. This unpredictability stems from the non-deterministic nature of AI-generated responses. Traditional software testing methods are not designed to handle such variability, which is why AI evaluation has become essential. AI evaluation is a data-driven, quantitative process that analyzes AI outputs to assess system performance, quality, and reliability.
The Centralized Evaluation Framework (CEF) provides a streamlined, unified approach to evaluating AI features at GitLab. It is essential to our strategy for ensuring the quality of our AI-powered features.
Conceptually, there are three parts to an evaluation:
Each part plays a role in the evaluation process, as described below:
Define metrics to determine when the target AI feature or component is working correctly. The chosen metrics should align with success metrics that determine when desired business outcomes have been met.
The following are examples of metrics that might be relevant:
Please note that for some targets, domain-specific metrics are essential, perhaps even more important than the general metrics listed here. In some cases, choosing the right metric is a gradual, iterative process of discovery and experimentation involving multiple teams as well as feedback from users.
Establish clear thresholds for each metric if possible, such as minimum acceptable performance. For example:
Note that it might not be feasible to define a threshold for novel metrics. This particularly applies to domain-specific metrics. In general, we rely on user expectations to define thresholds for acceptable performance. In some cases we'll know what users will expect before releasing a feature and can define thresholds accordingly. In other cases we'll need to wait until we get feedback before we know what threshold to set.
When designing an evaluation, you define how you'll measure the target AI feature or component performance against acceptance criteria. This involves choosing the right evaluators. Evaluators are functions that score the target AI performance on specific metrics. Designing evaluations can also involve creating scenarios that test the target AI feature or component under realistic conditions. You can implement different scenarios as distinct categories of dataset examples, or as variations in how the evaluation invokes the target AI feature or component.
Scenarios to consider include:
A well-structured dataset enables consistent testing and validation of an AI system or component across different scenarios and use cases.
For an overview of working with datasets in the CEF and LangSmith, see the dataset management documentation.
For more detailed information on creating and preparing datasets for evaluation, see our dataset creation guidelines and instructions for uploading datasets to LangSmith.
If you are evaluating a prompt, a quick way to get started is to use our dataset generator. It generates a synthetic evaluation dataset directly from an AI Gateway prompt definition. You can watch a quick demonstration.
When an evaluation is executed, the CEF invokes the target AI feature or component at least once for each input example in the evaluation dataset. The framework then invokes evaluators to score the AI output, and provides you with the results of the evaluation.
Evaluation Runner can be used to run an evaluation in a CI pipeline in a merge request. It spins up a new GDK instance on a remote environment, runs an evaluation using the CEF, and reports the results in the CI job log. See the guide for how to use evaluation runner.
See the step-by-step guide for conducting evaluations using the CEF.
The CEF uses LangSmith to store and analyze evaluation results. See the LangSmith guide for how to analyze an experiment.
For guidance regarding specific features, see the Analyze Results section of the feature-specific documentation for running evaluations locally. You can also find some information about interpreting evaluation metrics in the GitLab Duo Chat evaluation documentation.
Please note that we're updating the documentation on executing and interpreting the results of existing evaluation pipelines (see #671).
Similar to the AI feature development process, iterating on evaluation means returning to previous steps as indicated by the evaluation results. Prompt engineering is key to this step. However, it might also involve adding examples to the dataset, editing existing examples, adjusting the design of the evaluations, or reviewing and revising the metrics and success criteria.