wren-ai-service/eval/README.md
This document describes the evaluation framework for the Wren AI service. The evaluation framework is designed to assess the performance of the Wren AI service based on the following components:
.env.dev file with these credentials.just up to initiate the necessary development services.config.yaml located in the wren-ai-service/eval/ directory.The dataset curation process is used to prepare the evaluation dataset for the Wren AI service on evaluation purpose. You can follow the steps below to start the curation app:
.env.example to .env and fill in the environment variableswren-ai-service folder: just curate_eval_datajust prep <dataset-name>
Currently, we support two datasets for evaluation:
spider1.0: The Spider dataset (default if no dataset specified)bird: The Bird datasetThe command performs two main steps:
Downloads the specified dataset to:
wren-ai-service/tools/dev/etc/<dataset-name>
Prepares and saves evaluation datasets to:
wren-ai-service/eval/dataset
The output files follow these naming conventions:
spider_<db_name>_eval_dataset.tomlbird_<db_name>_eval_dataset.tomlEach evaluation dataset contains questions, SQL queries, and relevant context needed for testing the system's text-to-SQL capabilities.
Before starting the prediction and evaluation process, you need to configure the datasource correctly. This ensures that the system can access the necessary data for making predictions and evaluations.
For the Spider or Bird datasets, a built-in datasource is used. This means that the data is stored locally and accessed through a specific path. You need to specify the eval_data_db_path in the config.yaml file. This path tells the system where to find the database files.
Here's an example of how to set this up in the config.yaml file:
eval_data_db_path: "etc/bird/minidev/MINIDEV/dev_databases"
When working with custom MDLs that utilize BigQuery as their datasource, it's crucial to properly configure your system to access the necessary datasets. This involves setting specific parameters in the config.yaml file or the .env.dev file. Both methods are effective, but using the .env.dev file is particularly beneficial for keeping sensitive credentials secure.
You can use the following command to encode the credentials:
cat <path/to/credentials.json> | base64
config.yamlTo enable access to your BigQuery dataset, add the following parameters to your config.yaml file. This configuration will guide the system in locating and authenticating with your BigQuery resources:
bigquery_project_id: "your_project_id"
bigquery_dataset_id: "your_dataset_id"
bigquery_credentials: "your_credentials" # this is a base64 encoded string of the credentials
.env.devFor the .env.dev file, you can use the following parameters:
BIGQUERY_PROJECT_ID="your_project_id"
BIGQUERY_DATASET_ID="your_dataset_id"
BIGQUERY_CREDENTIALS="your_credentials" # this is a base64 encoded string of the credentials
The prediction process is used to produce the results of the evaluation data using the Wren AI service. It will create traces and a session on Langfuse to make the results available to the user. You can use the following command to predict the evaluation dataset under the eval/dataset directory:
just predict <evaluation-dataset>
Also, sub-pipeline predictions are supported by specifying the pipeline name:
just predict <evaluation-dataset> <pipeline-name>
Currently, we support the following pipelines: 'ask', 'generation', and 'retrieval'. If no pipeline name is specified, the default is the 'ask' pipeline.
The evaluation process is used to assess the prediction results of the Wren AI service. It compares the prediction results with the ground truth and calculates the evaluation metrics. This process will also add a trace in the same session on Langfuse to make the evaluation results available to the user. You can use the following command to evaluate the prediction results under the outputs/predictions directory:
just eval <prediction-result>
Note: If you would like to enable semantics comparison between SQLs by LLM in order to improve the accuracy metric, please fill in Open AI API key in .env file in wren-ai-service/eval and add --semantics to the end of the command like following:
just eval <prediction-result> --semantics
The evaluation results will be presented on Langfuse as follows:
This section describes the terms used in the evaluation framework:
This section describes the evaluation metrics used in the evaluation framework: