docs-mintlify/admin/ai/evals.mdx
Evals are currently in preview, and the user experience and file format may still change. Reach out to the Cube support team to activate this feature for your account.
</Warning>Evals let you benchmark your agent's answers against a known-correct ground truth, on any branch. You author a set of questions, each with the SQL or certified query that represents the right answer, run your agent against them, and get a per-question pass/fail plus an accuracy score for the run — so you can see, objectively, whether a data-model or agent change made the agent better or worse.
You'll find evals in the model IDE under the Evaluate tab, with two sub-tabs: Evaluations (runs) and Questions (the benchmark set).
<Frame> </Frame>| Term | What it is |
|---|---|
| Question | A natural-language question plus its ground truth (the correct answer, as SQL or a certified-query reference). Authored as code in your data model. |
| Evaluation (run) | One execution of the agent against the whole question set, on a specific branch and agent. |
| Result | The agent's answer to a single question in a run, graded against that question's ground truth. |
| Accuracy | passed / total for a run, shown as NN% (passed/total). |
Questions live in your data model repository,
versioned and branched like the rest of it — under agents/eval_questions/*.yml.
Each file has a top-level eval_questions list. A question needs a unique
name, a question, and exactly one ground truth: a certifiedQuery
reference or inline sql.
# agents/eval_questions/revenue.yml
eval_questions:
- name: revenue_by_quarter
question: What was our revenue by quarter over the last two years?
certifiedQuery: revenue_by_quarter # reference an existing certified query by name
- name: arr_last_4_years
question: What was our ARR over the last 4 years?
sql: | # ...or inline SQL ground truth
SELECT date_trunc('year', created_at) AS year, SUM(arr) AS arr
FROM subscriptions GROUP BY 1 ORDER BY 1
certifiedQuery references a certified query
by name. Define it under agents/certified_queries/ (or via Certify this
query in chat). A reference that doesn't resolve to an existing certified
query is flagged as a validation error.sql is inline ground-truth SQL, run through the same Cube SQL API the agent
uses (so MEASURE(...) and friends work).space key scopes a file's questions to a named space
(defaults to auto). Question names are unique per space.While in preview, the Questions tab is a read-only view of these files. To add or edit questions, edit the YAML in the IDE — there's no in-product question editor yet.
</Note>On the Evaluations tab, click New evaluation and choose:
auto (the implicit auto-agent) or a configured agent name.The run starts immediately and you can close the dialog — it executes in the background. The run list shows live progress and then the outcome:
| Column | Meaning |
|---|---|
| Evaluation name | When the run was created. |
| Environment | Where it ran — dev (your personal dev-mode branch, shown as "Name Dev Mode"), staging, or prod (the deploy branch, e.g. master or main). |
| Agent | The agent used. |
| Execution status | Running, Completed, or Failed. |
| Accuracy | NN% (passed/total). |
| Created by | Who triggered the run. |
| Last updated | When it finished. |
Open a run to see per-question results: the question list on the left, with a pass/fail icon for each, and the selected question's detail on the right.
pass, fail, review, or error.Grading is execution-based, not text-based — the same approach used by industry text-to-SQL benchmarks such as BIRD and Spider 2.0. The agent's SQL and the ground-truth SQL are both executed, and their result sets are compared. So an answer that's worded or written differently but produces the same data still passes.
The comparison is:
6646 vs. 6646.0) doesn't fail.revenue vs. total aliases don't matter. Extra columns the agent adds are
ignored.Verdicts:
| Verdict | When |
|---|---|
| pass | The agent's result set matches the ground truth. |
| fail | It ran but the result set doesn't match (see the score reason). |
| review | Nothing to compare automatically — the question has no ground truth, or the agent didn't run a query. Compare manually. |
| error | The agent run failed, the ground-truth query failed, or a referenced certified query wasn't found. |