Back to Cube

Evals

docs-mintlify/admin/ai/evals.mdx

1.6.586.7 KB
Original Source
<Warning>

Evals are currently in preview, and the user experience and file format may still change. Reach out to the Cube support team to activate this feature for your account.

</Warning>

Evals let you benchmark your agent's answers against a known-correct ground truth, on any branch. You author a set of questions, each with the SQL or certified query that represents the right answer, run your agent against them, and get a per-question pass/fail plus an accuracy score for the run — so you can see, objectively, whether a data-model or agent change made the agent better or worse.

You'll find evals in the model IDE under the Evaluate tab, with two sub-tabs: Evaluations (runs) and Questions (the benchmark set).

<Frame> </Frame>

Concepts

TermWhat it is
QuestionA natural-language question plus its ground truth (the correct answer, as SQL or a certified-query reference). Authored as code in your data model.
Evaluation (run)One execution of the agent against the whole question set, on a specific branch and agent.
ResultThe agent's answer to a single question in a run, graded against that question's ground truth.
Accuracypassed / total for a run, shown as NN% (passed/total).

Authoring benchmark questions

Questions live in your data model repository, versioned and branched like the rest of it — under agents/eval_questions/*.yml. Each file has a top-level eval_questions list. A question needs a unique name, a question, and exactly one ground truth: a certifiedQuery reference or inline sql.

yaml
# agents/eval_questions/revenue.yml
eval_questions:
  - name: revenue_by_quarter
    question: What was our revenue by quarter over the last two years?
    certifiedQuery: revenue_by_quarter        # reference an existing certified query by name

  - name: arr_last_4_years
    question: What was our ARR over the last 4 years?
    sql: |                                    # ...or inline SQL ground truth
      SELECT date_trunc('year', created_at) AS year, SUM(arr) AS arr
      FROM subscriptions GROUP BY 1 ORDER BY 1
  • certifiedQuery references a certified query by name. Define it under agents/certified_queries/ (or via Certify this query in chat). A reference that doesn't resolve to an existing certified query is flagged as a validation error.
  • sql is inline ground-truth SQL, run through the same Cube SQL API the agent uses (so MEASURE(...) and friends work).
  • Omitting both — or setting both — is a validation error.
  • An optional top-level space key scopes a file's questions to a named space (defaults to auto). Question names are unique per space.
<Note>

While in preview, the Questions tab is a read-only view of these files. To add or edit questions, edit the YAML in the IDE — there's no in-product question editor yet.

</Note>

Running an evaluation

On the Evaluations tab, click New evaluation and choose:

  • Branch — which branch's data model and agent configuration to run against. Defaults to the active branch.
  • Agentauto (the implicit auto-agent) or a configured agent name.

The run starts immediately and you can close the dialog — it executes in the background. The run list shows live progress and then the outcome:

ColumnMeaning
Evaluation nameWhen the run was created.
EnvironmentWhere it ran — dev (your personal dev-mode branch, shown as "Name Dev Mode"), staging, or prod (the deploy branch, e.g. master or main).
AgentThe agent used.
Execution statusRunning, Completed, or Failed.
AccuracyNN% (passed/total).
Created byWho triggered the run.
Last updatedWhen it finished.

Reading the results

Open a run to see per-question results: the question list on the left, with a pass/fail icon for each, and the selected question's detail on the right.

  • Assessmentpass, fail, review, or error.
  • Score reason — when a question doesn't pass, a tag categorizing why: Row count mismatch, Missing columns, Value mismatch, Unexpected rows, Query error, Ground truth query failed, Ground truth not found, or Agent error.
  • Failure analysis — a plain-English explanation, e.g. "The agent returned 3 rows, but the ground truth has 5 rows."
  • Model output · SQL vs. Ground truth SQL answer — the agent's query side-by-side with the ground truth, so you can spot the difference.
  • Response — the agent's full text answer, rendered as Markdown.

How grading works

Grading is execution-based, not text-based — the same approach used by industry text-to-SQL benchmarks such as BIRD and Spider 2.0. The agent's SQL and the ground-truth SQL are both executed, and their result sets are compared. So an answer that's worded or written differently but produces the same data still passes.

The comparison is:

  • Sort-invariant — row order never matters.
  • Numeric-tolerant — values are compared to 4 significant figures, so float/representation noise (6646 vs. 6646.0) doesn't fail.
  • Column-name-agnostic and lenient on extra columns — each ground-truth column must be reproduced by some agent column, matched by its values, so revenue vs. total aliases don't matter. Extra columns the agent adds are ignored.
  • No standalone row-count gate — row count falls out of the comparison: a "top 5" question is enforced because the golden result has exactly 5 rows.

Verdicts:

VerdictWhen
passThe agent's result set matches the ground truth.
failIt ran but the result set doesn't match (see the score reason).
reviewNothing to compare automatically — the question has no ground truth, or the agent didn't run a query. Compare manually.
errorThe agent run failed, the ground-truth query failed, or a referenced certified query wasn't found.

Preview limitations

  • Evals must be activated for your account by the Cube support team.
  • Questions are authored as code only; the Questions tab is read-only.
  • Very large question sets can be slow to run.
  • Grading is execution-based on the result set; it does not semantically judge prose answers.