mkdocs/docs/en/user/testing-evaluation.md
This page explains one thing:
what the left side edits, and what the right side proves.
Once that boundary is clear, the buttons become much easier to understand.
| Action | Where it happens | Main focus | Does it modify the left workspace? |
|---|---|---|---|
| Analysis | Left side | prompt structure, clarity, constraints | can suggest edits for the workspace |
| Optimize / Iterate | Left side | rewrite or improve the prompt directly | yes |
| Test | Right side | real execution output | no |
| Result Evaluation | one right-side column | whether this one execution reached the goal | can suggest edits for the workspace |
| Compare Evaluation | multiple right-side columns | differences across real outputs | can suggest edits for the workspace |
Left-side analysis asks: “Is this prompt written clearly enough?”
It focuses on:
Right-side evaluation asks: “How good was this real execution?”
It focuses on:
To avoid semantic confusion, left-side analysis does not treat right-side test input as evidence.
That means:
If you want to judge whether a prompt actually worked on a real result, use right-side evaluation.
| Workspace | Main right-side test input | Most important evidence during evaluation |
|---|---|---|
| System Prompt Workspace | one test message | system prompt + test message + output |
| User Prompt Workspace | usually no extra input | executed prompt + output |
| Variable Workspace | shared variable form | executed prompt + variable values + output |
| Context Workspace | full conversation + shared variables + optional tools | full execution snapshot + output |
| Text-to-Image Workspace | image model | prompt version + image model + real generated image |
| Image-to-Image Workspace | input image + image model | input image + prompt version + real generated image |
| Multi-Image Workspace | ordered input images + image model | image set / image order + prompt version + real generated image |
Use Result Evaluation when you want to judge one column on its own.
Typical questions:
Use Compare Evaluation when you already have two or more columns and want to compare the differences.
Typical questions:
v2Compare Evaluation compares real output evidence, not version labels.
For image workspaces, remember one extra rule:
So if you change the input image, or change the order of multi-image inputs, and then run compare evaluation, the conclusion can become misleading very quickly.
The Workspace option on the right means the current editable content on the left.
It is not the same as “latest saved version”.
Think of it like this:
v1 / v2 / v3: saved versionsEvaluation dialogs can include an optional Focus Brief.
If you provide something like:
the evaluation will prioritize that concern instead of returning a generic summary.
Evaluation suggestions are not bound to one version branch.
The rule is:
2-4 real columns on the rightWhen you use Run All, available result columns are started in parallel where possible. This makes comparison setup faster, but the evaluation rule stays the same: compare outputs that share the same prompt/model baseline unless you intentionally want to test that variable.