mkdocs/docs/en/user/model-testing-strategy.md
This page does not explain provider fields. It answers two questions:
The left-side model is responsible for:
Prioritize:
It does not have to match your production model exactly.
The right-side model is responsible for:
If you already know your target model, use it on the right side first.
If you want to know whether a prompt change actually helped, compare versions first:
original / workspace / vNIf you want to know whether the same prompt is stable across models, compare models:
The least helpful starting point is changing both at once:
If both change together, it becomes hard to tell what actually caused the change.
When comparing prompt versions, keep the variable values the same.
When comparing one target message version, keep the full conversation context stable.
Image workspaces differ from text workspaces because the left and right sides already use different model types.
The left side still uses a text model for:
The right side uses an image model for:
original / workspace / vNKeep the same input image whenever possible. If the input image changes, your comparison baseline changes too.
In addition to keeping the prompt and model fixed, try to keep these stable too:
If you change the order of image 1 / image 2 / image 3 without updating the prompt, your comparison becomes unreliable very quickly.
Before comparing versions, a safer sequence is:
Xoriginal / workspace / vNIf you mainly connect to public HTTPS APIs, the browser version is usually enough.
If you mainly connect to local or internal services that are affected by browser restrictions, the desktop app is usually more reliable.