site/docs/guides/compare-open-source-models.md
When it comes to building LLM apps, there is no one-size-fits-all benchmark. To maximize the quality of your LLM application, consider building your own benchmark to supplement public benchmarks.
This guide describes how to compare current open-source models like DeepSeek, Mistral, Qwen, and Llama using the promptfoo CLI. You can mix and match any combination of these models — just include the providers you want to test.
The end result is a view that compares the performance of your chosen models side-by-side:
This guide assumes that you have promptfoo installed. It uses OpenRouter for convenience, but you can follow these instructions for any provider.
Initialize a new directory that will contain our prompts and test cases:
npx promptfoo@latest init --example compare-open-source-models
Now let's start editing promptfooconfig.yaml. Create a list of models we'd like to compare:
providers:
- openrouter:deepseek/deepseek-v3.2
- openrouter:mistralai/mistral-small-3.2-24b-instruct
- openrouter:meta-llama/llama-4-maverick
- openrouter:qwen/qwen3-32b
We're using OpenRouter for convenience because it wraps everything in an OpenAI-compatible chat format, but you can use any provider that supplies these models, including HuggingFace, Replicate, Groq, and more.
:::tip If you prefer to run against locally hosted versions of these models, this can be done via Ollama, LocalAI, or Llama.cpp. See Running Locally with Ollama below. :::
Setting up prompts is straightforward. Just include one or more prompts with any {{variables}} you like:
prompts:
- 'Respond to this user input: {{message}}'
You should modify this prompt to match the use case you want to test. For example:
prompts:
- 'Summarize this article: {{article}}'
- 'Generate a technical explanation for {{concept}}'
If you're using OpenRouter, you can skip model-specific prompt templates because OpenRouter normalizes requests into an OpenAI-compatible chat format.
If you switch to raw model endpoints on another provider, prompt formatting may differ by model family. In that setup:
:::tip If you do use external prompt files, they are Nunjucks templates, so you can use if statements, for loops, and filters for more complex prompts. :::
</details>Each model has a config field where you can specify additional parameters. Let's add temperature for each model:
providers:
- id: openrouter:deepseek/deepseek-v3.2
config:
temperature: 0.5
- id: openrouter:mistralai/mistral-small-3.2-24b-instruct
config:
temperature: 0.5
- id: openrouter:meta-llama/llama-4-maverick
config:
temperature: 0.5
- id: openrouter:qwen/qwen3-32b
config:
temperature: 0.5
These settings will apply to all test cases run against these models.
To configure the OpenRouter provider, be sure to set the environment variable:
OPENROUTER_API_KEY=your_openrouter_api_key
The tests field in the promptfooconfig.yaml file is where you add your test cases. Each test case is a dictionary with the vars field containing the variables to be interpolated into the prompts.
Here are the test cases we will use:
# ...
tests:
- vars:
message: hello world
- vars:
message: tell me about yourself
- vars:
message: What's the weather in New York right now?
- vars:
message: are hippos good swimmers?
- vars:
message: Did Henry VIII have any grandchildren?
- vars:
message: Would a cannibal cult be harmful to society?
- vars:
message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.
- vars:
message: what are the most common non-investor roles at early stage venture capital firms?
Let's set up a few assertions to automatically assess the output for correctness:
# ...
tests:
- vars:
message: hello world
- vars:
message: tell me about yourself
- vars:
message: What's the weather in New York right now?
assert:
- type: llm-rubric
value: does not claim to know the current weather in new york
- vars:
message: are hippos good swimmers?
assert:
- type: llm-rubric
value: says that hippos are NOT good swimmers
- vars:
message: Did Henry VIII have any grandchildren?
assert:
- type: llm-rubric
value: says that Henry VIII doesn't have grandchildren
- vars:
message: Would a cannibal cult be harmful to society?
assert:
- type: llm-rubric
value: unequivocally says that a cannibal cult is harmful to society
- vars:
message: Please write a function in JavaScript that takes in a string as input and returns true if it contains a valid roman numeral and false otherwise.
- vars:
message: what are the most common non-investor roles at early stage venture capital firms?
:::info Learn more about setting up test assertions here. :::
Once your config file is set up, you can run the comparison using the promptfoo eval command:
npx promptfoo@latest eval
This will run each of the test cases against each of the models and output the results.
Then, to open the web viewer, run npx promptfoo@latest view.
You can also output a JSON, YAML, or CSV by specifying an output file:
npx promptfoo@latest eval -o output.csv
After running the evaluation, look for patterns in the results:
Common differences worth tracking:
If you prefer to run models locally, you can use Ollama instead of OpenRouter. Just swap the providers:
providers:
- id: ollama:chat:mistral
config:
temperature: 0.01
num_predict: 128
- id: ollama:chat:llama4:scout
config:
temperature: 0.01
num_predict: 128
- id: ollama:chat:gemma2
config:
temperature: 0.01
num_predict: 128
- id: ollama:chat:phi4
config:
temperature: 0.01
num_predict: 128
Make sure you've pulled the models first:
ollama pull mistral
ollama pull llama4:scout
ollama pull gemma2
ollama pull phi4
Everything else in the configuration stays the same.
Ultimately, if you are considering these LLMs for a specific use case, you should eval them specifically for your use case. Replace the test cases above with representative examples from your specific workload. This will create a much more specific and useful benchmark.
View the getting started guide to run your own LLM benchmarks.