site/blog/search-rubric-assertions.md
In mid-2025, two U.S. federal judges withdrew or corrected written opinions after lawyers noticed that the decisions quoted cases and language that did not exist. In one chambers, draft research produced using generative AI had slipped into a published ruling. (Reuters)
None of these errors looked obviously wrong on the page. They read like normal legal prose until someone checked the underlying facts.
This is the core problem: LLMs sound confident even when they are wrong or stale. Traditional assertions can check format and style, but they cannot independently verify that an answer matches the world right now.
Promptfoo's new search-rubric assertion does that. It lets a separate "judge" model with web search verify time-sensitive facts in your evals.
A few years ago, "does this answer look reasonable" was often good enough. Today, people use LLMs to:
Models trained on 2024 or early 2025 data will happily answer questions about:
But you cannot trust them to know:
You need a way to systematically check that kind of answer against the web while you run evals and CI.
That is what search-rubric is for.
search-rubric actually doesConceptually, search-rubric is llm-rubric plus a search-enabled judge model.
At a high level:
"Provides the current AAPL stock price within 2% and includes the currency."{ pass: boolean, score: number, reason: string }.You do not write any of the search logic yourself. You just describe what "correct enough" means.
prompts:
- 'What is the current stock price of {{company}}?'
providers:
- openai:gpt-4o-mini
tests:
- vars:
company: Apple
ticker: AAPL
assert:
- type: search-rubric
value: |
States the current {{ticker}} stock price that:
1. Is within 3% of the actual market price
2. Includes the currency (USD or $)
3. Mentions if the market is currently open or closed
threshold: 0.8
When this runs, Promptfoo uses a separate search-enabled model as the grader. If the SUT hallucinates or returns a stale training-data price, the assertion fails with an explanation.
What to expect: Models like gpt-4o-mini without web search will often refuse to answer real-time questions ("I don't have access to real-time data"). The search-rubric grader correctly flags this as a failure since no actual price was provided. To test models that confidently answer (and potentially hallucinate), use a more capable model or one with web search enabled as the SUT.
For people who care about the plumbing:
- type: search-rubric
value: 'Names Satya Nadella as the current CEO of Microsoft'
It prefers a provider with web search configured:
grading.provider, if setanthropic:messages:claude-opus-4-6 with web_search_20250305openai:responses:gpt-5.1 with web_search_previewgoogle:gemini-3-pro-preview with googleSearchperplexity:sonar-pro (built-in search)xai:responses:grok-4.3 with the web_search tool enabledInternally, Promptfoo uses a web-search-aware rubric prompt that looks roughly like:
You are grading output according to a user-specified rubric. You may search
the web to check current information. Respond with:
{ "reason": string, "pass": boolean, "score": number }
The model receives:
<Output>
{{output}}
</Output>
<Rubric>
{{rubric}}
</Rubric>
The prompt instructs the grader to call web search when the rubric references:
pass is a boolean decision.score is a 0.0-1.0 confidence score.threshold on the score.reason and optional search metadata are stored on the assertion.This is deliberately simple. The judge model is an agent with exactly one job: check whether the answer is consistent with reality as seen on the web, under a rubric you define.
search-rubric vs llm-rubricYou should not turn on search for every test. It adds latency and cost. Use it where the world moves fast.
| Use case | Prefer llm-rubric | Prefer search-rubric |
|---|---|---|
| Tone, UX copy, narrative quality | ✓ | |
| Prompt adherence, safety, style checks | ✓ | |
| Static APIs, math, pure reasoning | ✓ | |
| Stock prices, FX, crypto | ✓ | |
| Current weather and travel conditions | ✓ | |
| Latest software versions (Node, React) | ✓ | |
| Case citations and regulations | ✓ | |
| "Who won...?" style news questions | ✓ |
A practical pattern is:
llm-rubric for most qualitative checks.search-rubric only for tests that intentionally touch the outside world.Here are some real-world patterns where hallucinations hurt, and how search-rubric handles them.
# Verify real-time S&P 500 data
assert:
- type: search-rubric
value: |
Provides S&P 500 index value that:
- Is within 1% of the current market value
- States whether markets are open or closed
- Mentions the time reference (for example, "as of 10:32 ET")
threshold: 0.9
If the model grabs last Friday's close while markets are moving, the assertion fails and the grader explains why.
There are now multiple public cases of fake citations entering court filings and even judicial opinions through misuse of AI.
assert:
- type: search-rubric
value: |
Correctly describes Miranda v. Arizona including:
- Accurate citation (384 U.S. 436)
- Correct year (1966)
- Core holding on the right to remain silent
If the answer invents a citation or misstates the holding, the search-enabled grader should catch it.
assert:
- type: search-rubric
value: |
States the FDA approval timeline for Leqembi that:
- Notes accelerated approval in January 2023
- Notes traditional approval in July 2023
- Describes its use for early-stage Alzheimer's disease
threshold: 0.9
Because the rubric encodes the expected timeline, the grader must confirm the dates against current FDA or reputable medical sources.
Node.js LTS moves quickly. As of late 2025, Node 24.x is the newest Active LTS release, and older LTS lines like 22.x and 20.x are in Maintenance LTS rather than the recommended track for new projects. (Node.js)
assert:
- type: search-rubric
value: |
Names a current Node.js LTS version and:
- Identifies it as an LTS release
- Does not recommend an end-of-life version
This catches answers like "Node 18 is the latest LTS" that look reasonable but are wrong in 2025.
You can use any provider that can both:
By late 2025, all of the major model providers had some flavor of first-class web search or grounding API, each with its own pricing line item. That is great for capabilities, but it also means your evals need to understand when they are exercising these tools and whether they are returning current, correct information.
As of November 2025:
web_search_20250305 tool on the API. (Anthropic)web_search and web_search_preview) on the Responses API.googleSearch tool.search_parameters.Prices change, so treat these as ballpark numbers and always check the provider's official pricing page before wiring this into a large CI suite.
You have two knobs:
grading:
provider: openai:responses:gpt-5.1
providerOptions:
config:
tools:
- type: web_search_preview
tests:
- prompt: 'What is the weather in Tokyo right now?'
assert:
- type: search-rubric
value: |
Describes current Tokyo weather including:
- Temperature with units (C or F)
- General conditions (for example, sunny, cloudy, rainy)
- Any active weather warnings if present
If you do not specify a grading.provider, Promptfoo will try to pick a sensible default based on available API keys and built-in defaults:
web_search_20250305.If no search-capable provider can be found, search-rubric will throw a clear error instead of silently ignoring web search.
Every search-rubric assertion involves:
Typical impact in a CI environment:
That sounds expensive, but you rarely need search for all tests. For example, a 100-test suite where 20 tests use search-rubric is usually a few dollars per run, even on top tier models.
During development, you can enable caching:
promptfoo eval --cache
Promptfoo will reuse previous grading outputs so you do not pay or wait for repeated web searches while you iterate.
This is the part HN will reasonably worry about.
search-rubric is only as good as the search index behind your provider. You should:
Rubrics like "within 5 percent of the current BTC price" or "names at least two recent vulnerabilities from the last year" force you to make your own tradeoffs explicit.
That is a feature, but it takes work.
A suite that hits web search 500 times on every CI run will cost real money. Start with a handful of critical paths, then expand.
The grader is still an LLM with its own failure modes. Search reduces hallucinations but does not eliminate them. Use threshold to require a high score for sensitive checks, and keep some non-LLM assertions in place.
From scratch:
npm install -g promptfoo@latest
# or, if you use npx
npx promptfoo init
Then add a simple search-backed check:
prompts:
- 'The CEO of Microsoft is {{name}}'
providers:
- id: openai:gpt-5.1
grading:
provider: anthropic:messages:claude-opus-4-6
providerOptions:
config:
tools:
- type: web_search_20250305
name: web_search
max_uses: 5
tests:
- vars:
name: 'Satya Nadella'
assert:
- type: search-rubric
value: 'Confirms that {{name}} is the current CEO of Microsoft'
- vars:
name: 'Bill Gates' # Intentionally wrong
assert:
- type: search-rubric
value: 'States the correct current CEO of Microsoft and identifies this answer as incorrect'
threshold: 0.8
Run it:
npx promptfoo eval -c simple-search-test.yaml
You will see not just pass or fail, but detailed reasons from the grading model about what it found on the web.
Search-backed grading is not a silver bullet. It will not stop people from misusing AI in production or copying answers blindly into court filings.
What it does give you is a repeatable way to say:
"For this class of prompts, the answers are checked against the real world every time we run CI."
That turns "trust me, it usually works" into something closer to an actual contract.
You can read the full configuration reference in the Search-Rubric documentation. If you ship anything where incorrect real-world facts cost money, reputation, or legal risk, it is worth wiring at least a handful of these tests into your pipeline.
<script type="application/ld+json" dangerouslySetInnerHTML={{__html: ` { "@context": "https://schema.org", "@type": "TechArticle", "headline": "Real-Time Fact Checking for LLM Outputs", "datePublished": "2025-11-28", "author": { "@type": "Person", "name": "Michael" }, "keywords": "LLM testing, web search, fact checking, real-time verification", "description": "Promptfoo's search-rubric assertion uses models with web search to verify time-sensitive facts like stock prices, weather, software versions, and legal citations during testing." } `}} />