site/blog/rlvr-explained.md
If your model can solve a problem in 8 tries, RLVR trains it to succeed in 1 try. Recent research shows this is primarily search compression, not expanded reasoning capability. Training concentrates probability mass on paths the base model could already sample.
This matters because you need to measure what you're actually getting. Most RLVR gains come from sampling efficiency, with a smaller portion from true learning. This guide covers when RLVR works, three critical failure modes, and how to distinguish compression from capability expansion.
<!-- truncate -->RLVR replaces learned reward models with programmatic verifiers:
# Simplified example - production code needs error handling
def verifier(output: str, ground_truth: Any) -> float:
"""Returns 1.0 if correct, 0.0 if incorrect"""
return 1.0 if check_correctness(output, ground_truth) else 0.0
This approach eliminates reward model training (skipping weeks of work on preference pairs) and provides deterministic feedback (same input always produces the same reward). You get fast iteration because verifier logic changes don't require retraining. Verifiers are scalable if engineered carefully, though SQL execution and unit tests can take seconds.
| Method | Reward Signal | Best For | Major Limitation |
|---|---|---|---|
| RLHF | Human preferences | Subjective quality | Expensive, slow |
| DPO | Preference pairs | Style, tone | Needs good pairs |
| RLVR | Programmatic check | Verifiable tasks | Needs verifiers |
Note: This comparison focuses on post-training methods for reasoning models. Other approaches like RLAIF and Constitutional AI use different paradigms.
1. Generate K candidate solutions for each prompt
Input: "What is 37 × 29?"
Outputs: [1073, 1072, 1073, 1074, 1073, 1071, 1073, 1073]
2. Verify each output
Rewards: [1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0]
3. Update policy to favor high-reward trajectories
(Using GRPO or similar algorithm)
4. Repeat with new prompts
If 5 out of 8 attempts succeed, the model learns which reasoning paths led to correct answers.
RLVR works where ground truth exists. It fails for creative writing, brand voice, or nuanced argumentation. Human preference data remains superior for subjective quality. RLVR is standard reinforcement learning with deterministic reward functions (the technique isn't new, but applying it to LLM post-training at scale is).
Databricks Text2SQL: 73.5% BIRD test accuracy reported in July; a later paper reports 75.68% with few-sample self-consistency. Both use execution-based verifiers, not pattern matching.
DeepSeek R1: Scales GRPO with rule-based rewards (format compliance, verifiable correctness) for math, code, and logic. Details in the R1 paper and Nature write-up.
OpenAI o3 and o4-mini: April 16, 2025, release emphasizes scaling RL and tool-use, with strong results on AIME, SWE-bench, and Codeforces. OpenAI's public materials do not provide HumanEval deltas.
Compression vs capability: Tsinghua finds RLVR mostly improves sampling efficiency rather than expanding the reasoning boundary; Scale formalizes "self-distillation" vs "capability gain."
Your model will learn to cheat any test that isn't comprehensive.
A verifier catching 60% of errors creates a 40% gap. Models find and exploit these gaps.
Real Example from SQL Generation:
# Weak verifier: Only checks syntax
def weak_sql_verifier(sql_query):
try:
parse(sql_query) # Just parses, doesn't execute
return 1.0
except:
return 0.0
# Result: Models generate syntactically valid but wrong queries
# "SELECT * FROM users WHERE 1=1" gets reward 1.0
Strong verifier: Execution-based:
def strong_sql_verifier(sql_query, expected_results, db_connection):
try:
actual = db_connection.execute(sql_query).fetchall()
expected_set = set(map(tuple, expected_results))
actual_set = set(map(tuple, actual))
return 1.0 if actual_set == expected_set else 0.0
except:
return 0.0
Mitigation: Build adversarial test suites for verifiers and measure false negative rates on known-bad outputs. Add secondary checks like intent verification and format validation to catch what execution testing misses.
Your gains might be an accidental side effect of training, not a result of your verifier.
Research from June 2025 found Qwen2.5-Math-7B improved 21.4% on MATH-500 with random rewards, nearly matching the 29.1% gain from ground truth rewards.
The RL update process, even with random rewards, implicitly guides the model's attention. The model isn't learning from the random reward; the training process itself encourages exploring and refining certain internal pathways. In Qwen's case, "code reasoning" (thinking in code without execution) becomes more frequent (65% → 90%). Your performance gain might be an accidental side effect of training, not a result of your carefully designed verifier. These effects were strongest on Qwen2.5-Math and did not consistently replicate on Llama3 or OLMo2. Later research suggests Qwen's unusual sensitivity may indicate training data contamination rather than genuine capability surfacing. On contamination-free datasets, only accurate rewards deliver gains; random rewards provide no benefit. Always validate RLVR gains on held-out, distribution-shifted test sets.
You can validate your verifier with this test:
def random_baseline_test(model, dataset, real_verifier):
# Train with real verifier
real_results = train_rlvr(model, dataset, real_verifier)
# Compute reward hit-rate p from real verifier
real_rewards = [real_verifier(ex.output, ex.answer) for ex in dataset]
p = sum(real_rewards) / len(real_rewards)
# Train with random rewards matching base rate
random_verifier = lambda output, answer: 1.0 if random.random() < p else 0.0
random_results = train_rlvr(model, dataset, random_verifier)
if random_results['improvement'] > 0.05:
print("⚠️ WARNING: Spurious reward sensitivity detected")
print(f"Test across multiple model families (Llama, OLMo, Qwen)")
return False
return True
Some "RLVR gains" are artifacts of the training process. Validate on held-out data from different distributions. Test across multiple model families.
Entropy collapse is the silent killer of RLVR generalization.
Recent research shows that as GRPO training progresses and entropy declines, in-distribution test accuracy rises while out-of-distribution performance deteriorates. The model isn't learning generalizable reasoning patterns; it's overfitting to the training distribution. When entropy collapses too early, the policy becomes trapped in narrow reasoning modes that succeed on training data but fail on novel problems.
Value-free methods like GRPO are particularly vulnerable because they use batch statistics as baselines. Using robust baselines like medians instead of means helps prevent instability when reward distributions have outliers.
Monitor these metrics:
def check_training_health(training_log):
for checkpoint in training_log:
rewards = checkpoint['rewards']
entropy = checkpoint['entropy']
kl_div = checkpoint['kl_divergence']
if np.std(rewards) < 0.1:
print("⚠️ Reward variance collapsed")
if entropy < 2.0: # Domain-specific threshold
print("⚠️ Entropy too low (mode collapse)")
if kl_div > 10.0:
print("⚠️ KL divergence exploding")
RLVR uses standard RL with deterministic reward functions. Your choice of algorithm determines stability and efficiency.
GRPO (Group Relative Policy Optimization):
The advantage calculation uses group statistics as the baseline:
A_i = R(s_i, a_i) - baseline(R_group)
Where R(s_i, a_i) is the verifier's reward (0.0 or 1.0), and the baseline is computed from all samples in the batch (typically mean or median). Outputs with above-average rewards get positive advantages; below-average get negative. The policy gradient then increases probability of high-advantage trajectories.
Recent analysis formalizes GRPO's dynamics as a KL-regularized contrastive loss and proves "success amplification": the probability of success after training is guaranteed to exceed the initial probability, regardless of starting point.
Why value-free RL is popular for RLVR: Verifiers provide clean binary signals, eliminating the need for value function approximation. This reduces training complexity and speeds convergence.
DEPO reports comparable performance with only 20% of training data, yielding 1.85× and 1.66× speedups on AIME24/25 for R1-Distill Qwen-7B vs GRPO trained on the full set. Key techniques:
Offline curation:
def curate_training_data(dataset, base_model):
"""Select examples where base model struggles but is close"""
curated = []
for example in dataset:
pass_at_k = evaluate_pass_at_k(base_model, example, k=8)
# Sweet spot: 30-70% pass@k
if 0.3 <= pass_at_k <= 0.7:
curated.append(example)
return curated
Rollout pruning: Keep top 50% of rollouts by reward Difficulty scheduling: Start easy, gradually increase difficulty
These techniques reduce compute by 60-70% with minimal performance loss.
Evidence:
Example:
Before RLVR:
pass@1: 40%, pass@8: 75%
After RLVR:
pass@1: 65% (+25pp), pass@8: 77% (+2pp)
Analysis:
├─ Gap closed: 25pp / 35pp = 71% compression
└─ Ceiling lift: 2pp = minimal capability gain
Evidence from June 2025 research:
But: Separating guidance effects from RLVR effects is difficult.
Recent research suggests most RLVR gains break down as:
The exact ratio varies by model family, verifier coverage, and whether you use guidance techniques. Tsinghua and Scale both formalize this as "self-distillation" vs "capability gain."
How to measure what you're getting:
def analyze_rlvr_gains(base_model, rlvr_model, dataset, k=16):
base_results = {
'pass@1': evaluate_pass_at_1(base_model, dataset),
'pass@k': evaluate_pass_at_k(base_model, dataset, k=k)
}
rlvr_results = {
'pass@1': evaluate_pass_at_1(rlvr_model, dataset),
'pass@k': evaluate_pass_at_k(rlvr_model, dataset, k=k)
}
compression_gain = rlvr_results['pass@1'] - base_results['pass@1']
capability_gain = rlvr_results['pass@k'] - base_results['pass@k']
initial_gap = base_results['pass@k'] - base_results['pass@1']
compression_ratio = compression_gain / initial_gap if initial_gap > 0 else 0
return {
'compression_ratio': compression_ratio,
'capability_gain': capability_gain
}
If compression_ratio > 0.7, you're mostly getting search efficiency, not learning.
Note on cost estimates: The figures below are illustrative order-of-magnitude estimates for planning purposes. Actual costs depend on model size, dataset size, iteration count, cloud provider, and engineering rates. For your specific use case, build a spreadsheet model with: token counts, $/1K tokens, rollout counts (K samples per prompt), number of training steps, verifier execution cost, and engineering hours. Treat these numbers as directional, not precise.
You trade generality (RLHF works for any task) for efficiency (RLVR is 3x cheaper on verifiable tasks). Quality differences: RLHF captures nuanced preferences, RLVR provides deterministic correctness.
Use RLVR when:
Stick with DPO/RLHF when:
Verifier quality determines RLVR effectiveness. These patterns work across domains.
# Simplified example - add error handling for production
def math_verifier(output: str, expected_answer: float) -> float:
"""Extract and normalize numerical answers"""
import re
# Extract numbers (handles formats like "1,073" or "1073.0")
numbers = re.findall(r'-?\d+(?:,\d{3})*(?:\.\d+)?', output)
if not numbers:
return 0.0
final_answer = float(numbers[-1].replace(',', ''))
return 1.0 if abs(final_answer - expected_answer) < 0.01 else 0.0
# CRITICAL: Never use `exec` with untrusted code in production without sandboxing.
# Use Docker containers, gVisor, or hermetic runners with:
# - No network access
# - No file I/O
# - Strict timeouts (e.g., 5s)
# - Resource limits (CPU, memory)
def code_verifier_unsafe_example(generated_code: str, test_cases: list) -> float:
"""THIS IS UNSAFE - for illustration only"""
try:
namespace = {}
exec(generated_code, namespace) # DANGER: arbitrary code execution
for test in test_cases:
exec(test, namespace)
return 1.0
except Exception:
return 0.0
# In production: Use isolated execution environments
# Example: subprocess with timeout, Docker, or cloud functions
What works: Deterministic checking, fast execution (under 100ms), clear failure modes What doesn't: Style preferences, efficiency requirements, partial credit schemes
from collections import Counter
def sql_execution_verifier(generated_sql: str, expected_results: list, db) -> float:
"""Execute and compare result sets
WARNING: Executing model-generated SQL is dangerous.
- Use read-only database connections
- Whitelist allowed tables/schemas
- Set query timeouts
- Never interpolate model output into queries (SQL injection risk)
"""
try:
actual = db.execute(generated_sql).fetchall()
# Use Counter for multiset equality (handles duplicates and order)
# Set equality drops duplicates; many benchmarks care about row counts
actual_multiset = Counter(map(tuple, actual))
expected_multiset = Counter(map(tuple, expected_results))
return 1.0 if actual_multiset == expected_multiset else 0.0
except Exception:
return 0.0
Multiple correct SQL queries can produce the same results. Execution-based verification handles this naturally; you don't need to enumerate all valid queries.
Databricks case study: Their later paper reports 75.68% BIRD test accuracy combining execution verifiers with schema validation. Execution checking scales better than query pattern matching.
Pattern: Rubrics as Rewards (Scale AI approach)
def writing_rubric_verifier(output: str, requirements: dict) -> float:
"""Multi-criterion scoring for structured writing"""
score = 0.0
for criterion_name, check_fn in requirements.items():
if check_fn(output):
score += 1.0 / len(requirements)
return score
# Example: Technical documentation
requirements = {
'has_introduction': lambda text: 'introduction' in text.lower()[:300],
'meets_length': lambda text: 300 <= len(text.split()) <= 500,
'includes_examples': lambda text: text.lower().count('example') >= 2,
'has_code_block': lambda text: '```' in text,
}
DPO often outperforms rubric-based RLVR on creative tasks where quality is subjective. Rubrics work for structured documents (reports, documentation), not creative writing or marketing copy.
Good verifiers are:
Critical security requirements for production verifiers:
Code execution verifiers:
exec() without sandboxingSQL verifiers:
API-based verifiers:
Anti-patterns:
After training, validate two things: (1) Is the model better? (2) Is your verifier reliable?
Verifiers provide training rewards. Evaluation harnesses test if training worked. These are separate pipelines.
def evaluate_pass_at_k(model, test_set, k=8):
results = {'pass@1': 0, 'pass@k': 0, 'total': len(test_set)}
for problem in test_set:
solutions = model.generate(problem, num_samples=k)
rewards = [verifier(sol, problem.answer) for sol in solutions]
results['pass@1'] += int(rewards[0] > 0.5)
results['pass@k'] += int(any(r > 0.5 for r in rewards))
results['pass@1'] = 100 * results['pass@1'] / results['total']
results['pass@k'] = 100 * results['pass@k'] / results['total']
return results
Interpret results:
| pass@1 | pass@k | Meaning |
|---|---|---|
| ↑↑ | ↑ | Real capability gain + compression |
| ↑↑ | → | Pure search compression |
| ↑ | ↓ | Mode collapse (RED FLAG) |
Build adversarial test suites:
adversarial_cases = [
{'output': 'The answer is 42', 'expected': 1073,
'should_pass': False, 'test': 'incorrect_answer'},
{'output': 'I cannot solve this', 'expected': 1073,
'should_pass': False, 'test': 'no_answer'},
{'output': '37 × 29 = 1,073', 'expected': 1073,
'should_pass': True, 'test': 'correct_with_formatting'},
]
def test_verifier_coverage(verifier, test_cases):
failures = []
for case in test_cases:
result = verifier(case['output'], case['expected'])
passed = (result > 0.5)
if passed != case['should_pass']:
failures.append(case['test'])
coverage = 1 - (len(failures) / len(test_cases))
return coverage, failures
Target coverage: Over 90% (below 70% is exploitable)
After training, use tools like Promptfoo or custom scripts to validate your model. These tools test if training worked; they don't provide training rewards.
RLVR is promising, but it leaves critical questions unanswered:
Q1: How do we handle partial verifiers at scale? Current: Intent checks, tripwires. Needed: Automated coverage analysis, self-improving verifiers.
Q2: Do RLVR gains transfer across model families? Evidence shows mixed results. Spurious reward sensitivity varies by family. We need cross-family benchmarking standards.
Q3: What are the scaling laws for RLVR? For pretraining, we have Chinchilla laws. For RLVR: unknown. How do gains scale with compute? When do returns diminish?
Q4: Can we expand beyond deterministic verifiers? Current RLVR needs binary correctness. Can we extend to fuzzy verifiers, learned verifiers, or hybrid human-AI verification?
Research teams are exploring multi-verifier composition (chaining multiple checks with weighted scoring) to address partial verifier coverage. Self-play approaches have models generate harder problems for themselves to sustain exploration during training.
On the tooling front, teams are building verifier template libraries and automated coverage testing frameworks. For high-stakes applications, expect auditing standards and regulatory frameworks as RLVR moves into medical, legal, and financial domains.
Do you have objective correctness criteria?
│
├─ YES → Can you write a verifier covering >90% of errors?
│ │
│ ├─ YES → RLVR is worth trying
│ │ └─ Start: Small pilot, compare to DPO
│ │
│ └─ NO → Fix verifier coverage first
│ └─ Build adversarial test suite
│
└─ NO → Stick with DPO/RLHF
└─ Unless: Task decomposes into verifiable sub-tasks
Evidence to date suggests that for most applications, RLVR's gains are dominated by search compression rather than expanded reasoning capability. You're optimizing search, not expanding intelligence. The model was already capable of finding the right answer; RLVR just optimizes the path to solutions it could already reach.
If you can write a verifier, you can scale learning. Where ground truth doesn't exist, RLVR fails and human preference data remains superior.
The engineering challenge: proving what you've actually gained. Run pass@k analysis to distinguish compression from capability. Run random baseline tests to check for spurious rewards. Test across multiple model families.
The next time you see a model's performance jump after an RL run, ask the hard question: did you build a better thinker, or did you just build a faster guesser? The answer determines whether your product is truly intelligent or just a fragile house of cards.
Implementation:
Research: