dev/estimate_gpt3_core.ipynb
Authors: Claude Code Opus 4.5, Andrej Karpathy
Date: Jan 2026
The CORE metric (introduced in the DCLM paper) is a composite benchmark that evaluates pretrained language models across 22 diverse tasks spanning world knowledge, language understanding, commonsense reasoning, symbolic problem solving, and reading comprehension. It provides a single score that captures a model's general capabilities.
We want to compare nanochat models against the GPT-3 model family from OpenAI's "Language Models are Few-Shot Learners" paper (2020). However, there's a problem: GPT-3 models were never evaluated on CORE (which didn't exist in 2020), and the models were never publicly released, so we can't evaluate them ourselves.
We estimate CORE scores for GPT-3 by:
This notebook documents our methodology in detail for reproducibility.
import numpy as np
from pathlib import Path
import pandas as pd
# For nice table display
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', 20)
CORE consists of 22 tasks evaluated in specific few-shot settings. The key innovation is centering: raw accuracies are adjusted to account for random guessing baselines.
$$\text{centered accuracy} = \frac{\text{accuracy} - \text{baseline}}{1 - \text{baseline}}$$
The final CORE score is simply the mean of all 22 centered accuracies.
| Category | Tasks |
|---|---|
| World Knowledge | Jeopardy, ARC Easy, ARC Challenge, BigBench QA Wikidata |
| Language Understanding | HellaSwag (0-shot & 10-shot), LAMBADA, Winograd, Winogrande, BigBench Language ID |
| Commonsense Reasoning | COPA, CommonsenseQA, PIQA, OpenBookQA |
| Symbolic Problem Solving | BigBench Dyck, Operators, CS Algorithms, Repeat Copy Logic, AGI Eval LSAT-AR |
| Reading Comprehension | SQuAD, CoQA, BoolQ |
We carefully compared the evaluation methodology between GPT-3 and CORE for each task. Key considerations:
We applied a conservative filter: both evaluations must use K=0 (zero-shot) or both must use K>0 (few-shot). We excluded tasks that mix zero-shot with few-shot, as this introduces systematic differences.
| Task | GPT-3 K | CORE K | Reason for Exclusion |
|---|---|---|---|
| Winograd | 7 | 0 | Mixing K>0 with K=0 |
| Winogrande | 50 | 0 | Mixing K>0 with K=0 |
| COPA | 32 | 0 | Mixing K>0 with K=0 |
| OpenBookQA | 100 | 0 | Mixing K>0 with K=0, also uses unconditional normalization |
| BoolQ | 32 | 10 | High sensitivity to K (17% gap between 0-shot and few-shot in GPT-3) |
| CoQA | 5 | 0 | Different metric (F1 vs accuracy) |
| LAMBADA few-shot | 15 | 0 | GPT-3 uses special fill-in-blank format |
These CORE tasks simply don't appear in GPT-3 (many didn't exist in 2020):
# The 6 tasks we selected for overlap
selected_tasks = pd.DataFrame([
{'Task': 'HellaSwag 0-shot', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},
{'Task': 'LAMBADA', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},
{'Task': 'HellaSwag 10-shot', 'GPT-3 K': 20, 'CORE K': 10, 'Match': 'Both few-shot (K differs slightly)'},
{'Task': 'PIQA', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
{'Task': 'ARC Easy', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
{'Task': 'ARC Challenge', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
])
selected_tasks
Rationale for K differences: Looking at GPT-3's own data, the difference between different K values is typically small. Here's the evidence from the GPT-3 175B model:
| Task | 0-shot | Few-shot | K | Δ |
|---|---|---|---|---|
| HellaSwag | 78.9% | 79.3% | 20 | +0.4% |
| PIQA | 81.0% | 82.3% | 50 | +1.3% |
| ARC Easy | 68.8% | 70.1% | 50 | +1.3% |
| ARC Challenge | 51.4% | 51.5% | 50 | +0.1% |
| Winograd | 88.3% | 88.6% | 7 | +0.3% |
| COPA | 91.0% | 92.0% | 32 | +1.0% |
For most tasks, the gap between 0-shot and few-shot (with K=20-50) is only 0.1-1.3%. This suggests that differences between K=10 and K=50 would be even smaller, making our task selection reasonable.
Note: Some tasks show larger sensitivity (Winogrande: +7.5%, BoolQ: +17%), which is why we excluded them.
We have actual CORE scores for all 4 GPT-2 models. These serve as our calibration data.
# Random baselines for centering (from CORE specification)
BASELINES = {
'hellaswag_zeroshot': 0.25,
'lambada_openai': 0.0,
'hellaswag': 0.25,
'piqa': 0.50,
'arc_easy': 0.25,
'arc_challenge': 0.25,
}
TASK_ORDER = ['hellaswag_zeroshot', 'lambada_openai', 'hellaswag', 'piqa', 'arc_easy', 'arc_challenge']
TASK_NAMES = ['HellaSwag 0-shot', 'LAMBADA', 'HellaSwag 10-shot', 'PIQA', 'ARC Easy', 'ARC Challenge']
def center_accuracy(acc, baseline):
"""Convert raw accuracy to centered accuracy."""
return (acc - baseline) / (1.0 - baseline)
def parse_csv(filepath):
"""Parse a CORE results CSV file."""
results = {}
with open(filepath) as f:
for line in f:
parts = [p.strip() for p in line.strip().split(',')]
if len(parts) >= 3 and parts[0] != 'Task':
task = parts[0]
try:
acc = float(parts[1]) if parts[1] else None
centered = float(parts[2]) if parts[2] else None
results[task] = {'accuracy': acc, 'centered': centered}
except ValueError:
pass
return results
# Load GPT-2 CORE results
knowledge_dir = Path("/home/ubuntu/.cache/nanochat/eval_bundle")
gpt2_models = [
('GPT-2', 'openai-community-gpt2.csv', 124e6),
('GPT-2 Medium', 'openai-community-gpt2-medium.csv', 355e6),
('GPT-2 Large', 'openai-community-gpt2-large.csv', 774e6),
('GPT-2 XL', 'openai-community-gpt2-xl.csv', 1558e6),
]
gpt2_data = []
for name, filename, params in gpt2_models:
results = parse_csv(knowledge_dir / filename)
core = results['CORE']['centered']
task_accs = [results[task]['accuracy'] for task in TASK_ORDER]
gpt2_data.append({
'name': name,
'params': params,
'task_accs': task_accs,
'core': core,
})
# Display as DataFrame
gpt2_df = pd.DataFrame([
{
'Model': d['name'],
'Params': f"{d['params']/1e6:.0f}M",
**{name: f"{acc:.1%}" for name, acc in zip(TASK_NAMES, d['task_accs'])},
'CORE': f"{d['core']:.4f}"
}
for d in gpt2_data
])
print("GPT-2 Family: Raw Accuracies and CORE Scores")
gpt2_df
# Build feature matrix (centered accuracies)
X_gpt2 = []
y_gpt2 = []
for data in gpt2_data:
centered_accs = []
for task, acc in zip(TASK_ORDER, data['task_accs']):
centered = center_accuracy(acc, BASELINES[task])
centered_accs.append(centered)
X_gpt2.append(centered_accs)
y_gpt2.append(data['core'])
X_gpt2 = np.array(X_gpt2)
y_gpt2 = np.array(y_gpt2)
# Display centered accuracies
centered_df = pd.DataFrame(
X_gpt2,
columns=TASK_NAMES,
index=[d['name'] for d in gpt2_data]
)
centered_df['Mean'] = X_gpt2.mean(axis=1)
centered_df['CORE'] = y_gpt2
print("GPT-2 Family: Centered Accuracies")
centered_df
Observation: The mean of the 6 centered accuracies is consistently higher than the actual CORE score. This makes sense because CORE includes 16 additional tasks (many quite difficult) that pull down the average.
We extract the 6 task accuracies from the GPT-3 paper's Appendix H (master results table).
Source: Table H.1 in "Language Models are Few-Shot Learners" (Brown et al., 2020)
# GPT-3 accuracies from the paper
# Format: [hellaswag_0shot, lambada_0shot, hellaswag_fewshot, piqa_fewshot, arc_easy_fewshot, arc_challenge_fewshot]
gpt3_models = [
('GPT-3 Small', 125e6, [0.337, 0.427, 0.335, 0.643, 0.427, 0.255]),
('GPT-3 Medium', 350e6, [0.436, 0.543, 0.431, 0.694, 0.510, 0.284]),
('GPT-3 Large', 760e6, [0.510, 0.604, 0.513, 0.720, 0.581, 0.323]),
('GPT-3 XL', 1.3e9, [0.547, 0.636, 0.549, 0.743, 0.591, 0.367]),
('GPT-3 2.7B', 2.7e9, [0.628, 0.671, 0.629, 0.754, 0.621, 0.395]),
('GPT-3 6.7B', 6.7e9, [0.674, 0.703, 0.673, 0.778, 0.658, 0.437]),
('GPT-3 13B', 13e9, [0.709, 0.725, 0.713, 0.799, 0.691, 0.448]),
('GPT-3 175B', 175e9, [0.789, 0.762, 0.793, 0.823, 0.701, 0.515]),
]
# Display raw accuracies
gpt3_df = pd.DataFrame([
{
'Model': name,
'Params': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
**{task_name: f"{acc:.1%}" for task_name, acc in zip(TASK_NAMES, accs)}
}
for name, params, accs in gpt3_models
])
print("GPT-3 Family: Raw Accuracies from Paper")
gpt3_df
# Compute centered accuracies for GPT-3
X_gpt3 = []
for name, params, accs in gpt3_models:
centered_accs = [center_accuracy(acc, BASELINES[task]) for task, acc in zip(TASK_ORDER, accs)]
X_gpt3.append(centered_accs)
X_gpt3 = np.array(X_gpt3)
# Display
gpt3_centered_df = pd.DataFrame(
X_gpt3,
columns=TASK_NAMES,
index=[m[0] for m in gpt3_models]
)
gpt3_centered_df['Mean'] = X_gpt3.mean(axis=1)
print("GPT-3 Family: Centered Accuracies")
gpt3_centered_df
We fit two types of models:
We only have 4 calibration points (GPT-2 models) but 6 features + 1 intercept = 7 parameters. Without regularization, we get a perfect fit but with unstable, extreme weights. Ridge regression shrinks weights toward zero, preventing overfitting.
def simple_linear_regression(x, y):
"""Simple 1D linear regression: y = a*x + b"""
mean_x, mean_y = np.mean(x), np.mean(y)
a = np.sum((x - mean_x) * (y - mean_y)) / np.sum((x - mean_x) ** 2)
b = mean_y - a * mean_x
return a, b
def ridge_regression(X, y, alpha=0.1):
"""
Ridge regression: minimize ||Xw - y||² + α||w||²
We don't regularize the intercept.
"""
n_samples, n_features = X.shape
X_aug = np.column_stack([np.ones(n_samples), X])
reg_matrix = alpha * np.eye(n_features + 1)
reg_matrix[0, 0] = 0 # Don't regularize intercept
coeffs = np.linalg.solve(X_aug.T @ X_aug + reg_matrix, X_aug.T @ y)
return coeffs[0], coeffs[1:] # intercept, weights
def compute_r_squared(y_true, y_pred):
"""Compute R² score."""
ss_res = np.sum((y_true - y_pred) ** 2)
ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
return 1 - ss_res / ss_tot
# Compute average of 6 centered accuracies
avg_centered_gpt2 = X_gpt2.mean(axis=1)
# Fit linear regression
slope, intercept = simple_linear_regression(avg_centered_gpt2, y_gpt2)
print(f"Simple Model: CORE = {slope:.4f} × avg_centered + {intercept:.4f}")
# Validate
y_pred_simple = slope * avg_centered_gpt2 + intercept
r2_simple = compute_r_squared(y_gpt2, y_pred_simple)
validation_df = pd.DataFrame({
'Model': [d['name'] for d in gpt2_data],
'Avg Centered': avg_centered_gpt2,
'Predicted': y_pred_simple,
'Actual': y_gpt2,
'Error': y_pred_simple - y_gpt2
})
print(f"\nR² = {r2_simple:.4f}")
validation_df
Result: R² = 0.996 — excellent fit with just 2 parameters. The simple averaging approach works very well.
We try different regularization strengths (α) to find a good balance between fit and stability.
# Try different regularization strengths
alphas = [0.0, 0.001, 0.01, 0.1, 1.0]
results = []
for alpha in alphas:
intercept_r, weights = ridge_regression(X_gpt2, y_gpt2, alpha=alpha)
y_pred = X_gpt2 @ weights + intercept_r
r2 = compute_r_squared(y_gpt2, y_pred)
weight_norm = np.sqrt(np.sum(weights ** 2))
results.append({
'α': alpha,
'R²': r2,
'||weights||': weight_norm,
'Intercept': intercept_r,
'Weights': weights.copy()
})
alpha_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'Weights'} for r in results])
print("Effect of Regularization Strength:")
alpha_df
# Show weights for each alpha
print("Task Weights by Regularization Strength:")
weights_df = pd.DataFrame(
[r['Weights'] for r in results],
columns=TASK_NAMES,
index=[f"α={r['α']}" for r in results]
)
weights_df
Observations:
# Use α=0.01 as our chosen regularization
# This gives R²≈0.99 with reasonable, stable weights (~0.1 each task)
CHOSEN_ALPHA = 0.01
intercept_ridge, weights_ridge = ridge_regression(X_gpt2, y_gpt2, alpha=CHOSEN_ALPHA)
print(f"Ridge Model (α={CHOSEN_ALPHA}):")
print(f" Intercept: {intercept_ridge:.4f}")
print(f" Weights:")
for name, w in zip(TASK_NAMES, weights_ridge):
print(f" {name:20s}: {w:+.4f}")
# Validate
y_pred_ridge = X_gpt2 @ weights_ridge + intercept_ridge
r2_ridge = compute_r_squared(y_gpt2, y_pred_ridge)
print(f"\nR² = {r2_ridge:.4f}")
Which single task is the best predictor of CORE? We fit separate linear regressions for each task.
# Fit separate linear regression for each task
individual_results = []
for i, task_name in enumerate(TASK_NAMES):
x_task = X_gpt2[:, i]
slope_ind, intercept_ind = simple_linear_regression(x_task, y_gpt2)
y_pred_ind = slope_ind * x_task + intercept_ind
r2_ind = compute_r_squared(y_gpt2, y_pred_ind)
individual_results.append({
'Task': task_name,
'R²': r2_ind,
'Slope': slope_ind,
'Intercept': intercept_ind
})
individual_df = pd.DataFrame(individual_results).sort_values('R²', ascending=False)
print("Individual Task Correlations with CORE:")
individual_df
Key Finding: All 6 tasks have very high correlation with CORE (R² > 0.96), but PIQA is the single best predictor with R² = 0.9961 — actually slightly better than the simple averaging approach (R² = 0.9960)!
This is useful if you want a quick proxy for CORE with minimal evaluation cost. However, for robustness we still recommend using all 6 tasks or the averaged approaches.
We apply both models to GPT-3 data and report the average as our final estimate.
# Apply all three approaches
avg_centered_gpt3 = X_gpt3.mean(axis=1)
gpt3_core_simple = slope * avg_centered_gpt3 + intercept
gpt3_core_ridge = X_gpt3 @ weights_ridge + intercept_ridge
# Approach 3: Best individual predictor (PIQA)
piqa_idx = TASK_NAMES.index('PIQA')
piqa_model = [r for r in individual_results if r['Task'] == 'PIQA'][0]
gpt3_core_piqa = piqa_model['Slope'] * X_gpt3[:, piqa_idx] + piqa_model['Intercept']
# Average of approaches 1 and 2
gpt3_core_final = (gpt3_core_simple + gpt3_core_ridge) / 2
# Create results table with all approaches
results_df = pd.DataFrame({
'Model': [m[0] for m in gpt3_models],
'Params': [f"{m[1]/1e9:.1f}B" if m[1] >= 1e9 else f"{m[1]/1e6:.0f}M" for m in gpt3_models],
'Simple': gpt3_core_simple,
f'Ridge': gpt3_core_ridge,
'PIQA only': gpt3_core_piqa,
'Avg(1,2)': gpt3_core_final
})
print("GPT-3 CORE Estimates (all three approaches):")
results_df
# Combine with GPT-2 for complete picture
all_models = []
for data in gpt2_data:
params = data['params']
all_models.append({
'Model': data['name'],
'Family': 'GPT-2',
'Params': params,
'Params_str': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
'CORE': data['core'],
'Source': 'Measured'
})
for (name, params, _), core in zip(gpt3_models, gpt3_core_final):
all_models.append({
'Model': name,
'Family': 'GPT-3',
'Params': params,
'Params_str': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
'CORE': core,
'Source': 'Estimated'
})
# Sort by params and display
all_models.sort(key=lambda x: x['Params'])
final_df = pd.DataFrame(all_models)[['Model', 'Params_str', 'CORE', 'Source']]
final_df.columns = ['Model', 'Params', 'CORE', 'Source']
print("Complete CORE Scores (GPT-2 measured, GPT-3 estimated):")
final_df
comparisons = [
('~125M', 'GPT-2', gpt2_data[0]['core'], 'GPT-3 Small', gpt3_core_final[0]),
('~350M', 'GPT-2 Medium', gpt2_data[1]['core'], 'GPT-3 Medium', gpt3_core_final[1]),
('~760M', 'GPT-2 Large', gpt2_data[2]['core'], 'GPT-3 Large', gpt3_core_final[2]),
('~1.3-1.5B', 'GPT-2 XL', gpt2_data[3]['core'], 'GPT-3 XL', gpt3_core_final[3]),
]
comparison_df = pd.DataFrame([
{
'Size': size,
'GPT-2 CORE': gpt2_core,
'GPT-3 CORE': gpt3_core,
'Δ': gpt3_core - gpt2_core,
'Improvement': f"{100 * (gpt3_core - gpt2_core) / gpt2_core:+.1f}%"
}
for size, _, gpt2_core, _, gpt3_core in comparisons
])
print("GPT-3 vs GPT-2 at Similar Model Sizes:")
comparison_df
We estimated CORE scores for GPT-3 models by:
GPT-3 consistently outperforms GPT-2 at similar model sizes by approximately 0.03-0.05 CORE (14-30% relative improvement)
PIQA is the best single predictor of CORE (R²=0.9961). If you need a quick proxy for CORE with minimal evaluation cost, PIQA alone works nearly as well as averaging all 6 tasks.
The improvement likely comes from:
Final estimated CORE scores:
| Model | Params | Estimated CORE |
|---|---|---|
| GPT-3 Small | 125M | 0.148 |
| GPT-3 Medium | 350M | 0.216 |
| GPT-3 Large | 760M | 0.266 |
| GPT-3 XL | 1.3B | 0.291 |
| GPT-3 2.7B | 2.7B | 0.329 |
| GPT-3 6.7B | 6.7B | 0.361 |
| GPT-3 13B | 13B | 0.385 |
| GPT-3 175B | 175B | 0.427 |
Despite these limitations, the estimates are useful for approximate comparisons between nanochat models and the GPT-3 family.
# Export as a simple dict for use elsewhere
gpt3_core_estimates = {
'GPT-3 Small (125M)': round(gpt3_core_final[0], 4),
'GPT-3 Medium (350M)': round(gpt3_core_final[1], 4),
'GPT-3 Large (760M)': round(gpt3_core_final[2], 4),
'GPT-3 XL (1.3B)': round(gpt3_core_final[3], 4),
'GPT-3 2.7B': round(gpt3_core_final[4], 4),
'GPT-3 6.7B': round(gpt3_core_final[5], 4),
'GPT-3 13B': round(gpt3_core_final[6], 4),
'GPT-3 175B': round(gpt3_core_final[7], 4),
}
print("GPT-3 CORE Estimates (for copy-paste):")
import json
print(json.dumps(gpt3_core_estimates, indent=4))