Back to Nanochat

Estimating CORE Metric for GPT-3 Models

dev/estimate_gpt3_core.ipynb

latest20.7 KB
Original Source

Estimating CORE Metric for GPT-3 Models

Authors: Claude Code Opus 4.5, Andrej Karpathy

Date: Jan 2026

Motivation

The CORE metric (introduced in the DCLM paper) is a composite benchmark that evaluates pretrained language models across 22 diverse tasks spanning world knowledge, language understanding, commonsense reasoning, symbolic problem solving, and reading comprehension. It provides a single score that captures a model's general capabilities.

We want to compare nanochat models against the GPT-3 model family from OpenAI's "Language Models are Few-Shot Learners" paper (2020). However, there's a problem: GPT-3 models were never evaluated on CORE (which didn't exist in 2020), and the models were never publicly released, so we can't evaluate them ourselves.

Our Approach

We estimate CORE scores for GPT-3 by:

  1. Identifying overlapping tasks between the GPT-3 paper and CORE that were evaluated with similar methodology
  2. Using GPT-2 as calibration data — we have actual CORE scores for all 4 GPT-2 models, plus the GPT-3 paper reports results on GPT-2-equivalent tasks
  3. Fitting a regression model from the overlapping task scores to the full CORE score
  4. Applying the model to GPT-3 using their reported task scores

This notebook documents our methodology in detail for reproducibility.

Setup

python
import numpy as np
from pathlib import Path
import pandas as pd

# For nice table display
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', 20)

Part 1: Understanding CORE

CORE consists of 22 tasks evaluated in specific few-shot settings. The key innovation is centering: raw accuracies are adjusted to account for random guessing baselines.

$$\text{centered accuracy} = \frac{\text{accuracy} - \text{baseline}}{1 - \text{baseline}}$$

The final CORE score is simply the mean of all 22 centered accuracies.

CORE Tasks

CategoryTasks
World KnowledgeJeopardy, ARC Easy, ARC Challenge, BigBench QA Wikidata
Language UnderstandingHellaSwag (0-shot & 10-shot), LAMBADA, Winograd, Winogrande, BigBench Language ID
Commonsense ReasoningCOPA, CommonsenseQA, PIQA, OpenBookQA
Symbolic Problem SolvingBigBench Dyck, Operators, CS Algorithms, Repeat Copy Logic, AGI Eval LSAT-AR
Reading ComprehensionSQuAD, CoQA, BoolQ

Part 2: Task Overlap Analysis

We carefully compared the evaluation methodology between GPT-3 and CORE for each task. Key considerations:

  1. Number of few-shot examples (K): GPT-3 often uses more examples than CORE
  2. Task format: Some tasks use different prompting strategies
  3. Scoring method: GPT-3 uses unconditional probability normalization for some tasks
  4. Data split: dev vs test set

Selection Criteria

We applied a conservative filter: both evaluations must use K=0 (zero-shot) or both must use K>0 (few-shot). We excluded tasks that mix zero-shot with few-shot, as this introduces systematic differences.

Tasks We Excluded

TaskGPT-3 KCORE KReason for Exclusion
Winograd70Mixing K>0 with K=0
Winogrande500Mixing K>0 with K=0
COPA320Mixing K>0 with K=0
OpenBookQA1000Mixing K>0 with K=0, also uses unconditional normalization
BoolQ3210High sensitivity to K (17% gap between 0-shot and few-shot in GPT-3)
CoQA50Different metric (F1 vs accuracy)
LAMBADA few-shot150GPT-3 uses special fill-in-blank format

Tasks Not in GPT-3 Paper

These CORE tasks simply don't appear in GPT-3 (many didn't exist in 2020):

  • All 6 BigBench tasks (Dyck, Operators, CS Algorithms, Repeat Copy Logic, Language ID, QA Wikidata)
  • Jeopardy, CommonsenseQA, AGI Eval LSAT-AR
  • SQuAD v1 (GPT-3 uses v2)

Final Selected Tasks (6 tasks)

python
# The 6 tasks we selected for overlap
selected_tasks = pd.DataFrame([
    {'Task': 'HellaSwag 0-shot', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},
    {'Task': 'LAMBADA', 'GPT-3 K': 0, 'CORE K': 0, 'Match': 'Both zero-shot'},
    {'Task': 'HellaSwag 10-shot', 'GPT-3 K': 20, 'CORE K': 10, 'Match': 'Both few-shot (K differs slightly)'},
    {'Task': 'PIQA', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
    {'Task': 'ARC Easy', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
    {'Task': 'ARC Challenge', 'GPT-3 K': 50, 'CORE K': 10, 'Match': 'Both few-shot'},
])
selected_tasks

Rationale for K differences: Looking at GPT-3's own data, the difference between different K values is typically small. Here's the evidence from the GPT-3 175B model:

Task0-shotFew-shotKΔ
HellaSwag78.9%79.3%20+0.4%
PIQA81.0%82.3%50+1.3%
ARC Easy68.8%70.1%50+1.3%
ARC Challenge51.4%51.5%50+0.1%
Winograd88.3%88.6%7+0.3%
COPA91.0%92.0%32+1.0%

For most tasks, the gap between 0-shot and few-shot (with K=20-50) is only 0.1-1.3%. This suggests that differences between K=10 and K=50 would be even smaller, making our task selection reasonable.

Note: Some tasks show larger sensitivity (Winogrande: +7.5%, BoolQ: +17%), which is why we excluded them.

Part 3: Calibration Data (GPT-2 Family)

We have actual CORE scores for all 4 GPT-2 models. These serve as our calibration data.

python
# Random baselines for centering (from CORE specification)
BASELINES = {
    'hellaswag_zeroshot': 0.25,
    'lambada_openai': 0.0,
    'hellaswag': 0.25,
    'piqa': 0.50,
    'arc_easy': 0.25,
    'arc_challenge': 0.25,
}

TASK_ORDER = ['hellaswag_zeroshot', 'lambada_openai', 'hellaswag', 'piqa', 'arc_easy', 'arc_challenge']
TASK_NAMES = ['HellaSwag 0-shot', 'LAMBADA', 'HellaSwag 10-shot', 'PIQA', 'ARC Easy', 'ARC Challenge']

def center_accuracy(acc, baseline):
    """Convert raw accuracy to centered accuracy."""
    return (acc - baseline) / (1.0 - baseline)

def parse_csv(filepath):
    """Parse a CORE results CSV file."""
    results = {}
    with open(filepath) as f:
        for line in f:
            parts = [p.strip() for p in line.strip().split(',')]
            if len(parts) >= 3 and parts[0] != 'Task':
                task = parts[0]
                try:
                    acc = float(parts[1]) if parts[1] else None
                    centered = float(parts[2]) if parts[2] else None
                    results[task] = {'accuracy': acc, 'centered': centered}
                except ValueError:
                    pass
    return results
python
# Load GPT-2 CORE results
knowledge_dir = Path("/home/ubuntu/.cache/nanochat/eval_bundle")

gpt2_models = [
    ('GPT-2', 'openai-community-gpt2.csv', 124e6),
    ('GPT-2 Medium', 'openai-community-gpt2-medium.csv', 355e6),
    ('GPT-2 Large', 'openai-community-gpt2-large.csv', 774e6),
    ('GPT-2 XL', 'openai-community-gpt2-xl.csv', 1558e6),
]

gpt2_data = []
for name, filename, params in gpt2_models:
    results = parse_csv(knowledge_dir / filename)
    core = results['CORE']['centered']
    task_accs = [results[task]['accuracy'] for task in TASK_ORDER]
    gpt2_data.append({
        'name': name,
        'params': params,
        'task_accs': task_accs,
        'core': core,
    })

# Display as DataFrame
gpt2_df = pd.DataFrame([
    {
        'Model': d['name'],
        'Params': f"{d['params']/1e6:.0f}M",
        **{name: f"{acc:.1%}" for name, acc in zip(TASK_NAMES, d['task_accs'])},
        'CORE': f"{d['core']:.4f}"
    }
    for d in gpt2_data
])
print("GPT-2 Family: Raw Accuracies and CORE Scores")
gpt2_df
python
# Build feature matrix (centered accuracies)
X_gpt2 = []
y_gpt2 = []

for data in gpt2_data:
    centered_accs = []
    for task, acc in zip(TASK_ORDER, data['task_accs']):
        centered = center_accuracy(acc, BASELINES[task])
        centered_accs.append(centered)
    X_gpt2.append(centered_accs)
    y_gpt2.append(data['core'])

X_gpt2 = np.array(X_gpt2)
y_gpt2 = np.array(y_gpt2)

# Display centered accuracies
centered_df = pd.DataFrame(
    X_gpt2,
    columns=TASK_NAMES,
    index=[d['name'] for d in gpt2_data]
)
centered_df['Mean'] = X_gpt2.mean(axis=1)
centered_df['CORE'] = y_gpt2
print("GPT-2 Family: Centered Accuracies")
centered_df

Observation: The mean of the 6 centered accuracies is consistently higher than the actual CORE score. This makes sense because CORE includes 16 additional tasks (many quite difficult) that pull down the average.

Part 4: GPT-3 Data

We extract the 6 task accuracies from the GPT-3 paper's Appendix H (master results table).

Source: Table H.1 in "Language Models are Few-Shot Learners" (Brown et al., 2020)

python
# GPT-3 accuracies from the paper
# Format: [hellaswag_0shot, lambada_0shot, hellaswag_fewshot, piqa_fewshot, arc_easy_fewshot, arc_challenge_fewshot]
gpt3_models = [
    ('GPT-3 Small', 125e6, [0.337, 0.427, 0.335, 0.643, 0.427, 0.255]),
    ('GPT-3 Medium', 350e6, [0.436, 0.543, 0.431, 0.694, 0.510, 0.284]),
    ('GPT-3 Large', 760e6, [0.510, 0.604, 0.513, 0.720, 0.581, 0.323]),
    ('GPT-3 XL', 1.3e9, [0.547, 0.636, 0.549, 0.743, 0.591, 0.367]),
    ('GPT-3 2.7B', 2.7e9, [0.628, 0.671, 0.629, 0.754, 0.621, 0.395]),
    ('GPT-3 6.7B', 6.7e9, [0.674, 0.703, 0.673, 0.778, 0.658, 0.437]),
    ('GPT-3 13B', 13e9, [0.709, 0.725, 0.713, 0.799, 0.691, 0.448]),
    ('GPT-3 175B', 175e9, [0.789, 0.762, 0.793, 0.823, 0.701, 0.515]),
]

# Display raw accuracies
gpt3_df = pd.DataFrame([
    {
        'Model': name,
        'Params': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
        **{task_name: f"{acc:.1%}" for task_name, acc in zip(TASK_NAMES, accs)}
    }
    for name, params, accs in gpt3_models
])
print("GPT-3 Family: Raw Accuracies from Paper")
gpt3_df
python
# Compute centered accuracies for GPT-3
X_gpt3 = []
for name, params, accs in gpt3_models:
    centered_accs = [center_accuracy(acc, BASELINES[task]) for task, acc in zip(TASK_ORDER, accs)]
    X_gpt3.append(centered_accs)

X_gpt3 = np.array(X_gpt3)

# Display
gpt3_centered_df = pd.DataFrame(
    X_gpt3,
    columns=TASK_NAMES,
    index=[m[0] for m in gpt3_models]
)
gpt3_centered_df['Mean'] = X_gpt3.mean(axis=1)
print("GPT-3 Family: Centered Accuracies")
gpt3_centered_df

Part 5: Regression Models

We fit two types of models:

  1. Simple Approach: Average the 6 centered accuracies, then fit a linear regression to CORE
  2. Multivariate Approach: Use all 6 features with Ridge regularization

Why Regularization?

We only have 4 calibration points (GPT-2 models) but 6 features + 1 intercept = 7 parameters. Without regularization, we get a perfect fit but with unstable, extreme weights. Ridge regression shrinks weights toward zero, preventing overfitting.

python
def simple_linear_regression(x, y):
    """Simple 1D linear regression: y = a*x + b"""
    mean_x, mean_y = np.mean(x), np.mean(y)
    a = np.sum((x - mean_x) * (y - mean_y)) / np.sum((x - mean_x) ** 2)
    b = mean_y - a * mean_x
    return a, b

def ridge_regression(X, y, alpha=0.1):
    """
    Ridge regression: minimize ||Xw - y||² + α||w||²
    We don't regularize the intercept.
    """
    n_samples, n_features = X.shape
    X_aug = np.column_stack([np.ones(n_samples), X])
    reg_matrix = alpha * np.eye(n_features + 1)
    reg_matrix[0, 0] = 0  # Don't regularize intercept
    coeffs = np.linalg.solve(X_aug.T @ X_aug + reg_matrix, X_aug.T @ y)
    return coeffs[0], coeffs[1:]  # intercept, weights

def compute_r_squared(y_true, y_pred):
    """Compute R² score."""
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - ss_res / ss_tot

Approach 1: Simple Averaging

python
# Compute average of 6 centered accuracies
avg_centered_gpt2 = X_gpt2.mean(axis=1)

# Fit linear regression
slope, intercept = simple_linear_regression(avg_centered_gpt2, y_gpt2)
print(f"Simple Model: CORE = {slope:.4f} × avg_centered + {intercept:.4f}")

# Validate
y_pred_simple = slope * avg_centered_gpt2 + intercept
r2_simple = compute_r_squared(y_gpt2, y_pred_simple)

validation_df = pd.DataFrame({
    'Model': [d['name'] for d in gpt2_data],
    'Avg Centered': avg_centered_gpt2,
    'Predicted': y_pred_simple,
    'Actual': y_gpt2,
    'Error': y_pred_simple - y_gpt2
})
print(f"\nR² = {r2_simple:.4f}")
validation_df

Result: R² = 0.996 — excellent fit with just 2 parameters. The simple averaging approach works very well.

Approach 2: Multivariate Ridge Regression

We try different regularization strengths (α) to find a good balance between fit and stability.

python
# Try different regularization strengths
alphas = [0.0, 0.001, 0.01, 0.1, 1.0]

results = []
for alpha in alphas:
    intercept_r, weights = ridge_regression(X_gpt2, y_gpt2, alpha=alpha)
    y_pred = X_gpt2 @ weights + intercept_r
    r2 = compute_r_squared(y_gpt2, y_pred)
    weight_norm = np.sqrt(np.sum(weights ** 2))
    results.append({
        'α': alpha,
        'R²': r2,
        '||weights||': weight_norm,
        'Intercept': intercept_r,
        'Weights': weights.copy()
    })

alpha_df = pd.DataFrame([{k: v for k, v in r.items() if k != 'Weights'} for r in results])
print("Effect of Regularization Strength:")
alpha_df
python
# Show weights for each alpha
print("Task Weights by Regularization Strength:")
weights_df = pd.DataFrame(
    [r['Weights'] for r in results],
    columns=TASK_NAMES,
    index=[f"α={r['α']}" for r in results]
)
weights_df

Observations:

  • α=0 (no regularization): Perfect fit (R²=1.0) but extreme weights (+18, -22) — clearly overfitting
  • α=0.001: Still near-perfect fit with very large weights
  • α=0.01: Excellent fit (R²=0.99) with reasonable weights (~0.1 each) — good choice
  • α=0.1: Good fit (R²=0.84) with uniform weights (~0.06 each) — conservative
  • α=1.0: Poor fit (R²=0.25) — over-regularized
python
# Use α=0.01 as our chosen regularization
# This gives R²≈0.99 with reasonable, stable weights (~0.1 each task)
CHOSEN_ALPHA = 0.01
intercept_ridge, weights_ridge = ridge_regression(X_gpt2, y_gpt2, alpha=CHOSEN_ALPHA)

print(f"Ridge Model (α={CHOSEN_ALPHA}):")
print(f"  Intercept: {intercept_ridge:.4f}")
print(f"  Weights:")
for name, w in zip(TASK_NAMES, weights_ridge):
    print(f"    {name:20s}: {w:+.4f}")

# Validate
y_pred_ridge = X_gpt2 @ weights_ridge + intercept_ridge
r2_ridge = compute_r_squared(y_gpt2, y_pred_ridge)
print(f"\nR² = {r2_ridge:.4f}")

Approach 3: Individual Task Analysis

Which single task is the best predictor of CORE? We fit separate linear regressions for each task.

python
# Fit separate linear regression for each task
individual_results = []
for i, task_name in enumerate(TASK_NAMES):
    x_task = X_gpt2[:, i]
    slope_ind, intercept_ind = simple_linear_regression(x_task, y_gpt2)
    y_pred_ind = slope_ind * x_task + intercept_ind
    r2_ind = compute_r_squared(y_gpt2, y_pred_ind)
    individual_results.append({
        'Task': task_name,
        'R²': r2_ind,
        'Slope': slope_ind,
        'Intercept': intercept_ind
    })

individual_df = pd.DataFrame(individual_results).sort_values('R²', ascending=False)
print("Individual Task Correlations with CORE:")
individual_df

Key Finding: All 6 tasks have very high correlation with CORE (R² > 0.96), but PIQA is the single best predictor with R² = 0.9961 — actually slightly better than the simple averaging approach (R² = 0.9960)!

This is useful if you want a quick proxy for CORE with minimal evaluation cost. However, for robustness we still recommend using all 6 tasks or the averaged approaches.

Part 6: Final Estimates for GPT-3

We apply both models to GPT-3 data and report the average as our final estimate.

python
# Apply all three approaches
avg_centered_gpt3 = X_gpt3.mean(axis=1)
gpt3_core_simple = slope * avg_centered_gpt3 + intercept
gpt3_core_ridge = X_gpt3 @ weights_ridge + intercept_ridge

# Approach 3: Best individual predictor (PIQA)
piqa_idx = TASK_NAMES.index('PIQA')
piqa_model = [r for r in individual_results if r['Task'] == 'PIQA'][0]
gpt3_core_piqa = piqa_model['Slope'] * X_gpt3[:, piqa_idx] + piqa_model['Intercept']

# Average of approaches 1 and 2
gpt3_core_final = (gpt3_core_simple + gpt3_core_ridge) / 2

# Create results table with all approaches
results_df = pd.DataFrame({
    'Model': [m[0] for m in gpt3_models],
    'Params': [f"{m[1]/1e9:.1f}B" if m[1] >= 1e9 else f"{m[1]/1e6:.0f}M" for m in gpt3_models],
    'Simple': gpt3_core_simple,
    f'Ridge': gpt3_core_ridge,
    'PIQA only': gpt3_core_piqa,
    'Avg(1,2)': gpt3_core_final
})
print("GPT-3 CORE Estimates (all three approaches):")
results_df

Final CORE Estimates for GPT-3

python
# Combine with GPT-2 for complete picture
all_models = []

for data in gpt2_data:
    params = data['params']
    all_models.append({
        'Model': data['name'],
        'Family': 'GPT-2',
        'Params': params,
        'Params_str': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
        'CORE': data['core'],
        'Source': 'Measured'
    })

for (name, params, _), core in zip(gpt3_models, gpt3_core_final):
    all_models.append({
        'Model': name,
        'Family': 'GPT-3',
        'Params': params,
        'Params_str': f"{params/1e9:.1f}B" if params >= 1e9 else f"{params/1e6:.0f}M",
        'CORE': core,
        'Source': 'Estimated'
    })

# Sort by params and display
all_models.sort(key=lambda x: x['Params'])
final_df = pd.DataFrame(all_models)[['Model', 'Params_str', 'CORE', 'Source']]
final_df.columns = ['Model', 'Params', 'CORE', 'Source']
print("Complete CORE Scores (GPT-2 measured, GPT-3 estimated):")
final_df

Head-to-Head: GPT-2 vs GPT-3 at Similar Sizes

python
comparisons = [
    ('~125M', 'GPT-2', gpt2_data[0]['core'], 'GPT-3 Small', gpt3_core_final[0]),
    ('~350M', 'GPT-2 Medium', gpt2_data[1]['core'], 'GPT-3 Medium', gpt3_core_final[1]),
    ('~760M', 'GPT-2 Large', gpt2_data[2]['core'], 'GPT-3 Large', gpt3_core_final[2]),
    ('~1.3-1.5B', 'GPT-2 XL', gpt2_data[3]['core'], 'GPT-3 XL', gpt3_core_final[3]),
]

comparison_df = pd.DataFrame([
    {
        'Size': size,
        'GPT-2 CORE': gpt2_core,
        'GPT-3 CORE': gpt3_core,
        'Δ': gpt3_core - gpt2_core,
        'Improvement': f"{100 * (gpt3_core - gpt2_core) / gpt2_core:+.1f}%"
    }
    for size, _, gpt2_core, _, gpt3_core in comparisons
])
print("GPT-3 vs GPT-2 at Similar Model Sizes:")
comparison_df

Conclusions

Methodology

We estimated CORE scores for GPT-3 models by:

  1. Identifying 6 tasks with comparable evaluation methodology between GPT-3 and CORE
  2. Using GPT-2's measured CORE scores as calibration data
  3. Fitting three regression approaches:
    • Simple: Average the 6 metrics, then linear regression (R²=0.996)
    • Ridge: Use all 6 features with regularization (R²=0.992)
    • PIQA only: Single best predictor (R²=0.996)
  4. Averaging the Simple and Ridge approaches for final estimates

Key Findings

  1. GPT-3 consistently outperforms GPT-2 at similar model sizes by approximately 0.03-0.05 CORE (14-30% relative improvement)

  2. PIQA is the best single predictor of CORE (R²=0.9961). If you need a quick proxy for CORE with minimal evaluation cost, PIQA alone works nearly as well as averaging all 6 tasks.

  3. The improvement likely comes from:

    • More training data (300B tokens vs ~100B for GPT-2)
    • Better data quality and filtering
    • Larger context length (2048 vs 1024)
  4. Final estimated CORE scores:

ModelParamsEstimated CORE
GPT-3 Small125M0.148
GPT-3 Medium350M0.216
GPT-3 Large760M0.266
GPT-3 XL1.3B0.291
GPT-3 2.7B2.7B0.329
GPT-3 6.7B6.7B0.361
GPT-3 13B13B0.385
GPT-3 175B175B0.427

Caveats

  1. These are estimates, not measured values. True CORE scores could differ.
  2. We only have 4 calibration points, limiting statistical power.
  3. The 6 overlapping tasks may not perfectly represent all 22 CORE tasks.
  4. Slight differences in evaluation methodology (K values, splits) add uncertainty.

Despite these limitations, the estimates are useful for approximate comparisons between nanochat models and the GPT-3 family.

Appendix: Export Final Estimates

python
# Export as a simple dict for use elsewhere
gpt3_core_estimates = {
    'GPT-3 Small (125M)': round(gpt3_core_final[0], 4),
    'GPT-3 Medium (350M)': round(gpt3_core_final[1], 4),
    'GPT-3 Large (760M)': round(gpt3_core_final[2], 4),
    'GPT-3 XL (1.3B)': round(gpt3_core_final[3], 4),
    'GPT-3 2.7B': round(gpt3_core_final[4], 4),
    'GPT-3 6.7B': round(gpt3_core_final[5], 4),
    'GPT-3 13B': round(gpt3_core_final[6], 4),
    'GPT-3 175B': round(gpt3_core_final[7], 4),
}

print("GPT-3 CORE Estimates (for copy-paste):")
import json
print(json.dumps(gpt3_core_estimates, indent=4))