ML Paper Writing Philosophy & Best Practices

This reference compiles writing advice from prominent ML researchers including Neel Nanda, Andrej Karpathy, Sebastian Farquhar, Zachary Lipton, and Jacob Steinhardt.

The Narrative Principle
Time Allocation
Abstract Writing Formula
Introduction Structure
Sentence-Level Clarity
Word Choice and Precision
Mathematical Writing
Figure Design
Common Mistakes to Avoid

The Narrative Principle

From Neel Nanda

"A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about."

The narrative rests on three pillars that must be crystal clear by the end of your introduction:

The "What": One to three specific novel claims fitting within a cohesive theme. Vague contributions like "we study X" fail immediately—reviewers need precise, falsifiable claims.

The "Why": Rigorous empirical evidence that convincingly supports those claims, including strong baselines honestly tuned and experiments that distinguish between competing hypotheses rather than merely showing "decent results."

The "So What": Why readers should care, connecting your contribution to problems the community recognizes as important.

From Andrej Karpathy

"A paper is not a random collection of experiments you report on. The paper sells a single thing that was not obvious or present before. The entire paper is organized around this core contribution with surgical precision."

This applies whether you're presenting a new architecture, a theoretical result, or improved understanding of existing methods—NeurIPS explicitly notes that "originality does not necessarily require an entirely new method."

Practical Implication: If you cannot state your contribution in one sentence, you don't yet have a paper. Everything else—experiments, related work, discussion—exists only to support that core claim.

Time Allocation

From Neel Nanda

Spend approximately the same amount of time on each of:

The abstract
The introduction
The figures
Everything else combined

This isn't hyperbole—most reviewers form preliminary judgments before reaching your methods section. Readers encounter your paper in a predictable pattern: title → abstract → introduction → figures → maybe the rest.

Reviewer Reading Patterns

Studies of reviewer behavior show:

Abstract is read 100% of the time
Introduction is skimmed by 90%+ of reviewers
Figures are examined before methods by most reviewers
Full methods are read only if interest is established

Implication: Front-load your paper's value. Don't bury the contribution.

Abstract Writing Formula

Sebastian Farquhar's 5-Sentence Formula

What you achieved: "We introduce...", "We prove...", "We demonstrate..."
Why this is hard and important
How you do it (with specialist keywords for discoverability)
What evidence you have
Your most remarkable number/result

Example (Good Abstract)

We prove that gradient descent on overparameterized neural networks
converges to global minima at a linear rate. [What]
This resolves a fundamental question about why deep learning works
despite non-convex optimization landscapes. [Why hard/important]
Our proof relies on showing that the Neural Tangent Kernel remains
approximately constant during training, reducing the problem to
kernel regression. [How with keywords]
We validate our theory on CIFAR-10 and ImageNet, showing that
predicted convergence rates match experiments within 5%. [Evidence]
This is the first polynomial-time convergence guarantee for
networks with practical depth and width. [Remarkable result]

What to Avoid

From Zachary Lipton: "If the first sentence can be pre-pended to any ML paper, delete it."

Delete these openings:

"Large language models have achieved remarkable success..."
"Deep learning has revolutionized..."
"In recent years, neural networks have..."

Start with your specific contribution instead.

Introduction Structure

Requirements

1-1.5 pages maximum (in two-column format)
Methods should start by page 2-3
Must include 2-4 bullet contribution list (max 1-2 lines each)

Structure Template

markdown

1. Opening Hook (2-3 sentences)
   - State the problem your paper addresses
   - Why it matters RIGHT NOW

2. Background/Challenge (1 paragraph)
   - What makes this problem hard?
   - What have others tried? Why is it insufficient?

3. Your Approach (1 paragraph)
   - What do you do differently?
   - Key insight that enables your contribution

4. Contribution Bullets (2-4 items)
   - Be specific and falsifiable
   - Each bullet: 1-2 lines maximum

5. Results Preview (2-3 sentences)
   - Most impressive numbers
   - Scope of evaluation

6. Paper Organization (optional, 1-2 sentences)
   - "Section 2 presents... Section 3 describes..."

Contribution Bullets: Good vs Bad

Good:

We prove that X converges in O(n log n) time under assumption Y
We introduce Z, a 3-layer architecture that reduces memory by 40%
We demonstrate that A outperforms B by 15% on benchmark C

Bad:

We study the problem of X (not a contribution)
We provide extensive experiments (too vague)
We make several contributions to the field (says nothing)

Sentence-Level Clarity

From Gopen & Swan: "The Science of Scientific Writing"

The seminal 1990 paper by George Gopen and Judith Swan establishes that readers have structural expectations about where information appears in prose. Violating these expectations forces readers to spend energy on structure rather than content.

"If the reader is to grasp what the writer means, the writer must understand what the reader needs."

The 7 Principles of Reader Expectations

Principle 1: Subject-Verb Proximity

Keep grammatical subject and verb close together. Anything intervening reads as interruption of lesser importance.

Weak: "The model, which was trained on 100M tokens and fine-tuned on domain-specific data using LoRA with rank 16, achieves state-of-the-art results"

Strong: "The model achieves state-of-the-art results after training on 100M tokens and fine-tuning with LoRA (rank 16)"

Principle 2: Stress Position (Save the Best for Last)

Readers naturally emphasize the last words of a sentence. Place your most important information there.

Weak: "Accuracy improves by 15% when using attention" Strong: "When using attention, accuracy improves by 15%"

Principle 3: Topic Position (First Things First)

The beginning of a sentence establishes perspective. Put the "whose story" element first—readers expect the sentence to be about whoever shows up first.

Weak: "A novel attention mechanism that computes alignment scores is introduced" Strong: "To address the alignment problem, we introduce a novel attention mechanism"

Principle 4: Old Information Before New

Put familiar information (old) in the topic position for backward linkage; put new information in the stress position for emphasis.

Weak: "Sparse attention was introduced by Child et al. The quadratic complexity of standard attention motivates this work." Strong: "Standard attention has quadratic complexity. To address this, Child et al. introduced sparse attention."

Principle 5: One Unit, One Function

Each unit of discourse (sentence, paragraph, section) should serve a single function. If you have two points, use two units.

Principle 6: Articulate Action in the Verb

Express the action of each sentence in its verb, not in nominalized nouns.

Weak: "We performed an analysis of the results" (nominalization) Strong: "We analyzed the results" (action in verb)

Principle 7: Context Before New Information

Provide context before asking the reader to consider anything new. This applies at all levels—sentence, paragraph, section.

Weak: "Equation 3 shows that convergence is guaranteed when the learning rate satisfies..." Strong: "For convergence to be guaranteed, the learning rate must satisfy the condition in Equation 3..."

Summary Table

Principle	Rule	Mnemonic
Subject-Verb Proximity	Keep subject and verb close	"Don't interrupt yourself"
Stress Position	Emphasis at sentence end	"Save the best for last"
Topic Position	Context at sentence start	"First things first"
Old Before New	Familiar → unfamiliar	"Build on known ground"
One Unit, One Function	Each paragraph = one point	"One idea per container"
Action in Verb	Use verbs, not nominalizations	"Verbs do, nouns sit"
Context Before New	Explain before presenting	"Set the stage first"

Micro-Level Writing Tips

From Ethan Perez (Anthropic)

These practical micro-level tips improve clarity at the sentence and word level.

Pronoun Management

Minimize pronouns ("this," "it," "these," "that"). When pronouns are necessary, use them as adjectives with a noun:

Weak: "This shows that the model converges." Strong: "This result shows that the model converges."

Weak: "It improves performance." Strong: "This modification improves performance."

Verb Placement

Position verbs early in sentences for better parsing:

Weak: "The gradient, after being computed and normalized, updates the weights." Strong: "The gradient updates the weights after being computed and normalized."

Apostrophe Unfolding

Transform possessive constructions for clarity:

Original: "X's Y" → Unfolded: "The Y of X"

Before: "The model's accuracy on the test set" After: "The accuracy of the model on the test set"

This isn't always better, but when sentences feel awkward, try unfolding.

Words to Eliminate

Delete these filler words in almost all cases:

"actually"
"a bit"
"fortunately" / "unfortunately"
"very" / "really"
"quite"
"basically"
"essentially"
Excessive connectives ("however," "moreover," "furthermore" when not needed)

Sentence Construction Rules

One idea per sentence - If struggling to express an idea in one sentence, it needs two
No repeated sounds - Avoid similar-sounding words in the same sentence
Every sentence adds information - Delete sentences that merely restate
Active voice always - Specify the actor ("We find..." not "It is found...")
Expand contractions - "don't" → "do not" for formality

Paragraph Architecture

First sentence: State the point clearly
Middle sentences: Support with evidence
Last sentence: Reinforce or transition

Don't bury key information in the middle of paragraphs.

Word Choice and Precision

From Zachary Lipton

Eliminate hedging unless genuine uncertainty exists:

Delete "may" and "can" unless necessary
"provides very tight approximation" drips with insecurity
"provides tight approximation" is confident

Avoid vacuous intensifiers:

Delete: very, extremely, highly, significantly (unless statistical)
These words signal insecurity, not strength

From Jacob Steinhardt

Precision over brevity: Replace vague terms with specific ones.

Vague	Specific
performance	accuracy, latency, throughput
improves	increases accuracy by X%, reduces latency by Y
large	1B parameters, 100M tokens
fast	3x faster, 50ms latency
good results	92% accuracy, 0.85 F1

Consistent terminology: Referring to the same concept with different terms creates confusion.

Choose one and stick with it:

"model" vs "network" vs "architecture"
"training" vs "learning" vs "optimization"
"sample" vs "example" vs "instance"

Vocabulary Signaling

Avoid words signaling incremental work:

Never: "combine," "modify," "expand," "extend"
Instead: "develop," "propose," "introduce"

Why: "We combine X and Y" sounds like you stapled two existing ideas together. "We develop a method that leverages X for Y" sounds like genuine contribution.

Mathematical Writing

From Ethan Perez

Unfold apostrophes for clarity:

Weak: "X's Y"
Strong: "The Y of X"

Example: "the model's accuracy" → "the accuracy of the model"

General Principles

State all assumptions formally before theorems
Provide intuitive explanations alongside proofs
Use consistent notation throughout the paper
Define symbols at first use

Notation Conventions

latex

% Scalars: lowercase italic
$x$, $y$, $\alpha$, $\beta$

% Vectors: lowercase bold
$\mathbf{x}$, $\mathbf{v}$

% Matrices: uppercase bold
$\mathbf{W}$, $\mathbf{X}$

% Sets: uppercase calligraphic
$\mathcal{X}$, $\mathcal{D}$

% Functions: roman for named functions
$\mathrm{softmax}$, $\mathrm{ReLU}$

Figure Design

From Neel Nanda

Figures should tell a coherent story even if the reader skips the text. Many readers DO skip the text initially.

Design Principles

Figure 1 is crucial: Often the first thing readers examine after abstract
Self-contained captions: Reader should understand figure without main text
No title inside figure: The caption serves this function (ICML/NeurIPS rule)
Vector graphics: PDF/EPS for plots, PNG (600 DPI) only for photographs

Accessibility Requirements

8% of men have color vision deficiency. Your figures must work for them.

Solutions:

Use colorblind-safe palettes: Okabe-Ito or Paul Tol
Avoid red-green combinations
Verify figures work in grayscale
Use different line styles (solid, dashed, dotted) in addition to colors

Tools

python

# SciencePlots: Publication-ready styles
import matplotlib.pyplot as plt
plt.style.use(['science', 'ieee'])

# Or for Nature-style
plt.style.use(['science', 'nature'])

Common Mistakes to Avoid

Structure Mistakes

Mistake	Solution
Introduction too long (>1.5 pages)	Move background to Related Work
Methods buried (after page 3)	Front-load contribution, cut intro
Missing contribution bullets	Add 2-4 specific, falsifiable claims
Experiments without explicit claims	State what each experiment tests

Writing Mistakes

Mistake	Solution
Generic abstract opening	Start with your specific contribution
Inconsistent terminology	Choose one term per concept
Passive voice overuse	Use active voice: "We show" not "It is shown"
Hedging everywhere	Be confident unless genuinely uncertain

Figure Mistakes

Mistake	Solution
Raster graphics for plots	Use vector (PDF/EPS)
Red-green color scheme	Use colorblind-safe palette
Title inside figure	Put title in caption
Captions require main text	Make captions self-contained

Citation Mistakes

Mistake	Solution
Paper-by-paper Related Work	Organize methodologically
Missing relevant citations	Reviewers authored papers—cite generously
AI-generated citations	Always verify via APIs
Inconsistent citation format	Use BibLaTeX with consistent keys

Pre-Submission Checklist

Before submitting, verify:

Narrative:

Can state contribution in one sentence
Three pillars (What/Why/So What) clear in intro
Every experiment supports a specific claim

Structure:

Abstract follows 5-sentence formula
Introduction ≤1.5 pages
Methods start by page 2-3
2-4 contribution bullets included
Limitations section present

Writing:

Consistent terminology throughout
No generic opening sentences
Hedging removed unless necessary
All figures have self-contained captions

Technical:

All citations verified via API
Error bars included with methodology
Compute resources documented
Code/data availability stated

ML Paper Writing Philosophy & Best Practices

ML Paper Writing Philosophy & Best Practices

Contents

The Narrative Principle

From Neel Nanda

From Andrej Karpathy

Time Allocation

From Neel Nanda

Reviewer Reading Patterns

Abstract Writing Formula

Sebastian Farquhar's 5-Sentence Formula

Example (Good Abstract)

What to Avoid

Introduction Structure

Requirements

Structure Template

Contribution Bullets: Good vs Bad

Sentence-Level Clarity

From Gopen & Swan: "The Science of Scientific Writing"

The 7 Principles of Reader Expectations

Summary Table

Micro-Level Writing Tips

From Ethan Perez (Anthropic)

Pronoun Management

Verb Placement

Apostrophe Unfolding

Words to Eliminate

Sentence Construction Rules

Paragraph Architecture

Word Choice and Precision

From Zachary Lipton

From Jacob Steinhardt

Vocabulary Signaling

Mathematical Writing

From Ethan Perez

General Principles

Notation Conventions

Figure Design

From Neel Nanda

Design Principles

Accessibility Requirements

Tools

Common Mistakes to Avoid

Structure Mistakes

Writing Mistakes

Figure Mistakes

Citation Mistakes

Pre-Submission Checklist