skills/research/research-paper-writing/references/writing-guide.md
This reference compiles writing advice from prominent ML researchers including Neel Nanda, Andrej Karpathy, Sebastian Farquhar, Zachary Lipton, and Jacob Steinhardt.
"A paper is a short, rigorous, evidence-based technical story with a takeaway readers care about."
The narrative rests on three pillars that must be crystal clear by the end of your introduction:
The "What": One to three specific novel claims fitting within a cohesive theme. Vague contributions like "we study X" fail immediately—reviewers need precise, falsifiable claims.
The "Why": Rigorous empirical evidence that convincingly supports those claims, including strong baselines honestly tuned and experiments that distinguish between competing hypotheses rather than merely showing "decent results."
The "So What": Why readers should care, connecting your contribution to problems the community recognizes as important.
"A paper is not a random collection of experiments you report on. The paper sells a single thing that was not obvious or present before. The entire paper is organized around this core contribution with surgical precision."
This applies whether you're presenting a new architecture, a theoretical result, or improved understanding of existing methods—NeurIPS explicitly notes that "originality does not necessarily require an entirely new method."
Practical Implication: If you cannot state your contribution in one sentence, you don't yet have a paper. Everything else—experiments, related work, discussion—exists only to support that core claim.
Spend approximately the same amount of time on each of:
This isn't hyperbole—most reviewers form preliminary judgments before reaching your methods section. Readers encounter your paper in a predictable pattern: title → abstract → introduction → figures → maybe the rest.
Studies of reviewer behavior show:
Implication: Front-load your paper's value. Don't bury the contribution.
We prove that gradient descent on overparameterized neural networks
converges to global minima at a linear rate. [What]
This resolves a fundamental question about why deep learning works
despite non-convex optimization landscapes. [Why hard/important]
Our proof relies on showing that the Neural Tangent Kernel remains
approximately constant during training, reducing the problem to
kernel regression. [How with keywords]
We validate our theory on CIFAR-10 and ImageNet, showing that
predicted convergence rates match experiments within 5%. [Evidence]
This is the first polynomial-time convergence guarantee for
networks with practical depth and width. [Remarkable result]
From Zachary Lipton: "If the first sentence can be pre-pended to any ML paper, delete it."
Delete these openings:
Start with your specific contribution instead.
1. Opening Hook (2-3 sentences)
- State the problem your paper addresses
- Why it matters RIGHT NOW
2. Background/Challenge (1 paragraph)
- What makes this problem hard?
- What have others tried? Why is it insufficient?
3. Your Approach (1 paragraph)
- What do you do differently?
- Key insight that enables your contribution
4. Contribution Bullets (2-4 items)
- Be specific and falsifiable
- Each bullet: 1-2 lines maximum
5. Results Preview (2-3 sentences)
- Most impressive numbers
- Scope of evaluation
6. Paper Organization (optional, 1-2 sentences)
- "Section 2 presents... Section 3 describes..."
Good:
Bad:
The seminal 1990 paper by George Gopen and Judith Swan establishes that readers have structural expectations about where information appears in prose. Violating these expectations forces readers to spend energy on structure rather than content.
"If the reader is to grasp what the writer means, the writer must understand what the reader needs."
Principle 1: Subject-Verb Proximity
Keep grammatical subject and verb close together. Anything intervening reads as interruption of lesser importance.
Weak: "The model, which was trained on 100M tokens and fine-tuned on domain-specific data using LoRA with rank 16, achieves state-of-the-art results"
Strong: "The model achieves state-of-the-art results after training on 100M tokens and fine-tuning with LoRA (rank 16)"
Principle 2: Stress Position (Save the Best for Last)
Readers naturally emphasize the last words of a sentence. Place your most important information there.
Weak: "Accuracy improves by 15% when using attention" Strong: "When using attention, accuracy improves by 15%"
Principle 3: Topic Position (First Things First)
The beginning of a sentence establishes perspective. Put the "whose story" element first—readers expect the sentence to be about whoever shows up first.
Weak: "A novel attention mechanism that computes alignment scores is introduced" Strong: "To address the alignment problem, we introduce a novel attention mechanism"
Principle 4: Old Information Before New
Put familiar information (old) in the topic position for backward linkage; put new information in the stress position for emphasis.
Weak: "Sparse attention was introduced by Child et al. The quadratic complexity of standard attention motivates this work." Strong: "Standard attention has quadratic complexity. To address this, Child et al. introduced sparse attention."
Principle 5: One Unit, One Function
Each unit of discourse (sentence, paragraph, section) should serve a single function. If you have two points, use two units.
Principle 6: Articulate Action in the Verb
Express the action of each sentence in its verb, not in nominalized nouns.
Weak: "We performed an analysis of the results" (nominalization) Strong: "We analyzed the results" (action in verb)
Principle 7: Context Before New Information
Provide context before asking the reader to consider anything new. This applies at all levels—sentence, paragraph, section.
Weak: "Equation 3 shows that convergence is guaranteed when the learning rate satisfies..." Strong: "For convergence to be guaranteed, the learning rate must satisfy the condition in Equation 3..."
| Principle | Rule | Mnemonic |
|---|---|---|
| Subject-Verb Proximity | Keep subject and verb close | "Don't interrupt yourself" |
| Stress Position | Emphasis at sentence end | "Save the best for last" |
| Topic Position | Context at sentence start | "First things first" |
| Old Before New | Familiar → unfamiliar | "Build on known ground" |
| One Unit, One Function | Each paragraph = one point | "One idea per container" |
| Action in Verb | Use verbs, not nominalizations | "Verbs do, nouns sit" |
| Context Before New | Explain before presenting | "Set the stage first" |
These practical micro-level tips improve clarity at the sentence and word level.
Minimize pronouns ("this," "it," "these," "that"). When pronouns are necessary, use them as adjectives with a noun:
Weak: "This shows that the model converges." Strong: "This result shows that the model converges."
Weak: "It improves performance." Strong: "This modification improves performance."
Position verbs early in sentences for better parsing:
Weak: "The gradient, after being computed and normalized, updates the weights." Strong: "The gradient updates the weights after being computed and normalized."
Transform possessive constructions for clarity:
Original: "X's Y" → Unfolded: "The Y of X"
Before: "The model's accuracy on the test set" After: "The accuracy of the model on the test set"
This isn't always better, but when sentences feel awkward, try unfolding.
Delete these filler words in almost all cases:
Don't bury key information in the middle of paragraphs.
Eliminate hedging unless genuine uncertainty exists:
Avoid vacuous intensifiers:
Precision over brevity: Replace vague terms with specific ones.
| Vague | Specific |
|---|---|
| performance | accuracy, latency, throughput |
| improves | increases accuracy by X%, reduces latency by Y |
| large | 1B parameters, 100M tokens |
| fast | 3x faster, 50ms latency |
| good results | 92% accuracy, 0.85 F1 |
Consistent terminology: Referring to the same concept with different terms creates confusion.
Choose one and stick with it:
Avoid words signaling incremental work:
Why: "We combine X and Y" sounds like you stapled two existing ideas together. "We develop a method that leverages X for Y" sounds like genuine contribution.
Unfold apostrophes for clarity:
Example: "the model's accuracy" → "the accuracy of the model"
% Scalars: lowercase italic
$x$, $y$, $\alpha$, $\beta$
% Vectors: lowercase bold
$\mathbf{x}$, $\mathbf{v}$
% Matrices: uppercase bold
$\mathbf{W}$, $\mathbf{X}$
% Sets: uppercase calligraphic
$\mathcal{X}$, $\mathcal{D}$
% Functions: roman for named functions
$\mathrm{softmax}$, $\mathrm{ReLU}$
Figures should tell a coherent story even if the reader skips the text. Many readers DO skip the text initially.
8% of men have color vision deficiency. Your figures must work for them.
Solutions:
# SciencePlots: Publication-ready styles
import matplotlib.pyplot as plt
plt.style.use(['science', 'ieee'])
# Or for Nature-style
plt.style.use(['science', 'nature'])
| Mistake | Solution |
|---|---|
| Introduction too long (>1.5 pages) | Move background to Related Work |
| Methods buried (after page 3) | Front-load contribution, cut intro |
| Missing contribution bullets | Add 2-4 specific, falsifiable claims |
| Experiments without explicit claims | State what each experiment tests |
| Mistake | Solution |
|---|---|
| Generic abstract opening | Start with your specific contribution |
| Inconsistent terminology | Choose one term per concept |
| Passive voice overuse | Use active voice: "We show" not "It is shown" |
| Hedging everywhere | Be confident unless genuinely uncertain |
| Mistake | Solution |
|---|---|
| Raster graphics for plots | Use vector (PDF/EPS) |
| Red-green color scheme | Use colorblind-safe palette |
| Title inside figure | Put title in caption |
| Captions require main text | Make captions self-contained |
| Mistake | Solution |
|---|---|
| Paper-by-paper Related Work | Organize methodologically |
| Missing relevant citations | Reviewers authored papers—cite generously |
| AI-generated citations | Always verify via APIs |
| Inconsistent citation format | Use BibLaTeX with consistent keys |
Before submitting, verify:
Narrative:
Structure:
Writing:
Technical: