scientific-skills/venue-templates/references/ml_conference_style.md
Comprehensive writing guide for NeurIPS, ICML, ICLR, CVPR, ECCV, ICCV, and other major machine learning and computer vision conferences.
Last Updated: 2024
ML conferences prioritize novelty, rigorous empirical evaluation, and reproducibility. Papers are evaluated on clear contribution, strong baselines, comprehensive ablations, and honest discussion of limitations.
"Show don't tell—your experiments should demonstrate your claims, not just your prose."
Primary Goal: Advance the state of the art with novel methods validated through rigorous experimentation.
| Characteristic | Description |
|---|---|
| Technical | Dense with methodology details |
| Precise | Exact terminology, no ambiguity |
| Empirical | Claims backed by experiments |
| Direct | State contributions clearly |
| Honest | Acknowledge limitations |
Transformers have achieved remarkable success in sequence modeling but
suffer from quadratic computational complexity, limiting their application
to long sequences. We introduce FlashAttention-2, an IO-aware exact
attention algorithm that achieves 2x speedup over FlashAttention and up
to 9x speedup over standard attention on sequences up to 16K tokens. Our
key insight is to reduce memory reads/writes by tiling and recomputation,
achieving optimal IO complexity. On the Long Range Arena benchmark,
FlashAttention-2 enables training with 8x longer sequences while matching
standard attention accuracy. Combined with sequence parallelism, we train
GPT-style models on sequences of 64K tokens at near-linear cost. We
release optimized CUDA kernels achieving 80% of theoretical peak FLOPS
on A100 GPUs. Code is available at [anonymous URL].
❌ "We propose a novel method for X" (vague, no results) ❌ "Our method outperforms baselines" (no specific numbers) ❌ "This is an important problem" (self-evident claims)
✅ Include specific metrics: "achieves 94.5% accuracy, 3.2% improvement" ✅ Include scale: "on 1M samples" or "16K token sequences" ✅ Include comparison: "2x faster than previous SOTA"
ML introductions have a distinctive structure with numbered contributions.
Paragraph 1: Problem Motivation
"Large language models have demonstrated remarkable capabilities in
natural language understanding and generation. However, their quadratic
attention complexity presents a fundamental bottleneck for processing
long documents, multi-turn conversations, and reasoning over extended
contexts. As models scale to billions of parameters and context lengths
extend to tens of thousands of tokens, efficient attention mechanisms
become critical for practical deployment."
Paragraph 2: Limitations of Existing Approaches
"Prior work has addressed this through sparse attention patterns,
linear attention approximations, and low-rank factorizations. While
these methods reduce theoretical complexity, they often sacrifice
accuracy, require specialized hardware, or introduce approximation
errors that compound in deep networks. Exact attention remains
preferable when computational resources permit."
Paragraph 3: Your Approach (High-Level)
"We observe that the primary bottleneck in attention is not computation
but rather memory bandwidth—reading and writing the large N×N attention
matrix dominates runtime on modern GPUs. We propose FlashAttention-2,
which eliminates this bottleneck through a novel tiling strategy that
computes attention block-by-block without materializing the full matrix."
Paragraph 4: Contribution List (CRITICAL)
This is mandatory and distinctive for ML conferences:
Our contributions are as follows:
• We propose FlashAttention-2, an IO-aware exact attention algorithm
that achieves optimal memory complexity O(N²d/M) where M is GPU
SRAM size.
• We provide theoretical analysis showing that our algorithm achieves
2-4x fewer HBM accesses than FlashAttention on typical GPU
configurations.
• We demonstrate 2x speedup over FlashAttention and up to 9x over
standard PyTorch attention across sequence lengths from 256 to 64K
tokens.
• We show that FlashAttention-2 enables training with 8x longer
contexts on the same hardware, unlocking new capabilities for
long-range modeling.
• We release optimized CUDA kernels and PyTorch bindings at
[anonymous URL].
| Good Contribution Bullets | Bad Contribution Bullets |
|---|---|
| Specific, quantifiable | Vague claims |
| Self-contained | Requires reading paper to understand |
| Distinct from each other | Overlapping bullets |
| Emphasize novelty | State obvious facts |
METHOD
├── Problem Formulation
├── Method Overview / Architecture
├── Key Technical Components
│ ├── Component 1 (with equations)
│ ├── Component 2 (with equations)
│ └── Component 3 (with equations)
├── Theoretical Analysis (if applicable)
└── Implementation Details
Include clear pseudocode for reproducibility:
Algorithm 1: FlashAttention-2 Forward Pass
─────────────────────────────────────────
Input: Q, K, V ∈ ℝ^{N×d}, block size B_r, B_c
Output: O ∈ ℝ^{N×d}
1: Divide Q into T_r = ⌈N/B_r⌉ blocks
2: Divide K, V into T_c = ⌈N/B_c⌉ blocks
3: Initialize O = 0, ℓ = 0, m = -∞
4: for i = 1 to T_r do
5: Load Q_i from HBM to SRAM
6: for j = 1 to T_c do
7: Load K_j, V_j from HBM to SRAM
8: Compute S_ij = Q_i K_j^T
9: Update running max and sum
10: Update O_i incrementally
11: end for
12: Write O_i to HBM
13: end for
14: return O
EXPERIMENTS
├── Experimental Setup
│ ├── Datasets and Benchmarks
│ ├── Baselines
│ ├── Implementation Details
│ └── Evaluation Metrics
├── Main Results
│ └── Table/Figure with primary comparisons
├── Ablation Studies
│ └── Component-wise analysis
├── Analysis
│ ├── Scaling behavior
│ ├── Qualitative examples
│ └── Error analysis
└── Computational Efficiency
Critical for acceptance. Include:
Clear, comprehensive formatting:
Table 1: Results on Long Range Arena Benchmark (accuracy %)
──────────────────────────────────────────────────────────
Method | ListOps | Text | Retrieval | Image | Path | Avg
──────────────────────────────────────────────────────────
Transformer | 36.4 | 64.3 | 57.5 | 42.4 | 71.4 | 54.4
Performer | 18.0 | 65.4 | 53.8 | 42.8 | 77.1 | 51.4
Linear Attn | 16.1 | 65.9 | 53.1 | 42.3 | 75.3 | 50.5
FlashAttention | 37.1 | 64.5 | 57.8 | 42.7 | 71.2 | 54.7
FlashAttn-2 | 37.4 | 64.7 | 58.2 | 42.9 | 71.8 | 55.0
──────────────────────────────────────────────────────────
Show what matters in your method:
Table 2: Ablation Study on FlashAttention-2 Components
──────────────────────────────────────────────────────
Variant | Speedup | Memory
──────────────────────────────────────────────────────
Full FlashAttention-2 | 2.0x | 1.0x
- without sequence parallelism | 1.7x | 1.0x
- without recomputation | 1.3x | 2.4x
- without block tiling | 1.0x | 4.0x
FlashAttention-1 (baseline) | 1.0x | 1.0x
──────────────────────────────────────────────────────
**Efficient Attention Mechanisms.** Prior work on efficient attention
falls into three categories: sparse patterns (Beltagy et al., 2020;
Zaheer et al., 2020), linear approximations (Katharopoulos et al., 2020;
Choromanski et al., 2021), and low-rank factorizations (Wang et al.,
2020). Our work differs in that we focus on IO-efficient exact
attention rather than approximations.
**Memory-Efficient Training.** Gradient checkpointing (Chen et al., 2016)
and activation recomputation (Korthikanti et al., 2022) reduce memory
by trading compute. We adopt similar ideas but apply them within the
attention operator itself.
Increasingly required at NeurIPS, ICML, ICLR. Honest limitations:
**Limitations.** While FlashAttention-2 provides substantial speedups,
several limitations remain. First, our implementation is optimized for
NVIDIA GPUs and does not support AMD or other hardware. Second, the
speedup is most pronounced for medium to long sequences; for very short
sequences (<256 tokens), the overhead of our kernel launch dominates.
Third, we focus on dense attention; extending our approach to sparse
attention patterns remains future work. Finally, our theoretical
analysis assumes specific GPU memory hierarchy parameters that may not
hold for future hardware generations.
Most ML conferences require a reproducibility checklist covering:
Hyperparameters:
"We train with Adam (β₁=0.9, β₂=0.999, ε=1e-8) and learning rate 3e-4
with linear warmup over 1000 steps and cosine decay. Batch size is 256
across 8 A100 GPUs. We train for 100K steps (approximately 24 hours)."
Random Seeds:
"All experiments are averaged over 3 random seeds (0, 1, 2) with
standard deviation reported in parentheses."
Compute:
"Experiments were conducted on 8 NVIDIA A100-80GB GPUs. Total training
time was approximately 500 GPU-hours."
Self-contained captions that explain:
ML conferences have author response periods. Tips:
venue_writing_styles.md - Master style overviewconferences_formatting.md - Technical formatting requirementsreviewer_expectations.md - What ML reviewers seek