skills/ab-test-setup/references/sample-size-guide.md
Reference for calculating sample sizes and test duration.
Baseline conversion rate: If your page converts at 5%, that's your baseline.
MDE (Minimum Detectable Effect): The smallest improvement you care about detecting. Set this based on:
Statistical significance (95%): Means there's less than 5% chance the observed difference is due to random chance.
Statistical power (80%): Means if there's a real effect of size MDE, you have 80% chance of detecting it.
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (1% → 1.05%) | 1,500,000 | 3,000,000 |
| 10% (1% → 1.1%) | 380,000 | 760,000 |
| 20% (1% → 1.2%) | 97,000 | 194,000 |
| 50% (1% → 1.5%) | 16,000 | 32,000 |
| 100% (1% → 2%) | 4,200 | 8,400 |
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (3% → 3.15%) | 480,000 | 960,000 |
| 10% (3% → 3.3%) | 120,000 | 240,000 |
| 20% (3% → 3.6%) | 31,000 | 62,000 |
| 50% (3% → 4.5%) | 5,200 | 10,400 |
| 100% (3% → 6%) | 1,400 | 2,800 |
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (5% → 5.25%) | 280,000 | 560,000 |
| 10% (5% → 5.5%) | 72,000 | 144,000 |
| 20% (5% → 6%) | 18,000 | 36,000 |
| 50% (5% → 7.5%) | 3,100 | 6,200 |
| 100% (5% → 10%) | 810 | 1,620 |
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (10% → 10.5%) | 130,000 | 260,000 |
| 10% (10% → 11%) | 34,000 | 68,000 |
| 20% (10% → 12%) | 8,700 | 17,400 |
| 50% (10% → 15%) | 1,500 | 3,000 |
| 100% (10% → 20%) | 400 | 800 |
| Lift to Detect | Sample per Variant | Total Sample |
|---|---|---|
| 5% (20% → 21%) | 60,000 | 120,000 |
| 10% (20% → 22%) | 16,000 | 32,000 |
| 20% (20% → 24%) | 4,000 | 8,000 |
| 50% (20% → 30%) | 700 | 1,400 |
| 100% (20% → 40%) | 200 | 400 |
Duration (days) = (Sample per variant × Number of variants) / (Daily traffic × % exposed)
Scenario 1: High-traffic page
Scenario 2: Medium-traffic page
Scenario 3: Low-traffic with partial exposure
Even with sufficient sample size, run tests for at least:
Avoid running tests longer than 4-8 weeks:
Evan Miller's Calculator https://www.evanmiller.org/ab-testing/sample-size.html
Optimizely's Calculator https://www.optimizely.com/sample-size-calculator/
AB Test Guide Calculator https://www.abtestguide.com/calc/
VWO Duration Calculator https://vwo.com/tools/ab-test-duration-calculator/
With more than 2 variants (A/B/n tests), you need more sample:
| Variants | Multiplier |
|---|---|
| 2 (A/B) | 1x |
| 3 (A/B/C) | ~1.5x |
| 4 (A/B/C/D) | ~2x |
| 5+ | Consider reducing variants |
Why? More comparisons increase chance of false positives. You're comparing:
Apply Bonferroni correction or use tools that handle this automatically.
Problem: Not enough sample to detect realistic effects Fix: Be realistic about MDE, get more traffic, or don't test
Problem: Waiting for sample size when you already have significance Fix: This is actually fine—you committed to sample size, honor it
Problem: Using wrong conversion rate for calculation Fix: Use the specific metric and page, not site-wide averages
Problem: Calculating for full traffic, then analyzing segments Fix: If you plan segment analysis, calculate sample for smallest segment
Problem: Dividing traffic too many ways Fix: Prioritize ruthlessly, run fewer concurrent tests
Options when you can't get enough traffic:
If you must check results before reaching sample size:
Statistical method that adjusts for multiple looks at data.
Daily traffic to page: _____
Baseline conversion rate: _____
MDE I care about: _____
Sample needed per variant: _____ (from tables above)
Days to run: Sample / Daily traffic = _____
If days > 60: Consider alternatives
If days > 30: Acceptable for high-impact tests
If days < 14: Likely feasible
If days < 7: Easy to run, consider running longer anyway