skills/statistical-power/references/effect_sizes.md
The effect size is the input that makes or breaks a power analysis, and it is the one people most often get wrong. Power computed from a guessed or inflated effect is worse than no power analysis, because it carries false authority. This file covers how to pick a defensible value and how to convert between the metrics different tests use.
Power to detect the smallest effect that would actually change a decision or matter scientifically/clinically, not the effect you hope or expect to see. Ways to set it:
Powering on the SESOI is the most defensible choice: if the true effect is larger, you're even better powered; if it's smaller, you've decided it doesn't matter.
Pilot studies and published effects are biased upward (publication bias, the "winner's curse," and the fact that significant pilots are the ones that get followed up). Powering on a raw pilot d routinely underpowers the real study. If you must use a prior estimate:
Cohen's small/medium/large are arbitrary and field-blind. They were never meant as substitutes for domain knowledge. Use them only when nothing better exists, state explicitly that you did, and prefer "small" unless you have a reason — most real effects in many fields are small.
Whatever you pick, report how required n varies across a plausible range of effects (e.g. a power curve, or a small table of n at d = 0.3, 0.4, 0.5). A single n hides the dominant source of uncertainty. This is the actual deliverable of a good power analysis.
| Metric | Used for | Small | Medium | Large |
|---|---|---|---|---|
| d | mean differences (t-tests) | 0.20 | 0.50 | 0.80 |
| f | ANOVA | 0.10 | 0.25 | 0.40 |
| f² | regression / multiple R² | 0.02 | 0.15 | 0.35 |
| r | correlation | 0.10 | 0.30 | 0.50 |
| η² (eta-squared) | ANOVA variance explained | 0.01 | 0.06 | 0.14 |
| h | proportions (arcsine) | 0.20 | 0.50 | 0.80 |
| w | chi-square | 0.10 | 0.30 | 0.50 |
| OR | 2×2 odds ratio | ~1.5 | ~2.5 | ~4.3 |
(OR benchmarks are very context-dependent and depend on the base rate — treat as rough only.)
d ↔ r (two-group comparison ↔ point-biserial)
r = d / sqrt(d^2 + 4) # equal groups
d = 2r / sqrt(1 - r^2)
d ↔ Cohen's f (k groups; for two equal groups f = d/2)
f = d / 2 # two groups
f ↔ η²
f = sqrt(eta2 / (1 - eta2))
eta2 = f^2 / (1 + f^2)
f² ↔ R² (regression)
f2 = R2 / (1 - R2) # whole model
f2 = dR2 / (1 - R2_full) # increment from added predictors
proportions → Cohen's h
h = 2*asin(sqrt(p1)) - 2*asin(sqrt(p2))
In Python: statsmodels.stats.proportion.proportion_effectsize(p1, p2).
proportions → Cohen's w (for chi-square, against expected p0_i)
w = sqrt( sum( (p_i - p0_i)^2 / p0_i ) )
For a 2×2 table, w = sqrt(chi2 / N), and w = V * sqrt(min(r-1, c-1)) where V is
Cramér's V.
odds ratio → log-odds (for logistic-regression power by simulation)
beta = log(OR) # coefficient to plug into the simulated linear predictor
standardized → raw
A standardized effect is only as good as the SD you divide by. If you know the raw
difference and the SD, work in raw units and convert at the end:
d = (mean1 - mean2) / sd_pooled. For paired designs, dz uses the SD of the
differences, which depends on the within-pair correlation — see
closed_form_recipes.md.