skills/statistical-power/SKILL.md
Power analysis answers one of the most consequential questions in study planning: how large a sample do you need to reliably detect an effect of a given size, and what could you detect with the sample you can afford? An underpowered study wastes resources and produces inconclusive or irreproducible results; an overpowered one wastes participants, money, and (in clinical work) exposes more people to risk than necessary. Getting this right before data collection is the single highest-leverage statistical decision in a project.
Four quantities are locked together for any given test: sample size (n), effect size, significance level (α), and power (1 − β). Fix any three and the fourth is determined. Every calculation in this skill is some rearrangement of that relationship.
This skill covers the two ways to do power analysis:
references/closed_form_recipes.md.references/simulation_based_power.md.For choosing and converting effect sizes — usually the hardest part — see references/effect_sizes.md.
Use uv. Pin versions in production; unpinned is fine for exploration.
uv pip install "statsmodels>=0.14.6" "scipy>=1.11" "pingouin>=0.6" "numpy>=1.26" matplotlib pandas
# For simulation-based power of advanced models (optional, add as needed):
uv pip install lifelines # survival
# mixed models and GLMs come with statsmodels
Compatibility note: use statsmodels>=0.14.6 with scipy>=1.11 to avoid _lazywhere import errors on SciPy 1.16+. Pingouin 0.5+ renamed power-function arguments to match the names used below.
Power calculations are only as trustworthy as the effect size you feed them. Do not invent a number. Use, in rough order of preference:
Whatever you pick, run a sensitivity analysis: report how required n changes across a plausible range of effect sizes, not a single point. A power analysis presented as one number hides its biggest source of uncertainty. See references/effect_sizes.md for benchmarks and conversions between d, f, r, η², odds ratios, and Cohen's h/w.
Avoid post-hoc ("observed") power. Computing power from the effect size you just estimated is circular: it is a deterministic function of the p-value and tells you nothing new. If a study is already done and you want to know what it could have detected, report a sensitivity analysis (MDE at the achieved n) or, better, the confidence interval around the observed effect. This is a common reviewer complaint — do not produce observed power even if asked without flagging the issue.
The bundled scripts/power.py wraps statsmodels into one consistent interface so you don't have to remember which solver belongs to which test. Run from the skill directory or add scripts/ to sys.path.
from power import sample_size, power, mde, power_curve
# 1. How many per group to detect Cohen's d = 0.5, two-sided, 80% power?
sample_size(test="t_ind", effect_size=0.5, power=0.80, alpha=0.05)
# -> required n per group
# 2. Two groups, 3:1 allocation (e.g. more controls than cases)
sample_size(test="t_ind", effect_size=0.5, power=0.80, ratio=3.0)
# 3. Fixed n=30/group — what's the minimum detectable d at 80% power?
mde(test="t_ind", nobs1=30, power=0.80, alpha=0.05)
# 4. One-way ANOVA, 4 groups, detect Cohen's f = 0.25
sample_size(test="anova", effect_size=0.25, k_groups=4, power=0.80)
# 5. Two proportions: 0.40 vs 0.55 (auto-converts to Cohen's h)
sample_size(test="two_proportions", prop1=0.40, prop2=0.55, power=0.80)
# 6. Correlation: detect r = 0.30
sample_size(test="correlation", effect_size=0.30, power=0.80)
# 7. Power curve for the grant figure
power_curve(test="t_ind", effect_size=0.5, n_range=range(10, 120, 5),
save="power_curve.png")
Supported test= values: t_ind (two independent means), t_paired/t_one (paired or one-sample mean), anova (one-way), two_proportions, one_proportion, correlation, chi2 (goodness-of-fit / contingency via effect size w), linear_regression (R² increment / f²). Full argument tables and the underlying statsmodels calls are in references/closed_form_recipes.md.
Closed-form power exists only for a handful of simple tests. For logistic/Poisson regression, mixed-effects / repeated-measures models, cluster-randomized trials, survival analysis, mediation, multi-way interactions, or any non-standard analysis, the right tool is simulation. The logic is always the same three steps:
scripts/simulate_power.py provides a reusable harness plus worked examples (two-group difference, logistic regression, cluster-randomized trial with an ICC, and a linear mixed model). The core is just:
from simulate_power import simulate_power
def gen_and_test(n, rng):
# build a dataset of size n under the assumed effect, run the planned test,
# return True if the result is significant
...
est = simulate_power(gen_and_test, n=200, n_sims=2000, alpha=0.05)
print(f"Power at n=200: {est.power:.3f} (95% CI {est.ci_low:.3f}-{est.ci_high:.3f})")
Report the Monte Carlo confidence interval on the estimate (the harness returns it) so the reader knows whether 0.81 vs. 0.79 is signal or simulation noise. See references/simulation_based_power.md for the full patterns, including how to search for the n that hits target power and how to model dropout and clustering.
These routinely make the difference between an adequately powered study and an underpowered one. Apply them explicitly and state that you did.
n_enroll = ceil(n_analyzed / (1 − dropout_rate)). A 20% dropout rate means enrolling 25% more than the formula returns.DEFF = 1 + (m − 1)·ICC, where m is cluster size and ICC the intraclass correlation. Treating clustered data as independent is pseudoreplication and badly overstates power — for cluster-randomized designs, simulate instead.ratio= so the calculation reflects it.scripts/power.py (closed-form) or scripts/simulate_power.py (simulation).A defensible power statement contains every input, so a reader could reproduce it. Adapt:
A priori power analysis was conducted to determine the sample size needed to detect
a [between-group difference of Cohen's d = 0.50], which we considered the smallest
effect of clinical interest. With α = .05 (two-sided) and power = .80, a two-sample
t-test requires n = 64 per group (128 total; computed with statsmodels 0.14).
Allowing for 20% attrition, we will enrol 160 participants. A sensitivity analysis
showed required n ranges from 45 to 105 per group across plausible effects
d = 0.40–0.60 (Figure X).
For simulation: also state the data-generating assumptions (baseline rate, residual SD, ICC, cluster sizes), the number of simulations, and the Monte Carlo CI.
scripts/power.py — unified closed-form interface (sample_size, power, mde, power_curve) over statsmodels/pingouin for all standard tests.scripts/simulate_power.py — Monte Carlo power harness with simulate_power() and find_sample_size(), plus worked examples (two-group, logistic regression, cluster-randomized, linear mixed model).references/closed_form_recipes.md — per-test argument tables and exact statsmodels/pingouin calls, including proportions, chi-square, and regression.references/simulation_based_power.md — full simulation patterns for GLMs, mixed models, cluster designs, survival, and dropout.references/effect_sizes.md — choosing effect sizes (SESOI), Cohen's benchmarks, and conversions between d, f, r, η²/f², OR, h, and w.