Back to Claude Scientific Skills

Statistical Power & Sample Size

skills/statistical-power/SKILL.md

2.52.012.8 KB
Original Source

Statistical Power & Sample Size

Overview

Power analysis answers one of the most consequential questions in study planning: how large a sample do you need to reliably detect an effect of a given size, and what could you detect with the sample you can afford? An underpowered study wastes resources and produces inconclusive or irreproducible results; an overpowered one wastes participants, money, and (in clinical work) exposes more people to risk than necessary. Getting this right before data collection is the single highest-leverage statistical decision in a project.

Four quantities are locked together for any given test: sample size (n), effect size, significance level (α), and power (1 − β). Fix any three and the fourth is determined. Every calculation in this skill is some rearrangement of that relationship.

This skill covers the two ways to do power analysis:

  • Closed-form formulas (fast, exact for standard tests) — see references/closed_form_recipes.md.
  • Simulation / Monte Carlo (works for any design or model you can simulate and analyze) — see references/simulation_based_power.md.

For choosing and converting effect sizes — usually the hardest part — see references/effect_sizes.md.

When to Use This Skill

  • Determining required sample size before collecting data (a priori power analysis)
  • Finding the minimum detectable effect (MDE) for a fixed, already-determined sample size
  • Producing power curves (power vs. n, or power vs. effect size) for a grant or protocol
  • Justifying a sample size for an IRB submission, grant, or pre-registration
  • Powering designs with unequal group sizes or non-1:1 allocation
  • Powering anything without a textbook formula (mixed models, logistic/Poisson regression, cluster-randomized trials, survival analysis, mediation, interactions) via simulation
  • Accounting for multiple comparisons, attrition/dropout, or clustering in the sample-size estimate

Installation

Use uv. Pin versions in production; unpinned is fine for exploration.

bash
uv pip install "statsmodels>=0.14.6" "scipy>=1.11" "pingouin>=0.6" "numpy>=1.26" matplotlib pandas
# For simulation-based power of advanced models (optional, add as needed):
uv pip install lifelines            # survival
# mixed models and GLMs come with statsmodels

Compatibility note: use statsmodels>=0.14.6 with scipy>=1.11 to avoid _lazywhere import errors on SciPy 1.16+. Pingouin 0.5+ renamed power-function arguments to match the names used below.


The one decision that drives everything: the effect size

Power calculations are only as trustworthy as the effect size you feed them. Do not invent a number. Use, in rough order of preference:

  1. A minimally important effect — the smallest effect that would actually change a decision or matter scientifically/clinically (the "smallest effect size of interest", SESOI). This is the most defensible basis: you power to detect what matters, not what you hope to see.
  2. A pilot or prior-study estimate, but shrink it — published and pilot effects are inflated by publication bias and the winner's curse. Powering on a raw pilot estimate routinely underpowers the real study.
  3. A convention (Cohen's small/medium/large) only as a last resort, and say so explicitly.

Whatever you pick, run a sensitivity analysis: report how required n changes across a plausible range of effect sizes, not a single point. A power analysis presented as one number hides its biggest source of uncertainty. See references/effect_sizes.md for benchmarks and conversions between d, f, r, η², odds ratios, and Cohen's h/w.

Avoid post-hoc ("observed") power. Computing power from the effect size you just estimated is circular: it is a deterministic function of the p-value and tells you nothing new. If a study is already done and you want to know what it could have detected, report a sensitivity analysis (MDE at the achieved n) or, better, the confidence interval around the observed effect. This is a common reviewer complaint — do not produce observed power even if asked without flagging the issue.


Quick recipes (closed-form)

The bundled scripts/power.py wraps statsmodels into one consistent interface so you don't have to remember which solver belongs to which test. Run from the skill directory or add scripts/ to sys.path.

python
from power import sample_size, power, mde, power_curve

# 1. How many per group to detect Cohen's d = 0.5, two-sided, 80% power?
sample_size(test="t_ind", effect_size=0.5, power=0.80, alpha=0.05)
# -> required n per group

# 2. Two groups, 3:1 allocation (e.g. more controls than cases)
sample_size(test="t_ind", effect_size=0.5, power=0.80, ratio=3.0)

# 3. Fixed n=30/group — what's the minimum detectable d at 80% power?
mde(test="t_ind", nobs1=30, power=0.80, alpha=0.05)

# 4. One-way ANOVA, 4 groups, detect Cohen's f = 0.25
sample_size(test="anova", effect_size=0.25, k_groups=4, power=0.80)

# 5. Two proportions: 0.40 vs 0.55 (auto-converts to Cohen's h)
sample_size(test="two_proportions", prop1=0.40, prop2=0.55, power=0.80)

# 6. Correlation: detect r = 0.30
sample_size(test="correlation", effect_size=0.30, power=0.80)

# 7. Power curve for the grant figure
power_curve(test="t_ind", effect_size=0.5, n_range=range(10, 120, 5),
            save="power_curve.png")

Supported test= values: t_ind (two independent means), t_paired/t_one (paired or one-sample mean), anova (one-way), two_proportions, one_proportion, correlation, chi2 (goodness-of-fit / contingency via effect size w), linear_regression (R² increment / f²). Full argument tables and the underlying statsmodels calls are in references/closed_form_recipes.md.


When there is no formula: simulate

Closed-form power exists only for a handful of simple tests. For logistic/Poisson regression, mixed-effects / repeated-measures models, cluster-randomized trials, survival analysis, mediation, multi-way interactions, or any non-standard analysis, the right tool is simulation. The logic is always the same three steps:

  1. Simulate a dataset from your assumed truth (the effect you want to detect, plus realistic noise, baseline rates, cluster structure, etc.).
  2. Analyze it with the exact test/model you plan to use on the real data.
  3. Repeat many times (≥1,000; 5,000–10,000 for a stable estimate near 80%). Power is the fraction of replicates in which the test is significant.

scripts/simulate_power.py provides a reusable harness plus worked examples (two-group difference, logistic regression, cluster-randomized trial with an ICC, and a linear mixed model). The core is just:

python
from simulate_power import simulate_power

def gen_and_test(n, rng):
    # build a dataset of size n under the assumed effect, run the planned test,
    # return True if the result is significant
    ...

est = simulate_power(gen_and_test, n=200, n_sims=2000, alpha=0.05)
print(f"Power at n=200: {est.power:.3f} (95% CI {est.ci_low:.3f}-{est.ci_high:.3f})")

Report the Monte Carlo confidence interval on the estimate (the harness returns it) so the reader knows whether 0.81 vs. 0.79 is signal or simulation noise. See references/simulation_based_power.md for the full patterns, including how to search for the n that hits target power and how to model dropout and clustering.


Adjustments people forget

These routinely make the difference between an adequately powered study and an underpowered one. Apply them explicitly and state that you did.

  • Multiple comparisons. If the analysis tests m hypotheses with a Bonferroni-style correction, power each test at the corrected α (e.g. α/m), which raises n. Better: power on the family-wise or FDR-controlled procedure directly via simulation. Ignoring this silently underpowers every secondary endpoint.
  • Attrition / dropout / unusable samples. Power gives the n you need analyzed. Inflate the enrolled n: n_enroll = ceil(n_analyzed / (1 − dropout_rate)). A 20% dropout rate means enrolling 25% more than the formula returns.
  • Clustering (design effect). When observations are nested (patients within clinics, cells within animals, repeated measures within subject), the effective sample size is smaller than the raw count. Inflate by the design effect DEFF = 1 + (m − 1)·ICC, where m is cluster size and ICC the intraclass correlation. Treating clustered data as independent is pseudoreplication and badly overstates power — for cluster-randomized designs, simulate instead.
  • One- vs. two-sided. Two-sided is the default and almost always the right choice; a one-sided test buys power only by refusing to detect an effect in the unexpected direction. Justify any one-sided test.
  • Unequal allocation. Equal groups are most efficient for a fixed total n. If allocation is fixed by design (e.g. 2:1 treatment:control), pass ratio= so the calculation reflects it.

Workflow

  1. State the design and the planned analysis. The test you will run determines the power method. If the analysis is a mixed model or GLM, go straight to simulation.
  2. Choose the effect size on a defensible basis (SESOI > shrunk pilot > convention) and write down the justification.
  3. Set α and target power. Conventional defaults are α = 0.05 (two-sided) and power = 0.80; 0.90 is common for confirmatory/clinical work. State them.
  4. Compute with scripts/power.py (closed-form) or scripts/simulate_power.py (simulation).
  5. Sensitivity analysis. Recompute across a range of plausible effect sizes and produce a power curve. This is the deliverable, not a single number.
  6. Apply adjustments for dropout, clustering, and multiplicity.
  7. Report following the template below.

Reporting template

A defensible power statement contains every input, so a reader could reproduce it. Adapt:

A priori power analysis was conducted to determine the sample size needed to detect
a [between-group difference of Cohen's d = 0.50], which we considered the smallest
effect of clinical interest. With α = .05 (two-sided) and power = .80, a two-sample
t-test requires n = 64 per group (128 total; computed with statsmodels 0.14).
Allowing for 20% attrition, we will enrol 160 participants. A sensitivity analysis
showed required n ranges from 45 to 105 per group across plausible effects
d = 0.40–0.60 (Figure X).

For simulation: also state the data-generating assumptions (baseline rate, residual SD, ICC, cluster sizes), the number of simulations, and the Monte Carlo CI.


Common pitfalls

  1. Inventing the effect size or copying an inflated pilot estimate — the most common way power analyses go wrong.
  2. Reporting a single n instead of a sensitivity range / power curve.
  3. Post-hoc / observed power — circular and uninformative; use sensitivity analysis or the effect-size CI instead.
  4. Ignoring clustering (pseudoreplication) — counting cells/measurements as if they were independent subjects.
  5. Forgetting dropout — powering the analyzed n but enrolling the same number.
  6. Confusing α with power, or one-sided with two-sided.
  7. Powering only the primary endpoint while reporting secondary/interaction tests that need far larger n.
  8. Using a t-test formula for a model you won't actually fit (e.g. planning a logistic regression with a means-based calculation) — match the power method to the planned analysis.

Resources

Scripts

  • scripts/power.py — unified closed-form interface (sample_size, power, mde, power_curve) over statsmodels/pingouin for all standard tests.
  • scripts/simulate_power.py — Monte Carlo power harness with simulate_power() and find_sample_size(), plus worked examples (two-group, logistic regression, cluster-randomized, linear mixed model).

References

  • references/closed_form_recipes.md — per-test argument tables and exact statsmodels/pingouin calls, including proportions, chi-square, and regression.
  • references/simulation_based_power.md — full simulation patterns for GLMs, mixed models, cluster designs, survival, and dropout.
  • references/effect_sizes.md — choosing effect sizes (SESOI), Cohen's benchmarks, and conversions between d, f, r, η²/f², OR, h, and w.
  • experimental-design — once you know n, lay out the actual study (randomization, blocking, factorial/DOE, crossover, sequential designs).
  • statistical-analysis — assumption checks, running the test, effect sizes, and APA reporting after data collection.
  • statsmodels / pymc — fitting the models referenced here.

Key references

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.).
  • Lakens, D. (2022). Sample Size Justification. Collabra: Psychology, 8(1).
  • Arnold, B. F. et al. (2011). Simulation methods to estimate design power. BMC Medical Research Methodology, 11:94.