Effect Sizes and Power Analysis

This document provides guidance on calculating, interpreting, and reporting effect sizes, as well as conducting power analyses for study planning.

Why Effect Sizes Matter

Statistical significance ≠ practical significance: p-values only tell if an effect exists, not how large it is
Sample size dependent: With large samples, trivial effects become "significant"
Interpretation: Effect sizes provide magnitude and practical importance
Meta-analysis: Effect sizes enable combining results across studies
Power analysis: Required for sample size determination

Golden rule: ALWAYS report effect sizes alongside p-values.

Effect Sizes by Analysis Type

T-Tests and Mean Differences

Cohen's d (Standardized Mean Difference)

Formula:

Independent groups: d = (M₁ - M₂) / SD_pooled
Paired groups: d = M_diff / SD_diff

Interpretation (Cohen, 1988):

Small: |d| = 0.20
Medium: |d| = 0.50
Large: |d| = 0.80

Context-dependent interpretation:

In education: d = 0.40 is typical for successful interventions
In psychology: d = 0.40 is considered meaningful
In medicine: Small effect sizes can be clinically important

Python calculation:

python

import pingouin as pg
import numpy as np

# Independent t-test with effect size
result = pg.ttest(group1, group2, correction=False)
cohens_d = result['cohen-d'].values[0]

# Manual calculation
mean_diff = np.mean(group1) - np.mean(group2)
pooled_std = np.sqrt((np.var(group1, ddof=1) + np.var(group2, ddof=1)) / 2)
cohens_d = mean_diff / pooled_std

# Paired t-test
result = pg.ttest(pre, post, paired=True)
cohens_d = result['cohen-d'].values[0]

Confidence intervals for d:

python

from pingouin import compute_effsize_from_t

d, ci = compute_effsize_from_t(t_statistic, nx=n1, ny=n2, eftype='cohen')

Hedges' g (Bias-Corrected d)

Why use it: Cohen's d has slight upward bias with small samples (n < 20)

Formula: g = d × correction_factor, where correction_factor = 1 - 3/(4df - 1)

Python calculation:

python

result = pg.ttest(group1, group2, correction=False)
hedges_g = result['hedges'].values[0]

Use Hedges' g when:

Sample sizes are small (n < 20 per group)
Conducting meta-analyses (standard in meta-analysis)

Glass's Δ (Delta)

When to use: When one group is a control with known variability

Formula: Δ = (M₁ - M₂) / SD_control

Use cases:

Clinical trials (use control group SD)
When treatment affects variability

ANOVA

Eta-squared (η²)

What it measures: Proportion of total variance explained by factor

Formula: η² = SS_effect / SS_total

Interpretation:

Small: η² = 0.01 (1% of variance)
Medium: η² = 0.06 (6% of variance)
Large: η² = 0.14 (14% of variance)

Limitation: Biased with multiple factors (sums to > 1.0)

Python calculation:

python

import pingouin as pg

# One-way ANOVA
aov = pg.anova(dv='value', between='group', data=df)
eta_squared = aov['SS'][0] / aov['SS'].sum()

# Or use pingouin directly
aov = pg.anova(dv='value', between='group', data=df, detailed=True)
eta_squared = aov['np2'][0]  # Note: pingouin reports partial eta-squared

Partial Eta-squared (η²_p)

What it measures: Proportion of variance explained by factor, excluding other factors

Formula: η²_p = SS_effect / (SS_effect + SS_error)

Interpretation: Same benchmarks as η²

When to use: Multi-factor ANOVA (standard in factorial designs)

Python calculation:

python

aov = pg.anova(dv='value', between=['factor1', 'factor2'], data=df)
# pingouin reports partial eta-squared by default
partial_eta_sq = aov['np2']

Omega-squared (ω²)

What it measures: Less biased estimate of population variance explained

Why use it: η² overestimates effect size; ω² provides better population estimate

Formula: ω² = (SS_effect - df_effect × MS_error) / (SS_total + MS_error)

Interpretation: Same benchmarks as η², but typically smaller values

Python calculation:

python

def omega_squared(aov_table):
    ss_effect = aov_table.loc[0, 'SS']
    ss_total = aov_table['SS'].sum()
    ms_error = aov_table.loc[aov_table.index[-1], 'MS']  # Residual MS
    df_effect = aov_table.loc[0, 'DF']

    omega_sq = (ss_effect - df_effect * ms_error) / (ss_total + ms_error)
    return omega_sq

Cohen's f

What it measures: Effect size for ANOVA (analogous to Cohen's d)

Formula: f = √(η² / (1 - η²))

Interpretation:

Small: f = 0.10
Medium: f = 0.25
Large: f = 0.40

Python calculation:

python

eta_squared = 0.06  # From ANOVA
cohens_f = np.sqrt(eta_squared / (1 - eta_squared))

Use in power analysis: Required for ANOVA power calculations

Correlation

Pearson's r / Spearman's ρ

Interpretation:

Small: |r| = 0.10
Medium: |r| = 0.30
Large: |r| = 0.50

Important notes:

r² = coefficient of determination (proportion of variance explained)
r = 0.30 means 9% shared variance (0.30² = 0.09)
Consider direction (positive/negative) and context

Python calculation:

python

import pingouin as pg

# Pearson correlation with CI
result = pg.corr(x, y, method='pearson')
r = result['r'].values[0]
ci = [result['CI95%'][0][0], result['CI95%'][0][1]]

# Spearman correlation
result = pg.corr(x, y, method='spearman')
rho = result['r'].values[0]

Regression

R² (Coefficient of Determination)

What it measures: Proportion of variance in Y explained by model

Interpretation:

Small: R² = 0.02
Medium: R² = 0.13
Large: R² = 0.26

Context-dependent:

Physical sciences: R² > 0.90 expected
Social sciences: R² > 0.30 considered good
Behavior prediction: R² > 0.10 may be meaningful

Python calculation:

python

from sklearn.metrics import r2_score
from statsmodels.api import OLS

# Using statsmodels
model = OLS(y, X).fit()
r_squared = model.rsquared
adjusted_r_squared = model.rsquared_adj

# Manual
r_squared = 1 - (SS_residual / SS_total)

Adjusted R²

Why use it: R² artificially increases when adding predictors; adjusted R² penalizes model complexity

Formula: R²_adj = 1 - (1 - R²) × (n - 1) / (n - k - 1)

When to use: Always report alongside R² for multiple regression

Standardized Regression Coefficients (β)

What it measures: Effect of one-SD change in predictor on outcome (in SD units)

Interpretation: Similar to Cohen's d

Small: |β| = 0.10
Medium: |β| = 0.30
Large: |β| = 0.50

Python calculation:

python

from scipy import stats

# Standardize variables first
X_std = (X - X.mean()) / X.std()
y_std = (y - y.mean()) / y.std()

model = OLS(y_std, X_std).fit()
beta = model.params

f² (Cohen's f-squared for Regression)

What it measures: Effect size for individual predictors or model comparison

Formula: f² = R²_AB - R²_A / (1 - R²_AB)

Where:

R²_AB = R² for full model with predictor
R²_A = R² for reduced model without predictor

Interpretation:

Small: f² = 0.02
Medium: f² = 0.15
Large: f² = 0.35

Python calculation:

python

# Compare two nested models
model_full = OLS(y, X_full).fit()
model_reduced = OLS(y, X_reduced).fit()

r2_full = model_full.rsquared
r2_reduced = model_reduced.rsquared

f_squared = (r2_full - r2_reduced) / (1 - r2_full)

Categorical Data Analysis

Cramér's V

What it measures: Association strength for χ² test (works for any table size)

Formula: V = √(χ² / (n × (k - 1)))

Where k = min(rows, columns)

Interpretation (for k > 2):

Small: V = 0.07
Medium: V = 0.21
Large: V = 0.35

For 2×2 tables: Use phi coefficient (φ)

Python calculation:

python

from scipy.stats.contingency import association

# Cramér's V
cramers_v = association(contingency_table, method='cramer')

# Phi coefficient (for 2x2)
phi = association(contingency_table, method='pearson')

Odds Ratio (OR) and Risk Ratio (RR)

For 2×2 contingency tables:

	Outcome +	Outcome -
Exposed	a	b
Unexposed	c	d

Odds Ratio: OR = (a/b) / (c/d) = ad / bc

Interpretation:

OR = 1: No association
OR > 1: Positive association (increased odds)
OR < 1: Negative association (decreased odds)
OR = 2: Twice the odds
OR = 0.5: Half the odds

Risk Ratio: RR = (a/(a+b)) / (c/(c+d))

When to use:

Cohort studies: Use RR (more interpretable)
Case-control studies: Use OR (RR not available)
Logistic regression: OR is natural output

Python calculation:

python

import statsmodels.api as sm

# From contingency table
odds_ratio = (a * d) / (b * c)

# Confidence interval
table = np.array([[a, b], [c, d]])
oddsratio, pvalue = stats.fisher_exact(table)

# From logistic regression
model = sm.Logit(y, X).fit()
odds_ratios = np.exp(model.params)  # Exponentiate coefficients
ci = np.exp(model.conf_int())  # Exponentiate CIs

Bayesian Effect Sizes

Bayes Factor (BF)

What it measures: Ratio of evidence for alternative vs. null hypothesis

Interpretation:

BF₁₀ = 1: Equal evidence for H₁ and H₀
BF₁₀ = 3: H₁ is 3× more likely than H₀ (moderate evidence)
BF₁₀ = 10: H₁ is 10× more likely than H₀ (strong evidence)
BF₁₀ = 100: H₁ is 100× more likely than H₀ (decisive evidence)
BF₁₀ = 0.33: H₀ is 3× more likely than H₁
BF₁₀ = 0.10: H₀ is 10× more likely than H₁

Classification (Jeffreys, 1961):

1-3: Anecdotal evidence
3-10: Moderate evidence
10-30: Strong evidence
30-100: Very strong evidence
100: Decisive evidence

Python calculation:

python

import pingouin as pg

# Bayesian t-test
result = pg.ttest(group1, group2, correction=False)
# Note: pingouin doesn't include BF; use other packages

# Using JASP or BayesFactor (R) via rpy2
# Or implement using numerical integration

Power Analysis

Concepts

Statistical power: Probability of detecting an effect if it exists (1 - β)

Conventional standards:

Power = 0.80 (80% chance of detecting effect)
α = 0.05 (5% Type I error rate)

Four interconnected parameters (given 3, can solve for 4th):

Sample size (n)
Effect size (d, f, etc.)
Significance level (α)
Power (1 - β)

A Priori Power Analysis (Planning)

Purpose: Determine required sample size before study

Steps:

Specify expected effect size (from literature, pilot data, or minimum meaningful effect)
Set α level (typically 0.05)
Set desired power (typically 0.80)
Calculate required n

Python implementation:

python

from statsmodels.stats.power import (
    tt_ind_solve_power,
    zt_ind_solve_power,
    FTestAnovaPower,
    NormalIndPower
)

# T-test power analysis
n_required = tt_ind_solve_power(
    effect_size=0.5,  # Cohen's d
    alpha=0.05,
    power=0.80,
    ratio=1.0,  # Equal group sizes
    alternative='two-sided'
)

# ANOVA power analysis
anova_power = FTestAnovaPower()
n_per_group = anova_power.solve_power(
    effect_size=0.25,  # Cohen's f
    ngroups=3,
    alpha=0.05,
    power=0.80
)

# Correlation power analysis
from pingouin import power_corr
n_required = power_corr(r=0.30, power=0.80, alpha=0.05)

Post Hoc Power Analysis (After Study)

⚠️ CAUTION: Post hoc power is controversial and often not recommended

Why it's problematic:

Observed power is a direct function of p-value
If p > 0.05, power is always low
Provides no additional information beyond p-value
Can be misleading

When it might be acceptable:

Study planning for future research
Using effect size from multiple studies (not just your own)
Explicit goal is sample size for replication

Better alternatives:

Report confidence intervals for effect sizes
Conduct sensitivity analysis
Report minimum detectable effect size

Sensitivity Analysis

Purpose: Determine minimum detectable effect size given study parameters

When to use: After study is complete, to understand study's capability

Python implementation:

python

# What effect size could we detect with n=50 per group?
detectable_effect = tt_ind_solve_power(
    effect_size=None,  # Solve for this
    nobs1=50,
    alpha=0.05,
    power=0.80,
    ratio=1.0,
    alternative='two-sided'
)

print(f"With n=50 per group, we could detect d ≥ {detectable_effect:.2f}")

Reporting Effect Sizes

APA Style Guidelines

T-test example:

"Group A (M = 75.2, SD = 8.5) scored significantly higher than Group B (M = 68.3, SD = 9.2), t(98) = 3.82, p < .001, d = 0.77, 95% CI [0.36, 1.18]."

ANOVA example:

"There was a significant main effect of treatment condition on test scores, F(2, 87) = 8.45, p < .001, η²p = .16. Post hoc comparisons using Tukey's HSD revealed..."

Correlation example:

"There was a moderate positive correlation between study time and exam scores, r(148) = .42, p < .001, 95% CI [.27, .55]."

Regression example:

"The regression model significantly predicted exam scores, F(3, 146) = 45.2, p < .001, R² = .48. Study hours (β = .52, p < .001) and prior GPA (β = .31, p < .001) were significant predictors."

Bayesian example:

"A Bayesian independent samples t-test provided strong evidence for a difference between groups, BF₁₀ = 23.5, indicating the data are 23.5 times more likely under H₁ than H₀."

Effect Size Pitfalls

Don't only rely on benchmarks: Context matters; small effects can be meaningful
Report confidence intervals: CIs show precision of effect size estimate
Distinguish statistical vs. practical significance: Large n can make trivial effects "significant"
Consider cost-benefit: Even small effects may be valuable if intervention is low-cost
Multiple outcomes: Effect sizes vary across outcomes; report all
Don't cherry-pick: Report effects for all planned analyses
Publication bias: Published effects are often overestimated

Quick Reference Table

Analysis	Effect Size	Small	Medium	Large
T-test	Cohen's d	0.20	0.50	0.80
ANOVA	η², ω²	0.01	0.06	0.14
ANOVA	Cohen's f	0.10	0.25	0.40
Correlation	r, ρ	0.10	0.30	0.50
Regression	R²	0.02	0.13	0.26
Regression	f²	0.02	0.15	0.35
Chi-square	Cramér's V	0.07	0.21	0.35
Chi-square (2×2)	φ	0.10	0.30	0.50

Resources

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.)
Lakens, D. (2013). Calculating and reporting effect sizes
Ellis, P. D. (2010). The Essential Guide to Effect Sizes