scientific-skills/scientific-critical-thinking/references/statistical_pitfalls.md
Misconception: p = .05 means 5% chance the null hypothesis is true.
Reality: P-value is the probability of observing data this extreme (or more) if the null hypothesis is true. It says nothing about the probability the hypothesis is true.
Correct interpretation: "If there were truly no effect, we would observe data this extreme only 5% of the time."
Misconception: p > .05 proves there's no effect.
Reality: Absence of evidence ≠ evidence of absence. Non-significant results may indicate:
Better approach:
Misconception: Statistical significance means practical importance.
Reality: With large samples, trivial effects become "significant." A statistically significant 0.1 IQ point difference is meaningless in practice.
Better approach:
Misconception: These are meaningfully different because one crosses the .05 threshold.
Reality: These represent nearly identical evidence. The .05 threshold is arbitrary.
Better approach:
Misconception: One-tailed tests are free extra power.
Reality: One-tailed tests assume effects can only go one direction, which is rarely true. They're often used to artificially boost significance.
When appropriate: Only when effects in one direction are theoretically impossible or equivalent to null.
Problem: Testing 20 hypotheses at p < .05 gives ~65% chance of at least one false positive.
Examples:
Solutions:
Problem: Testing many subgroups until finding significance.
Why problematic:
Solutions:
Problem: Analyzing many outcomes, reporting only significant ones.
Detection signs:
Solutions:
Problem: Small samples have low probability of detecting true effects.
Consequences:
Solutions:
Problem: Calculating power after seeing results is circular and uninformative.
Why useless:
Better approach:
Problem: Trusting results from very small samples.
Issues:
Guidelines:
Problem: Focusing only on significance, not magnitude.
Why problematic:
Solutions:
Problem: Treating Cohen's d = 0.5 as "medium" without context.
Reality:
Better approach:
Problem: "Only explains 5% of variance" = unimportant.
Reality:
Consideration: Context matters more than percentage alone.
Problem: Inferring causation from correlation.
Alternative explanations:
Criteria for causation:
Problem: Inferring individual-level relationships from group-level data.
Example: Countries with more chocolate consumption have more Nobel laureates doesn't mean eating chocolate makes you win Nobels.
Why problematic: Group-level correlations may not hold at individual level.
Problem: Trend appears in groups but reverses when combined (or vice versa).
Example: Treatment appears worse overall but better in every subgroup.
Cause: Confounding variable distributed differently across groups.
Solution: Consider confounders and look at appropriate level of analysis.
Problem: Model fits sample data well but doesn't generalize.
Causes:
Solutions:
Problem: Predicting outside the range of observed data.
Why dangerous:
Solution: Only interpolate; avoid extrapolation.
Problem: Using statistical tests without checking assumptions.
Common violations:
Solutions:
Problem: "We controlled for X and it wasn't significant, so it's not a confounder."
Reality: Non-significant covariates can still be important confounders. Significance ≠ confounding.
Solution: Include theoretically important covariates regardless of significance.
Problem: When predictors are highly correlated, true effects may appear non-significant.
Manifestations:
Detection:
Solutions:
Problem: Conducting multiple t-tests instead of ANOVA.
Why wrong: Inflates Type I error rate dramatically.
Correct approach:
Problem: Using Pearson's r for curved relationships.
Why misleading: r measures linear relationships only.
Solutions:
Problem: Chi-square test with expected cell counts < 5.
Why wrong: Violates test assumptions, p-values inaccurate.
Solutions:
Problem: Using independent samples test for paired data (or vice versa).
Why wrong:
Solution: Match test to design.
Misconception: "95% chance the true value is in this interval."
Reality: The true value either is or isn't in this specific interval. If we repeated the study many times, 95% of resulting intervals would contain the true value.
Better interpretation: "We're 95% confident this interval contains the true value."
Problem: Assuming overlapping confidence intervals mean no significant difference.
Reality: Overlapping CIs are less stringent than difference tests. Two CIs can overlap while the difference between groups is significant.
Guideline: Overlap of point estimate with other CI is more relevant than overlap of intervals.
Problem: Focusing only on whether CI includes zero, not precision.
Why important: Wide CIs indicate high uncertainty. "Significant" effects with huge CIs are less convincing.
Consider: Both significance and precision.
Problem: Making Bayesian statements from frequentist analyses.
Examples:
Solution:
Problem: Treating all hypotheses as equally likely initially.
Reality: Extraordinary claims need extraordinary evidence. Prior plausibility matters.
Consider:
Problem: Splitting continuous variables at arbitrary cutoffs.
Consequences:
Exceptions: Clinically meaningful cutoffs with strong justification.
Better: Keep continuous or use multiple categories.
Problem: Testing many transformations until finding significance.
Why problematic: Inflates Type I error, is a form of p-hacking.
Better approach:
Problem: Automatically deleting all cases with any missing data.
Consequences:
Better approaches:
Problem: Not considering why data are missing.
Types:
Solution: Analyze patterns, use appropriate methods, consider sensitivity analyses.
Problem: Only reporting significant results or favorable analyses.
Consequences:
Solutions:
Problem: Reporting exact p-values selectively (e.g., p = .049 but p < .05 for .051).
Why problematic: Obscures values near threshold, enables p-hacking detection evasion.
Better: Always report exact p-values.
Problem: Not making data available for verification or reanalysis.
Consequences:
Best practice: Share data unless privacy concerns prohibit.
Problem: Testing model on same data used to build it.
Consequence: Overly optimistic performance estimates.
Solutions:
Problem: Information from test set leaking into training.
Examples:
Consequence: Inflated performance metrics.
Prevention: All preprocessing decisions made using only training data.
Problem: Combining studies with different designs, populations, or measures.
Balance: Need homogeneity but also comprehensiveness.
Solutions:
Problem: Published studies overrepresent significant results.
Consequences: Overestimated effects in meta-analyses.
Detection:
Solutions: