Bayesian Statistical Analysis

This document provides guidance on conducting and interpreting Bayesian statistical analyses, which offer an alternative framework to frequentist (classical) statistics.

Bayesian vs. Frequentist Philosophy

Fundamental Differences

Aspect	Frequentist	Bayesian
Probability interpretation	Long-run frequency of events	Degree of belief/uncertainty
Parameters	Fixed but unknown	Random variables with distributions
Inference	Based on sampling distributions	Based on posterior distributions
Primary output	p-values, confidence intervals	Posterior probabilities, credible intervals
Prior information	Not formally incorporated	Explicitly incorporated via priors
Hypothesis testing	Reject/fail to reject null	Probability of hypotheses given data
Sample size	Often requires minimum	Can work with any sample size
Interpretation	Indirect (probability of data given H₀)	Direct (probability of hypothesis given data)

Key Question Difference

Frequentist: "If the null hypothesis is true, what is the probability of observing data this extreme or more extreme?"

Bayesian: "Given the observed data, what is the probability that the hypothesis is true?"

The Bayesian question is more intuitive and directly addresses what researchers want to know.

Bayes' Theorem

Formula:

P(θ|D) = P(D|θ) × P(θ) / P(D)

In words:

Posterior = Likelihood × Prior / Evidence

Where:

θ (theta): Parameter of interest (e.g., mean difference, correlation)
D: Observed data
P(θ|D): Posterior distribution (belief about θ after seeing data)
P(D|θ): Likelihood (probability of data given θ)
P(θ): Prior distribution (belief about θ before seeing data)
P(D): Marginal likelihood/evidence (normalizing constant)

Prior Distributions

Types of Priors

1. Informative Priors

When to use: When you have substantial prior knowledge from:

Previous studies
Expert knowledge
Theory
Pilot data

Example: Meta-analysis shows effect size d ≈ 0.40, SD = 0.15

Prior: Normal(0.40, 0.15)

Advantages:

Incorporates existing knowledge
More efficient (smaller samples needed)
Can stabilize estimates with small data

Disadvantages:

Subjective (but subjectivity can be strength)
Must be justified and transparent
May be controversial if strong prior conflicts with data

2. Weakly Informative Priors

When to use: Default choice for most applications

Characteristics:

Regularizes estimates (prevents extreme values)
Has minimal influence on posterior with moderate data
Prevents computational issues

Example priors:

Effect size: Normal(0, 1) or Cauchy(0, 0.707)
Variance: Half-Cauchy(0, 1)
Correlation: Uniform(-1, 1) or Beta(2, 2)

Advantages:

Balances objectivity and regularization
Computationally stable
Broadly acceptable

3. Non-Informative (Flat/Uniform) Priors

When to use: When attempting to be "objective"

Example: Uniform(-∞, ∞) for any value

⚠️ Caution:

Can lead to improper posteriors
May produce non-sensible results
Not truly "non-informative" (still makes assumptions)
Often not recommended in modern Bayesian practice

Better alternative: Use weakly informative priors

Prior Sensitivity Analysis

Always conduct: Test how results change with different priors

Process:

Fit model with default/planned prior
Fit model with more diffuse prior
Fit model with more concentrated prior
Compare posterior distributions

Reporting:

If results are similar: Evidence is robust
If results differ substantially: Data are not strong enough to overwhelm prior

Python example:

python

import pymc as pm

# Model with different priors
priors = [
    ('weakly_informative', pm.Normal.dist(0, 1)),
    ('diffuse', pm.Normal.dist(0, 10)),
    ('informative', pm.Normal.dist(0.5, 0.3))
]

results = {}
for name, prior in priors:
    with pm.Model():
        effect = pm.Normal('effect', mu=prior.mu, sigma=prior.sigma)
        # ... rest of model
        trace = pm.sample()
        results[name] = trace

Bayesian Hypothesis Testing

Bayes Factor (BF)

What it is: Ratio of evidence for two competing hypotheses

Formula:

BF₁₀ = P(D|H₁) / P(D|H₀)

Interpretation:

BF₁₀	Evidence
>100	Decisive for H₁
30-100	Very strong for H₁
10-30	Strong for H₁
3-10	Moderate for H₁
1-3	Anecdotal for H₁
1	No evidence
1/3-1	Anecdotal for H₀
1/10-1/3	Moderate for H₀
1/30-1/10	Strong for H₀
1/100-1/30	Very strong for H₀
<1/100	Decisive for H₀

Advantages over p-values:

Can provide evidence for null hypothesis
Not dependent on sampling intentions (no "peeking" problem)
Directly quantifies evidence
Can be updated with more data

Python calculation:

python

import pingouin as pg

# Note: Limited BF support in Python
# Better options: R packages (BayesFactor), JASP software

# Approximate BF from t-statistic
# Using Jeffreys-Zellner-Siow prior
from scipy import stats

def bf_from_t(t, n1, n2, r_scale=0.707):
    """
    Approximate Bayes Factor from t-statistic
    r_scale: Cauchy prior scale (default 0.707 for medium effect)
    """
    # This is simplified; use dedicated packages for accurate calculation
    df = n1 + n2 - 2
    # Implementation requires numerical integration
    pass

Region of Practical Equivalence (ROPE)

Purpose: Define range of negligible effect sizes

Process:

Define ROPE (e.g., d ∈ [-0.1, 0.1] for negligible effects)
Calculate % of posterior inside ROPE
Make decision:
- 95% in ROPE: Accept practical equivalence
- 95% outside ROPE: Reject equivalence
- Otherwise: Inconclusive

Advantage: Directly tests for practical significance

Python example:

python

# Define ROPE
rope_lower, rope_upper = -0.1, 0.1

# Calculate % of posterior in ROPE
in_rope = np.mean((posterior_samples > rope_lower) &
                  (posterior_samples < rope_upper))

print(f"{in_rope*100:.1f}% of posterior in ROPE")

Bayesian Estimation

Credible Intervals

What it is: Interval containing parameter with X% probability

95% Credible Interval interpretation:

"There is a 95% probability that the true parameter lies in this interval."

This is what people THINK confidence intervals mean (but don't in frequentist framework)

Types:

Equal-Tailed Interval (ETI)

2.5th to 97.5th percentile
Simple to calculate
May not include mode for skewed distributions

Highest Density Interval (HDI)

Narrowest interval containing 95% of distribution
Always includes mode
Better for skewed distributions

Python calculation:

python

import arviz as az

# Equal-tailed interval
eti = np.percentile(posterior_samples, [2.5, 97.5])

# HDI
hdi = az.hdi(posterior_samples, hdi_prob=0.95)

Posterior Distributions

Interpreting posterior distributions:

Central tendency:
- Mean: Average posterior value
- Median: 50th percentile
- Mode: Most probable value (MAP - Maximum A Posteriori)
Uncertainty:
- SD: Spread of posterior
- Credible intervals: Quantify uncertainty
Shape:
- Symmetric: Similar to normal
- Skewed: Asymmetric uncertainty
- Multimodal: Multiple plausible values

Visualization:

python

import matplotlib.pyplot as plt
import arviz as az

# Posterior plot with HDI
az.plot_posterior(trace, hdi_prob=0.95)

# Trace plot (check convergence)
az.plot_trace(trace)

# Forest plot (multiple parameters)
az.plot_forest(trace)

Common Bayesian Analyses

Bayesian T-Test

Purpose: Compare two groups (Bayesian alternative to t-test)

Outputs:

Posterior distribution of mean difference
95% credible interval
Bayes Factor (BF₁₀)
Probability of directional hypothesis (e.g., P(μ₁ > μ₂))

Python implementation:

python

import pymc as pm
import arviz as az

# Bayesian independent samples t-test
with pm.Model() as model:
    # Priors for group means
    mu1 = pm.Normal('mu1', mu=0, sigma=10)
    mu2 = pm.Normal('mu2', mu=0, sigma=10)

    # Prior for pooled standard deviation
    sigma = pm.HalfNormal('sigma', sigma=10)

    # Likelihood
    y1 = pm.Normal('y1', mu=mu1, sigma=sigma, observed=group1)
    y2 = pm.Normal('y2', mu=mu2, sigma=sigma, observed=group2)

    # Derived quantity: mean difference
    diff = pm.Deterministic('diff', mu1 - mu2)

    # Sample posterior
    trace = pm.sample(2000, tune=1000, return_inferencedata=True)

# Analyze results
print(az.summary(trace, var_names=['mu1', 'mu2', 'diff']))

# Probability that group1 > group2
prob_greater = np.mean(trace.posterior['diff'].values > 0)
print(f"P(μ₁ > μ₂) = {prob_greater:.3f}")

# Plot posterior
az.plot_posterior(trace, var_names=['diff'], ref_val=0)

Bayesian ANOVA

Purpose: Compare three or more groups

Model:

python

import pymc as pm

with pm.Model() as anova_model:
    # Hyperpriors
    mu_global = pm.Normal('mu_global', mu=0, sigma=10)
    sigma_between = pm.HalfNormal('sigma_between', sigma=5)
    sigma_within = pm.HalfNormal('sigma_within', sigma=5)

    # Group means (hierarchical)
    group_means = pm.Normal('group_means',
                            mu=mu_global,
                            sigma=sigma_between,
                            shape=n_groups)

    # Likelihood
    y = pm.Normal('y',
                  mu=group_means[group_idx],
                  sigma=sigma_within,
                  observed=data)

    trace = pm.sample(2000, tune=1000, return_inferencedata=True)

# Posterior contrasts
contrast_1_2 = trace.posterior['group_means'][:,:,0] - trace.posterior['group_means'][:,:,1]

Bayesian Correlation

Purpose: Estimate correlation between two variables

Advantage: Provides distribution of correlation values

Python implementation:

python

import pymc as pm

with pm.Model() as corr_model:
    # Prior on correlation
    rho = pm.Uniform('rho', lower=-1, upper=1)

    # Convert to covariance matrix
    cov_matrix = pm.math.stack([[1, rho],
                                [rho, 1]])

    # Likelihood (bivariate normal)
    obs = pm.MvNormal('obs',
                     mu=[0, 0],
                     cov=cov_matrix,
                     observed=np.column_stack([x, y]))

    trace = pm.sample(2000, tune=1000, return_inferencedata=True)

# Summarize correlation
print(az.summary(trace, var_names=['rho']))

# Probability that correlation is positive
prob_positive = np.mean(trace.posterior['rho'].values > 0)

Bayesian Linear Regression

Purpose: Model relationship between predictors and outcome

Advantages:

Uncertainty in all parameters
Natural regularization (via priors)
Can incorporate prior knowledge
Credible intervals for predictions

Python implementation:

python

import pymc as pm

with pm.Model() as regression_model:
    # Priors for coefficients
    alpha = pm.Normal('alpha', mu=0, sigma=10)  # Intercept
    beta = pm.Normal('beta', mu=0, sigma=10, shape=n_predictors)
    sigma = pm.HalfNormal('sigma', sigma=10)

    # Expected value
    mu = alpha + pm.math.dot(X, beta)

    # Likelihood
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)

    trace = pm.sample(2000, tune=1000, return_inferencedata=True)

# Posterior predictive checks
with regression_model:
    ppc = pm.sample_posterior_predictive(trace)

az.plot_ppc(ppc)

# Predictions with uncertainty
with regression_model:
    pm.set_data({'X': X_new})
    posterior_pred = pm.sample_posterior_predictive(trace)

Hierarchical (Multilevel) Models

When to use:

Nested/clustered data (students within schools)
Repeated measures
Meta-analysis
Varying effects across groups

Key concept: Partial pooling

Complete pooling: Ignore groups (biased)
No pooling: Analyze groups separately (high variance)
Partial pooling: Borrow strength across groups (Bayesian)

Example: Varying intercepts:

python

with pm.Model() as hierarchical_model:
    # Hyperpriors
    mu_global = pm.Normal('mu_global', mu=0, sigma=10)
    sigma_between = pm.HalfNormal('sigma_between', sigma=5)
    sigma_within = pm.HalfNormal('sigma_within', sigma=5)

    # Group-level intercepts
    alpha = pm.Normal('alpha',
                     mu=mu_global,
                     sigma=sigma_between,
                     shape=n_groups)

    # Likelihood
    y_obs = pm.Normal('y_obs',
                     mu=alpha[group_idx],
                     sigma=sigma_within,
                     observed=y)

    trace = pm.sample()

Model Comparison

Methods

1. Bayes Factor

Directly compares model evidence
Sensitive to prior specification
Can be computationally intensive

2. Information Criteria

WAIC (Widely Applicable Information Criterion):

Bayesian analog of AIC
Lower is better
Accounts for effective number of parameters

LOO (Leave-One-Out Cross-Validation):

Estimates out-of-sample prediction error
Lower is better
More robust than WAIC

Python calculation:

python

import arviz as az

# Calculate WAIC and LOO
waic = az.waic(trace)
loo = az.loo(trace)

print(f"WAIC: {waic.elpd_waic:.2f}")
print(f"LOO: {loo.elpd_loo:.2f}")

# Compare multiple models
comparison = az.compare({
    'model1': trace1,
    'model2': trace2,
    'model3': trace3
})
print(comparison)

Checking Bayesian Models

1. Convergence Diagnostics

R-hat (Gelman-Rubin statistic):

Compares within-chain and between-chain variance
Values close to 1.0 indicate convergence
R-hat < 1.01: Good
R-hat > 1.05: Poor convergence

Effective Sample Size (ESS):

Number of independent samples
Higher is better
ESS > 400 per chain recommended

Trace plots:

Should look like "fuzzy caterpillar"
No trends, no stuck chains

Python checking:

python

# Automatic summary with diagnostics
print(az.summary(trace, var_names=['parameter']))

# Visual diagnostics
az.plot_trace(trace)
az.plot_rank(trace)  # Rank plots

2. Posterior Predictive Checks

Purpose: Does model generate data similar to observed data?

Process:

Generate predictions from posterior
Compare to actual data
Look for systematic discrepancies

Python implementation:

python

with model:
    ppc = pm.sample_posterior_predictive(trace)

# Visual check
az.plot_ppc(ppc, num_pp_samples=100)

# Quantitative checks
obs_mean = np.mean(observed_data)
pred_means = [np.mean(sample) for sample in ppc.posterior_predictive['y_obs']]
p_value = np.mean(pred_means >= obs_mean)  # Bayesian p-value

Reporting Bayesian Results

Example T-Test Report

"A Bayesian independent samples t-test was conducted to compare groups A and B. Weakly informative priors were used: Normal(0, 1) for the mean difference and Half-Cauchy(0, 1) for the pooled standard deviation. The posterior distribution of the mean difference had a mean of 5.2 (95% CI [2.3, 8.1]), indicating that Group A scored higher than Group B. The Bayes Factor BF₁₀ = 23.5 provided strong evidence for a difference between groups, and there was a 99.7% probability that Group A's mean exceeded Group B's mean."

Example Regression Report

"A Bayesian linear regression was fitted with weakly informative priors (Normal(0, 10) for coefficients, Half-Cauchy(0, 5) for residual SD). The model explained substantial variance (R² = 0.47, 95% CI [0.38, 0.55]). Study hours (β = 0.52, 95% CI [0.38, 0.66]) and prior GPA (β = 0.31, 95% CI [0.17, 0.45]) were credible predictors (95% CIs excluded zero). Posterior predictive checks showed good model fit. Convergence diagnostics were satisfactory (all R-hat < 1.01, ESS > 1000)."

Advantages and Limitations

Advantages

Intuitive interpretation: Direct probability statements about parameters
Incorporates prior knowledge: Uses all available information
Flexible: Handles complex models easily
No p-hacking: Can look at data as it arrives
Quantifies uncertainty: Full posterior distribution
Small samples: Works with any sample size

Limitations

Computational: Requires MCMC sampling (can be slow)
Prior specification: Requires thought and justification
Complexity: Steeper learning curve
Software: Fewer tools than frequentist methods
Communication: May need to educate reviewers/readers

Key Python Packages

PyMC: Full Bayesian modeling framework
ArviZ: Visualization and diagnostics
Bambi: High-level interface for regression models
PyStan: Python interface to Stan
TensorFlow Probability: Bayesian inference with TensorFlow

When to Use Bayesian Methods

Use Bayesian when:

You have prior information to incorporate
You want direct probability statements
Sample size is small
Model is complex (hierarchical, missing data, etc.)
You want to update analysis as data arrives

Frequentist may be sufficient when:

Standard analysis with large sample
No prior information
Computational resources limited
Reviewers unfamiliar with Bayesian methods