Back to Statsmodels

Regression diagnostics

examples/notebooks/regression_diagnostics.ipynb

0.15.0.dev03.6 KB
Original Source

Regression diagnostics

This example file shows how to use a few of the statsmodels regression diagnostic tests in a real-life context. You can learn about more tests and find out more information about the tests here on the Regression Diagnostics page.

Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online statsmodels documentation. For presentation purposes, we use the zip(name,test) construct to pretty-print short descriptions in the examples below.

Estimate a regression model

python
%matplotlib inline
python
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from statsmodels.compat import lzip

# Load data
url = "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/HistData/Guerry.csv"
dat = pd.read_csv(url)

# Fit regression model (using the natural log of one of the regressors)
results = smf.ols("Lottery ~ Literacy + np.log(Pop1831)", data=dat).fit()

# Inspect the results
print(results.summary())

Normality of the residuals

Omni test:

python
name = ["Chi^2", "Two-tail probability"]
test = sms.omni_normtest(results.resid)
lzip(name, test)

Jarque-Bera test:

Kurtosis below is the sample kurtosis, not the excess kurtosis. A sample from the normal distribution has kurtosis equal to 3.

python
name = ["Jarque-Bera test", "Chi^2 two-tail prob.", "Skew", "Kurtosis"]
test = sms.jarque_bera(results.resid)
lzip(name, test)

Autorelation

Durbin-Watson test:

DW statistic always ranges from 0 to 4. The closer to 2, the less autocorrelation is in the sample.

python
name = ["Durbin-Watson statistic"]
test = [sms.durbin_watson(results.resid)]
lzip(name, test)

Breusch–Godfrey test for serial correlation:

python
name = [
    "Breusch-Pagan Lagrange multiplier test statistic",
    "p-value",
    "f-value",
    "f p-value",
]
test = sms.acorr_breusch_godfrey(results)
lzip(name, test)

Multicollinearity

Condition number

python
name = ["Condition Number"]
test = [np.linalg.cond(results.model.exog)]
lzip(name, test)

Influence tests

Once created, an object of class OLSInfluence holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:

python
from statsmodels.stats.outliers_influence import OLSInfluence

test_class = OLSInfluence(results)
test_class.dfbetas[:5, :]

Explore other options by typing dir(influence_test)

Useful information on leverage can also be plotted:

python
from statsmodels.graphics.regressionplots import plot_leverage_resid2

fig, ax = plt.subplots(figsize=(8, 6))
fig = plot_leverage_resid2(results, ax=ax)

Other plotting options can be found on the Graphics page.

Heteroskedasticity tests

Breush-Pagan test:

python
name = ["Lagrange multiplier statistic", "p-value", "f-value", "f p-value"]
test = sms.het_breuschpagan(results.resid, results.model.exog)
lzip(name, test)

Goldfeld-Quandt test

python
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(results.resid, results.model.exog)
lzip(name, test)

Linearity

Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:

python
name = ["t value", "p value"]
test = sms.linear_harvey_collier(results)
lzip(name, test)