Back to Statsmodels

Generalized Linear Models

examples/notebooks/glm.ipynb

0.15.0.dev04.3 KB
Original Source

Generalized Linear Models

python
%matplotlib inline
python
import numpy as np
import statsmodels.api as sm
from matplotlib import pyplot as plt
from scipy import stats

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

GLM: Binomial response data

Load Star98 data

In this example, we use the Star98 dataset which was taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook information can be obtained by typing:

python
print(sm.datasets.star98.NOTE)

Load the data and add a constant to the exogenous (independent) variables:

python
data = sm.datasets.star98.load()
data.exog = sm.add_constant(data.exog, prepend=False)

The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW):

python
print(data.endog.head())

The independent variables include all the other variables described above, as well as the interaction terms:

python
print(data.exog.head())

Fit and summary

python
glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
res = glm_binom.fit()
print(res.summary())

Quantities of interest

python
print("Total number of trials:", data.endog.iloc[:, 0].sum())
print("Parameters: ", res.params)
print("T-values: ", res.tvalues)

First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables:

python
means = data.exog.mean(axis=0)
means25 = means.copy()
means25.iloc[0] = stats.scoreatpercentile(data.exog.iloc[:, 0], 25)
means75 = means.copy()
means75.iloc[0] = lowinc_75per = stats.scoreatpercentile(data.exog.iloc[:, 0], 75)
resp_25 = res.predict(means25)
resp_75 = res.predict(means75)
diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is:

python
print("%2.4f%%" % (diff.iloc[0] * 100))

Plots

We extract information that will be used to draw some interesting plots:

python
nobs = res.nobs
y = data.endog.iloc[:, 0] / data.endog.sum(1)
yhat = res.mu

Plot yhat vs y:

python
from statsmodels.graphics.api import abline_plot
python
fig, ax = plt.subplots()
ax.scatter(yhat, y)
line_fit = sm.OLS(y, sm.add_constant(yhat, prepend=True)).fit()
abline_plot(model_results=line_fit, ax=ax)


ax.set_title("Model Fit Plot")
ax.set_ylabel("Observed values")
ax.set_xlabel("Fitted values");

Plot yhat vs. Pearson residuals:

python
fig, ax = plt.subplots()

ax.scatter(yhat, res.resid_pearson)
ax.hlines(0, 0, 1)
ax.set_xlim(0, 1)
ax.set_title("Residual Dependence Plot")
ax.set_ylabel("Pearson Residuals")
ax.set_xlabel("Fitted values")

Histogram of standardized deviance residuals:

python
from scipy import stats

fig, ax = plt.subplots()

resid = res.resid_deviance.copy()
resid_std = stats.zscore(resid)
ax.hist(resid_std, bins=25)
ax.set_title("Histogram of standardized deviance residuals");

QQ Plot of Deviance Residuals:

python
from statsmodels import graphics

graphics.gofplots.qqplot(resid, line="r")

GLM: Gamma for proportional count response

Load Scottish Parliament Voting data

In the example above, we printed the NOTE attribute to learn about the Star98 dataset. statsmodels datasets ships with other useful information. For example:

python
print(sm.datasets.scotland.DESCRLONG)

Load the data and add a constant to the exogenous variables:

python
data2 = sm.datasets.scotland.load()
data2.exog = sm.add_constant(data2.exog, prepend=False)
print(data2.exog.head())
print(data2.endog.head())

Model Fit and summary

python
glm_gamma = sm.GLM(
    data2.endog, data2.exog, family=sm.families.Gamma(sm.families.links.Log())
)
glm_results = glm_gamma.fit()
print(glm_results.summary())

Artificial data

python
nobs2 = 100
x = np.arange(nobs2)
np.random.seed(54321)
X = np.column_stack((x, x**2))
X = sm.add_constant(X, prepend=False)
lny = np.exp(-(0.03 * x + 0.0001 * x**2 - 1.0)) + 0.001 * np.random.rand(nobs2)

Fit and summary (artificial data)

python
gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.Log()))
gauss_log_results = gauss_log.fit()
print(gauss_log_results.summary())