docs/source/diagnostic.rst
:orphan:
.. _diagnostics:
In many cases of statistical analysis, we are not sure whether our statistical model is correctly specified. For example when using ols, then linearity and homoscedasticity are assumed, some test statistics additionally assume that the errors are normally distributed or that we have a large sample. Since our results depend on these statistical assumptions, the results are only correct of our assumptions hold (at least approximately).
One solution to the problem of uncertainty about the correct specification is to use robust methods, for example robust regression or robust covariance (sandwich) estimators. The second approach is to test whether our sample is consistent with these assumptions.
The following briefly summarizes specification and diagnostics tests for linear regression.
For these test the null hypothesis is that all observations have the same error variance, i.e. errors are homoscedastic. The tests differ in which kind of heteroscedasticity is considered as alternative hypothesis. They also vary in the power of the test for different types of heteroscedasticity.
:py:func:het_breuschpagan <statsmodels.stats.diagnostic.het_breuschpagan>
Lagrange Multiplier Heteroscedasticity Test by Breusch-Pagan
:py:func:het_white <statsmodels.stats.diagnostic.het_white>
Lagrange Multiplier Heteroscedasticity Test by White
:py:func:het_goldfeldquandt <statsmodels.stats.diagnostic.het_goldfeldquandt>
test whether variance is the same in 2 subsamples
This group of test whether the regression residuals are not autocorrelated. They assume that observations are ordered by time.
:py:func:durbin_watson <statsmodels.stats.diagnostic.durbin_watson>
:py:func:acorr_ljungbox <statsmodels.stats.diagnostic.acorr_ljungbox>
:py:func:acorr_breusch_godfrey <statsmodels.stats.diagnostic.acorr_breusch_godfrey>
missing
:py:func:linear_harvey_collier <statsmodels.stats.diagnostic.linear_harvey_collier>
:py:func:acorr_linear_rainbow <statsmodels.stats.diagnostic.acorr_linear_rainbow>
:py:func:acorr_linear_lm <statsmodels.stats.diagnostic.acorr_linear_lm>
:py:func:spec_white <statsmodels.stats.diagnostic.spec_white>
Test whether all or some regression coefficient are constant over the entire data sample.
Known Change Point ^^^^^^^^^^^^^^^^^^
OneWayLS :
missing
Unknown Change Point ^^^^^^^^^^^^^^^^^^^^
:py:func:breaks_cusumolsresid <statsmodels.stats.diagnostic.breaks_cusumolsresid>
:py:func:breaks_hansen <statsmodels.stats.diagnostic.breaks_hansen>
:py:func:recursive_olsresiduals <statsmodels.stats.diagnostic.recursive_olsresiduals>
Calculate recursive ols with residuals and cusum test statistic. This is
currently mainly helper function for recursive residual based tests.
However, since it uses recursive updating and does not estimate separate
problems it should be also quite efficient as expanding OLS function.
missing
conditionnum (statsmodels.stattools)
numpy.linalg.cond
Variance Inflation Factors This is currently together with influence and outlier measures (with some links to other tests here: http://www.stata.com/help.cgi?vif)
:py:func:jarque_bera <statsmodels.stats.tools.jarque_bera>
Normality tests in scipy stats need to find list again
:py:func:omni_normtest <statsmodels.stats.tools.omni_normtest>
:py:func:normal_ad <statsmodels.stats.diagnostic.normal_ad>
:py:func:kstest_normal <statsmodels.stats.diagnostic.kstest_normal> :py:func:lilliefors <statsmodels.stats.diagnostic.lilliefors>
Lilliefors test for normality, this is a Kolmogorov-Smirnov tes with for
normality with estimated mean and variance. lilliefors is an alias for
kstest_normal
qqplot, scipy.stats.probplot
other goodness-of-fit tests for distributions in scipy.stats and enhancements
These measures try to identify observations that are outliers, with large residual, or observations that have a large influence on the regression estimates. Robust Regression, RLM, can be used to both estimate in an outlier robust way as well as identify outlier. The advantage of RLM that the estimation results are not strongly influenced even if there are many outliers, while most of the other measures are better in identifying individual outliers and might not be able to identify groups of outliers.
:py:class:RLM <statsmodels.robust.robust_linear_model.RLM>
example from example_rlm.py ::
import statsmodels.api as sm
### Example for using Huber's T norm with the default
### median absolute deviation scaling
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
hub_results = huber_t.fit()
print(hub_results.weights)
And the weights give an idea of how much a particular observation is
down-weighted according to the scaling asked for.
:py:class:Influence <statsmodels.stats.outliers_influence.OLSInfluence>
Class in stats.outliers_influence, most standard measures for outliers
and influence are available as methods or attributes given a fitted
OLS model. This is mainly written for OLS, some but not all measures
are also valid for other models.
Some of these statistics can be calculated from an OLS results instance,
others require that an OLS is estimated for each left out variable.
Wikipedia <https://en.wikipedia.org/wiki/Cook%27s_distance>_ (with some other links)