docs/source/large_data.rst
.. module:: statsmodels.base.distributed_estimation .. currentmodule:: statsmodels.base.distributed_estimation
Big data is something of a buzzword in the modern world. While statsmodels works well with small and moderately-sized data sets that can be loaded in memory--perhaps tens of thousands of observations--use cases exist with millions of observations or more. Depending your use case, statsmodels may or may not be a sufficient tool.
statsmodels and most of the software stack it is written on operates in memory. Resultantly, building models on larger data sets can be challenging or even impractical. With that said, there are 2 general strategies for building models on larger data sets with statsmodels.
If your system is capable of loading all the data, but the analysis you are attempting to perform is slow, you might be able to build models on horizontal slices of the data and then aggregate the individual models once fit.
A current limitation of this approach is that it generally does not support
patsy <https://patsy.readthedocs.io/en/latest/>_ so constructing your
design matrix (known as exog) in statsmodels, is a little challenging.
A detailed example is available
here <examples/notebooks/generated/distributed_estimation.ipynb>_.
.. autosummary:: :toctree: generated/
DistributedModel DistributedResults
If your entire data set is too large to store in memory, you might try storing
it in a columnar container like Apache Parquet <https://parquet.apache.org/>_
or bcolz <http://bcolz.blosc.org/en/latest/>_. Using the patsy formula
interface, statsmodels will use the __getitem__ function (i.e. data['Item'])
to pull only the specified columns.
.. code-block:: python
import pyarrow as pa
import pyarrow.parquet as pq
import statsmodels.formula.api as smf
class DataSet(dict):
def __init__(self, path):
self.parquet = pq.ParquetFile(path)
def __getitem__(self, key):
try:
return self.parquet.read([key]).to_pandas()[key]
except:
raise KeyError
LargeData = DataSet('LargeData.parquet')
res = smf.ols('Profit ~ Sugar + Power + Women', data=LargeData).fit()
Additionally, you can add code to this example DataSet object to return only
a subset of the rows until you have built a good model. Then, you can refit
your final model on more data.