docs/usage_guide/advanced_usage/modin_xgboost.rst
Modin provides an implementation of distributed XGBoost_ machine learning
algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or
interfaces could be changed.
Modin comes with all the dependencies except xgboost package by default.
Currently, distributed XGBoost on Modin is only supported on the Ray execution engine, therefore, see
the :doc:installation page </getting_started/installation> for more information on installing Modin with the Ray engine.
To install xgboost package you can use pip:
.. code-block:: bash
pip install xgboost
Distributed XGBoost functionality is placed in modin.experimental.xgboost module.
modin.experimental.xgboost provides a drop-in replacement API for train and Booster.predict xgboost functions.
.. automodule:: modin.experimental.xgboost :noindex: :members: train
.. autoclass:: modin.experimental.xgboost.Booster :noindex: :members: predict
Data is passed to modin.experimental.xgboost functions via a Modin DMatrix object.
.. automodule:: modin.experimental.xgboost :noindex: :members: DMatrix
Currently, the Modin DMatrix supports modin.pandas.DataFrame only as an input.
The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.
To start the Ray runtime on a single node:
.. code-block:: python
import ray
ray.init()
If you already had the Ray cluster you can connect to it by next way:
.. code-block:: python
import ray ray.init(address='auto')
A detailed information about initializing the Ray runtime you can find in starting ray_ page.
In example below we train XGBoost model using the Iris Dataset_ and get prediction on the same data.
All processing will be in a single node mode.
.. code-block:: python
from sklearn import datasets
import ray
ray.init() # Start the Ray runtime for single-node
import modin.pandas as pd import modin.experimental.xgboost as xgb
iris = datasets.load_iris()
X = pd.DataFrame(iris.data) y = pd.DataFrame(iris.target)
dtrain = xgb.DMatrix(X, y) dtest = xgb.DMatrix(X, y)
xgb_params = { "eta": 0.3, "max_depth": 3, "objective": "multi:softprob", "num_class": 3, "eval_metric": "mlogloss", } steps = 20
evals_result = dict()
model = xgb.train( xgb_params, dtrain, steps, evals=[(dtrain, "train")], evals_result=evals_result )
print(f'Evals results:\n{evals_result}')
prediction = model.predict(dtest)
print(f'Prediction results:\n{prediction}')
.. _Dataframe: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
.. _starting ray: https://docs.ray.io/en/master/starting-ray.html
.. _the Iris Dataset: https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
.. _distributed XGBoost: https://medium.com/intel-analytics-software/distributed-xgboost-with-modin-on-ray-fc17edef7720