Back to Modin

pandas on Dask

docs/development/using_pandas_on_dask.rst

0.37.12.6 KB
Original Source

pandas on Dask

This section describes usage related documents for the pandas on Dask component of Modin.

Modin uses pandas as a primary memory format of the underlying partitions and optimizes queries ingested from the API layer in a specific way to this format. Thus, there is no need to care of choosing it but you can explicitly specify it anyway as shown below.

One of the execution engines that Modin uses is Dask. To enable the pandas on Dask execution you should set the following environment variables:

.. code-block:: bash

export MODIN_ENGINE=dask export MODIN_STORAGE_FORMAT=pandas

or turn them on in source code:

.. code-block:: python

import modin.config as cfg cfg.Engine.put('dask') cfg.StorageFormat.put('pandas')

Using Modin on Dask locally

If you want to run Modin on Dask locally using a single node, just set Modin engine to Dask and continue working with a Modin DataFrame as if it was a pandas DataFrame. You can either initialize a Dask client on your own and Modin connects to the existing Dask cluster or allow Modin itself to initialize a Dask client.

.. code-block:: python

import modin.pandas as pd import modin.config as modin_cfg

modin_cfg.Engine.put("dask") df = pd.DataFrame(...)

Using Modin on Dask in a Cluster

If you want to run Modin on Dask in a cluster, you should set up a Dask cluster and initialize a Dask client. Once the Dask client is initialized, Modin will be able to connect to it and use the Dask cluster.

.. code-block:: python

from distributed import Client import modin.pandas as pd import modin.config as modin_cfg

Define your cluster here

cluster = ... client = Client(cluster)

modin_cfg.Engine.put("dask") df = pd.DataFrame(...)

To get more information on how to deploy and run a Dask cluster, visit the Deploy Dask Clusters_ page.

Conversion between Modin DataFrame and Dask DataFrame

Modin DataFrame can be converted to/from Dask DataFrame with no-copy partition conversion. This allows you to take advantage of both Modin and Dask libraries for maximum performance.

.. code-block:: python

import modin.pandas as pd import modin.config as modin_cfg from modin.pandas.io import to_dask, from_dask

modin_cfg.Engine.put("dask") df = pd.DataFrame(...)

Convert Modin to Dask DataFrame

dask_df = to_dask(df)

Convert Dask to Modin DataFrame

modin_df = from_dask(dask_df)

.. _Deploy Dask Clusters: https://docs.dask.org/en/stable/deploying.html