.. _graph_manipulation:

Advanced graph manipulation

There are some situations where computations with Dask collections will result in suboptimal memory usage (e.g. an entire Dask DataFrame is loaded into memory). This may happen when Dask’s scheduler doesn’t automatically delay the computation of nodes in a task graph to avoid occupying memory with their output for prolonged periods of time, or in scenarios where recalculating nodes is much cheaper than holding their output in memory.

This page highlights a set of graph manipulation utilities which can be used to help avoid these scenarios. In particular, the utilities described below rewrite the underlying Dask graph for Dask collections, producing equivalent collections with different sets of keys.

Consider the following example:

.. code-block:: python

import dask.array as da x = da.random.default_rng().normal(size=500_000_000, chunks=100_000) x_mean = x.mean() y = (x - x_mean).max().compute()

The above example computes the largest value of a distribution after removing its bias. This involves loading the chunks of x into memory in order to compute x_mean. However, since the x array is needed later in the computation to compute y, the entire x array is kept in memory. For large Dask Arrays this can be very problematic.

To alleviate the need for the entire x array to be kept in memory, one could rewrite the last line as follows:

.. code-block:: python

from dask.graph_manipulation import bind xb = bind(x, x_mean) y = (xb - x_mean).max().compute()

Here we use :func:~dask.graph_manipulation.bind to create a new Dask Array, xb, which produces exactly the same output as x, but whose underlying Dask graph has different keys than x, and will only be computed after x_mean has been calculated.

This results in the chunks of x being computed and immediately individually reduced by mean; then recomputed and again immediately pipelined into the subtraction followed by reduction with max. This results in a much smaller peak memory usage as the full x array is no longer loaded into memory. However, the tradeoff is that the compute time increases as x is computed twice.

API

.. currentmodule:: dask.graph_manipulation

.. autosummary::

checkpoint wait_on bind clone

Definitions


.. autofunction:: checkpoint
.. autofunction:: wait_on
.. autofunction:: bind
.. autofunction:: clone