doc/source/tune/tutorials/tune-storage.rst
.. _tune-storage-options:
.. seealso::
Before diving into storage options, one can take a look at
:ref:`the different types of data stored by Tune <tune-persisted-experiment-data>`.
Tune allows you to configure persistent storage options to enable following use cases in a distributed Ray cluster:
Tune provides support for three scenarios:
.. note::
A network filesystem or cloud storage can be configured for single-node
experiments. This can be useful to persist your experiment results in external storage
if, for example, the instance you run your experiment on clears its local storage
after termination.
.. seealso::
See :class:`~ray.tune.SyncConfig` for the full set of configuration options as well as more details.
.. _tune-cloud-checkpointing:
Configuring Tune with cloud storage (AWS S3, Google Cloud Storage)
If all nodes in a Ray cluster have access to cloud storage, e.g. AWS S3 or Google Cloud Storage (GCS),
then all experiment outputs can be saved in a shared cloud bucket.
We can configure cloud storage by telling Ray Tune to **upload to a remote** ``storage_path``:
.. code-block:: python
from ray import tune
tuner = tune.Tuner(
trainable,
run_config=tune.RunConfig(
name="experiment_name",
storage_path="s3://bucket-name/sub-path/",
)
)
tuner.fit()
In this example, all experiment results can be found in the shared storage at ``s3://bucket-name/sub-path/experiment_name`` for further processing.
.. note::
The head node will not have access to all experiment results locally. If you want to process
e.g. the best checkpoint further, you will first have to fetch it from the cloud storage.
Experiment restoration should also be done using the experiment directory at the cloud storage
URI, rather than the local experiment directory on the head node. See :ref:`here for an example <tune-syncing-restore-from-uri>`.
Configuring Tune with a network filesystem (NFS)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If all Ray nodes have access to a network filesystem, e.g. AWS EFS or Google Cloud Filestore,
they can all write experiment outputs to this directory.
All we need to do is **set the shared network filesystem as the path to save results**.
.. code-block:: python
from ray import tune
tuner = tune.Tuner(
trainable,
run_config=tune.RunConfig(
name="experiment_name",
storage_path="/mnt/path/to/shared/storage/",
)
)
tuner.fit()
In this example, all experiment results can be found in the shared storage at ``/path/to/shared/storage/experiment_name`` for further processing.
.. _tune-default-syncing:
Configure Tune without external persistent storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On a single-node cluster
************************
If you're just running an experiment on a single node (e.g., on a laptop), Tune will use the
local filesystem as the default storage location for checkpoints and other artifacts.
Results are saved to ``~/ray_results`` in a sub-directory with a unique auto-generated name by default,
unless you customize this with ``storage_path`` and ``name`` in :class:`~ray.tune.RunConfig`.
.. code-block:: python
from ray import tune
tuner = tune.Tuner(
trainable,
run_config=tune.RunConfig(
storage_path="/tmp/custom/storage/path",
name="experiment_name",
)
)
tuner.fit()
In this example, all experiment results can be found locally at ``/tmp/custom/storage/path/experiment_name`` for further processing.
On a multi-node cluster (Deprecated)
************************************
.. warning::
When running on multiple nodes, using the local filesystem of the head node as the persistent storage location is *deprecated*.
If you save trial checkpoints and run on a multi-node cluster, Tune will raise an error by default, if NFS or cloud storage is not setup.
See `this issue <https://github.com/ray-project/ray/issues/37177>`_ for more information.
Examples
--------
Let's show some examples of configuring storage location and synchronization options.
We'll also show how to resume the experiment for each of the examples, in the case that your experiment gets interrupted.
See :ref:`tune-fault-tolerance-ref` for more information on resuming experiments.
In each example, we'll give a practical explanation of how *trial checkpoints* are saved
across the cluster and the external storage location (if one is provided).
See :ref:`tune-persisted-experiment-data` for an overview of other experiment data that Tune needs to persist.
Example: Running Tune with cloud storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Let's assume that you're running this example script from your Ray cluster's head node.
In the example below, ``my_trainable`` is a Tune :ref:`trainable <trainable-docs>`
that implements saving and loading checkpoints.
.. code-block:: python
import os
import ray
from ray import tune
from your_module import my_trainable
tuner = tune.Tuner(
my_trainable,
run_config=tune.RunConfig(
# Name of your experiment
name="my-tune-exp",
# Configure how experiment data and checkpoints are persisted.
# We recommend cloud storage checkpointing as it survives the cluster when
# instances are terminated and has better performance.
storage_path="s3://my-checkpoints-bucket/path/",
checkpoint_config=tune.CheckpointConfig(
# We'll keep the best five checkpoints at all times
# (with the highest AUC scores, a metric reported by the trainable)
checkpoint_score_attribute="max-auc",
checkpoint_score_order="max",
num_to_keep=5,
),
),
)
# This starts the run!
results = tuner.fit()
In this example, trial checkpoints will be saved to: ``s3://my-checkpoints-bucket/path/my-tune-exp/<trial_name>/checkpoint_<step>``
.. _tune-syncing-restore-from-uri:
If this run stopped for any reason (ex: user CTRL+C, terminated due to out of memory issues),
you can resume it any time starting from the experiment state saved in the cloud:
.. code-block:: python
from ray import tune
tuner = tune.Tuner.restore(
"s3://my-checkpoints-bucket/path/my-tune-exp",
trainable=my_trainable,
resume_errored=True,
)
tuner.fit()
There are a few options for restoring an experiment:
``resume_unfinished``, ``resume_errored`` and ``restart_errored``.
Please see the documentation of
:meth:`~ray.tune.Tuner.restore` for more details.
Advanced configuration
----------------------
See :ref:`Ray Train's section on advanced storage configuration <train-storage-advanced>`.
All of the configurations also apply to Ray Tune.