doc/source/cluster/vms/examples/ml-example.md
(clusters-vm-ml-example)=
:::{note}
To learn the basics of Ray on VMs, we recommend taking a look
at the {ref}introductory guide <vm-cluster-quick-start> first.
:::
In this guide, we show you how to run a sample Ray machine learning workload on AWS. The similar steps can be used to deploy on GCP or Azure as well.
We will run Ray's {ref}XGBoost training benchmark <xgboost-benchmark> with a 100 gigabyte training set.
To learn more about using Ray's XGBoostTrainer, check out {ref}the XGBoostTrainer documentation <train-xgboost>.
For the workload in this guide, it is recommended to use the following setup:
The corresponding cluster configuration file is as follows:
:language: yaml
**If you would like to try running the workload with autoscaling enabled**,
change ``min_workers`` of worker nodes to 0.
After the workload is submitted, 9 workers nodes will
scale up to accommodate the workload. These nodes will scale back down after the workload is complete.
Now we're ready to deploy the Ray cluster with the configuration that's defined above. Before running the command, make sure your aws credentials are configured correctly.
ray up -y cluster.yaml
A Ray head node and 9 Ray worker nodes will be created.
We will use {ref}Ray Job Submission <jobs-overview> to kick off the workload.
First, we connect to the Job server. Run the following blocking command in a separate shell.
ray dashboard cluster.yaml
This will forward remote port 8265 to port 8265 on localhost.
We'll use the {ref}Ray Job Python SDK <ray-job-sdk> to submit the XGBoost workload.
:language: python
To submit the workload, run the above Python script. The script is available in the Ray repository.
# Download the above script.
curl https://raw.githubusercontent.com/ray-project/ray/releases/2.0.0/doc/source/cluster/doc_code/xgboost_submit.py -o xgboost_submit.py
# Run the script.
python xgboost_submit.py
The benchmark may take up to 30 minutes to run. Use the following tools to observe its progress.
To follow the job's logs, use the command printed by the above submission script.
# Substitute the Ray Job's submission id.
ray job logs 'raysubmit_xxxxxxxxxxxxxxxx' --address="http://localhost:8265" --follow
View localhost:8265 in your browser to access the Ray Dashboard.
Observe autoscaling status and Ray resource usage with
ray exec cluster.yaml 'ray status'
Once the benchmark is complete, the job log will display the results:
Results: {'training_time': 1338.488839321999, 'prediction_time': 403.36653568099973}
The performance of the benchmark is sensitive to the underlying cloud infrastructure --
you might not match {ref}the numbers quoted in the benchmark docs <xgboost-benchmark>.
The file model.json in the Ray head node contains the parameters for the trained model.
Other result data will be available in the directory ray_results in the head node.
Refer to the {ref}XGBoostTrainer documentation <train-xgboost> for details.
If autoscaling is enabled, Ray worker nodes will scale down after the specified idle timeout.
Delete your Ray cluster with the following command:
ray down -y cluster.yaml