docs/Training-on-Microsoft-Azure.md
:warning: Note: We no longer use this guide ourselves and so it may not work correctly. We've decided to keep it up just in case it is helpful to you.
This page contains instructions for setting up training on Microsoft Azure through either Azure Container Instances or Virtual Machines. Non "headless" training has not yet been tested to verify support.
A pre-configured virtual machine image is available in the Azure Marketplace and is nearly completely ready for training. You can start by deploying the Data Science Virtual Machine for Linux (Ubuntu) into your Azure subscription.
Note that, if you choose to deploy the image to an N-Series GPU optimized VM, training will, by default, run on the GPU. If you choose any other type of VM, training will run on the CPU.
Setting up your own instance requires a number of package installations. Please view the documentation for doing so here.
ml-agents sub-folder of this ml-agents repo to the remote Azure
instance, and set it as the working directory.pip3 install torch==1.7.0 -f https://download.pytorch.org/whl/torch_stable.html and
MLAgents: python -m pip install mlagents==1.1.0To verify that all steps worked correctly:
from mlagents_envs.environment import UnityEnvironment
env = UnityEnvironment(file_name="<your_env>", seed=1, side_channels=[])
Where <your_env> corresponds to the path to your environment executable (i.e. /home/UserName/Build/yourFile).
You should receive a message confirming that the environment was loaded successfully.
Note: When running your environment in headless mode, you must append --no-graphics to your mlagents-learn command, as it won't train otherwise.
You can test this simply by aborting a training and check if it says "Model Saved" or "Aborted", or see if it generated the .onnx in the result folder.
To run your training on the VM:
mlagents-learn <trainer_config> --env=<your_app> --run-id=<run_id> --train
Where <your_app> is the path to your app (i.e.
~/unity-volume/3DBallHeadless) and <run_id> is an identifier you would like
to identify your training run with.
If you've selected to run on a N-Series VM with GPU support, you can verify that
the GPU is being used by running nvidia-smi from the command line.
Once you have started training, you can use TensorBoard to observe the training.
Start by opening the appropriate port for web traffic to connect to your VM.
Network Security Group but
instead, go to the Networking tab under Settings for your VM.Unless you started the training as a background process, connect to your VM from another terminal instance.
Run the following command from your terminal
tensorboard --logdir results --host 0.0.0.0
You should now be able to open a browser and navigate to
<Your_VM_IP_Address>:6060 to view the TensorBoard report.
Azure Container Instances allow you to spin up a container, on demand, that will run your training and then be shut down. This ensures you aren't leaving a billable VM running when it isn't needed. Using ACI enables you to offload training of your models without needing to install Python and TensorFlow on your own computer.
This page contains instructions for setting up a custom Virtual Machine on Microsoft Azure so you can running ML-Agents training in the cloud.
Start by deploying an Azure VM with Ubuntu Linux (tests were done with 16.04 LTS). To use GPU support, use a N-Series VM.
SSH into your VM.
Start with the following commands to install the Nvidia driver:
wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda-drivers
sudo reboot
After a minute you should be able to reconnect to your VM and install the CUDA toolkit:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda-8-0
You'll next need to download cuDNN from the Nvidia developer site. This requires a registered account.
Navigate to http://developer.nvidia.com and create an account and verify it.
Download (to your own computer) cuDNN from this url.
Copy the deb package to your VM:
scp libcudnn6_6.0.21-1+cuda8.0_amd64.deb <VMUserName>@<VMIPAddress>:libcudnn6_6.0.21-1+cuda8.0_amd64.deb
SSH back to your VM and execute the following:
sudo dpkg -i libcudnn6_6.0.21-1+cuda8.0_amd64.deb
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile
sudo reboot
After a minute, you should be able to SSH back into your VM. After doing so, run the following:
sudo apt install python-pip
sudo apt install python3-pip
At this point, you need to install TensorFlow. The version you install should be tied to if you are using GPU to train:
pip3 install tensorflow-gpu==1.4.0 keras==2.0.6
Or CPU to train:
pip3 install tensorflow==1.4.0 keras==2.0.6
You'll then need to install additional dependencies:
pip3 install pillow
pip3 install numpy