Back to Charts

⚠️ Repo Archive Notice

stable/horovod/README.md

latest5.3 KB
Original Source

⚠️ Repo Archive Notice

As of Nov 13, 2020, charts in this repo will no longer be updated. For more information, see the Helm Charts Deprecation and Archive Notice, and Update.

Horovod

Horovod is a distributed training framework for TensorFlow, and it's provided by UBER. The goal of Horovod is to make distributed Deep Learning fast and easy to use. And it provides Horovod in Docker to streamline the installation process.

DEPRECATION NOTICE

This chart is deprecated and no longer supported.

Introduction

This chart bootstraps Horovod which is a Distributed TensorFlow Framework on a Kubernetes cluster using the Helm Package Manager. It deploys Horovod workers as statefulsets, and the Horovod master as a job, then discover the host list automatically.

Prerequisites

  • Kubernetes cluster v1.8+

Build Docker Image

You can download official Horovod Dockerfile, then modify it according to your requirement, e.g. select a different CUDA, TensorFlow or Python version.

# mkdir horovod-docker
# wget -O horovod-docker/Dockerfile https://raw.githubusercontent.com/uber/horovod/master/Dockerfile
# docker build -t horovod:latest horovod-docker

Prepare ssh keys

# Setup ssh key
export SSH_KEY_DIR=`mktemp -d`
cd $SSH_KEY_DIR
yes | ssh-keygen -N "" -f id_rsa

Create the values.yaml

To run Horovod with GPU, you can create values.yaml like below

# cat << EOF > ~/values.yaml
---
ssh:
  useSecrets: true
  hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/    /g')

  hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/    /g')

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

worker:
  number: 2
  image:
    repository: uber/horovod
    tag: 0.12.1-tf1.8.0-py3.5
master:
  image:
    repository: uber/horovod
    tag: 0.12.1-tf1.8.0-py3.5
  args:
    - "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF

For most cases, the overlay network impacts the Horovod performance greatly, so we should apply Host Network solution. To run Horovod with Host Network and GPU, you can create values.yaml like below

# cat << EOF > ~/values.yaml
---
useHostNetwork: true

ssh:
  useSecrets: true
  port: 32222
  hostKey: |-
$(cat $SSH_KEY_DIR/id_rsa | sed 's/^/    /g')

  hostKeyPub: |-
$(cat $SSH_KEY_DIR/id_rsa.pub | sed 's/^/    /g')

resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1

worker:
  number: 2
  image:
    repository: uber/horovod
    tag: 0.12.1-tf1.8.0-py3.5
master:
  image:
    repository: uber/horovod
    tag: 0.12.1-tf1.8.0-py3.5
  args:
    - "mpirun -np 3 --hostfile /horovod/generated/hostfile --mca orte_keep_fqdn_hostnames t --allow-run-as-root --display-map --tag-output --timestamp-output sh -c 'python /examples/tensorflow_mnist.py'"
EOF

notice: the difference is that you should set useHostNetwork as true, then set another ssh port rather than 22

Installing the Chart

To install the chart with the release name mnist:

bash
$ helm install --values ~/values.yaml --name mnist stable/horovod

Uninstalling the Chart

To uninstall/delete the mnist deployment:

bash
$ helm delete mnist

The command removes all the Kubernetes components associated with the chart and deletes the release.

Upgrading an existing Release to a new major version

A major chart version change (like v1.2.3 -> v2.0.0) indicates that there is an incompatible breaking change needing manual actions.

1.0.0

This version removes the chart label from the spec.selector.matchLabels which is immutable since StatefulSet apps/v1beta2. It has been inadvertently added, causing any subsequent upgrade to fail. See https://github.com/helm/charts/issues/7726.

In order to upgrade, delete the Horovod StatefulSet before upgrading, supposing your Release is named my-release:

bash
$ kubectl delete statefulsets.apps --cascade=false my-release

Configuration

The following table lists the configurable parameters of the Horovod chart and their default values.

ParameterDescriptionDefault
useHostNetworkHost networkfalse
ssh.portThe ssh port22
ssh.useSecretsDetermine if using the secrets for sshfalse
worker.numberThe worker's number5
worker.image.repositoryhorovod worker imageuber/horovod
worker.image.pullPolicypullPolicy for the workerIfNotPresent
worker.image.tagtag for the worker0.12.1-tf1.8.0-py3.5
resourcespod resource requests & limits{}
worker.envworker's environment variables{}
master.image.repositoryhorovod master imageuber/horovod
master.image.tagtag for the master0.12.1-tf1.8.0-py3.5
master.image.pullPolicyimage pullPolicy for the master imageIfNotPresent
master.argsmaster's args{}
master.envmaster's environment variables{}