Back to Charts

⚠️ Repo Archive Notice

stable/distributed-tensorflow/README.md

latest3.5 KB
Original Source

⚠️ Repo Archive Notice

As of Nov 13, 2020, charts in this repo will no longer be updated. For more information, see the Helm Charts Deprecation and Archive Notice, and Update.

Distributed TensorFlow

TensorFlow is an open source software library for numerical computation using data flow graphs, and it supports distributed computing, allowing 'data parallel' or 'model parallel' on different servers. This means data scientists can now scale out distributed training to 100s of GPUs using TensorFlow.

DEPRECATION NOTICE

This chart is deprecated and no longer supported.

Prerequisites

  • Kubernetes cluster v1.12+

Chart Details

This chart will create a TensorFlow cluster, and distribute a computation graph across that cluster.

Installing the Chart

  • To install the chart with the release name mnist:

    bash
    $ helm install mnist incubator/distributed-tensorflow
    
  • To install with custom values via file:

    $ helm install --values values.yaml mnist incubator/distributed-tensorflow
    

    Below is an example of the custom value file values.yaml with GPU support.

    worker:
      number: 2
      podManagementPolicy: Parallel
      image:
        repository: dysproz/distributed-tf
        tag: 1.6.0-gpu
      port: 9090
      gpuCount: 1
    
    ps:
      number: 2
      podManagementPolicy: Parallel
      image:
        repository: dysproz/distributed-tf
        tag: 1.6.0
        pullPolicy: IfNotPresent
      port: 8080
    
    # optimize for training
    hyperparams:
      batchsize: 20
      learningrate: 0.001
      trainsteps: 10000
    

Notice: you can check the details of docker image from Docker hub

Uninstalling the Chart

  • To uninstall/delete the mnist deployment:

    bash
    $ helm delete mnist
    

The command removes all the Kubernetes components associated with the chart and deletes the release.

Configuration

The following table lists the configurable parameters of the Service Distributed Tensorflow chart and their default values.

ParameterDescriptionDefault
worker.image.repositoryTensorFlow Worker Server's image repositorydysproz/distributed-tf
worker.image.tagTensorFlow Worker Server's image taggpu
worker.image.pullPolicyimage pullPolicy for the workerIfNotPresent
worker.gpuCountSet the gpu to be allocated and allowed for the Pods0
worker.envkey-value environment variablesNone
ps.image.repositoryTensorFlow Parameter Server's image repositorydysproz/distributed-tf
ps.image.tagTensorFlow Parameter Server's image tag1.7.0-gpu
ps.image.pullPolicyimage pullPolicy for the psIfNotPresent
ps.envkey-value environment variablesNone
volumesList of volumes defined in cluster (in standard k8s format)host path to /tmp/mnist
volumeMountsVolumes mounted into Pods (in standard k8s format)host path volume mounted into /tmp/mnist-log
hyperparams.batchsizebatch size20
hyperparams.learningratelearning rate0.001
hyperparams.trainstepstrain steps0 (continuous run)
hyperparams.datadirdata directoryNone
hyperparams.logdirlogging directoryNone
hyperparams.hiddenunitsHidden units in neural networkNone