hooks/nvidia-device-plugin/README.md
This kops hook container may be used to enable nodes with GPUs to work with Kubernetes. It is targeted specifically for AWS GPU instance types.
It installs the following from web sources.
Using this hook indicates that you agree to the NVIDIA license.
nvidia-device-plugin.service along with setup scripts.nvidia-device-plugin.service runs and executes the setup
scripts in the host directory /nvidia-device-plugin.Although this hook may work among many combinatorial versions of software and images, it has only been tested with the following:
kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27, kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
nvidia-docker requires a newer version
of docker-ce >= 18.0, which is not available in the Debian jessie package
repository. In addition, the Debian jessie kernel was compiled with gcc-7,
while the system packages install gcc-4, thus making the nvidia driver
compilation fail.This kOps hook was developed against the following version combinations.
| Kops Version | Kubernetes Version | GPU Mode | OS Image |
|---|---|---|---|
| 1.15.0 | 1.15.5 | deviceplugin | kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17 |
| 1.10-beta.1 | 1.10 | deviceplugin | kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27 |
| 1.9.1 | 1.11 | deviceplugin | kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27 |
| 1.9.1 | 1.10 | legacy | kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27 |
DOCKER_REGISTRY= make image push with the desired registry to self-host the docker image.kops create cluster \
--name gpu.example.k8s.local \
--zones us-east-1c \
--node-size p2.xlarge \
--node-count 1 \
--image kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17 \
--kubernetes-version 1.15.5
This should be safe to do for all machines, because the hook auto-detects if the machine is an AWS GPU instancetype and will NO-OP otherwise. Choose between the DevicePlugin GPU Mode or Legacy Accelerators GPU Mode.
This mode is:
For Kubernetes >= 1.11.0 or clusters supporting DevicePlugins
# > kops edit instancegroup nodes
# CUDA 10.0
spec:
image: kope.io/k8s-1.15-debian-stretch-amd64-hvm-ebs-2020-01-17
hooks:
- execContainer:
image: DOCKER_REGISTRY/nvidia-device-plugin:0.2.0-cuda10.0 # Replace DOCKER_REGISTRY with the registry used to host the image
# CUDA 9.1
spec:
image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
hooks:
- execContainer:
image: dcwangmit01/nvidia-device-plugin:0.1.0
### The settings below are only necessary for kubernetes <= 1.11.0, where
### deviceplugins are not enabled by default.
# kubelet:
# featureGates:
# # Enable DevicePlugins
# DevicePlugins: "true"
# # Disable Accelerators (may interfere with DevicePlugins)
# Accelerators: "false"
The legacy accelerator
GPU mode is equivalent to the original GPU hook.
Accelerators are deprecated in Kubernetes >= 1.11.0.
# > kops edit instancegroup nodes
spec:
image: kope.io/k8s-1.10-debian-stretch-amd64-hvm-ebs-2018-05-27
hooks:
- execContainer:
image: dcwangmit01/nvidia-device-plugin:0.1.0
environment:
NVIDIA_DEVICE_PLUGIN_MODE: legacy
kubelet:
featureGates:
# Disable DevicePlugins (may interfere with DevicePlugins)
DevicePlugins: "false"
# Enable Accelerators
Accelerators: "true"
kops update cluster gpu.example.k8s.local --yes
kops rolling-update cluster gpu.example.k8s.local --yes
Only for DevicePlugin GPU Mode, load the deviceplugin daemonset for your specific environment. This is not required for the Legacy Accelerators GPU Mode.
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml
# (Optional) Set permissive toleration to allow daemonset to run anywhere.
# By default this is permissive in case you have tainted your GPU nodes.
kubectl patch daemonset nvidia-device-plugin-daemonset --namespace kube-system \
-p '{ "spec": { "template": { "spec": { "tolerations": [ { "operator": "Exists" } ] } } } }'
# CUDA 10.0
cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: tf-gpu
spec:
containers:
- name: gpu
image: tensorflow/tensorflow:2.0.1-gpu
command: [ "/bin/bash", "-ce", "tail -f /dev/null" ]
# ^ From 2.0.0 onwards, CMD is not set
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 1024Mi
nvidia.com/gpu: 1 # requesting 1 GPU
EOF
# CUDA 9.1
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: tf-gpu
spec:
containers:
- name: gpu
image: tensorflow/tensorflow:1.9.0-gpu
imagePullPolicy: IfNotPresent
resources:
limits:
memory: 1024Mi
# ^ Set memory in case default limits are set low
nvidia.com/gpu: 1 # requesting 1 GPUs
# ^ For Legacy Accelerators mode this key must be renamed
# 'alpha.kubernetes.io/nvidia-gpu'
tolerations:
# This toleration will allow the gpu hook to run anywhere
# By default this is permissive in case you have tainted your GPU nodes.
- operator: "Exists"
# ^ If you have a specific taint to target, comment out the above and modify
# the example below
### Example tolerations
# - key: "dedicated"
# operator: "Equal"
# value: "gpu"
# effect: "NoExecute"
EOF
# Check that nodes are detected to have GPUs
kubectl describe nodes | grep -E 'gpu:\s.*[1-9]'
# Check the logs of the Tensorflow Container to ensure that it ran
kubectl logs tf-gpu # no output since 2.0.0
# Show GPU info from within the pod
# Only works in DevicePlugin mode
kubectl exec -it tf-gpu nvidia-smi
# Show Tensorflow detects GPUs from within the pod.
# Only works in DevicePlugin mode
kubectl exec -it tf-gpu -- \
python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'
kubectl delete pod/tf-gpu
kubectl delete -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta5/nvidia-device-plugin.yml
kops delete cluster --name gpu.example.k8s.local --yes