Back to Charts

⚠️ Repo Archive Notice

stable/hadoop/README.md

latest7.9 KB
Original Source

⚠️ Repo Archive Notice

As of Nov 13, 2020, charts in this repo will no longer be updated. For more information, see the Helm Charts Deprecation and Archive Notice, and Update.

Hadoop Chart

Hadoop is a framework for running large scale distributed applications.

This chart is primarily intended to be used for YARN and MapReduce job execution where HDFS is just used as a means to transport small artifacts within the framework and not for a distributed filesystem. Data should be read from cloud based datastores such as Google Cloud Storage, S3 or Swift.

DEPRECATION NOTICE

This chart is deprecated and no longer supported.

Chart Details

Installing the Chart

To install the chart with the release name hadoop that utilizes 50% of the available node resources:

$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) stable/hadoop

Note that you need at least 2GB of free memory per NodeManager pod, if your cluster isn't large enough, not all pods will be scheduled.

The optional calc_resources.sh script is used as a convenience helper to set the yarn.numNodes, and yarn.nodeManager.resources appropriately to utilize all nodes in the Kubernetes cluster and a given percentage of their resources. For example, with a 3 node n1-standard-4 GKE cluster and an argument of 50, this would create 3 NodeManager pods claiming 2 cores and 7.5Gi of memory.

Persistence

To install the chart with persistent volumes:

$ helm install --name hadoop $(stable/hadoop/tools/calc_resources.sh 50) \
  --set persistence.nameNode.enabled=true \
  --set persistence.nameNode.storageClass=standard \
  --set persistence.dataNode.enabled=true \
  --set persistence.dataNode.storageClass=standard \
  stable/hadoop

Change the value of storageClass to match your volume driver. standard works for Google Container Engine clusters.

Configuration

The following table lists the configurable parameters of the Hadoop chart and their default values.

ParameterDescriptionDefault
image.repositoryHadoop image (source)danisla/hadoop
image.tagHadoop image tag2.9.0
imagee.pullPolicyPull policy for the imagesIfNotPresent
hadoopVersionVersion of hadoop libraries being used2.9.0
antiAffinityPod antiaffinity, hard or softhard
hdfs.nameNode.pdbMinAvailablePDB for HDFS NameNode1
hdfs.nameNode.resourcesresources for the HDFS NameNoderequests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m
hdfs.dataNode.replicasNumber of HDFS DataNode replicas1
hdfs.dataNode.pdbMinAvailablePDB for HDFS DataNode1
hdfs.dataNode.resourcesresources for the HDFS DataNoderequests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m
hdfs.webhdfs.enabledEnable WebHDFS REST APIfalse
yarn.resourceManager.pdbMinAvailablePDB for the YARN ResourceManager1
yarn.resourceManager.resourcesresources for the YARN ResourceManagerrequests:memory=256Mi,cpu=10m,limits:memory=2048Mi,cpu=1000m
yarn.nodeManager.pdbMinAvailablePDB for the YARN NodeManager1
yarn.nodeManager.replicasNumber of YARN NodeManager replicas2
yarn.nodeManager.parallelCreateCreate all nodeManager statefulset pods in parallel (K8S 1.7+)false
yarn.nodeManager.resourcesResource limits and requests for YARN NodeManager podsrequests:memory=2048Mi,cpu=1000m,limits:memory=2048Mi,cpu=1000m
persistence.nameNode.enabledEnable/disable persistent volumefalse
persistence.nameNode.storageClassName of the StorageClass to use per your volume provider-
persistence.nameNode.accessModeAccess mode for the volumeReadWriteOnce
persistence.nameNode.sizeSize of the volume50Gi
persistence.dataNode.enabledEnable/disable persistent volumefalse
persistence.dataNode.storageClassName of the StorageClass to use per your volume provider-
persistence.dataNode.accessModeAccess mode for the volumeReadWriteOnce
persistence.dataNode.sizeSize of the volume200Gi

The Zeppelin Notebook chart can use the hadoop config for the hadoop cluster and use the YARN executor:

helm install --set hadoop.useConfigMap=true stable/zeppelin

References