docs/setup/deployment/spark_cluster_mode.md
{% include JB/setup %}
Apache Spark has supported three cluster manager types(Standalone, Apache Mesos and Hadoop YARN) so far. This document will guide you how you can build and configure the environment on 3 types of Spark cluster manager with Apache Zeppelin using Docker scripts. So install docker on the machine first.
Spark standalone is a simple cluster manager included with Spark that makes it easy to set up a cluster. You can simply set up Spark standalone environment with below steps.
Note : Since Apache Zeppelin and Spark use same
8080port for their web UI, you might need to changezeppelin.server.portinconf/zeppelin-site.xml.
You can find docker script files under scripts/docker/spark-cluster-managers.
cd $ZEPPELIN_HOME/scripts/docker/spark-cluster-managers/spark_standalone
docker build -t "spark_standalone" .
docker run -it \
-p 8080:8080 \
-p 7077:7077 \
-p 8888:8888 \
-p 8081:8081 \
-h sparkmaster \
--name spark_standalone \
spark_standalone bash;
Note that sparkmaster hostname used here to run docker container should be defined in your /etc/hosts.
Set Spark master as spark://<hostname>:7077 in Zeppelin Interpreters setting page.
After running single paragraph with Spark interpreter in Zeppelin, browse https://<hostname>:8080 and check whether Spark cluster is running well or not.
You can also simply verify that Spark is running well in Docker with below command.
ps -ef | grep spark
You can simply set up Spark on YARN docker environment with below steps.
Note : Since Apache Zeppelin and Spark use same
8080port for their web UI, you might need to changezeppelin.server.portinconf/zeppelin-site.xml.
You can find docker script files under scripts/docker/spark-cluster-managers.
cd $ZEPPELIN_HOME/scripts/docker/spark-cluster-managers/spark_yarn_cluster
docker build -t "spark_yarn" .
docker run -it \
-p 5000:5000 \
-p 9000:9000 \
-p 9001:9001 \
-p 8088:8088 \
-p 8042:8042 \
-p 8030:8030 \
-p 8031:8031 \
-p 8032:8032 \
-p 8033:8033 \
-p 8080:8080 \
-p 7077:7077 \
-p 8888:8888 \
-p 8081:8081 \
-p 50010:50010 \
-p 50075:50075 \
-p 50020:50020 \
-p 50070:50070 \
--name spark_yarn \
-h sparkmaster \
spark_yarn bash;
Note that sparkmaster hostname used here to run docker container should be defined in your /etc/hosts.
You can simply verify the processes of Spark and YARN are running well in Docker with below command.
ps -ef
You can also check each application web UI for HDFS on http://<hostname>:50070/, YARN on http://<hostname>:8088/cluster and Spark on http://<hostname>:8080/.
Set following configurations to conf/zeppelin-env.sh.
export HADOOP_CONF_DIR=[your_hadoop_conf_path]
export SPARK_HOME=[your_spark_home_path]
HADOOP_CONF_DIR(Hadoop configuration path) is defined in /scripts/docker/spark-cluster-managers/spark_yarn_cluster/hdfs_conf.
Don't forget to set Spark spark.master as yarn-client in Zeppelin Interpreters setting page like below.
After running a single paragraph with Spark interpreter in Zeppelin, browse http://<hostname>:8088/cluster/apps and check Zeppelin application is running well or not.
You can simply set up Spark on Mesos docker environment with below steps.
cd $ZEPPELIN_HOME/scripts/docker/spark-cluster-managers/spark_mesos
docker build -t "spark_mesos" .
docker run --net=host -it \
-p 8080:8080 \
-p 7077:7077 \
-p 8888:8888 \
-p 8081:8081 \
-p 8082:8082 \
-p 5050:5050 \
-p 5051:5051 \
-p 4040:4040 \
-h sparkmaster \
--name spark_mesos \
spark_mesos bash;
Note that sparkmaster hostname used here to run docker container should be defined in your /etc/hosts.
You can simply verify the processes of Spark and Mesos are running well in Docker with below command.
ps -ef
You can also check each application web UI for Mesos on http://<hostname>:5050/cluster and Spark on http://<hostname>:8080/.
export MESOS_NATIVE_JAVA_LIBRARY=[PATH OF libmesos.so]
export SPARK_HOME=[PATH OF SPARK HOME]
Don't forget to set Spark spark.master as mesos://127.0.1.1:5050 in Zeppelin Interpreters setting page like below.
After running a single paragraph with Spark interpreter in Zeppelin, browse http://<hostname>:5050/#/frameworks and check Zeppelin application is running well or not.
--add-host option when executing dockerrun## use `--add-host=moby:127.0.0.1` option to resolve
## since docker container couldn't resolve `moby`
: java.net.UnknownHostException: moby: moby: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1496)
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:789)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:782)
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:782)
mesos://127.0.0.1 instead of mesos://127.0.1.1I0103 20:17:22.329269 340 sched.cpp:330] New master detected at [email protected]:5050
I0103 20:17:22.330749 340 sched.cpp:341] No credentials provided. Attempting to register without authentication
W0103 20:17:22.333531 340 sched.cpp:736] Ignoring framework registered message because it was sentfrom '[email protected]:5050' instead of the leading master '[email protected]:5050'
W0103 20:17:24.040252 339 sched.cpp:736] Ignoring framework registered message because it was sentfrom '[email protected]:5050' instead of the leading master '[email protected]:5050'
W0103 20:17:26.150250 339 sched.cpp:736] Ignoring framework registered message because it was sentfrom '[email protected]:5050' instead of the leading master '[email protected]:5050'
W0103 20:17:26.737604 339 sched.cpp:736] Ignoring framework registered message because it was sentfrom '[email protected]:5050' instead of the leading master '[email protected]:5050'
W0103 20:17:35.241714 336 sched.cpp:736] Ignoring framework registered message because it was sentfrom '[email protected]:5050' instead of the leading master '[email protected]:5050'