How I Got Apache Spark to Sort Of (Not Really) Work on my PicoCluster of 5 Raspberry PI
I've read several blog posts about people running Apache Spark on a Raspberry PI. It didn't seem too hard so I thought I've have a go at it. But the results were disappointing. Bear in mind that I am a Spark novice so some setting is probably. I ran into two issues - memory and heartbeats.
So, this what I did.
I based my work on these pages:
* https://darrenjw2.wordpress.com/2015/04/17/installing-apache-spark-on-a-raspberry-pi-2/
* https://darrenjw2.wordpress.com/2015/04/18/setting-up-a-standalone-apache-spark-cluster-of-raspberry-pi-2/
* http://www.openkb.info/2014/11/memory-settings-for-spark-standalone_27.html
I created five SD cards according to my previous blog post (see http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html).
Installation of Apache Spark
* install Oracle Java and Python
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local sudo apt-get install -y oracle-java8-jdk python2.7 &); done
* download Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
* Copy Spark to all RPI
for i in `seq 1 5`; do (scp -q -oStrictHostKeyChecking=no -oCheckHostIP=no spark-1.6.2-bin-hadoop2.6.tgz pirate@pi0${i}.local:. && echo "Copy complete to pi0${i}" &); done
* Uncompress Spark
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local tar xfz spark-1.6.2-bin-hadoop2.6.tgz && echo "Uncompress complete to pi0${i}" &); done
* Remove tgz file
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local rm spark-1.6.2-bin-hadoop2.6.tgz); done
* Add the following to your .bashrc file on each RPI. I can't figure out how to put this into a loop.
export SPARK_LOCAL_IP="$(ip route get 1 | awk '{print $NF;exit}')"
* Run Standalone Spark Shell
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
cd spark-1.6.2-bin-hadoop2.6
bin/run-example SparkPi 10
bin/spark-shell --master local[4]
# This takes several minutes to display a prompt.
# While the shell is running, visit http://pi01.local:4040/
scala> sc.textFile("README.md").count
# After the job is complete, visit the monitor page.
scala> exit
* Run PyShark Shell
bin/pyspark --master local[4]
>>> sc.textFile("README.md").count()
>>> exit()
CLUSTER
Now for the clustering...
* Enable password-less SSH between nodes
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
for i in `seq 1 5`; do avahi-resolve --name pi0${i}.local -4 | awk ' { t = $1; $1 = $2; $2 = t; print; } ' | sudo tee --append /etc/hosts; done
echo "$(ip route get 1 | awk '{print $NF;exit}') $(hostname).local" | sudo tee --append /etc/hosts
ssh-keygen
for i in `seq 1 5`; do ssh-copy-id pirate@pi0${i}.local; done
* Configure Spark for Cluster
cd spark-1.6.2-bin-hadoop2.6/conf
create a slaves file with the following contents
pi01.local
pi02.local
pi03.local
pi04.local
pi05.local
cp spark-env.sh.template spark-env.sh
In spark-env.sh
Set SPARK_MASTER_IP the results of "ip route get 1 | awk '{print $NF;exit}'"
SPARK_WORKER_MEMORY=512m
* Copy the spark environment script to the other RPI
for i in `seq 2 5`; do scp spark-env.sh pirate@pi0${i}.local:spark-1.6.2-bin-hadoop2.6/conf/; done
* Start the cluster
cd ..
sbin/start-all.sh
* Visit the monitor page
http://192.168.1.8:8080
And everything is working so far! But ...
* Start a Spark Shell
bin/spark-shell --executor-memory 500m --driver-memory 500m --master spark://pi01.local:7077 --conf spark.executor.heartbeatInterval=45s
And this fails...
So, this what I did.
I based my work on these pages:
* https://darrenjw2.wordpress.com/2015/04/17/installing-apache-spark-on-a-raspberry-pi-2/
* https://darrenjw2.wordpress.com/2015/04/18/setting-up-a-standalone-apache-spark-cluster-of-raspberry-pi-2/
* http://www.openkb.info/2014/11/memory-settings-for-spark-standalone_27.html
I created five SD cards according to my previous blog post (see http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html).
Installation of Apache Spark
* install Oracle Java and Python
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local sudo apt-get install -y oracle-java8-jdk python2.7 &); done
* download Spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz
* Copy Spark to all RPI
for i in `seq 1 5`; do (scp -q -oStrictHostKeyChecking=no -oCheckHostIP=no spark-1.6.2-bin-hadoop2.6.tgz pirate@pi0${i}.local:. && echo "Copy complete to pi0${i}" &); done
* Uncompress Spark
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local tar xfz spark-1.6.2-bin-hadoop2.6.tgz && echo "Uncompress complete to pi0${i}" &); done
* Remove tgz file
for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local rm spark-1.6.2-bin-hadoop2.6.tgz); done
* Add the following to your .bashrc file on each RPI. I can't figure out how to put this into a loop.
export SPARK_LOCAL_IP="$(ip route get 1 | awk '{print $NF;exit}')"
* Run Standalone Spark Shell
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
cd spark-1.6.2-bin-hadoop2.6
bin/run-example SparkPi 10
bin/spark-shell --master local[4]
# This takes several minutes to display a prompt.
# While the shell is running, visit http://pi01.local:4040/
scala> sc.textFile("README.md").count
# After the job is complete, visit the monitor page.
scala> exit
* Run PyShark Shell
bin/pyspark --master local[4]
>>> sc.textFile("README.md").count()
>>> exit()
CLUSTER
Now for the clustering...
* Enable password-less SSH between nodes
ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
for i in `seq 1 5`; do avahi-resolve --name pi0${i}.local -4 | awk ' { t = $1; $1 = $2; $2 = t; print; } ' | sudo tee --append /etc/hosts; done
echo "$(ip route get 1 | awk '{print $NF;exit}') $(hostname).local" | sudo tee --append /etc/hosts
ssh-keygen
for i in `seq 1 5`; do ssh-copy-id pirate@pi0${i}.local; done
* Configure Spark for Cluster
cd spark-1.6.2-bin-hadoop2.6/conf
create a slaves file with the following contents
pi01.local
pi02.local
pi03.local
pi04.local
pi05.local
cp spark-env.sh.template spark-env.sh
In spark-env.sh
Set SPARK_MASTER_IP the results of "ip route get 1 | awk '{print $NF;exit}'"
SPARK_WORKER_MEMORY=512m
* Copy the spark environment script to the other RPI
for i in `seq 2 5`; do scp spark-env.sh pirate@pi0${i}.local:spark-1.6.2-bin-hadoop2.6/conf/; done
* Start the cluster
cd ..
sbin/start-all.sh
* Visit the monitor page
http://192.168.1.8:8080
And everything is working so far! But ...
* Start a Spark Shell
bin/spark-shell --executor-memory 500m --driver-memory 500m --master spark://pi01.local:7077 --conf spark.executor.heartbeatInterval=45s
And this fails...