Codebits by David Medinets

I've read several blog posts about people running Apache Spark on a Raspberry PI. It didn't seem too hard so I thought I've have a go at it. But the results were disappointing. Bear in mind that I am a Spark novice so some setting is probably. I ran into two issues - memory and heartbeats.

So, this what I did.

I based my work on these pages:

* https://darrenjw2.wordpress.com/2015/04/17/installing-apache-spark-on-a-raspberry-pi-2/
* https://darrenjw2.wordpress.com/2015/04/18/setting-up-a-standalone-apache-spark-cluster-of-raspberry-pi-2/
* http://www.openkb.info/2014/11/memory-settings-for-spark-standalone_27.html

I created five SD cards according to my previous blog post (see http://affy.blogspot.com/2016/06/how-did-i-prepare-my-picocluster-for.html).

Installation of Apache Spark

* install Oracle Java and Python

for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local sudo apt-get install -y oracle-java8-jdk python2.7 &); done

* download Spark

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz

* Copy Spark to all RPI

for i in `seq 1 5`; do (scp -q -oStrictHostKeyChecking=no -oCheckHostIP=no spark-1.6.2-bin-hadoop2.6.tgz pirate@pi0${i}.local:. && echo "Copy complete to pi0${i}" &); done

* Uncompress Spark

for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local tar xfz spark-1.6.2-bin-hadoop2.6.tgz && echo "Uncompress complete to pi0${i}" &); done

* Remove tgz file

for i in `seq 1 5`; do (ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi0${i}.local rm spark-1.6.2-bin-hadoop2.6.tgz); done

* Add the following to your .bashrc file on each RPI. I can't figure out how to put this into a loop.

export SPARK_LOCAL_IP="$(ip route get 1 | awk '{print $NF;exit}')"

* Run Standalone Spark Shell

ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
cd spark-1.6.2-bin-hadoop2.6
bin/run-example SparkPi 10
bin/spark-shell --master local[4]
# This takes several minutes to display a prompt.
# While the shell is running, visit http://pi01.local:4040/
scala> sc.textFile("README.md").count
# After the job is complete, visit the monitor page.
scala> exit

* Run PyShark Shell

bin/pyspark --master local[4]
>>> sc.textFile("README.md").count()
>>> exit()

CLUSTER

Now for the clustering...

* Enable password-less SSH between nodes

ssh -oStrictHostKeyChecking=no -oCheckHostIP=no pirate@pi01.local
for i in `seq 1 5`; do avahi-resolve --name pi0${i}.local -4 | awk ' { t = $1; $1 = $2; $2 = t; print; } ' | sudo tee --append /etc/hosts; done
echo "$(ip route get 1 | awk '{print $NF;exit}') $(hostname).local" | sudo tee --append /etc/hosts
ssh-keygen
for i in `seq 1 5`; do ssh-copy-id pirate@pi0${i}.local; done

* Configure Spark for Cluster

cd spark-1.6.2-bin-hadoop2.6/conf

create a slaves file with the following contents
pi01.local
pi02.local
pi03.local
pi04.local
pi05.local

cp spark-env.sh.template spark-env.sh
In spark-env.sh
Set SPARK_MASTER_IP the results of "ip route get 1 | awk '{print $NF;exit}'"
SPARK_WORKER_MEMORY=512m

* Copy the spark environment script to the other RPI

for i in `seq 2 5`; do scp spark-env.sh pirate@pi0${i}.local:spark-1.6.2-bin-hadoop2.6/conf/; done

* Start the cluster

cd ..
sbin/start-all.sh

* Visit the monitor page

http://192.168.1.8:8080

And everything is working so far! But ...

* Start a Spark Shell

bin/spark-shell --executor-memory 500m --driver-memory 500m --master spark://pi01.local:7077 --conf spark.executor.heartbeatInterval=45s

And this fails...