Lead Developer Phil Kendall on getting started with Spark on EMR.
In June, Spark, the up and coming big data processing framework, became a first class citizen on Amazon Elastic MapReduce (EMR). Last month, Amazon announced EMR release 4.0.0 which “brings many changes to the platform”. However, some of those changes lead to a couple of “gotchas” when trying to run Spark on EMR, so this post is a quick walk through the issues I found when getting started with Spark on EMR and (mostly!) solutions to those issues.
Running the demo
Jon Fritz‘s blog post announcing the availability of Spark on EMR contained a nice simple example of getting a Spark application up and running on EMR. Unfortunately, if you try and run through that demo on the EMR 4.0.0 release, then you get an error when trying to fetch the flightsample jar from S3:
Exception in thread "main" java.lang.RuntimeException: Local file does not exist. at com.amazon.elasticmapreduce.scriptrunner.ScriptRunner.fetchFile(ScriptRunner.java:30)
This one turns out to be not too hard to fix – the EMR 4.0.0 release has just moved the location of the hdfs utility so it’s now on the normal PATH rather than being installed in the hadoop user’s home directory. That can trivially be fixed by just removing the absolute path, but while we’re in the area, we can also upgrade to using the new command-runner rather than script-runner. Once you’ve done both those changes, the Custom JAR step should look like this:
…and you can then happily run through the rest of the demo.
Spark Streaming on Elastic MapReduce
The next thing you might try is to get Spark Streaming running on EMR. On the face of it, this looks to be nice and easy – just push your jar containing the streaming application onto the cluster and away you go. And your application starts…. and then just sits there, steadfastly refusing to do anything at all. Experienced Spark Streaming folk will quite possibly recognise this as a symptom of the executors not having enough cores to run their workloads – each receiver you create occupies a core, so you need to ensure that there are enough cores in your cluster to run the receivers and to process the data. To some extent, you’d hope this isn’t a problem as the m3.xlarge instances that you get by default when creating an EMR cluster each have 4 cores, so there must be something else going on here.
The issue here turns out to be the default Spark configuration when running on YARN, which is what EMR uses for its cluster management – each executor is by default allocated only one core so your nice cluster with two 4 core machines in it was actually sitting there with three quarters of its processors doing nothing. Getting around this is what the “-x” option mentioned in Jon Fritz’s blog post did – it ensured that Spark used all the available resources on the cluster, but that setting isn’t available with EMR 4.0.0. The equivalent option for the new version is mentioned in the “Additional EMR Configuration Options for Spark” of the EMR 4.0.0 announcement: you need to set the “maximizeResourceAllocation” property. To do that, select “Go to advanced options” when creating the cluster, expand the “Edit software settings (optional)” section and then add in the appropriate configuration string: “
classification=spark,properties=[maximizeResourceAllocation=true]“. This does unfortunately mean that the “quick options” for creating a cluster is pretty much useless when using Spark as you’re always going to want to be setting this option or a variant of it.
Getting to the Spark web UI
When you’re running a Spark application, you may well be used to using the Spark web UI to keep an eye on your job. However, getting to the web UI on an EMR cluster isn’t as easy as it might appear at first glance. You can happily point your web browser to http://<cluster master DNS address>:4040/ as usual, but that returns a redirect to http://ip-<numbers>.<region>.compute.internal:20888/proxy/application_<n>_<n>/ containing a reference to the internal DNS name of the machine which isn’t too helpful if you’re outside the VPC inside which the cluster is running. I haven’t found a perfect solution to this one yet, but you can just replace “ip-<numbers>.<region>.compute.internal” with the external DNS name of the master – so you’re pointing at something like http://<cluster master DNS address>:20888/proxy/application_<n>_<n>/ – and then you can happily browse around the web UI from there.
Onward and upward
With all that, I’ve pretty much got up and running with Spark on Elastic MapReduce 4. Now, it’s back to the actual Spark applications again…