A Sparkling View from the Canals

Control F1 sent Lead Developer Phil Kendall and Senior Developer Kevin Wood over to Amsterdam for the first European edition of Spark Summit. Here’s their summary of the conference.

One of the themes from Strata + Hadoop World in London earlier this year was the rise of Apache Spark as the new darling of the big data processing world. If anything, that trend has accelerated since May, but it has perhaps also moved in a slightly different direction as well – while the majority of the companies talking about Spark at Strata + Hadoop World were the innovative, disruptive small businesses, at Spark Summit there were a lot of big enterprises who were either building their big data infrastructure on Spark, or moving their infrastructure from “classical” Hadoop MapReduce to Spark. From a business point of view, that’s probably the headline for the conference, but here’s some more technical bits:

The Future of MapReduce

MapReduce is dead. It’s going to hang on for a few years yet due to the number of production deployments which exist, but I don’t think you would have been able to find anyone at the conference who was intending to use MapReduce for any of their new deployments. Of course, it should be remembered that this was the Spark Summit, so this won’t have been a representative sample, but when you’ve some of the biggest players in the big data space like Cloudera and Hortonworks joining in on the bandwagon, I certainly think this is the way that things are going.

In consequence, the Lambda Architecture is on its way out as well. Nobody ever really liked having to maintain two entirely separate systems for processing their data, but at the time there really wasn’t a better way. This is a movement which started to gain momentum with Jay Kreps’ “Questioning the Lambda Architecture” article last year, but as we now have an enterprise ready framework which can handle both the streaming and batch sides of the processing coin, it’s time to move on to something with less overhead, quite possibly Spark, MesosAkkaCassandra and Kafka, something which Helena Edelson implored us to do during her talk. Just hope your kids don’t go around saying “My mum/dad works with Smack”.

The Future of Languages for Spark

Scala is the language for interacting with Spark. While the conference was pretty much split down the middle between folks using Scala and folks using Python, how the Spark world is going was perhaps most obviously demonstrated by the spontaneous round of applause which Vincent Saulys got for his “Please use Scala!” comment during his keynote presentation. The theme here was very much that while there were people moving from Python to Scala, nobody was going the other way. On the other hand, the newcomer on the block here is SparkR, which has the potential to open up Spark to the large community of data scientists out there who already know R. The support in Spark 1.5 probably isn’t quite there yet to really open the door, but improvements are coming in Spark 1.6, and they’re definitely looking for feedback from the R community as to which features should be a priority, so it’s not going to be long before you’re going to see a lot of people using Spark and R.

The Future of Spark APIs

DataFrames are the future for Spark applications. Similarly to MapReduce, while nobody’s going to be killing off the low level way of working directly with resilient distributed datasets (RDDs), the DataFrames API (which is essentially equivalent to Spark SQL) is going to be where a lot of the new work gets done. The major initiative here at the moment is Project Tungsten, which gives a whole number of nice optimisations at the DataFrame level. Why is Spark moving this way? Because it’s easier to optimise when you’re higher up the stack – if you have a holistic view of what the programmer is attempting to accomplish, you can generally optimise that a lot better than if you’re looking at the individual atomic operations (the maps, sorts, reduces and whatever else of RDDs). SQL showed the value of introducing a declarative language for “little” data problems in the 1970s and 1980s; will DataFrames be that solution for big data? Given their position in all of R, Python (via Pandas) and Spark, I’d be willing to bet a small amount of money on “yes”.

On a related topic, if you’ve done any work with Spark, you’ve probably seen the Spark stack diagram by now. However, I thought Reynold Xin’s “different take” on the diagram during his keynote was really powerful – as a developer, this expresses what matters to me – the APIs I can use to interact with Spark. To a very real extent, I don’t care what’s happening under the hood: I just need to know that the mightily clever folks contributing to Spark are working their black magic in that “backend” box which makes everything run super-fast.

The Future of Spark at Control F1

I don’t think it will come as a suprise to anyone who has been following our blog that we’re big fans of Spark here at Control F1. Don’t be surprised if you see it in some of our work in the near future 🙂


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s