How data and intelligence are not the same thing

Data is more and more seen as a panacea to improving business performance and providing insight. Think: data on customers, employees, machines, logistics and so on. But let’s not forget that technology is merely a tool, and data the raw materials we mine. It’s what we do with that material that matters. Duncan Davies, Commercial Director for Notify Solutions, wants to see more organisations mining the data and engaging with it to dig out the diamonds of insight and innovation.

Not a day goes by without a media story about data. Sometimes it’s the use and control of personal data (see Facebook), sometimes it’s about the stealing of data (see Yahoo) and sometimes it’s about how blindly trusting we are of ‘the data’ (see predictions on Brexit and Trump).

There’s a mistaken view that harvesting data is what it’s all about. That having the data in a cool database (ideally ‘up in the cloud’) will just throw out “answers”. Even better, if you can apply some ‘artificial intelligence’ to that data…that’s really sexy. Bringing in software tools that help gather all this data is seen as the end goal. The budget is spent. The software is deployed. And bang!

I’d argue that a lot of companies stop at that point and simply rely on the basic dashboards and standard reports they are given. Don’t get me wrong: some of the world’s greatest inventions have been stumbled upon completely by accident, or while the inventor was looking for something else (Steven Johnson calls this Serendipity and Exaptation in his great book ‘Where Good Ideas Come From’). But by and large, this pursuit of data for its own sake risks losing sight of the bigger fundamental: the ability to ask the right questions. We gather all this incredible data and it just sits there waiting. Containing diamonds that might never be found.

Asking the right questions drives what data you look to gather, and what you’re looking into that data for. Even better to then share that data within the organisation so that different perspectives can be applied. But you have to do this proactively; putting data in a glass cabinet and hoping people will glance at it as they pass isn’t going to work. It needs to see a cultural shift from measuring everything, to measuring what matters.

As an innovative technology business, we’re obsessed with making data faster and easier to access (and it’s in the Cloud!). But there’s still a crucial step that businesses need to grasp, which is that data only equates to intelligence when it’s properly interpreted and delivered in a useable way.

This is a challenge to the sort of apps that are now being produced to collect data. Our Health & Safety app, Notify, is awesome at gathering really good data very quickly, and sending it to a ‘back end database’ for actioning and then into a funky dashboard for review.  We’re working on predictive algorithms that will alert users to impending issues, but in truth without the skills to interpret data and information, our users still risk missing the opportunity to engage with the data, to use it to pose questions and to drive behavioural change. We can help (we’re lucky enough to have our own data scientist) but I’d argue it’s the company that best knows the questions to ask.

Taking the data, asking questions of it, and then making changes based on that data that can be measured; now that’s intelligence.

So the next time you look at a product and get excited about all the data it can collect for you, make sure you ask yourself whether you have the organisation skills and resource to do something useful with it that truly helps your business improve and innovate.

Adventures in Spark on Elastic MapReduce 4

Lead Developer Phil Kendall on getting started with Spark on EMR.

In June, Spark, the up and coming big data processing framework, became a first class citizen on Amazon Elastic MapReduce (EMR). Last month, Amazon announced EMR release 4.0.0 which “brings many changes to the platform”. However, some of those changes lead to a couple of “gotchas” when trying to run Spark on EMR, so this post is a quick walk through the issues I found when getting started with Spark on EMR and (mostly!) solutions to those issues.

Running the demo

Jon Fritz‘s blog post announcing the availability of Spark on EMR contained a nice simple example of getting a Spark application up and running on EMR. Unfortunately, if you try and run through that demo on the EMR 4.0.0 release, then you get an error when trying to fetch the flightsample jar from S3:

Exception in thread "main" java.lang.RuntimeException: Local file does not exist.

This one turns out to be not too hard to fix – the EMR 4.0.0 release has just moved the location of the hdfs utility so it’s now on the normal PATH rather than being installed in the hadoop user’s home directory. That can trivially be fixed by just removing the absolute path, but while we’re in the area, we can also upgrade to using the new command-runner rather than script-runner. Once you’ve done both those changes, the Custom JAR step should look like this:


…and you can then happily run through the rest of the demo.

Spark Streaming on Elastic MapReduce

The next thing you might try is to get Spark Streaming running on EMR. On the face of it, this looks to be nice and easy – just push your jar containing the streaming application onto the cluster and away you go. And your application starts…. and then just sits there, steadfastly refusing to do anything at all. Experienced Spark Streaming folk will quite possibly recognise this as a symptom of the executors not having enough cores to run their workloads – each receiver you create occupies a core, so you need to ensure that there are enough cores in your cluster to run the receivers and to process the data. To some extent, you’d hope this isn’t a problem as the m3.xlarge instances that you get by default when creating an EMR cluster each have 4 cores, so there must be something else going on here.

The issue here turns out to be the default Spark configuration when running on YARN, which is what EMR uses for its cluster management – each executor is by default allocated only one core so your nice cluster with two 4 core machines in it was actually sitting there with three quarters of its processors doing nothing. Getting around this is what the “-x” option mentioned in Jon Fritz’s blog post did – it ensured that Spark used all the available resources on the cluster, but that setting isn’t available with EMR 4.0.0. The equivalent option for the new version is mentioned in the “Additional EMR Configuration Options for Spark” of the EMR 4.0.0 announcement: you need to set the “maximizeResourceAllocation” property. To do that, select “Go to advanced options” when creating the cluster, expand the “Edit software settings (optional)” section and then add in the appropriate configuration string: “classification=spark,properties=[maximizeResourceAllocation=true]“. This does unfortunately mean that the “quick options” for creating a cluster is pretty much useless when using Spark as you’re always going to want to be setting this option or a variant of it.

Getting to the Spark web UI

When you’re running a Spark application, you may well be used to using the Spark web UI to keep an eye on your job. However, getting to the web UI on an EMR cluster isn’t as easy as it might appear at first glance. You can happily point your web browser to http://<cluster master DNS address>:4040/ as usual, but that returns a redirect to http://ip-<numbers>.<region>.compute.internal:20888/proxy/application_<n>_<n>/ containing a reference to the internal DNS name of the machine which isn’t too helpful if you’re outside the VPC inside which the cluster is running. I haven’t found a perfect solution to this one yet, but you can just replace “ip-<numbers>.<region>.compute.internal” with the external DNS name of the master – so you’re pointing at something like http://<cluster master DNS address>:20888/proxy/application_<n>_<n>/ – and then you can happily browse around the web UI from there.

Onward and upward

With all that, I’ve pretty much got up and running with Spark on Elastic MapReduce 4. Now, it’s back to the actual Spark applications again…