Big Data madness and reality

“Big Data” has been hyped a lot, and due to this now is burning down to get some reality checking. It’s been already overdue, but it is now happening.

I have seen a lot of people be very excited about NoSQL databases; and these databases do have their use cases. However, you just cannot put everything into NoSQL mode and expect it to just work. That is why we have recently seen NewSQL databases, query languages for NoSQL databases etc. - and in fact, they all move towards relational database management again.

Sound new database systems seem to be mostly made up of three aspects: in-memory optimization instead of optimizing for on-disk operation (memory just has become a lot cheaper the last 10 years), the execution of queries on the servers that store the actual data (you may want to call this “stored procedures” if you like SQL, or “map-reduce” if you like buzzwords), and optimized memory layouts (many of the benefits of “colum store” databases come from having a single, often primitive, datatype for the full column to scan, instead of alternating datatypes in records.

However, here is one point you need to consider:
is your data actually this “big”? Big as in: Google scale.

I see people use big data and Hadoop a lot when they just shouldn’t. I see a lot of people run Hadoop in a VM on their laptop. Ouch.

The big data technologies are not a one-size-fits-all solution. They are the supersize-me solution, and supersize just does not fit every task.

When you look at the cases where Hadoop is really successful, it is mostly in keeping the original raw data, and enabling people to re-scan the full data again when e.g. their requirements changed. This is where Hadoop is really good at: managing 100 TB of data, and allowing you to quickly filter out the few GB that you really need for your current task.

For the actual analysis - or when you don’t have 100 TB, and a large cluster anyway - then just don’t try to hadoopify everything.

Here is a raw number from a current experiment. I have a student work on indexing image data; he is implementing some state of the art techniques. For these, a large number of image features are extracted, and then clustering is used to find “typical” features to improve image search.

The benchmarking image set is 56 GB (I have others with 2 TB to use next). The subset the student is currently processing is 13 GB. Extracting 2.3 million 128 dimensional feature vectors reduces this to about 1.4 GB. As you can see, the numbers drop quickly.

State of the art seems to be to load the data into Hadoop, and run clustering (actually, this is more of a vector quantization than clustering) into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.

However, as I’ve put a lot of effort into the data mining software ELKI, I considered also to try processing it in ELKI.

By cutting the data into 10 MB blocks, Mahout/Hadoop can run the clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it needs multiple processes. I have fixed the number of iterations to 10, which should produce a good enough approximation for my use cases.

K-means is embarrassingly parallel, so one would expect the cluster to really shine at this task. Well, here are some numbers:

Mahout k-Means took 51 minutes on average per iteration (The raw numbers are 38:24, 62:29, 46:21, 56:20, 52:15, 45:11, 57:12, 44:58, 52:01, 50:26, so you can see there is a surpisingly high amount of variance there).
ELKI k-Means on a single CPU core took 4 minutes 25 seconds per iteration, and 45 minutes total, including parsing the input data from an ascii file. Maybe I will try a parallel implementation next.

So what is happening? Why is ELKI beating Mahout by a factor of 10x?

It’s (as always) a mixture of a number of things:

ELKI is quite well optimized. The Java Hotspot VM does a good job at optimizing this code, and I have seen it to be on par with R’s k-means, which is written in C. I’m not sure if Mahout has received a similar amount of optimization yet. (In fact, 15% of the Mahout runtime was garbage collection runtime - indicating that it creates too many objects.)
ELKI can use the data in a uniform way, similar to a column store database. It’s literally crunching the raw double[] arrays. Mahout on the other hand - as far as I can tell - is getting the data from a sequence file, which then is deserialized into a complex object. In addition to the actual data, it might be expecting sparse and dense vectors mixed etc.
Size: this data set fits well into memory. Once this no longer holds, ELKI will no longer be an option. Then MapReduce/Hadoop/Mahout wins. In particular, such an implementation will by design not keep the whole data set in memory, but need to de-serialize it from disk again on each iteration. This is overhead, but saves memory.
Design: MapReduce is designed for huge clusters, where you must expect nodes to crash during your computation. Well, chances are that my computer will survive 45 minutes, so I do not need this for this data size. However, when you really have large data, and need multiple hours on 1000 nodes to process it, then this becomes important to survive losing a node. The cost is that all interim results are written to the distributed filesystem. This extra I/O comes, again, at a cost.

Let me emphasize this: I’m not saying, Hadoop/Mahout is bad. I’m saying: this data set is not big enough to make Mahout beneficial.

Conclusions: As long as your data fits into your main memory and takes just a few hours to compute, don’t go for Hadoop.
It will likely be faster on a single node by avoiding the overhead associated with (current) distributed implementations.

Sometimes, it may also be a solution to use the cluster only for preparing the data, then get it to a powerful workstation, and analyze it there. We did do this with the images: for extracting the image features, we used a distributed implementation (not on Hadoop, but on a dozen PCs).

I’m not saying it will stay that way. I have plans for starting “Project HELKI”, aka “ELKI on Hadoop”. Because I do sometimes hit the barrier of what I can compute on my 8 core CPU in an “reasonable” amount of time. And of course, Mahout will improve over time, and hopefully lose some of its “Java boxing” weight.

But before trying to run everything on a cluster, always try to run it on a single node first. You can still figure out how to scale up later, once you really know what you need.

And last but not least, consider whether scaling up really makes sense. K-means results don’t really get better with a larger data set. They are averages, and adding more data will likely only change the last few digits. Now if you have an implementation that doesn’t pay attention to numerical issues, you might end up with more numerical error than you gained from that extra data.

In fact, k-means can effectively be accelerated by processing a sample first, and only refining this on the full data set then. And: sampling is the most important “big data” approach. In particular when using statistical methods, consider whether more data can really be expected to improve your quality.

Better algorithms: K-means is a very crude heuristic. It optimizes the sum of squared deviations, which may be not too meaningful for your problem. It is not very noise tolerant, either. And there are thousands of variants. For example bisecting k-means, which no longer is embarrassingly parallel (i.e. it is not as easy to implement on MapReduce), but took just 7 minutes doing 20 iterations for each split. The algorithm can be summarized as starting with k=2, then always splitting the largest cluster in two until you have the desired k. For many use cases, this result will be just as good as the full k-means result.

Don’t get me wrong. There are true big data problems. Web scale search, for example. Or product and friend recommendations at Facebook scale. But chances are that you don’t have this amount of data. ~~Google probably doesn’t employ k-means at that scale either.~~ (actually, Google runs k-means on 5 mio keypoints for vector quantization; which, judging my experience here, can still be done one a single host; in particular with hierarchical approaches such as bisecting k-means)

Don’t choose the wrong tool for the wrong problem!