"Big Data" has been hyped a lot, and due to this now is burning down to
get some reality checking. It's been already overdue, but it is now happening.
I have seen a lot of people be very excited about NoSQL databases; and
these databases do have their use cases. However, you just cannot put
everything into NoSQL mode and expect it to just work. That is why we have
recently seen NewSQL databases, query languages for NoSQL databases etc. -
and in fact, they all move towards relational database management again.
Sound new database systems seem to be mostly made up of three aspects:
in-memory optimization instead of optimizing for on-disk operation (memory just
has become a lot cheaper the last 10 years), the execution of queries on the
servers that store the actual data (you may want to call this "stored
procedures" if you like SQL, or "map-reduce" if you like buzzwords), and
optimized memory layouts (many of the benefits of "colum store" databases
come from having a single, often primitive, datatype for the full column to
scan, instead of alternating datatypes in records.
However, here is one point you need to consider:
is your data actually
this "big"? Big as in: Google scale.
I see people use big data and Hadoop a lot when they just
shouldn't. I see a lot of people run Hadoop in a VM on their
The big data technologies are not a one-size-fits-all solution. They are
the supersize-me solution, and supersize just does not fit every task.
When you look at the cases where Hadoop is really successful, it is mostly
in keeping the original raw data, and enabling people to re-scan the
full data again when e.g. their requirements changed. This is where Hadoop
is really good at: managing 100 TB of data, and allowing you to quickly filter
out the few GB that you really need for your current task.
For the actual analysis - or when you don't have 100 TB, and a large
cluster anyway - then just don't try to hadoopify everything.
Here is a raw number from a current experiment. I have a student work
on indexing image data; he is implementing some state of the art techniques.
For these, a large number of image features are extracted, and then clustering
is used to find "typical" features to improve image search.
The benchmarking image set is 56 GB (I have others with 2 TB
to use next). The subset the student is currently processing is 13 GB.
Extracting 2.3 million 128 dimensional feature vectors reduces this to about
1.4 GB. As you can see, the numbers drop quickly.
State of the art seems to be to load the data into Hadoop, and run
clustering (actually, this is more of a vector quantization than clustering)
into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.
However, as I've put a lot of effort into the data mining software
, I considered also to try
processing it in ELKI.
By cutting the data into 10 MB blocks, Mahout/Hadoop can run the
clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it
needs multiple processes. I have fixed the number of iterations to 10, which
should produce a good enough approximation for my use cases.
K-means is embarrassingly parallel
, so one would expect the cluster to really shine
at this task. Well, here are some numbers:
- Mahout k-Means took 51 minutes on average per iteration
(The raw numbers are 38:24, 62:29, 46:21, 56:20, 52:15, 45:11, 57:12, 44:58,
52:01, 50:26, so you can see there is a surpisingly high amount of variance
- ELKI k-Means on a single CPU core took 4 minutes 25 seconds
per iteration, and 45 minutes total, including parsing the input
data from an ascii file. Maybe I will try a parallel implementation next.
So what is happening? Why is ELKI beating Mahout by a factor of 10x?
It's (as always) a mixture of a number of things:
- ELKI is quite well optimized. The Java Hotspot VM does a good job at
optimizing this code, and I have seen it to be on par with R's k-means, which
is written in C. I'm not sure if Mahout has received a similar amount of
optimization yet. (In fact, 15% of the Mahout runtime was garbage
collection runtime - indicating that it creates too many objects.)
- ELKI can use the data in a uniform way, similar to a column store database.
It's literally crunching the raw double arrays. Mahout on the
other hand - as far as I can tell - is getting the data from a sequence file,
which then is deserialized into a complex object. In addition to the actual
data, it might be expecting sparse and dense vectors mixed etc.
- Size: this data set fits well into memory. Once this no longer holds,
ELKI will no longer be an option. Then MapReduce/Hadoop/Mahout wins.
In particular, such an implementation will by design not keep the
whole data set in memory, but need to de-serialize it from disk again on
each iteration. This is overhead, but saves memory.
- Design: MapReduce is designed for huge clusters, where you must expect
nodes to crash during your computation. Well, chances are that my computer
will survive 45 minutes, so I do not need this for this data size. However,
when you really have large data, and need multiple hours on 1000
nodes to process it, then this becomes important to survive losing
a node. The cost is that all interim results are written to the distributed
filesystem. This extra I/O comes, again, at a cost.
Let me emphasize this: I'm not saying, Hadoop/Mahout is bad. I'm
saying: this data set is not big enough to make Mahout beneficial.
Conclusions: As long as your data fits into your main memory and
takes just a few hours to compute, don't go for Hadoop.
It will likely be faster on a single node by avoiding the overhead
associated with (current) distributed implementations.
Sometimes, it may also be a solution to use the cluster only for preparing
the data, then get it to a powerful workstation, and analyze it there. We did
do this with the images: for extracting the image features, we used a
distributed implementation (not on Hadoop, but on a dozen PCs).
I'm not saying it will stay that way. I have plans for starting
"Project HELKI", aka "ELKI on Hadoop". Because I do sometimes hit the
barrier of what I can compute on my 8 core CPU in an "reasonable" amount
of time. And of course, Mahout will improve over time, and hopefully lose
some of its "Java boxing" weight.
But before trying to run everything on a cluster, always try to run it
on a single node first. You can still figure out how to scale up later, once
you really know what you need.
And last but not least, consider whether scaling up really makes sense.
K-means results don't really get better with a larger data set. They are
averages, and adding more data will likely only change the last few
digits. Now if you have an implementation that doesn't pay attention to
numerical issues, you might end up with more numerical error than you gained
from that extra data.
In fact, k-means can effectively be accelerated by processing a sample
first, and only refining this on the full data set then. And:
sampling is the most important "big data" approach. In particular when
using statistical methods, consider whether more data can really be expected to
improve your quality.
Better algorithms: K-means is a very crude heuristic. It optimizes the sum
of squared deviations, which may be not too meaningful for your problem. It is
not very noise tolerant, either. And there are thousands of variants. For
example bisecting k-means, which no longer is embarrassingly parallel (i.e.
it is not as easy to implement on MapReduce), but took just 7 minutes doing
20 iterations for each split. The algorithm can be summarized as starting with
k=2, then always splitting the largest cluster in two until you have the
desired k. For many use cases, this result will be just as good as the full
Don't get me wrong. There are true big data problems. Web scale search,
for example. Or product and friend recommendations at Facebook scale.
But chances are that you don't have this amount of data.
Google probably doesn't employ k-means at that scale either.
(actually, Google runs k-means on 5 mio keypoints for vector quantization;
which, judging my experience here, can still be done one a single host; in
particular with hierarchical approaches such as bisecting k-means)
Don't choose the wrong tool for the wrong problem!