Everybody is trying (or pretending) to be “big data” these days. Yet, I have the impression that there are very few true success stories. People fight with the technology to scale, and have not even arrived at the point of doing detailed analysis - or even meta-analysis, i.e. whether the new solutions actuall perform better than the old “small data” approaches.

In fact, a lot of the “big data” approaches just reason “you can’t do it with the existing solutions, so we cannot compare”. Which is not exactly true.


In my experiments with large data (not big; it still fits into main memory) is that you have to be quite careful with your methods. Just scaling up existing methods does not always yield the expected results. The larger your data set, the more tiny problem surface that can ruin your computations.

“Big data” is often based on the assumption that just by throwing more data at your problem, your results will automatically become more precise. This is not true. On contrary: the larger your data, the more likely you have some contamination that can ruin everything.

We tend to assume that numerics and such issues have long been resolved. But while there are some solutions, it should be noted that they come at a price: they are slower, have some limitations, and are harder to implement.

Unforunately, they are just about everywhere in data analysis. I’ll demonstrate it with a very basic example. Assume we want to compute the mean of the following series: [1e20, 3, -1e20]. Computing the mean, everybody should be able to do this, right? Well, let’s agree that the true solution is 1, as the first and last term cancel out. Now let’s try some variants:

  • Python, naive: sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy sum: numpy.sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy mean: numpy.mean([1e20, 3, -1e20]) yields 0.0
  • Python, less-known function: math.fsum([1e20, 3, -1e20])/3 yields 1.0
  • Java, naive: System.out.println( (1e20+3-1e20)/3 ); yields 0.0
  • R, mean: mean( c(1e20,3,-1e20) ) yields 0
  • R, sum: sum( c(1e20,3,-1e20) )/3 yields 0
  • Octave, mean: mean([1e20,3,-1e20]) yields 0
  • Octave, sum: sum([1e20,3,-1e20])/3 yields 0

So what is happening here? All of these functions (except pythons less known math.fsum) use double precision. With double precision, 1e20 + 3 = 1e20, as double can only retain 15-16 digits of precision. To actually get the correct result, you need to keep track of your error using additional doubles.

Now you may argue, this would only happen when having large differences in magnitude. Unfortunately, this is not true. It also surfaces when you have a large number of observations! Again, I’m using python to exemplify (because math.fsum is accurate).

> a = array(range(-1000000,1000001)) * 0.000001
> min(a), max(a), numpy.sum(a), math.fsum(a)
(-1.0, 1.0, -2.3807511517759394e-11, 0.0)

As it can be seen from the math.fsum function, solutions exist. For example Shewchuk’s algorithm (which is probably what powers math.fsum). For many cases, Kahan summation will also be sufficient, which essentially gives you twice the precision of doubles.

Note that these issues become even worse once you use subtraction, such as when computing variance. never use the famous E[X^2]-E[X]^2 formula. It’s mathematically correct, but when your data is not central (i.e. E[X] is not close to 0, and much smaller than your standard deviation) then you will see all kinds of odd errors, including negative variance; which may then yield NaN standard deviation:

> b = a + 1e15
> numpy.var(a), numpy.var(b)
(0.33333366666666647, 0.33594164452917774)
> mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2
(0.33333366666666647, -21532835718365184.0)

(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)

So why do we not always use the accurate computations? Well, we use floating point with fixed precision because it is fast. And most of the time, when dealing with well conditioned numbers, it is easily accurate enough. To show the performance difference:

> import timeit
> for f in ["sum(a)", "math.fsum(a)"]:
>     print timeit.timeit(f, setup="import math; a=range(0,1000)")
30.6121790409
202.994441986

So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won’t arise.


Long story. Short takeaway: When dealing with large data, pay extra attention to the quality of your results. In fact, even do so when handling small data that is dirty and contains outliers and different magnitudes. Don’t assume that computer math is exact, floating point arithmetic is not.
Don’t just blindly scale up approaches that seemed to work on small data. Analyze them carefully. And last but not least, consider if adding more data will actually give you extra precision.