Everything is “big data” now, and everything is “data science”. Because these terms lack a proper, falsifiable definition.

A number of attempts to define them exist, but they usually only consist of a number of “nice to haves” strung together. For Big Data, it’s the 3+ V’s, and for Data Science, this diagram on Wikipedia is a typical example.

This is not surprising: effectively these term are all marketing, not scientific attempts at definiting a research domain.

Actually, my favorite definition is this, except that it should maybe read pink pony in the middle, instead of unicorn.

Data science has been called “the sexiest job” so often, this has recently led to an integer overflow.

The problem with these definitions is that they are open-ended. They name some examples (like “volume”) but they essentially leave it open to call anything “big data” or “data science” that you would like to. This is, of course, a marketers dream buzzword. There is nothing saying that “picking my nose” is not big data science.


If we ever want to get to a usable definition and get rid of all the hype, we should consider a more precise definition; even when this means making it more exclusive (funnily enough, some people already called above open-ended definitions “elitist” …).

Big data:

  • Must involve distributed computation on multiple servers
  • Must intermix computation and data management
  • Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
  • Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
  • Must be disruptively more data-driven

Data science:

  • Must incorporate domain knowledge (e.g. business, geology, etc.).
  • Must take computational aspects into account (scalability etc.).
  • Must involve scientific techniques such as hypothesis testing and result validation.
  • Results must be falsifiable.
  • Should involve more mathematics and statistics than earlier approaches.
  • Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
  • Should involve machine learning, AI or knowledge discovery algorithms.
  • Should involve visualization and rapid prototyping for software development.
  • Must satisfy at least one of these shoulds in a disruptive level.

But this is all far from a proper definition. Partially because these fields are so much in flux; but largely because they’re just too ill-defined.

There is a lot of overlap, that we should try to flesh out. For example, data science is not just statistics. Because it is much more concerned with how data is organized and how the computations can be made efficiently. Yet often, statistics is much better at integrating domain knowledge. People coming from computation, on the other hand, usually care too little about the domain knowledge and falsifiability of their results - they’re happy if they can compute anything.

Last but not least, nobody will be in favor of such a rigid definition and requirements. Because most likely, you will have to strip that “data scientist” label off your business card - and why bite the hand that feeds? Most of what I do certainly would not qualify as data science or big data anymore with an “elitist” definition. While this doesn’t lessen my scientific results, it makes them less marketable.

Essentially, this is like a global “gentlemans agreement”. Buzz these words while they last, then move on to the next similar “trending topic”.


Maybe we should just leave these terms to the marketing folks, and let them bubble them till it bursts. Instead, we should just stick to the established and better defined terms…

  • When you are doing statistics, call it statistics.
  • When you are doing unsupervised learning, call it machine learning.
  • When your focus is distributed computation, call it distributed computing.
  • When you do data management, continute to call it data management and databases.
  • When you do data indexing, call it data indexing.
  • When you are doing unsupervised data mining, call it cluster analysis, outlier detection, …
  • Whatever it is, try to use a precise term, instead of a buzzword.

Thank you.

Of course, sometimes you will have to play Buzzword Bingo. Nobody is going to stop you. But I will understand that you are doing “playing buzzword bingo”, unless you get more precise.

Once you then have results that are so massively better, and really disrupted science, then you can still call it “data science” later on.

You have been seeing, I’ve been picking on the word “disruptive” a lot. As long as you are doing “business as usual”, and focusing on off-the-shelf solution, it will not be disruptive. And it then won’t be big data science, or a big data approach that yields major gains. It will be just “business as usual” with different labels, and return results as usual.

Let’s face it. We don’t just want big data or data science. What everybody is looking for is disruptive results, which will require a radical approach, not a slight modification involving slightly more computers of what you have been doing all along.