Your big data toolchain is a big security risk!

This post is a follow-up to my earlier post on the “sad state of sysadmin in the age of containers”. While I was drafting this post, that story got picked up by HackerNews, Reddit and Twitter, sending a lot of comments and emails my way. Surprisingly many of the comments are supportive of my impression - I would have expected to see much more insults along the lines “you just don’t like my-favorite-tool, so you rant against using it”. But a lot of people seem to share my concerns. Thanks, you surprised me!

Here is the new ~~rant~~ post, in the slightly different context of big data:

Everybody is doing “big data” these days. Or at least, pretending to do so to upper management. A lot of the time, there is no big data. People do more data anylsis than before, and therefore stick the “big data” label on them to promote themselves and get green light from management, isn’t it?

“Big data” is not a technical term. It is a business term, referring to any attempt to get more value out of your business by analyzing data you did not use before. From this point of view, most of such projects are indeed “big data” as in “data-driven revenue generation” projects. It may be unsatisfactory to those interested in the challenges of volume and the other “V’s”, but this is the reality how the term is used.

But even in those cases where the volume and complexity of the data would warrant the use of all the new ~~toys~~ tools, people overlook a major problem: security of their systems and of their data.

The currently offered “big data technology stack” is all but secure. Sure, companies try to earn money with security add-ons such as Kerberos authentication to sell multi-tenancy, and with offering their version of Hadoop (their “Hadoop distribution”).

The security problem is deep inside the “stack”. It comes from the way this world ticks: the world of people that constantly follow the latest tool-of-the-day. In many of the projects, you no longer have mostly Linux developers that co-function as system administrators, but you see a lot of Apple iFanboys now. They live in a world where technology is outdated after half a year, so you will not need to support product longer than that. They love reinstalling their development environment frequently - because each time, they get to change something. They also live in a world where you would simply get a new model if your machine breaks down at some point. (Note that this will not work well for your big data project, restarting it from scratch every half year…)
And while Mac users have recently been surprisingly unaffected by various attacks (and unconcerned about e.g. GoToFail, or the fail to fix the rootpipe exploit) the operating system is not considered to be very secure. Combining this with users who do not care is an explosive mixture…

This type of developer, who is good at getting a prototype website for a startup kicking in a short amount of time, rolling out new features every day to beta test on the live users is what currently makes the Dotcom 2.0 bubble grow. It’s also this type of user that mainstream products aim at - he has already forgotten what was half a year ago, but is looking for the next tech product to announced soon, and willing to buy it as soon as it is available…

This attitude causes a problem at the very heart of the stack: in the way packages are built, upgrades (and safety updates) are handled etc.

nobody is interested in consistency or reproducability anymore.

Someone commented on my blog that all these tools “seem to be written by 20 year old” kids. He probably is right. It wouldn’t be so bad if we had some experienced sysadmins with a cluebat around. People that have experience on how to build systems that can be maintained for 10 years, and securely deployed automatically, instead of relying on puppet hacks, wget and unzipping of unsigned binary code.

I know that a lot of people don’t want to hear this, but:

Your Hadoop system contains unsigned binary code in a number of places, that people downloaded, uploaded and redownloaded a countless number of times. There is no guarantee that .jar ever was what people think it is.

Hadoop has a huge set of dependencies, and little of this has been seriously audited for security - and in particular not in a way that would allow you to check that your binaries are built from this audited code anyway.

There might be functionality hidden in the code that just sits there and waits for a system with a hostname somewhat like “yourcompany.com” to start looking for its command and control server to steal some key data from your company. The way your systems are built they probably do not have much of a firewall guarding against such. Much of the software may be constantly calling home, and your DevOps would not notice (nor would they care, anyway).

The mentality of “big data stacks” these days is that of Windows Shareware in the 90s. People downloading random binaries from the Internet, not adequately checked for security (ever heard of anybody running an AntiVirus on his Hadoop cluster?) and installing them everywhere.

And worse: not even keeping track of what they installed over time, or how. Because the tools change every year. But what if that developer leaves? You may never be able to get his stuff running properly again!

Fire-and-forget.

I predict that within the next 5 years, we will have a number of security incidents in various major companies. This is industrial espionage heaven. A lot of companies will cover it up, but some leaks will reach mass media, and there will be a major backlash against this hipster way of stringing together random components. There is a big “Hadoop bubble” growing, that will eventually burst.

In order to get into a trustworthy state, the big data toolchain needs to:

Consolidate. There are too many tools for every job. There are even too many tools to manage your too many tools, and frontends for your frontends.
Lose weight. Every project depends on way too many other projects, each of which only contributes a tiny fragment for a very specific use case. Get rid of most dependencies!
Modularize. If you can’t get rid of a dependency, but it is still only of interest to a small group of users, make it an optional extension module that the user only has to install if he needs this particular functionality.
Buildable. Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background. Test your builds offline, with a clean build directory, and document them! Everything must be rebuildable by any sysadmin in a reproducible way, so he can ensure a bug fix is really applied.
Distribute. Do not rely on binary downloads from your CDN as sole distribution channel. Instead, encourage and support alternate means of distribution, such as the proper integration in existing and trusted Linux distributions.
Maintain compatibility. successful big data projects will not be fire-and-forget. Eventually, they will need to go into production and then it will be necessary to run them over years. It will be necessary to migrate them to newer, larger clusters. And you must not lose all the data while doing so.
Sign. Code needs to be signed, end-of-story.
Authenticate. All downloads need to come with a way of checking the downloaded files agree with what you uploaded.
Integrate. The key feature that makes Linux systems so very good at servers is the all-round integrated software management. When you tell the system to update - and you have different update channels available, such as a more conservative “stable/LTS” channel, a channel that gets you the latest version after basic QA, and a channel that gives you the latest versions shortly after their upload to help with QA. It covers almost all software on your system, so it does not matter whether the security fix is in your kernel, web server, library, auxillary service, extension module, scripting language etc. - it will pull this fix and update you in no time.

Now you may argue that Hortonworks, Cloudera, Bigtop etc. already provide packages. Well … they provide crap. They have something they call a “package”, but it fails by any quality standards. Technically, a Wartburg is a car; but not one that would pass todays safety regulations…
For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the latest version they support… Furthermore, these packages are roughly the same. Cloudera eventually handed over their efforts to “the community” (in other words, they gave up on doing it themselves, and hoped that someone else would clean up their mess); and Hortonworks HDP (any maybe Pivotal HD, too) is derived from these efforts, too. Much of what they do is offering some extra documentation and training for the packages they built using Bigtop with minimal effort.
The “spark” .deb packages of Bigtop, for example, are empty. They forgot to include the .jars in the package. Do I really need to give more examples of bad packaging decisions? All bigtop packages now depend on their own version of groovy - for a single script. Instead of rewriting this script in an already required language - or in a way that it would run on the distribution-provided groovy version - they decided to make yet another package, bigtop-groovy.

When I read about Hortonworks and IBM announcing their “Open Data Platform”, I could not care less. As far as I can tell, they are only sticking their label on the existing tools anyway. Thus, I’m also not surprised that Cloudera and MapR do not join this rebranding effort - given the low divergence of Hadoop, who would need such a label anyway?

So why does this matter? Essentially, if anything does not work, you are currently toast. Say there is a bug in Hadoop that makes it fail to process your data. Your business is belly-up because of that, no data is processed anymore, your are vegetable. Who is going to fix it? All these “distributions” are built from the same, messy, branch. There is probably only a dozen of people around the world who have figured this out well enough to be able to fully build this toolchain. Apparently, none of the “Hadoop” companies are able to support a newer Ubuntu than 2012.04 - are you sure they have really understood what they are selling? I have doubts. All the freelancers out there, they know how to download and use Hadoop. But can they get that business-critical bug fix into the toolchain to get you up and running again? This is much worse than with Linux distributions. They have build daemons - servers that continuously check they can compile all the software that is there. You need to type two well-documented lines to rebuild a typical Linux package from scratch on your workstation - any experienced developer can follow the manual, and get a fix into the package. There are even people who try to recompile complete distributions with a different compiler to discover compatibility issues early that may arise in the future.

In other words, the “Hadoop distribution” they are selling you is not code they compiled themselves. It is mostly .jar files they downloaded from unsigned, unencrypted, unverified sources on the internet. They have no idea how to rebuild these parts, who compiled that, and how it was built. At most, they know for the very last layer. You can figure out how to recompile the Hadoop .jar. But when doing so, your computer will download a lot of binaries. It will not warn you of that, and they are included in the Hadoop distributions, too.

As is, I can not recommend to trust your business data into Hadoop.
It is probably okay to copy the data into HDFS and play with it - in particular if you keep your cluster and development machines isolated with strong firewalls - but be prepared to toss everything and restart from scratch. It’s not ready yet for prime time, and as they keep on adding more and more unneeded cruft, it does not look like it will be ready anytime soon.

One more examples of the immaturity of the toolchain:
The scala package from scala-lang.org cannot be cleanly installed as an upgrade to the old scala package that already exists in Ubuntu and Debian (and the distributions seem to have given up on compiling a newer Scala due to a stupid Catch-22 build process, making it very hacky to bootstrap scala and sbt compilation).
And the “upstream” package also cannot be easily fixed, because it is not built with standard packaging tools, but with an automagic sbt helper that lacks important functionality (in particular, access to the Replaces: field, or even cleaner: a way of splitting the package properly into components) instead - obviously written by someone with 0 experience in packaging for Ubuntu or Debian; and instead of using the proven tools, he decided to hack some wrapper that tries to automatically do things the wrong way…

I’m convinced that most “big data” projects will turn out to be a miserable failure. Either due to overmanagement or undermanagement, and due to lack of experience with the data, tools, and project management… Except that - of course - nobody will be willing to admit these failures. Since all these projects are political projects, they by definition must be successful, even if they never go into production, and never earn a single dollar.