Stop abusing lambda expressions - this is not functional programming

I know, all the Scala fanboys are going to hate me now. But:
Stop overusing lambda expressions.
Most of the time when you are using lambdas, you are not even doing functional programming, because you often are violating one key rule of functional programming: no side effects.
For example:
is of course very cute to use, and is (wow) 10 characters shorter than:
for (Object o : collection) System.out.println(o);
but this is not functional programming because it has side effects.
What you are doing are anonymous methods/objects, using a shorthand notion. It's sometimes convenient, it is usually short, and unfortunately often unreadable, once you start cramming complex problems into this framework.
It does not offer efficiency improvements, unless you have the propery of side-effect freeness (and a language compiler that can exploit this, or parallelism that can then call the function concurrently in arbitrary order and still yield the same result).
Here is an examples of how to not use lambdas:
DZone Java 8 Factorial (with boilerplate such as the Pair class omitted):
Stream<Pair> allFactorials = Stream.iterate(
  new Pair(BigInteger.ONE, BigInteger.ONE),
  x -> new Pair(
return allFactorials.filter(
  (x) -> x.num.equals(num)).findAny().get().value;
When you are fresh out of the functional programming class, this may seem like a good idea to you... (and in contrast to the examples mentioned above, this is really a functional program).
But such code is a pain to read, and will not scale well either. Rewriting this to classic Java yields:
BigInteger cur = BigInteger.ONE, acc = BigInteger.ONE;
while(cur.compareTo(num) <= 0) {
  cur = cur.add(BigInteger.ONE); // Unfortunately, BigInteger is immutable!
  acc = acc.multiply(cur);
return acc;
Sorry, but the traditional loop is much more readable. It will still not perform very well (because of BigInteger not being designed for efficiency - it does not even make sense to allow BigInteger for num - the factorial of 2**63-1, the maximum of a Java long, needs 1020 bytes to store, i.e. about 500 exabyte.
For some, I did some benchmarking. One hundred random values num (of course the same for all methods) from the range 1 to 1000.
I also included this even more traditional version:
BigInteger acc = BigInteger.ONE;
for(long i = 2; i <=x; i++) {
  acc = acc.multiply(BigInteger.valueOf(i));
return acc;
Here are the results (Microbenchmark, using JMH, 10 warum iterations, 20 measurement iterations of 1 second each):
functional    1000     100  avgt   20  9748276,035 ± 222981,283  ns/op
biginteger    1000     100  avgt   20  7920254,491 ± 247454,534  ns/op
traditional   1000     100  avgt   20  6360620,309 ± 135236,735  ns/op
As you can see, this "functional" approach above is about 50% slower than the classic for-loop. This will be mostly due to the Pair and additional BigInteger objects created and garbage collected.
Apart from being substantially faster, the iterative approach is also much simpler to follow. (To some extend it is faster because it is also easier for the compiler!)
There was a recent blog post by Robert Bräutigam that discussed exception throwing in Java lambdas and the pitfalls associated with this. The discussed approach involves abusing generics for throwing unknown checked exceptions in the lambdas, ouch.

Don't get me wrong. There are cases where the use of lambdas is perfectly reasonable. There are also cases where it adheres to the "functional programming" principle. For example, a stream.filter(x ->"John Doe")) can be a readable shorthand when selecting or preprocessing data. If it is really functional (side-effect free), then it can safely be run in parallel and give you some speedup.
Also, Java lambdas were carefully designed, and the hotspot VM tries hard to optimize them. That is why Java lambdas are not closures - that would be much less performant. Also, the stack traces of Java lambdas remain somewhat readable (although still much worse than those of traditional code). This blog post by Takipi showcases how bad the stacktraces become (in the Java example, the stream function is more to blame than the actual lambda - nevertheless, the actual lambda application shows up as the cryptic LmbdaMain$$Lambda$1/821270929.apply(Unknown Source) without line number information). Java 8 added new bytecodes to be able to optimize Lambdas better - earlier JVM-based languages may not yet make good use of this.
But you really should use lambdas only for one-liners. If it is a more complex method, you should give it a name to encourage reuse and improve debugging.
Beware of the cost of .boxed() streams!
And do not overuse lambdas. Most often, non-Lambda code is just as compact, and much more readable. Similar to foreach-loops, you do lose some flexibility compared to the "raw" APIs such as Iterators:
for(Iterator<Something>> it = collection.iterator(); it.hasNext(); ) {
  Something s =;
  if (someTest(s)) continue; // Skip
  if (otherTest(s)) it.remove(); // Remove
  if (thirdTest(s)) process(s); // Call-out to a complex function
  if (fourthTest(s)) break; // Stop early
In many cases, this code is preferrable to the lambda hacks we see pop up everywhere these days. Above code is efficient, and readable.
If you can solve it with a for loop, use a for loop!
Code quality is not measured by how much functionality you can do without typing a semicolon or a newline!
On the contrary: the key ingredient to writing high-performance code is the memory layout (usually) - something you need to do low-level.
Instead of going crazy about Lambdas, I'm more looking forward to real value types (similar to a struct in C, reference-free objects) maybe in Java 9 (Project Valhalla), as they will allow reducing the memory impact for many scenarios considerably. I'd prefer a mutable design, however - I understand why this is proposed, but the uses cases I have in mind become much less elegant when having to overwrite instead of modify all the time.
2016-03-01 10:19 — Categories: English Coding JavaPermaLink & Comments

Protect your file server from the Locky trojan

The "Locky" trojan and similar trojans apparently can cause havoc on your file servers (you may have heard the reports of hospitals that had to pay thousands of dollars to be able to decrypt their files).
Obviously, this is a good reason to double-check you backups.
But as a Linux admin, you may want to consider additional security measures. Here is one suggestion (untested, because I do not run a Samba file server):
Enable logging in the Samba file server, and monitor the log file for the known file names created by Locky. I.e. files named .locky or _Locky_recover_instructions.txt.
If a user creates such a file, immediately ban his IP from accessing your file server, and send out an alert to the admin and the affected user.
This probably won't prevent much damage from the users PC, but it should at least prevent it from doing much on your file server.
There also exist security modules such as "samba-virusfilter" that could probably be extended to cover this, too.

Sorry, I cannot provide you step-by-step instruction because I am a Linux-only user. I do not run a Samba file server. I have only had conversations with friends about this trojan.
2016-02-26 10:16 — Categories: English Linux SecurityPermaLink & Comments

ELKI 0.7.0 on Maven and GitHub

Version 0.7.0 of our data mining toolkit ELKI is now available on the project homepage, GitHub and Maven.
You can also clone this example project to get started easily.
What is new in ELKI 0.7.0? Too much, see the release notes, please!
What is ELKI exactly?
ELKI is a Java based data mining toolkit. We focus on cluster analysis and outlier detection, because there are plenty of tools available for classification already. But there is a kNN classifier, and a number of frequent itemset mining algorithms in ELKI, too.
ELKI is highly modular. You can combine almost everything with almost everything else. In particular, you can combine algorithms such as DBSCAN, with arbitrary distance functions, and you can choose from many index structures to accelerate the algorithm. But because we separate them well, you can add a new index, or a new distance function, or a new data type, and still benefit from the other parts. In other tools such as R, you cannot easily add a new distance function into an arbitrary algorithm and get good performance - all the fast code in R is written in C and Fortran; and cannot be easily extended this way. In ELKI, you can define a new data type, new distance function, new index, and still use most algorithms. (Some algorithms may have prerequisites that e.g. your new data type does not fulfill, of course).
ELKI is also very fast. Of course a good C code can be faster - but then it usually is not as modular and easy to extend anymore.
ELKI is documented. We have JavaDoc, and we annotate classes with their scientific references (see a list of all references we have). So you know which algorithm a class is supposed to implement, and can look up details there. This makes it very useful for science.
ELKI is not: a turnkey solution. It aims at researchers, developers and data scientists. If you have a SQL database, and want to do a point-and-click analysis of your data, please get a business solution instead with commercial support.
2015-11-27 18:27 — Categories: English Research Coding TechnologyPermaLink & Comments

Ubuntu broke Java because of Unity

Unity, that is the Ubuntu user interface, that nobody else uses.

Since it is a Ubuntu-only thing, few applications have native support for its OSX-style hipster "global" menus.

For Java, someone once wrote a hack called java-swing-ayatana, or "jayatana", that is preloaded into the JVM via the environment variable JAVA_TOOL_OPTIONS. The hacks seems to be unmaintained now.

Unfortunately, this hack seems to be broken now (Google has thousands of problem reports), and causes a NullPointerException or similar crashes in many applications; likely due to a change in OpenJDK 8.

Now all Java Swing applications appear to be broken for Ubuntu users, if they have the jayatana package installed. Congratulations!

And of couse, you see bug reports everywhere. Matlab seems to no longer work for some, NetBeans appears to have issues, and I got a number of bug reports on ELKI because of Ubuntu. Thank you, not.

2015-09-29 09:57 — Categories: English Linux JavaPermaLink & Comments

@Zigo: Why I don't package Hadoop myself

A quick reply to Zigo's post:
Well, I looked at the Bigtop efforts because I needed Hadoop packages. But they are not very useful. They have lots of issues (including empty packages, naming conflicts etc.).
I filed a few bugs, and I even uploaded my fixes to Github. Some of that went unnoticed, because Sean Owen of Cloudera decided to remove all Debian packaging from Spark. But in the end, even with these fixes, the resulting packages do not live up to Debian quality standards (not to say, they would outright violate policy).
If you wanted to package Hadoop properly, you should ditch Apache Bigtop, and instead use the existing best practises for packaging. Using any of the Bigtop work just makes your job harder, by pulling in additional dependencies like their modified Groovy.
But whatever you do, you will be stuck in .jar dependency hell. Whatever you look at, it pulls in another batch of dependencies, that all need to be properly packaged, too. Here is the dependency chain of Hadoop:
[INFO] +- org.apache.hadoop:hadoop-hdfs:jar:2.6.0:compile
[INFO] |  +-
[INFO] |  +- org.mortbay.jetty:jetty:jar:6.1.26:compile
[INFO] |  +- org.mortbay.jetty:jetty-util:jar:6.1.26:compile
[INFO] |  +- com.sun.jersey:jersey-core:jar:1.9:compile
[INFO] |  +- com.sun.jersey:jersey-server:jar:1.9:compile
[INFO] |  |  \- asm:asm:jar:3.1:compile
[INFO] |  +- commons-cli:commons-cli:jar:1.2:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.4:compile
[INFO] |  +- commons-io:commons-io:jar:2.4:compile
[INFO] |  +- commons-lang:commons-lang:jar:2.6:compile
[INFO] |  +- commons-logging:commons-logging:jar:1.1.3:compile
[INFO] |  +- commons-daemon:commons-daemon:jar:1.0.13:compile
[INFO] |  +- javax.servlet.jsp:jsp-api:jar:2.1:compile
[INFO] |  +- log4j:log4j:jar:1.2.17:compile
[INFO] |  +-
[INFO] |  +- javax.servlet:servlet-api:jar:2.5:compile
[INFO] |  +- org.codehaus.jackson:jackson-core-asl:jar:1.9.13:compile
[INFO] |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.9.13:compile
[INFO] |  +- tomcat:jasper-runtime:jar:5.5.23:compile
[INFO] |  +- xmlenc:xmlenc:jar:0.52:compile
[INFO] |  +- io.netty:netty:jar:3.6.2.Final:compile
[INFO] |  +- xerces:xercesImpl:jar:2.9.1:compile
[INFO] |  |  \- xml-apis:xml-apis:jar:1.3.04:compile
[INFO] |  \- org.htrace:htrace-core:jar:3.0.4:compile
[INFO] +- org.apache.hadoop:hadoop-auth:jar:2.6.0:compile
[INFO] |  +- org.slf4j:slf4j-api:jar:1.7.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.5:compile
[INFO] |  |  \- org.apache.httpcomponents:httpcore:jar:4.2.4:compile
[INFO] |  +-
[INFO] |  |  +-
[INFO] |  |  +-
[INFO] |  |  \-
[INFO] |  +- org.apache.zookeeper:zookeeper:jar:3.4.6:compile
[INFO] |  |  +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile
[INFO] |  |  \- jline:jline:jar:0.9.94:compile
[INFO] |  \- org.apache.curator:curator-framework:jar:2.6.0:compile
[INFO] +- org.apache.hadoop:hadoop-common:jar:2.6.0:compile
[INFO] |  +- org.apache.hadoop:hadoop-annotations:jar:2.6.0:compile
[INFO] |  |  \-
[INFO] |  +- org.apache.commons:commons-math3:jar:3.1.1:compile
[INFO] |  +- commons-httpclient:commons-httpclient:jar:3.1:compile
[INFO] |  +- commons-net:commons-net:jar:3.1:compile
[INFO] |  +- commons-collections:commons-collections:jar:3.2.1:compile
[INFO] |  +- com.sun.jersey:jersey-json:jar:1.9:compile
[INFO] |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile
[INFO] |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile
[INFO] |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile
[INFO] |  |  |     +-
[INFO] |  |  |     \- javax.activation:activation:jar:1.1:compile
[INFO] |  |  +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile
[INFO] |  |  \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile
[INFO] |  +-
[INFO] |  |  \- com.jamesmurty.utils:java-xmlbuilder:jar:0.4:compile
[INFO] |  +- commons-configuration:commons-configuration:jar:1.6:compile
[INFO] |  |  +- commons-digester:commons-digester:jar:1.8:compile
[INFO] |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile
[INFO] |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile
[INFO] |  +- org.apache.avro:avro:jar:1.7.4:compile
[INFO] |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.3:compile
[INFO] |  |  \- org.xerial.snappy:snappy-java:jar:
[INFO] |  +-
[INFO] |  +- com.jcraft:jsch:jar:0.1.42:compile
[INFO] |  +- org.apache.curator:curator-client:jar:2.6.0:compile
[INFO] |  +- org.apache.curator:curator-recipes:jar:2.6.0:compile
[INFO] |  +-
[INFO] |  \- org.apache.commons:commons-compress:jar:1.4.1:compile
[INFO] |     \- org.tukaani:xz:jar:1.0:compile
[INFO] +- org.apache.hadoop:hadoop-core:jar:1.2.1:compile
[INFO] |  +- org.apache.commons:commons-math:jar:2.1:compile
[INFO] |  +- tomcat:jasper-compiler:jar:5.5.23:compile
[INFO] |  +- org.mortbay.jetty:jsp-api-2.1:jar:6.1.14:compile
[INFO] |  |  \- org.mortbay.jetty:servlet-api-2.5:jar:6.1.14:compile
[INFO] |  +- org.mortbay.jetty:jsp-2.1:jar:6.1.14:compile
[INFO] |  |  \- ant:ant:jar:1.6.5:compile
[INFO] |  +- commons-el:commons-el:jar:1.0:compile
[INFO] |  +- hsqldb:hsqldb:jar:
[INFO] |  +- oro:oro:jar:2.0.8:compile
[INFO] |  \- org.eclipse.jdt:core:jar:3.1.1:compile
So the first step for packaging Hadoop would be to check which of these dependencies are not yet packaged in Debian... I guess 1/3 is not.
Maybe, we should just rip out some of these dependencies with a cluebat. For the stupid reason of making a webfrontend (which doesn't provide a lot of functionality, and I doubt many people use it at all), Hadoop embeds not just one web server, but two: Jetty and Netty...
Things would also be easier if e.g. S3 support, htrace, the web frontend, and different data serializations were properly put into modules. Then you could postpose S3 support, for example.
As I said, the deeper you dig, the crazier it gets.
If the OpenDataPlatform efforts of Hortonworks, Pivotal and IBM were anything but a marketing gag, they would try to address these technical issues. Instead, they make things worse by specifying yet another fatter core, including Ambari, Apaches attempt to automatically make a mess out of your servers - essentially, they are now adding the ultimate root shell, for all those cases where unaudited puppet commands and "curl | sudo bash" was not bad enough:
  command1 = as_sudo(["cat,"/etc/passwd"]) + " | grep user"
(from the Ambari python documentation)
The closer you look, the more you want to rather die than use this.
P.S. I have updated the libtrove3-java package (Java collections for primitive types; but no longer the fastest such library), so that it is now in the local maven repository (/usr/share/maven-repo) and that it can be rebuilt reproducible (the build user name is no longer in the jar manifest).
2015-05-03 21:17 — Categories: English LinuxPermaLink & Comments

Your big data toolchain is a big security risk!

This post is a follow-up to my earlier post on the "sad state of sysadmin in the age of containers". While I was drafting this post, that story got picked up by HackerNews, Reddit and Twitter, sending a lot of comments and emails my way. Surprisingly many of the comments are supportive of my impression - I would have expected to see much more insults along the lines "you just don't like my-favorite-tool, so you rant against using it". But a lot of people seem to share my concerns. Thanks, you surprised me!
Here is the new rant post, in the slightly different context of big data:

Everybody is doing "big data" these days. Or at least, pretending to do so to upper management. A lot of the time, there is no big data. People do more data anylsis than before, and therefore stick the "big data" label on them to promote themselves and get green light from management, isn't it?
"Big data" is not a technical term. It is a business term, referring to any attempt to get more value out of your business by analyzing data you did not use before. From this point of view, most of such projects are indeed "big data" as in "data-driven revenue generation" projects. It may be unsatisfactory to those interested in the challenges of volume and the other "V's", but this is the reality how the term is used.
But even in those cases where the volume and complexity of the data would warrant the use of all the new toys tools, people overlook a major problem: security of their systems and of their data.

The currently offered "big data technology stack" is all but secure. Sure, companies try to earn money with security add-ons such as Kerberos authentication to sell multi-tenancy, and with offering their version of Hadoop (their "Hadoop distribution").
The security problem is deep inside the "stack". It comes from the way this world ticks: the world of people that constantly follow the latest tool-of-the-day. In many of the projects, you no longer have mostly Linux developers that co-function as system administrators, but you see a lot of Apple iFanboys now. They live in a world where technology is outdated after half a year, so you will not need to support product longer than that. They love reinstalling their development environment frequently - because each time, they get to change something. They also live in a world where you would simply get a new model if your machine breaks down at some point. (Note that this will not work well for your big data project, restarting it from scratch every half year...)
And while Mac users have recently been surprisingly unaffected by various attacks (and unconcerned about e.g. GoToFail, or the fail to fix the rootpipe exploit) the operating system is not considered to be very secure. Combining this with users who do not care is an explosive mixture...
This type of developer, who is good at getting a prototype website for a startup kicking in a short amount of time, rolling out new features every day to beta test on the live users is what currently makes the Dotcom 2.0 bubble grow. It's also this type of user that mainstream products aim at - he has already forgotten what was half a year ago, but is looking for the next tech product to announced soon, and willing to buy it as soon as it is available...
This attitude causes a problem at the very heart of the stack: in the way packages are built, upgrades (and safety updates) are handled etc. - nobody is interested in consistency or reproducability anymore.
Someone commented on my blog that all these tools "seem to be written by 20 year old" kids. He probably is right. It wouldn't be so bad if we had some experienced sysadmins with a cluebat around. People that have experience on how to build systems that can be maintained for 10 years, and securely deployed automatically, instead of relying on puppet hacks, wget and unzipping of unsigned binary code.
I know that a lot of people don't want to hear this, but:
Your Hadoop system contains unsigned binary code in a number of places, that people downloaded, uploaded and redownloaded a countless number of times. There is no guarantee that .jar ever was what people think it is.
Hadoop has a huge set of dependencies, and little of this has been seriously audited for security - and in particular not in a way that would allow you to check that your binaries are built from this audited code anyway.
There might be functionality hidden in the code that just sits there and waits for a system with a hostname somewhat like "" to start looking for its command and control server to steal some key data from your company. The way your systems are built they probably do not have much of a firewall guarding against such. Much of the software may be constantly calling home, and your DevOps would not notice (nor would they care, anyway).
The mentality of "big data stacks" these days is that of Windows Shareware in the 90s. People downloading random binaries from the Internet, not adequately checked for security (ever heard of anybody running an AntiVirus on his Hadoop cluster?) and installing them everywhere.
And worse: not even keeping track of what they installed over time, or how. Because the tools change every year. But what if that developer leaves? You may never be able to get his stuff running properly again!
I predict that within the next 5 years, we will have a number of security incidents in various major companies. This is industrial espionage heaven. A lot of companies will cover it up, but some leaks will reach mass media, and there will be a major backlash against this hipster way of stringing together random components.
There is a big "Hadoop bubble" growing, that will eventually burst.
In order to get into a trustworthy state, the big data toolchain needs to:
  • Consolidate. There are too many tools for every job. There are even too many tools to manage your too many tools, and frontends for your frontends.
  • Lose weight. Every project depends on way too many other projects, each of which only contributes a tiny fragment for a very specific use case. Get rid of most dependencies!
  • Modularize. If you can't get rid of a dependency, but it is still only of interest to a small group of users, make it an optional extension module that the user only has to install if he needs this particular functionality.
  • Buildable. Make sure that everybody can build everything from scratch, without having to rely on Maven or Ivy or SBT downloading something automagically in the background. Test your builds offline, with a clean build directory, and document them! Everything must be rebuildable by any sysadmin in a reproducible way, so he can ensure a bug fix is really applied.
  • Distribute. Do not rely on binary downloads from your CDN as sole distribution channel. Instead, encourage and support alternate means of distribution, such as the proper integration in existing and trusted Linux distributions.
  • Maintain compatibility. successful big data projects will not be fire-and-forget. Eventually, they will need to go into production and then it will be necessary to run them over years. It will be necessary to migrate them to newer, larger clusters. And you must not lose all the data while doing so.
  • Sign. Code needs to be signed, end-of-story.
  • Authenticate. All downloads need to come with a way of checking the downloaded files agree with what you uploaded.
  • Integrate. The key feature that makes Linux systems so very good at servers is the all-round integrated software management. When you tell the system to update - and you have different update channels available, such as a more conservative "stable/LTS" channel, a channel that gets you the latest version after basic QA, and a channel that gives you the latest versions shortly after their upload to help with QA. It covers almost all software on your system, so it does not matter whether the security fix is in your kernel, web server, library, auxillary service, extension module, scripting language etc. - it will pull this fix and update you in no time.
Now you may argue that Hortonworks, Cloudera, Bigtop etc. already provide packages. Well ... they provide crap. They have something they call a "package", but it fails by any quality standards. Technically, a Wartburg is a car; but not one that would pass todays safety regulations...
For example, they only support Ubuntu 12.04 - a three year old Ubuntu is the latest version they support... Furthermore, these packages are roughly the same. Cloudera eventually handed over their efforts to "the community" (in other words, they gave up on doing it themselves, and hoped that someone else would clean up their mess); and Hortonworks HDP (any maybe Pivotal HD, too) is derived from these efforts, too. Much of what they do is offering some extra documentation and training for the packages they built using Bigtop with minimal effort.
The "spark" .deb packages of Bigtop, for example, are empty. They forgot to include the .jars in the package. Do I really need to give more examples of bad packaging decisions? All bigtop packages now depend on their own version of groovy - for a single script. Instead of rewriting this script in an already required language - or in a way that it would run on the distribution-provided groovy version - they decided to make yet another package, bigtop-groovy.
When I read about Hortonworks and IBM announcing their "Open Data Platform", I could not care less. As far as I can tell, they are only sticking their label on the existing tools anyway. Thus, I'm also not surprised that Cloudera and MapR do not join this rebranding effort - given the low divergence of Hadoop, who would need such a label anyway?
So why does this matter? Essentially, if anything does not work, you are currently toast. Say there is a bug in Hadoop that makes it fail to process your data. Your business is belly-up because of that, no data is processed anymore, your are vegetable. Who is going to fix it? All these "distributions" are built from the same, messy, branch. There is probably only a dozen of people around the world who have figured this out well enough to be able to fully build this toolchain. Apparently, none of the "Hadoop" companies are able to support a newer Ubuntu than 2012.04 - are you sure they have really understood what they are selling? I have doubts. All the freelancers out there, they know how to download and use Hadoop. But can they get that business-critical bug fix into the toolchain to get you up and running again? This is much worse than with Linux distributions. They have build daemons - servers that continuously check they can compile all the software that is there. You need to type two well-documented lines to rebuild a typical Linux package from scratch on your workstation - any experienced developer can follow the manual, and get a fix into the package. There are even people who try to recompile complete distributions with a different compiler to discover compatibility issues early that may arise in the future.
In other words, the "Hadoop distribution" they are selling you is not code they compiled themselves. It is mostly .jar files they downloaded from unsigned, unencrypted, unverified sources on the internet. They have no idea how to rebuild these parts, who compiled that, and how it was built. At most, they know for the very last layer. You can figure out how to recompile the Hadoop .jar. But when doing so, your computer will download a lot of binaries. It will not warn you of that, and they are included in the Hadoop distributions, too.
As is, I can not recommend to trust your business data into Hadoop.
It is probably okay to copy the data into HDFS and play with it - in particular if you keep your cluster and development machines isolated with strong firewalls - but be prepared to toss everything and restart from scratch. It's not ready yet for prime time, and as they keep on adding more and more unneeded cruft, it does not look like it will be ready anytime soon.

One more examples of the immaturity of the toolchain:
The scala package from cannot be cleanly installed as an upgrade to the old scala package that already exists in Ubuntu and Debian (and the distributions seem to have given up on compiling a newer Scala due to a stupid Catch-22 build process, making it very hacky to bootstrap scala and sbt compilation).
And the "upstream" package also cannot be easily fixed, because it is not built with standard packaging tools, but with an automagic sbt helper that lacks important functionality (in particular, access to the Replaces: field, or even cleaner: a way of splitting the package properly into components) instead - obviously written by someone with 0 experience in packaging for Ubuntu or Debian; and instead of using the proven tools, he decided to hack some wrapper that tries to automatically do things the wrong way...

I'm convinced that most "big data" projects will turn out to be a miserable failure. Either due to overmanagement or undermanagement, and due to lack of experience with the data, tools, and project management... Except that - of course - nobody will be willing to admit these failures. Since all these projects are political projects, they by definition must be successful, even if they never go into production, and never earn a single dollar.
2015-04-26 15:41 — Categories: English tech LinuxPermaLink & Comments

The sad state of sysadmin in the age of containers

System administration is in a sad state. It in a mess.
I'm not complaining about old-school sysadmins. They know how to keep systems running, manage update and upgrade paths.
This rant is about containers, prebuilt VMs, and the incredible mess they cause because their concept lacks notions of "trust" and "upgrades".
Consider for example Hadoop. Nobody seems to know how to build Hadoop from scratch. It's an incredible mess of dependencies, version requirements and build tools.
None of these "fancy" tools still builds by a traditional make command. Every tool has to come up with their own, incomptaible, and non-portable "method of the day" of building.
And since nobody is still able to compile things from scratch, everybody just downloads precompiled binaries from random websites. Often without any authentication or signature.
NSA and virus heaven. You don't need to exploit any security hole anymore. Just make an "app" or "VM" or "Docker" image, and have people load your malicious binary to their network.
The Hadoop Wiki Page of Debian is a typical example. Essentially, people have given up in 2010 to be able build Hadoop from source for Debian and offer nice packages.
To build Apache Bigtop, you apparently first have to install puppet3. Let it download magic data from the internet. Then it tries to run sudo puppet to enable the NSA backdoors (for example, it will download and install an outdated precompiled JDK, because it considers you too stupid to install Java.) And then hope the gradle build doesn't throw a 200 line useless backtrace.
I am not joking. It will try to execute commands such as e.g.
/bin/bash -c "wget ; dpkg -x ./scala-2.10.3.deb /"
Note that it doesn't even install the package properly, but extracts it to your root directory. The download does not check any signature, not even SSL certificates. (Source: Bigtop puppet manifests)
Even if your build would work, it will involve Maven downloading unsigned binary code from the internet, and use that for building.
Instead of writing clean, modular architecture, everything these days morphs into a huge mess of interlocked dependencies. Last I checked, the Hadoop classpath was already over 100 jars. I bet it is now 150, without even using any of the HBaseGiraphFlumeCrunchPigHiveMahoutSolrSparkElasticsearch (or any other of the Apache chaos) mess yet.
Stack is the new term for "I have no idea what I'm actually using".
Maven, ivy and sbt are the go-to tools for having your system download unsigned binary data from the internet and run it on your computer.
And with containers, this mess gets even worse.
Ever tried to security update a container?
Essentially, the Docker approach boils down to downloading an unsigned binary, running it, and hoping it doesn't contain any backdoor into your companies network.
Feels like downloading Windows shareware in the 90s to me.
When will the first docker image appear which contains the Ask toolbar? The first internet worm spreading via flawed docker images?

Back then, years ago, Linux distributions were trying to provide you with a safe operating system. With signed packages, built from a web of trust. Some even work on reproducible builds.
But then, everything got Windows-ized. "Apps" were the rage, which you download and run, without being concerned about security, or the ability to upgrade the application to the next version. Because "you only live once".
Update: it was pointed out that this started way before Docker: »Docker is the new 'curl | sudo bash'«. That's right, but it's now pretty much mainstream to download and run untrusted software in your "datacenter". That is bad, really bad. Before, admins would try hard to prevent security holes, now they call themselves "devops" and happily introduce them to the network themselves!
2015-03-12 14:04 — Categories: English Linux DebianPermaLink & Comments

Year 2014 in Review as Seen by a Trend Detection System

We ran our trend detection tool Signi-Trend (published at KDD 2014) on news articles collected for the year 2014. We removed the category of financial news, which is overrepresented in the data set. Below are the (described) results, from the top 50 trends (I will push the raw result to appspot if possible due to file limits).
I have highlighted the top 10 trends in bold, but otherwise ordered them chronologically.
Updated: due to an error in a regexp, I had filtered out too many stories. The new results use more articles.

2014-01-29: Obama's state of the union address
2014-02-07: Sochi Olympics gay rights protests
2014-02-08: Sochi Olympics first results
2014-02-19: Violence in Ukraine and Maidan in Kiev
2014-02-20: Wall street reaction to Facebook buying WhatsApp
2014-02-22: Yanukovich leaves Kiev
2014-02-28: Crimea crisis begins
2014-03-01: Crimea crisis escalates futher
2014-03-02: NATO meeting on Crimea crisis
2014-03-04: Obama presents U.S. fiscal budget 2015 plan
2014-03-08: Malaysia Airlines MH-370 missing in South China Sea
2014-03-08: MH-370: many Chinese on board of missing airplane
2014-03-15: Crimean status referencum (upcoming)
2014-03-18: Crimea now considered part of Russia by Putin
2014-03-21: Russian stocks fall after U.S. sanctions.
2014-04-02: Chile quake and tsunami warning
2014-04-09: False positive? experience + views
2014-04-13: Pro-russian rebels in Ukraine's Sloviansk
2014-04-17: Russia-Ukraine crisis continues
2014-04-22: French deficit reduction plan pressure
2014-04-28: Soccer World Cup coverage: team lineups
2014-05-14: MERS reports in Florida, U.S.
2014-05-23: Russia feels sanctions impact
2014-05-25: EU elections
2014-06-06: World cup coverage
2014-06-13: Islamic state Camp Speicher massacre in Iraq
2014-06-14: Soccer world cup: Spain surprisingly destoyed by Netherlands
2014-07-05: Soccer world cup quarter finals
2014-07-17: Malaysian Airlines MH-17 shot down over Ukraine
2014-07-18: Russian blamed for 298 dead in airline downing
2014-07-19: Independent crash site investigation demanded
2014-07-20: Israel shelling Gaza causes 40+ casualties in a day
2014-08-07: Russia bans food imports from EU and U.S.
2014-08-08: Obama orders targeted air strikes in Iraq
2014-08-20: IS murders journalist James Foley, air strikes continue
2014-08-30: EU increases sanctions against Russia
2014-09-05: NATO summit with respect to IS and Ukraine conflict
2014-09-11: Scottish referendum upcoming - poll results are close
2014-09-23: U.N. on legality of U.S. air strikes in Syria against IS
2014-09-26: Star manager Bill Gross leaves Allianz/PIMCO for Janus
2014-10-22: Ottawa parliament shooting
2014-10-26: EU banking review
2014-11-05: U.S. Senate and governor elections
2014-11-12: Foreign exchange manipulation investigation results
2014-11-17: Japan recession
2014-12-11: CIA prisoner and U.S. torture centers revieled
2014-12-15: Sydney cafe hostage siege
2014-12-17: U.S. and Cuba relations improve unexpectedly
2014-12-18: Putin criticizes NATO, U.S., Kiev
2014-12-28: AirAsia flight QZ-8501 missing

As you can guess, we are really happy with this result - just like the result for 2013 it mentiones (almost) all the key events.
There probably is one "false positive" there: 2014-04-09 has a lot of articles talking about "experience" and "views", but not all refer to the same topic (we did not do topic modeling yet).
There are also some events missing that we would have liked to appear; many of these barely did not make it into the top 50, but do appear in the top 100, such as the Sony cyberattack (#51) and the Fergusson riots on November 11 (#66).
You can also explore the results online in a snapshot.
2015-01-22 20:00 — Categories: English ResearchPermaLink & Comments

Big data predictions for 2015

My big data predictions for 2015:
  1. Big data will continue to fail to deliver for most companies.
    This has several reasons, including in particular: 1: lack of data to analyze that actually benefits from big data tools and approaches (and which is not better analyzed with traditional tools). 2: lack of talent, and failure to attract analytics talent. 3: stuck in old IT, and too inflexible to allow using modern tools (if you want to use big data, you will need a flexible "in-house development" type of IT that can install tools, try them, abandon them, without going up and down the management chains) 4: too much marketing. As long as big data is being run by the marketing department, not by developers, it will fail.
  2. Project consolidation: we have seen hundreds of big data software projects the last years. Plenty of them on Apache, too. But the current state is a mess, there is massive redundancy, and lots and lots of projects are more-or-less abandoned. Cloudera ML, for example, is dead: superseded by Oryx and Oryx 2. More projects will be abandoned, because we have way too many (including much too many NoSQL databases, that fail to outperform SQL solutions like PostgreSQL). As is, we have dozens of competing NoSQL databases, dozens of competing ML tools, dozens of everything.
  3. Hype: the hype will continue, but eventually (when there is too much negative press on the term "big data" due to failed projects and inflated expectations) move on to other terms. The same is also happening to "data science", so I guess the next will be "big analytics", "big intelligence" or something like that.
  4. Less openness: we have seen lots of open-source projects. However, many decided to go with Apache-style licensing - always ready to close down their sharing, and no longer share their development. In 2015, we'll see this happen more often, as companies try to make money off their reputation. At some point, copyleft licenses like GPL may return to popularity due to this.
2015-01-13 16:01 — Categories: English ResearchPermaLink & Comments

Java sum-of-array comparisons

This is a follow-up to the post by Daniel Lemire on a close topic.
Daniel Lemire hat experimented with boxing a primitive array in an interface, and has been trying to measure the cost.
I must admit I was a bit sceptical about his results, because I have seen Java successfully inlining code in various situations.
For an experimental library I occasionally work on, I had been spending quite a bit of time on benchmarking. Previously, I had used Google Caliper for it (I even wrote an evaluation tool for it to produce better statistics). However, Caliper hasn't seen much updates recently, and there is a very attractive similar tool at openJDK now, too: Java Microbenchmarking Harness (actually it can be used for benchmarking at other scale, too).
Now that I have experience in both, I must say I consider JMH superior, and I have switched over my microbenchmarks to it. One of the nice things is that it doesn't make this distinction of micro vs. macrobenchmarks, and the runtime of your benchmarks is easier to control.
I largely recreated his task using JMH. The benchmark task is easy: compute the sum of an array; the question is how much the cost is when allowing different data structures than double[].
My results, however, are quite different. And the statistics of JMH indicate the differences may be not significant, and thus indicating that Java manages to inline the code properly.
adapterFor       1000000  thrpt  50  836,898 ± 13,223  ops/s
adapterForL      1000000  thrpt  50  842,464 ± 11,008  ops/s
adapterForR      1000000  thrpt  50  810,343 ±  9,961  ops/s
adapterWhile     1000000  thrpt  50  839,369 ± 11,705  ops/s
adapterWhileL    1000000  thrpt  50  842,531 ±  9,276  ops/s
boxedFor         1000000  thrpt  50  848,081 ±  7,562  ops/s
boxedForL        1000000  thrpt  50  840,156 ± 12,985  ops/s
boxedForR        1000000  thrpt  50  817,666 ±  9,706  ops/s
boxedWhile       1000000  thrpt  50  845,379 ± 12,761  ops/s
boxedWhileL      1000000  thrpt  50  851,212 ±  7,645  ops/s
forSum           1000000  thrpt  50  845,140 ± 12,500  ops/s
forSumL          1000000  thrpt  50  847,134 ±  9,479  ops/s
forSumL2         1000000  thrpt  50  846,306 ± 13,654  ops/s
forSumR          1000000  thrpt  50  831,139 ± 13,519  ops/s
foreachSum       1000000  thrpt  50  843,023 ± 13,397  ops/s
whileSum         1000000  thrpt  50  848,666 ± 10,723  ops/s
whileSumL        1000000  thrpt  50  847,756 ± 11,191  ops/s
The postfix is the iteration type: sum using for loops, with local variable for the length (L), or in reverse order (R); while loops (again with local variable for the length). The prefix is the data layout: the primitive array, the array using a static adapter (which is the approach I have been using in many implementations in cervidae) and using a "boxed" wrapper class around the array (roughly the approach that Daniel Lemire has been investigating. On the primitive array, I also included the foreach loop approach (for(double v:array){).
If you look at the standard deviations, the results are pretty much identical, except for reverse loops. This is not surprising, given the strong inlining capabilities of Java - all of these codes will lead to next to the same CPU code after warmup and hotspot optimization.
I do not have a full explanation of the differences the others have been seeing. There is no "polymorphism" occurring here (at runtime) - there is only a single Array implementation in use; but this was the same with his benchmark.
Here is a visualization of the results (sorted by average):
Result boxplots
As you can see, most results are indiscernible. The measurement standard deviation is higher than the individual differences. If you run the same benchmark again, you will likely get a different ranking.
Note that performance may - drastically - drop once you use multiple adapters or boxing classes in the same hot codepath. Java Hotspot keeps statistics on the classes it sees, and as long as it only sees 1-2 different types, it performs quite aggressive optimizations instead of doing "virtual" method calls.
2014-12-22 23:04 — Categories: English Coding JavaPermaLink & Comments

Installing Debian with sysvinit

First let me note that I am using systemd, so these things here are untested by me. See e.g. Petter's and Simon's blog entries on the same overall topic.
According to the Debian installer maintainers, the only accepted way to install Debian with sysvinit is to use preseeding. This can either be done at the installer boot prompt by manually typing the magic spell:
preseed/late_command="in-target apt-get install -y sysvinit-core"
or by using a preseeding file (which is a really nice feature I used for installing my Hadoop nodes) to do the same:
d-i preseed/late_command string in-target apt-get install -y sysvinit-core
If you are a sysadmin, using preseeding can save you a lot of typing. Put all your desired configuration into preseeding files, put them on a webserver (best with a short name resolvable by local DNS). Let's assume you have set up the DNS name, and your DHCP is configured such that is on the DNS search list. You can also add a vendor extension to DHCP to serve a full URL. Manually enabling preseeding then means adding
auto url=d-i
to the installer boot command line (d-i is the hostname I suggested to set up in your DNS before, and the full URL would then be Preseeding is well documented in Appendix B of the installer manual, but nevertheless will require a number of iterations to get everything work as desired for a fully automatic install like I used for my Hadoop nodes.

There might be an easier option.
I have filed a wishlist bug suggesting to use the tasksel mechanism to allow the user to choose sysvinit at installation time. However, it got turned down by the Debian installer maintainers quire rudely in a "No." - essentially this is a "shut the f... up and go away", which is in my opinion an inappropriate to discard a reasonable user wishlist request.
Since I don't intend to use sysvinit anymore, I will not be pursuing this option further. It is, as far as I can tell, still untested. If it works, it might be the least-effort, least-invasive option to allow the installation of sysvinit Jessie (except for above command line magic).
If you have interest in sysvinit, you (because I don't use sysvinit) should now test if this approach works.
  1. Get the patch proposed to add a task-sysvinit package.
  2. Build an installer CD with this tasksel (maybe this documentation is helpful for this step).
  3. Test whether the patch works. Report results to above bug report, so that others interested in sysvinit can find them easily.
  4. Find and fix bugs if it didn't work. Repeat.
  5. Publish the modified ("forked") installer, and get user feedback.
If you are then still up for a fight, you can try to convince the maintainers (or go the nasty way, and ask the CTTE for their opinion, to start another flamewar and make more maintainers give up) that this option should be added to the mainline installer. And hurry up, or you may at best get this into Jessie reloaded, 8.1. - chance are that the release manager will not accept such patches this late anymore. The sysvinit supporters should have investigated this option much, much earlier instead of losing time on the GR.
Again, I won't be doing this job for you. I'm happy with systemd. But patches and proof-of-concept is what makes open source work, not GRs and MikeeUSA's crap videos spammed to the LKML...
(And yes, I am quite annoyed by the way the Debian installer maintainers handled the bug report. This is not how open-source collaboration is supposed to work. I tried to file a proper wishlist bug reporting, suggesting a solution that I could not find discussed anywhere before and got back just this "No. Shut up." answer. I'm not sure if I will be reporting a bug in debian-installer ever again, if this is the way they handle bug reports ...)
I do care about our users, though. If you look at popcon "vote" results, we have 4179 votes for sysvinit-core and 16918 votes for systemd-sysv (graph) indicating that of those already testing jessie and beyond - neglecting 65 upstart votes, and assuming that there is no bias to not-upgrade if you prefer sysvinit - about 20% appear to prefer sysvinit (in fact, they may even have manually switched back to sysvinit after being upgraded to systemd unintentionally?). These are users that we should listen to, and that we should consider adding an installer option for, too.
2014-11-25 09:24 — Categories: English Linux DebianPermaLink & Comments

What the GR outcome means for the users

The GR outcome is: no GR necessary
This is good news.
Because it says: Debian will remain Debian, as it was the last 20 years.
For 20 years, we have tried hard to build the "universal operating system", and give users a choice. We've often had alternative software in the archive. Debian has come up with various tool to manage alternatives over time, and for example allows you to switch the system-wide Java.
You can still run Debian with sysvinit. There are plenty of Debian Developers which will fight for this to be possible in the future.
The outcome of this resolution says:
  • Using a GR to force others is the wrong approach of getting compatibility.
  • We've offered choice before, and we trust our fellow developers to continue to work towards choice.
  • Write patches, not useless GRs. We're coders, not bureocrats.
  • We believe we can do this, without making it a formal MUST requirement. Or even a SHOULD requirement. Just do it.
The sysvinit proponents may perceive this decision as having "lost". But they just don't realize they won, too. Because the GR may easily have backfired on them. The GR was not "every package must support sysvinit". It was also "every sysvinit package must support systemd". Here is an example: eudev, a non-systemd fork of udev. It is not yet in Debian, but I'm fairly confident that someone will make a package of it after the release, for the next Debian. Given the text of the GR, this package might have been inappropriate for Debian, unless it also supports systemd. But systemd has it's own udev - there is no reason to force eudev to work with systemd, is there?
Debian is about choice. This includes the choice to support different init systems as appropriate. Not accepting a proper patch that adds support for a different init would be perceived as a major bug, I'm assured.
A GR doesn't ensure choice. It only is a hammer to annoy others. But it doesn't write the necessary code to actually ensure compatibility.
If GNOME at some point decides that systemd as pid 1 is a must, the GR only would have left us three options: A) fork the previous version, B) remove GNOME altogether, C) remove all other init systems (so that GNOME is compliant). Does this add choice? No.
Now, we can preserve choice: if GNOME decides to go systemd-pid1-only, we can both include a forked GNOME, and the new GNOME (depending on systemd, which is allowed without the GR). Or any other solution that someone codes and packages...
Don't fear that systemd will magically become a must. Trust that the Debian Developers will continue what they have been doing the last 20 years. Trust that there are enough Debian Developers that don't run systemd. Because they do exist, and they'll file bugs where appropriate. Bugs and patches, that are the appropriate tools, not GRs (or trolling).
2014-11-19 20:58 — Categories: English Linux DebianPermaLink & Comments

Generate iptables rules via pyroman

Vincent Bernat blogged on using Netfilter rulesets, pointing out that inserting the rules one-by-one using iptables calls may leave your firewall temporarily incomplete, eventually half-working, and that this approach can be slow.
He's right with that, but there are tools that do this properly. ;-)
Some years ago, for a multi-homed firewall, I wrote a tool called Pyroman. Using rules specified either in Python or XML syntax, it generates a firewall ruleset for you.
But it also adresses the points Vincent raised:
  • It uses iptables-restore to load the firewall more efficiently than by calling iptables a hundred times
  • It will backup the previous firewall, and roll-back on errors (or lack of confirmation, if you are remote and use --safe)
It also has a nice feature for the use in staging: it can generate firewall rule sets offline, to allow you reviewing them before use, or transfer them to a different host. Not all functionality is supported though (e.g. the Firewall.hostname constant usable in python conditionals will still be the name of the host you generate the rules on - you may want to add a --hostname parameter to pyroman)
pyroman --print-verbose will generate a script readable by iptables-restore except for one problem: it contains both the rules for IPv4 and for IPv6, separated by #### IPv6 rules. It will also annotate the origin of the rule, for example:
# /etc/pyroman/
-A rfc4890f -p icmpv6 --icmpv6-type 255 -j DROP
indicates that this particular line was produced due to line 82 in file /etc/pyroman/ This makes debugging easier. In particular it allows pyroman to produce a meaningful error message if the rules are rejected by the kernel: it will tell you which line caused the rule that was rejected.
For the next version, I will probably add --output-ipv4 and --output-ipv6 options to make this more convenient to use. So far, pyroman is meant to be used on the firewall itself.
Note: if you have configured a firewall that you are happy with, you can always use iptables-save to dump the current firewall. But it will not preserve comments, obviously.
2014-11-18 09:46 — Categories: English Debian SecurityPermaLink & Comments

GR vote on init coupling

Overregulation is bad, and the project is suffering from the recent Anti-Systemd hate campaigning.
There is nothing balanced about the original GR proposal. It is bullshit from a policy point of view (it means we must remove software from Debian that would not work with other inits, such as gnome-journal, by policy).At the same time, it uses manipulative language like "freedom to select a different init system" (as if this would otherwise be impossible) and "accidentally locked in". It is exactly this type of language and behavior which has made Debian quite poisonous the last months.
In fact, the GR pretty much says "I don't trust my fellow maintainers to do the right thing, therefore I want a new hammer to force my opinion on them". This is unacceptable in my opinion, and the GR will only demotivate contributors. Every Debian developer (I'm not talking about systemd upstream, but about Debian developers!) I've met would accept a patch that adds support for sysvinit to a package that currently doesn't. The proposed GR will not improve sysvinit support. It is a hammer to kick out software where upstream doesn't want to support sysvinit, but it won't magically add sysvinit support anywhere.
What some supporters of the GR may not have realized - it may as well backfire on them. Some packages that don't yet work with systemd would violate policy then, too... - in my opinion, it is much better to make the support on a "as good as possible, given available upstream support and patches" basis, instead of a "must" basis. The lock-in may come even faster if we make init system support mandatory: it may be more viable to drop software to satisfy the GR than to add support for other inits - and since systemd is the current default, software that doesn't support systemd are good candidates to be dropped, aren't they? (Note that I do prefer to keep them, and have a policy that allows keeping them ...)
For these reasons I voted:
  1. Choice 4: GR not required
  2. Choice 3: Let maintainers do their work
  3. Choice 2: Recommended, but not mandatory
  4. Choice 5: No decision
  5. Choice 1: Ban packages that don't work with every init system
Fact is that Debian maintainers have always been trying hard to allow people to choose their favorite software. Until you give me an example where the Debian maintainer (not upstream) has refused to include sysvinit support, I will continue to trust my fellow DDs. I've been considering to place Choice 2 below "further discussion", but essentially this is a no-op GR anyway - in my opinion "should support other init systems" is present in default Debian policy already anyway...
Say no to the haters.
And no, I'm not being unfair. One of the most verbose haters going by various pseudonyms such as Gregory Smith (on Linux Kernel Mailing list), Brad Townshend (LKML) and John Garret (LKML) has come forward with his original alias - it is indeed MikeeUSA, a notorious anti-feminist troll (see his various youtube "songs", some of them include this pseudonym). It's easy to verify yourself.
He has not contributed anything to the open source community. His songs and "games" are not worth looking at, and I'm not aware of any project that has accepted any of his "contributions". Yet, he uses several sock puppets to spread his hate.
The anti-systemd "crowd" (if it acually is more than a few notorious trolls) has lost all its credibility in my opinion. They spread false information, use false names, and focus on hate instead of improving source code. And worse, they tolerate such trolling in their ranks.
2014-11-09 15:38 — Categories: English Linux DebianPermaLink & Comments

Clustering 23 mio Tweet locations

To test scalability of ELKI, I've clustered 23 million Tweet locations from the Twitter Statuses Sample API obtained over 8.5 months (due to licensing restrictions by Twitter, I cannot make this data available to you, sorry.
23 million points is a challenge for advanced algorithms. It's quite feasible by k-means; in particular if you choose a small k and limit the number of iterations. But k-means does not make a whole lot of sense on this data set - it is a forced quantization algorithm, but does not discover actual hotspots.
Density-based clustering such as DBSCAN and OPTICS are much more appropriate. DBSCAN is a bit tricky to parameterize - you need to find the right combination of radius and density for the whole world. Given that Twitter adoption and usage is quite different it is very likely that you won't find a single parameter that is appropriate everywhere.
OPTICS is much nicer here. We only need to specify a minimum object count - I chose 1000, as this is a fairly large data set. For performance reasons (and this is where ELKI really shines) I chose a bulk-loaded R*-tree index for acceleration. To benefit from the index, the epsilon radius of OPTICS was set to 5000m. Also, ELKI allows using geodetic distance, so I can specify this value in meters and do not get much artifacts from coordinate projection.
To extract clusters from OPTICS, I used the Xi method, with xi set to 0.01 - a rather low value, also due to the fact of having a large data set.
The results are pretty neat - here is a screenshot (using KDE Marble and OpenStreetMap data, since Google Earth segfaults for me right now):
Screenshot of Clusters in central Europe
Some observations: unsurprisingly, many cities turn up as clusters. Also regional differences are apparent as seen in the screenshot: plenty of Twitter clusters in England, and low acceptance rate in Germany (Germans do seem to have objections about using Twitter; maybe they still prefer texting, which was quite big in Germany - France and Spain uses Twitter a lot more than Germany).
Spam - some of the high usage in Turkey and Indonesia may be due to spammers using a lot of bots there. There also is a spam cluster in the ocean south of Lagos - some spammer uses random coordinates [0;1]; there are 36000 tweets there, so this is a valid cluster...
A benefit of OPTICS and DBSCAN is that they do not cluster every object - low density areas are considered as noise. Also, they support clusters of different shape (which may be lost in this visualiation, which uses convex hulls!) and different size. OPTICS can also produce a hierarchical result.
Note that for these experiments, the actual Tweet text was not used. This has a rough correspondence to Twitter popularity "heatmaps", except that the clustering algorithms will actually provide a formalized data representation of activity hotspots, not only a visualization.
You can also explore the clustering result in your browser - the Google Drive visualization functionality seems to work much better than Google Earth.
If you go to Istanbul or Los Angeles, you will see some artifacts - odd shaped clusters with a clearly visible spike. This is caused by the Xi extraction of clusters, which is far from perfect. At the end of a valley in the OPTICS plot, it is hard to decide whether a point should be included or not. These errors are usually the last element in such a valley, and should be removed via postprocessing. But our OpticsXi implementation is meant to be as close as possible to the published method, so we do not intend to "fix" this.
Certain areas - such as Washington, DC, New York City, and the silicon valley - do not show up as clusters. The reason is probably again the Xi extraction - these region do not exhibit the steep density increase expected by Xi, but are too blurred in their surroundings to be a cluster.
Hierarchical results can be found e.g. in Brasilia and Los Angeles.
Compare the OPTICS results above to k-means results (below) - see why I consider k-means results to be a meaningless quantization?
k-means clusters
Sure, k-means is fast (30 iterations; not converged yet. Took 138 minutes on a single core, with k=1000. The parallel k-means implementation in ELKI took 38 minutes on a single node, Hadoop/Mahout on 8 nodes took 131 minutes, as slow as a single CPU core!). But you can see how sensitive it is to misplaced coordinates (outliers, but mostly spam), how many "clusters" are somewhere in the ocean, and that there is no resolution on the cities? The UK is covered by 4 clusters, with little meaning; and three of these clusters stretch all the way into Bretagne - k-means clusters clearly aren't of high quality here.
If you want to reproduce these results, you need to get the current ELKI beta version (0.6.5~20141030 - the output of cluster convex hulls was just recently added to the default codebase), and of course data. The settings I used are: coords.tsv.gz
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-pagefile.pagesize 500
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.01
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 1000
-resulthandler KMLOutputHandler -out /tmp/out.kmz
and the total runtime for 23 million points on a single core was about 29 hours. The indexes helped a lot: less than 10000 distances were computed per point, instead of 23 million - the expected speedup over a non-indexed approach is 2400.
Don't try this with R or Matlab. Your average R clustering algorithm will try to build a full distance matrix, and you probably don't have an exabyte of memory to store this matrix. Maybe start with a smaller data set first, then see how long you can afford to increase the data size.
2014-10-23 10:01 — Categories: English Web ResearchPermaLink & Comments

Avoiding systemd isn't hard

The former contents of this blog post have been removed.
Systemd-haters pissed me off so much, that I'm no longer willing to provide information helpful for avoiding systemd.
Here is my message to all the anti-systemd-trolls: Go jump in a lake, we do not need haters in the open source community.
Hint: if you want people to care about you, stop insulting them. If you keep on pissing off people, you will not achieve anything!
2014-10-21 13:17 — Categories: English DebianPermaLink & Comments

Beware of trolls - do not feed

A particularly annoying troll has been on his hate crusade against systemd for months now.
Unfortunately, he's particularly active on Debian mailing lists (but apparently also on Ubuntu and the Linux Kernel mailing list) and uses a tons of fake users he keeps on setting up. Our listmasters have a hard time blocking all his hate, sorry.
Obviously, this is also the same troll that has been attacking Lennart Poettering.
There is evidence that this troll used to go by the name "MikeeUSA", and has quite a reputation with anti-feminist hate for over 10 years now.
Please, do not feed this troll.
Here are some names he uses on YouTube: Gregory Smith, Matthew Bradshaw, Steve Stone.
Blacklisting is the best measure we have, unfortunately.
Even if you don't like the road systemd is taking or Lennart Poetting personall - the behaviour of that troll is unacceptable to say the least; and indicates some major psychological problems... also, I wouldn't be surprised if he is also involved in #GamerGate.
See this example (LKML) if you have any doubts. We seriously must not tolerate such poisonous people.
If you don't like systemd, the acceptable way of fighting it is to write good alternative software (and you should be able to continue using SysV init or openRC, unless there is a bug, in Debian - in this case, provide a bug fix). End of story.
2014-10-18 18:41 — Categories: English DebianPermaLink & Comments

Google Earth on Linux

Google Earth for Linux appears to be largely abandoned by Google, unfortunately. The packages available for download cannot be installed on a modern amd64 Debian or Ubuntu system due to dependency issues.
In fact, the adm64 version is a 32 bit build, too. The packages are really low quality, the dependencies are outdated, locales support is busted etc.
So here are hacky instructions how to install nevertheless. But beware, these instructions are a really bad hack.
  1. These instructions are appropriate for version Do not use them for any other version. Things will have changed.
  2. Make sure your system has i386 architecture enabled. Follow the instructions in section "Configuring architectures" on the Debian MultiArch Wiki page to do so
  3. Install lsb-core, and try to install the i386 versions of these packages, too!
  4. Download the i386 version of the Google Earth package
  5. Install the package by forcing dependencies, via
    sudo dpkg --force-depends -i google-earth-stable_current_i386.deb
  6. As of now, your package manager will complain, and suggest to remove the package again. To make it happy, we have to hack the installed packages list. This is ugly, and you should make a backup. You can totally bust your system this way... Fortunately, the change we're doing is rather simple. As admin, edit the file /var/lib/dpkg/status. Locate the section Package: google-earth-stable. In this section, delete the line starting with Depends:. Don't add in extra newlines or change anything else!
  7. Now the package manager should believe the dependencies of Google Earth are fulfilled, and no longer suggest removal. But essentially this means you have to take care of them yourself!
Some notes on using Google Earth:
  • Locales are busted. Use LC_NUMERIC=en_US.UTF-8 google-earth to start it. Otherwise, it will fail parsing coordinates, if you are in a locale that uses a different number format.
  • You may need to install the i386 versions of some libraries, in particular of your OpenGL drivers! I cannot provide you with a complete list.
  • Search doesn't work sometimes for me.
  • Occassionally, it reports "unknown" network errors.
  • If you upgrade Nvidia graphics drivers, you will usually have to reboot, or you will see graphics errors.
  • Some people have removed/replaced the bundled libQt* and libfreeimage* libraries, but that did not work for me.
2014-10-17 15:59 — Categories: English Web DebianPermaLink & Comments

Analyzing Twitter - beware of spam

This year I started to widen up my research; and one data source of interest was text because of the lack of structure in it, that makes it often challenging. One of the data sources that everybody seems to use is Twitter: it has a nice API, and few restrictions on using it (except on resharing data). By default, you can get a 1% random sample from all tweets, which is more than enough for many use cases.
We've had some exciting results which a colleague of mine will be presenting tomorrow (Tuesday, Research 22: Topic Modeling) at the KDD 2014 conference:
SigniTrend: Scalable Detection of Emerging Topics in Textual Streams by Hashed Significance Thresholds
Erich Schubert, Michael Weiler, Hans-Peter Kriegel
20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
You can also explore some (static!) results online at

In our experiments, the "news" data set was more interesting. But after some work, we were able to get reasonable results out of Twitter as well. As you can see from the online demo, most of these fall into pop culture: celebrity deaths, sports, hip-hop. Not much that would change our live; and even less that wasn't before captured by traditional media.
The focus of this post is on the preprocessing needed for getting good results from Twitter. Because it is much easier to get bad results!
The first thing you need to realize about Twitter is that due to the media attention/hype it gets, it is full of spam. I'm pretty sure the engineers at Twitter already try to reduce spam; block hosts and fraud apps. But a lot of the trending topics we discovered were nothing but spam.
Retweets - the "like" of Twitter - are an easy source to see what is popular, but are not very interesting if you want to analyze text. They just reiterate the exact same text (except for a "RT " prefix) than earlier tweets. We found results to be more interesting if we removed retweets. Our theory is that retweeting requires much less effort than writing a real tweet; and things that are trending "with effort" are more interesting than those that were just liked.
Teenie spam. If you ever searched for a teenie idol on Twitter, say this guy I hadn't heard of before, but who has 3.88 million followers on Twitter, and search for Tweets addressed to him, you will get millions over millions of results. Many of these tweets look like this: Now if you look at this tweet, there is this odd "x804" at the end. This is to defeat a simple spam filter by Twitter. Because this user did not tweet this just once: instead it is common amongst teenie to spam their idols with follow requests by the dozen. Probably using some JavaScript hack, or third party Twitter client. Occassionally, you see hundreds of such tweets, each sent within a few seconds of the previous one. If you get a 1% sample of these, you still get a few then...
Even worse (for data analysis) than teenie spammers are commercial spammers and wannabe "hackers" that exercise their "sk1llz" by spamming Twitter. To get a sample of such spam, just search for weight loss on Twitter. There is plenty of fresh spam there, usually consisting of some text pretending to be news, and an anonymized link (there is no need to use an URL shortener such as on Twitter, since Twitter has its own URL shortener; and you'll end up with double-shortened URLs). And the hacker spam is even worse (e.g. #alvianbencifa) as he seems to have trojaned hundreds of users, and his advertisement seems to be a nonexistant hash tag, which he tries to get into Twitters "trending topics".
And then there are the bots. Plenty of bots spam Twitter with their analysis of trending topics, reinforcing the trending topics. In my opinion, bots such as trending topics indonesia are useless. No wonder there are only 280 followers. And of the trending topics reported, most of them seem to be spam topics...

Bottom line: if you plan on analyzing Twitter data, spend considerable time on preprocessing to filter out spam of various kind. For example, we remove singletons and digits, then feed the data through a duplicate detector. We end up discarding 20%-25% of Tweets. But we still get some of the spam, such as that hackers spam.
All in all, real data is just messy. People posting there have an agenda that might be opposite to yours. And if someone (or some company) promises you wonders from "Big data" and "Twitter", you better have them demonstrate their use first, before buying their services. Don't trust visions of what could be possible, because the first rule of data analysis is: Garbage in, garbage out.
2014-08-25 12:40 — Categories: English Web ResearchPermaLink & Comments

Paypal Spam

Paypal ist jetzt unter die Spammer gegangen.

Seit ein paar Monaten versendet Paypal jeden Monat eine Werbemail an alle Nutzer. Diese erweckt den Eindruck es gäbe Transaktionen auf dem Konto (selbst wenn man PayPal gar nicht nutzt) und fordert zum Besuch der Webseite auf. Es handelt sich bei diesen Email nicht um Phishig, aber der Hintergedanke von Paypal wird sicher sein, dass man den Dienst wieder mehr nutzen soll...

Hier der Wortlaut der letzten Email:

Ihre Kontoübersicht für Juni 2014

Hallo XXX YYY,

mit PayPal behalten Sie den Überblick wie die Profi-Kicker auf dem WM-Rasen. Um Ihre letzten Zahlungen anzusehen und Ihren aktuellen Kontostand abzurufen, melden Sie sich bitte bei an. Vertrauliche Informationen versenden wir aus Sicherheitsgründen niemals in einer E-Mail.

... aus "Sicherheitsgründen" verschicken sie also nur Werbeslogans wie "Profi-Kicker" und "WM-Rasen"?!?
"Ihre letzten Zahlungen" gibt es nämlich keine!

Nein Danke. Ich werde meinen PayPal-Account - den ich eh nicht nutze, und der nicht mal meine aktuelle Kreditkartennummer hat - einfach stilllegen.

Und nein, abschalten kann man sie nicht. Zitat "Kundendienst" Paypal:

Vielen Dank für Ihre Anfrage an PayPal.

Sie schreiben mir, dass die Benachrichtigungen nicht abgeschaltet werden Können.

Herr XXX, ich habe Ihre Unterlagen geprüft und kann Ihnen bestätigen, dass Sie die Zusendung von Benachrichtigungen ausgestellt haben.

Wir behalten uns jedoch das Recht vor, unsere Nutzer über wichtige Ankündigungen zu Produkten oder Richtlinien zu informieren.

wichtige Ankündigungen? Trifft hier wahrlich nicht zu. Versteckte Werbung schon eher...

07.07.2014 11:08 — Categories: Deutsch WebPermaLink & Comments

Kernel-density based outlier detection and the need for customization

Outlier detection (also: anomaly detection, change detection) is an unsupervised data mining task that tries to identify the unexpected.
Most outlier detection methods are based on some notion of density: in an appropriate data representation, "normal" data is expected to cluster, and outliers are expected to be further away from the normal data.
This intuition can be quantified in different ways. Common heuristics include kNN outlier detection and the Local Outlier Factor (which uses a density quotient). One of the directions in my dissertation was to understand (also from a statistical point of view) how the output and the formal structure of these methods can be best understood.
I will present two smaller results of this analysis at the SIAM Data Mining 2014 conference: instead of the very heuristic density estimation found in above methods, we design a method (using the same generalized pattern) that uses a best-practise from statistics: Kernel Density Estimation. We aren't the first to attempt this (c.f. LDF), but we actuall retain the properties of the kernel, whereas the authors of LDF tried to mimic the LOF method too closely, and this way damaged the kernel.
The other result presented in this work is the need to customize. When working with real data, using "library algorithms" will more often than not fail. The reason is that real data isn't as nicely behaved - it's dirty, it seldom is normal distributed. And the problem that we're trying to solve is often much narrower. For best results, we need to integrate our preexisting knowledge of the data into the algorithm. Sometimes we can do so by preprocessing and feature transformation. But sometimes, we can also customize the algorithm easily.
Outlier detection algorithms aren't black magic, or carefully adjusted. They follow a rather simple logic, and this means that we can easily take only parts of these methods, and adjust them as necessary for our problem at hand!
The article persented at SDM will demonstrate such a use case: analyzing 1.2 million traffic accidents in the UK (from we are not interested in "classic" density based outliers - this would be a rare traffic accident on a small road somewhere in Scotland. Instead, we're interested in unusual concentrations of traffic accidents, i.e. blackspots.
The generalized pattern can be easily customized for this task. While this data does not allow automatic evaluation, many outliers could be easily verified using Google Earth and search: often, historic imagery on Google Earth showed that the road layout was changed, or that there are many news reports about the dangerous road. The data can also be nicely visualized, and I'd like to share these examples with you. First, here is a screenshot from Google Earth for one of the hotspots (Cherry Lane Roundabout, North of Heathrow airport, which used to be a double cut-through roundabout - one of the cut-throughs was removed since):
Screenshot of Cherry Lane Roundabout hotspot
Google Earth is best for exploring this result, because you can hide and show the density overlay to see the crossroad below; and you can go back in time to access historic imagery. Unfortunately, KML does not allow easy interactions (at least it didn't last time I checked).
I have also put the KML file on Google Drive. It will automatically display it on Google Maps (nice feature of Drive, kudos to Google!), but it should also allow you to download it. I've also explored the data on an Android tablet (but I don't think you can hide elements there, or access historic imagery as in the desktop application).
With a classic outlier detection method, this analysis would not have been possible. However, it was easy to customize the method; and the results are actually more meaningful: instead of relying on some heuristic to choose kernel bandwidth, I opted for choosing the bandwidth by physical arguments: 50 meters is a reasonable bandwidth for a crossroad / roundabout, and for comparison a radius of 2 kilometers is used to model the typical accident density in this region (there should other crossroads within 2 km in Europe).
Since I advocate reproducible science, the source code of the basic method will be in the next ELKI release. For the customization case studies, I plan to share them as a how-to or tutorial type of document in the ELKI wiki; probably also detailing data preprocessing and visualization aspects. The code for the customizations is not really suited for direct inclusion in the ELKI framework, but can serve as an example for advanced usage.
E. Schubert, A. Zimek, H.-P. Kriegel
Generalized Outlier Detection with Flexible Kernel Density Estimates
In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, 2014.
So TLDR of the story: A) try to use more established statistics (such as KDE), and B) don't expect an off-the-shelf solution to do magic, but customize the method for your problem.
P.S. if you happen to know nice post-doc positions in academia:
I'm actively looking for a position to continue my research. I'm working on scaling these methods to larger data and to make them work with various real data that I can find. Open-source, modular and efficient implementations are very important to me, and one of the directions I'd like to investigate is porting these methods to a distributed setting, for example using Spark. In order to get closer to "real" data, I've started to make these approaches work e.g. on textual data, mixed type data, multimedia etc. And of course, I like teaching; which is why I would prefer a position in academia.
2014-04-22 14:39 — Categories: English ResearchPermaLink & Comments

Google Summer of Code 2014 - not participating

Google Summer of Code 2014 is currently open for mentoring organizations to register.

I've decided to not apply for GSoC with my data mining open source project ELKI anymore.

  • You don't get any feedback for your application. So unless you were accepted, you have no idea if it is worth trying again (I tried twice).
  • As far as I can tell, you only have a chance if there is someone at Google advocating your project. For example, someone involved in your project actually working at Google.
  • I don't really trust Google anymore. They've been too much exploiting their market dominance, leveraging YouTube and Android to force people into Google+ etc. (see Google turning evil blog post).
  • I'll be looking for a Post-Doc position in fall, so the timing of the GSoC isn't ideal anyway.
2014-02-13 17:17 — Categories: English Coding ResearchPermaLink & Comments

Debian chooses systemd as default init

It has been all over the place. The Debian CTTE has chosen systemd over upstart as default init system by chairman call. This decision was overdue, as it was pretty clear that the tie will not change, and thus it will be up to chairman. There were no new positions presented, and nobody was being convinced of a different preference. The whole discussion had been escalating, and had started to harm Debian.

Some people may not want to hear this, but another 10 ballots and options would not have changed this outcome. Repating essentially the same decision (systemd, upstart, or "other things nobody prefers") will do no good, but turn out the same result, a tie. Every vote counting I saw happen would turn out this tie. Half of the CTTE members prefer systemd, the other half upstart.

The main problems, however, are these:

  • People are not realizing this is about the default init for jessie, not about the only init to support. Debian was and is about choice, but even then you need to make something default... If you prefer SysV init or openRC, you will still be able to use it (it will be just fewer people debugging these startup scripts).
  • Some people (not part of the CTTE) are just social inept trolls, and have rightfully been banned from the list.
  • The discussion has been so heated up, a number of imporant secondary decisions have not yet been discussed in a civil manner. For example, we will have some packages (for example, Gnome Logs), which are specific to a particular init system, but not part of the init system. A policy needs to be written with these cases in mind that states when and how a package may depend on a particular init system. IMHO, a package that cannot support other init systems without major changes should be allowed to depend on that system, but then meta-packages such as the Gnome meta packages must not require this application.

So please, everybody get back to work now. If there ever is enough reason to overturn this decision, there are formal ways to do this, and there are people in the CTTE (half of the CTTE, actually) that will take care of this.

Until then, live with the result of a 0.1 votes lead for systemd. Instead of pursuing a destructive hate campaign, why not improve your favorite init system instead.

Oh, and don't forget about the need to spell out a policy for init system support requirements of packages.

2014-02-11 22:07 — Categories: English Linux DebianPermaLink & Comments

Minimizing usage of debian-multimedia packages

How to reduce your usage of debian-multimedia packages:

As you might be aware, the "deb-multimedia" packages have seen their share of chaos.

Originally, they were named "debian-multimedia", but it was then decided that they should better be called "deb-multimedia" as they are not an official part of Debian. The old domain was then grabbed by cybersquatters when it expired.

While a number of packages remain indistributable for Debian due to legal issues - such as decrypting DVDs - and thus "DMO" remains useful to many desktop users, please note that for many of the packages, a reasonable version exists within Debian "main".

So here is a way to prevent automatically upgrading to DMO versions of packages that also exist in Debian main:

We will use a configuration option known as apt pinning. We will modify the priority of DMO package to below 100, which means they will not be automatically upgraded to; but they can be installed when e.g. no version of this package exists within Debian main. I.e. the packages can be easily installed, but it will prefer using the official versions.

For this we need to create a file I named /etc/apt/preferences.d/unprefer-dmo with the contents:

Package: *
Pin: release o=Unofficial Multimedia Packages
Pin-Priority: 123

As long as the DMO archive doesn't rename itself, this pin will work; it will continue to work if you use a mirror of the archive, as it is not tied to the URL.

It will not downgrade packages for you. If you want to downgrade, this will be a trickier process. You can use aptitude for this, and start with a filter (l key, as in "limit") of ?narrow(~i,~Vdmo). This will only show packages installed where the version number contains dmo. You can now patrol this list, enter the detail view of each package, and check the version list at the end for a non-dmo version.

I cannot claim this will be an easy process. You'll probably have a lot of conflicts to clear up, due to different libavcodec* API versions. If I recall correctly, I opted to uninstall some dmo packages such as ffmpeg that I do not really need. In other cases, the naming is slightly different: Debian main has handbrake, while dmo has handbrake-gtk.

A simple approach is to consider uninstalling all of these packages. Then reinstall as needed; since installing Debian packages is trivial, it does not hurt to deinstall something, does it? When reinstalling, Debian packages will be preferred over DMO packages.

I prefer to use DFSG-free software wherever possible. And as long as I can watch my favorite series (Tatort, started in 1970 and airing episode 900 this month) and the occasional DVD movie, I have all I need.

An even more powerful aptitude filter to review installed non-Debian software is ?narrow(~i,!~ODebian), or equivalently ?narrow(?installed,?not(?origin(Debian))). This will list all package versions, which cannot be installed from Debian sources. In particular, this includes versions that are no longer available (they may have been removed because of security issues!), software that has been installed manually from .deb files, or any other non-Debian source such as Google or DMO. (You should check the output of aptitude policy that no source claims to be o=Debian that isn't Debian though).

This filter is a good health check for your system. Debian packages receive a good amount of attention with respect to packaging quality, maintainability, security and all of these aspects. Third party packages usually failed the Debian quality check in at least one respect. For example, Sage isn't in Debian yet; mostly because it has so many dependencies, that making a reliable and upgradeable installation is a pain. Similarly, there is no Hadoop in Debian yet. If you look at Hadoop packaging efforts such as Apache Bigtop, they do not live up to Debian quality yet. In particular, the packages have high redundancy, and re-include various .jar copies instead of sharing access to an independently packaged dependency. If a security issue arises with any of these jars, all the Hadoop packages will need to be rebuilt.

As you can see, it is usually not because of Debian being too strict or too conservative about free software when software is not packaged. More often than not, it's because the software in question itself currently does not live up to Debian packaging quality requirements. And of course sometimes because there is too little community backing the software in question, that would improve the packaging. If you want your software to be included in Debian, try to make it packaging friendly. Don't force-include copies of other software, for example. Allow the use of system-installed libraries where possible, and provide upgrade paths and backwards compatibility. Make dependencies optional, so that your software can be pacakged even if a dependency is not yet available - and does not have to be removed if the dependency becomes a security risk. Maybe learn how to package yourself, too. This will make it easier for you to push new features and bug fixes to your userbase: instead of requiring your users to manually upgrade, make your software packaging friendly. Then your users will be auto-upgraded by their operating system to the latest version.

Update: If you are using APT::Default-Release or other pins, these may override above pin. You may need to use apt-cache policy to find out if the pins are in effect, and rewrite your default-release pin into e.g.

Package: *
Pin: release a=testing,o=Debian
Pin-Priority: 912
for debugging, use a different priority for each pin, so you can see which one is in effect. Notice the o=Debian in this pin, which makes it apply only to Debian repositories.

2014-02-09 12:35 — Categories: English Linux DebianPermaLink & Comments

Definition of Data Science

Everything is "big data" now, and everything is "data science". Because these terms lack a proper, falsifiable definition.
A number of attempts to define them exist, but they usually only consist of a number of "nice to haves" strung together. For Big Data, it's the 3+ V's, and for Data Science, this diagram on Wikipedia is a typical example.
This is not surprising: effectively these term are all marketing, not scientific attempts at definiting a research domain.
Actually, my favorite definition is this, except that it should maybe read pink pony in the middle, instead of unicorn.
Data science has been called "the sexiest job" so often, this has recently led to an integer overflow.
The problem with these definitions is that they are open-ended. They name some examples (like "volume") but they essentially leave it open to call anything "big data" or "data science" that you would like to. This is, of course, a marketers dream buzzword. There is nothing saying that "picking my nose" is not big data science.

If we ever want to get to a usable definition and get rid of all the hype, we should consider a more precise definition; even when this means making it more exclusive (funnily enough, some people already called above open-ended definitions "elitist" ...).
Big data:
  • Must involve distributed computation on multiple servers
  • Must intermix computation and data management
  • Must advance over the state-of-the-art of relational databases, data warehousing and cloud computing in 2005
  • Must enable results that were unavailable with earlier approaches, or that would take substantially longer (runtime or latency)
  • Must be disruptively more data-driven
Data science:
  • Must incorporate domain knowledge (e.g. business, geology, etc.).
  • Must take computational aspects into account (scalability etc.).
  • Must involve scientific techniques such as hypothesis testing and result validation.
  • Results must be falsifiable.
  • Should involve more mathematics and statistics than earlier approaches.
  • Should involve more data management than earlier approaches (indexing, sketching&hashing etc.).
  • Should involve machine learning, AI or knowledge discovery algorithms.
  • Should involve visualization and rapid prototyping for software development.
  • Must satisfy at least one of these shoulds in a disruptive level.
But this is all far from a proper definition. Partially because these fields are so much in flux; but largely because they're just too ill-defined.
There is a lot of overlap, that we should try to flesh out. For example, data science is not just statistics. Because it is much more concerned with how data is organized and how the computations can be made efficiently. Yet often, statistics is much better at integrating domain knowledge. People coming from computation, on the other hand, usually care too little about the domain knowledge and falsifiability of their results - they're happy if they can compute anything.
Last but not least, nobody will be in favor of such a rigid definition and requirements. Because most likely, you will have to strip that "data scientist" label off your business card - and why bite the hand that feeds? Most of what I do certainly would not qualify as data science or big data anymore with an "elitist" definition. While this doesn't lessen my scientific results, it makes them less marketable.
Essentially, this is like a global "gentlemans agreement". Buzz these words while they last, then move on to the next similar "trending topic".

Maybe we should just leave these terms to the marketing folks, and let them bubble them till it bursts. Instead, we should just stick to the established and better defined terms...
  • When you are doing statistics, call it statistics.
  • When you are doing unsupervised learning, call it machine learning.
  • When your focus is distributed computation, call it distributed computing.
  • When you do data management, continute to call it data management and databases.
  • When you do data indexing, call it data indexing.
  • When you are doing unsupervised data mining, call it cluster analysis, outlier detection, ...
  • Whatever it is, try to use a precise term, instead of a buzzword.
Thank you.
Of course, sometimes you will have to play Buzzword Bingo. Nobody is going to stop you. But I will understand that you are doing "playing buzzword bingo", unless you get more precise.
Once you then have results that are so massively better, and really disrupted science, then you can still call it "data science" later on.
You have been seeing, I've been picking on the word "disruptive" a lot. As long as you are doing "business as usual", and focusing on off-the-shelf solution, it will not be disruptive. And it then won't be big data science, or a big data approach that yields major gains. It will be just "business as usual" with different labels, and return results as usual.
Let's face it. We don't just want big data or data science. What everybody is looking for is disruptive results, which will require a radical approach, not a slight modification involving slightly more computers of what you have been doing all along.
2014-01-24 16:19 — Categories: English ResearchPermaLink & Comments

The init wars

The init wars have recently caught a lot of media attention (e.g. heise, prolinux, phoronix). However, one detail that is often overlooked: Debian is debating over the default, while all of them are already supported to a large extend, actually. Most likely, at least two of them will be made mandatory to support IMHO.
The discussion seems to be quite heated, with lots of people trying to evangelize for their preferred system. This actually only highlights that we need to support more than one, as Debian has always been about choice. This may mean some extra work for the debian-installer developers, because choosing the init system at install time (instead of switching later) will be much easier. More often than not, when switching from one init system to another you will have to perform a hard reset.
If you want to learn about the options, please go to the formal discussion page, which does a good job at presenting the positions in a neutral way.
Here is my subjective view of the init systems:
  • SysV init is the current default, and thus deserves to be mentioned first. It is slow, because it is based on a huge series of shell scripts. It can often be fragile, but at the same time it is very transparent. For a UNIX system administrator, SysV init is probably the preferred choice. You only reboot your servers every year anyway.
  • upstart seems to be a typical Canonical project. It solves a great deal of problems, but apparently isn't good enough at it for everybody, and they fail at including anyone in their efforts. Other examples of these fails include Unity and Mir, where they also announced the project as-is, instead of trying to get other supporters on board early (AFAICT). The key problem to widespread upstart acceptance seems to be the Canonical Contributor License Agreement that many would-be contributors are unwilling to accept. The only alternative would be to fork upstart completely, to make it independent of Canonical. (Note that upstart nevertheless is GPL, which is why it can be used by Debian just fine. The CLA only makes getting patches and enhancements included in the official version hard.)
  • systemd is the rising star in the init world. It probably has the best set of features, and it has started to incorporate/replace a number of existing projects such as ConsoleKit. I.e. it not only manages services, but also user sessions. It can be loosely tied to the GNOME project which has started to rely on it more and more (much to the unhappyness of Canonical, who used to be a key player for GNOME; note that officially, GNOME chose to not depend on systemd, yet I see this as the only reliable combination to get a complete GNOME system running, and since "systemd can eventually replace gnome-session" I foresee this tie to become closer). As the main drawback, systemd as is will (apparently) only work with the Linux kernel, whereas Debian has to also support kFreeBSD, NetBSD, Hurd and the OpenSolaris kernels (some aren't officially supported by Debian, but by separate projects).
So my take: I believe the only reasonable default is systemd. It has the most active development community and widest set of features. But as it cannot support all architectures, we need mandatory support for an alternative init system, probably SysV. Getting both working reliably will be a pain, in particular since more and more projects (e.g. GNOME) tie themselves closely to systemd, and would then become Linux-only or require major patches.
I have tried only systemd on a number of machines, and unfortunately I cannot report it as "prime time ready" yet. You do have the occasional upgrade problems and incompatibilities, as it is quite invasive. From screensavers activating during movies to double suspends, to being unable to shutdown my system when logged in (systemd would treat the login manager as separate session, and not being the sole user it would not allow me to shut down), I have seen quite a lot of annoyances happen. This is an obvious consequence of the active development on systemd. This means that we should make the decision early, because we will need a lot of time to resolve all these bugs for the release.
There are more disruptions coming on the way. Nobody seems to have talked about kDBUS yet, the integration of an IPC mechanism like DBUS into the Linux kernel. It IMHO has a good chance of making it into the Linux kernel rather soon, and I wouldn't be surprised if it became mandatory for systemd soon after. Which then implies that only a recent kernel (say, mid-2014) version might be fully supported by systemd soon.
I would also like to see less GNOME influence in systemd. I have pretty much given up on the GNOME community, which is moving into a UI direction that I hate: they seem to only care about tablet and mobile phones for dumb users, and slowly turn GNOME into an android UI; selling black background as major UI improvements. I feel that the key GNOME development community does not care about developers and desktop users like me anymore (but dream of being the next Android), and therefore I have abandoned GNOME and switched to XFCE.
I don't give upstart much of a chance. Of course there are some Debian developers already involved in its development (employed by Canonical), so this will cause some frustration. But so far, upstart is largely an Ubuntu-only solution. And just like Mir, I don't see much future in it; instead I foresee Ubuntu going systemd within a few years, because it will want to get all the latest GNOME features. Ubuntu relies on GNOME, and apparently GNOME already has chosen systemd over upstart (even though this is "officially" denied).
Sticking with SysV is obviously the easiest choice, but it does not make a lot of sense to me technically. It's okay for servers, but more and more desktop applications will start to rely on systemd. For legacy reasons, I would however like to retain good SysV support for at least 5-10 more years.

But what is the real problem? After all, this is a long overdue decision.
  • There is too much advocacy and evangelism, from either side. The CTTE isn't really left alone to do a technical decision, but instead the main factors have become of political nature, unfortunately. You have all kinds of companies (such as Spotify) weigh in on the debate, too.
  • The tone has become quite aggressive and emotional, unfortunately. I can already foresee some comments on this blog post "you are a liar, because GNOME is now spelled Gnome!!1!".
  • Media attention. This upcoming decision has been picked up by various Linux media already, increasing the pressure on everybody.
  • Last but not least, the impact will be major. Debian is one of the largest distributions, last but not least used by Ubuntu and Steam, amongst others. Debian preferring one over the other will be a slap in somebodys face, unfortunately.
So how to solve it? Let the CTTE do their discussions, and stop flooding them with mails trying to influence them. There has been so much influencing going on, it may even backfire. I'm confident they will find a reasonable decision, or they'll decide to poll all the DDs. If you want to influence the outcome provide patches to anything that doesn't yet fully support your init system of choice! I'm sure there are hundreds of packages which do neither have upstart nor systemd support yet (as is, I currently have 27 init.d scripts launched by systemd, for example). IMHO, nothing is more convincing than have things just work, and of course, contributing code. We are in open source development, and the one thing that gets you sympathy in the community is to contribute code to someone elses project. For example, contribute full integrated power-management support into XFCE, if you include power management functionality.
As is, I have apparently 7 packages installed with upstart support, and 25 with systemd support. So either, everybody is crazy about systemd, or they have the better record of getting their scripts accepted upstream. (Note that this straw poll is biased - with systemd, the benefits of not using "legacy" init.d script may just be larger).
2014-01-22 16:39 — Categories: English Linux DebianPermaLink & Comments

Hört bitte mit diesem Datenschürfen auf!

Nein, es geht jetzt nicht um die NSA-Affäre.
Es geht um dieses Un-Wort für Data-Mining.

Immer wenn ich dieses Wort lese stellen sich mir alle Haare auf (zum Glück scheinen es nur mittelmäßige Journalisten zu verwenden, die für quantitative Nachrichtenseiten mit der bevorzugten Farbe Rot - alle drei - schreiben)

  • Es handelt sich dabei anscheinend um eine deutschtümelnde Wortschöpfung. Ich denke nicht, dass man hier auf eine solche Notfallübersetzung zurückgreifen muss. Es gibt wahrlich bessere Alternativen, man muss nur präziser sein.
  • "Schürfen" mag eine dem Bergbaufachmann oder im Ruhrpott gängiger Begriff für das Abbauen und Gewinnen von Erzen sein. Für viele Deutsche klingt es heute aber einfach nur noch nach etwas "abschaben". D.h. "Daten abschaben"?
  • Der Duden kennt übrigens ... "Data-Mining". Ein etwaiges "Datenschürfen" nicht.
  • ... prominent in der Duden-Erklärung ist das Wort "Auswertung". Eine bessere naive Übersetzung wäre also wenigstens: Datenbankauswertung. Aber das ist den lieben Journalisten natürlich nicht reißerisch genug: "Deutschland will Weltmeister im Datenschürfen werden!" Aua, aua. Ja, gleich nachdem wir Sahra Wagenknecht im Dschungelcamp nackt gesehen haben, zusammen mit Beendet-alles-Pofalla, und die Politiker das erste Mal ein Großprojekt erfolgreich abgeschlossen haben, ohne den Zeit- und Kostenrahmen zu sprengen.
  • Der gängige Begriff ist übrigens: Wissensentdeckung in Datenbanken. Hakt aber auch etwas - es fehlt das "automatisch" - und ist natürlich so für Schlagzeilen denkbar ungeeignet.
  • Meistens gibt es übrigens auch präzisere Begriffe (und allzu of: richtigere, denn nicht überall wo "data mining" draufgeschrieben wird, ist auch fortgeschrittene Analyse drin):
    • Das massive Sammeln von Daten ist kein Data-Mining, sondern einfach genau das: massives Sammeln von Daten"
    • Auch die "Privatsphäre" hat schon ein, zugegebenermaßen schwieriges, eigenes Wort
    • "Überwachung" ist besser geeignet, um die NSA zu beschreiben. Denn Analysieren können die ihre gesamten Daten schon lange nicht mehr. Mit ach und krach noch durchsuchen und filtern!
    • "Suchen" (übrigens das alte deutsche Wort für "gugeln") ist auch oft viel passender. Ebenso das Wort "Filtern"
    • "Elektronische Datenverarbeitung"! Zugegeben, "EDV" klingt alt, aber letztlich wird genau das gemacht... und diese Abkürzung sollten noch die meisten Leser kennen.

Apropos, weil ich eh schon so am Schimpfen bin: mit "Big Data" ist das ganz genauso. Alles ist Big Data. Selst wenn ich mir die Nase schneuze, dann ist das manchmal "Big Data". Denn ebenso wie "Data Science" ist dieser Begriff eigentlich undefiniert. Insbesondere: was ist es denn nicht? Und selbst wenn es eine gute, weit anerkannte Definition gäbe, würde sich einfach keiner dran halten.

Hier ist meine Definition von Big Data:

Big Data ist der nice try, ein bestehendes Technologieportfolio mit maximalen return of investment (ROI) in Big Bucks zu transformieren, indem man das Buzzword-des-Tages darauf anwendet, und hofft dass die anderen Chief Marketing Officers (CMOs) nicht genau auf die gleiche brilliante Idee gekommen sind.

19.01.2014 18:14 — Categories: DeutschPermaLink & Comments

Java Hotspot Compiler - a heavily underappreciated technology

When I had my first contacts with Java, probably around Java 1.1 or Java 1.2, it felt all clumsy and slow. And this is still the reputation that Java has to many developers: bloated source code and slow performance.
The last years I've worked a lot with Java; it would not have been my personal first choice, but as this is usually the language the students know best, it was the best choice for this project, the data mining framework ELKI.
I've learned a lot on Java since, also on debugging and optimizing Java code. ELKI contains a number of tasks that require a good number chrunching performance; something where Java particularly had the reputation of being slow.
I must say, this is not entirely fair. Sure, the pure matrix multiplication performance of Java is not up to Fortran (BLAS libraries are usually implemented in Fortran, and many tools such as R or NumPy will use them for the heavy lifting). But there are other tasks than matrix multiplication, too!
There is a number of things where Java could be improved a lot. Some of this will be coming with Java 8, others is still missing. I'd particularly like to see native BLAS support and multi-valued on-stack returns (to allow intrinsic sincos, for example).

In this post, I want to emphasize that usually the Hotspot compiler does an excellent job.
A few years ago, I have always been laughing at those that claimed "Java code can even be faster than C code"; because the Java JVM is written in C. Having had a deeper look at what the hotspot compiler does, I'm now saying: I'm not surprised that quite often, reasonably good Java code outperforms reasonably good C code.
In fact, I'd love to see a "hotspot" optimizer for C.
So what is it what makes Hotspot so fast? In my opinion, the key ingredient to hotspot performance is aggressive inlining. And this is exactly why "reasonably well written" Java code can be faster than C code written at a similar level.
Let me explain this at an example. Assuming we want to compute a pariwise distance matrix; but we want the code to be able to support arbitrary distance functions. The code will roughly look like this (not heavily optimized):
for (int i = 0; i < size; i++) {
  for (int j = i + 1; j < size; j++) {
    matrix[i][j] = computeDistance(data[i], data[j]);
In C, if you want to be able to choose computeDistance at runtime, you would likely make it a function pointer, or in C++ use e.g. boost::function or a virtual method. In Java, you would use an interface method instead, i.e. distanceFunction.distance().
In C, your compiler will most likely emit a jmp *%eax instruction to jump to the method to compute the distance; with virtual methods in C++, it would load the target method from the vtable and then jmp there. Technically, it will likely be a "register-indirect absolute jump". Java will, however, try to inline this code at runtime, i.e. it will often insert the actual distance function used at the location of this call.
Why does this make a difference? CPUs have become quite good at speculative execution, prefetching and caching. Yet, it can still pay off to save those jmps as far as I can tell; and if it is just to allow the CPU to apply these techniques to predict another branch better. But there is also a second effect: the hotspot compiler will be optimizing the inlined version of the code, whereas the C compiler has to optimize the two functions independently (as it cannot know they will be only used in this combination).
Hotspot can be quite aggressive there. It will even inline and optimize when it is not 100% sure that these assumptions are correct. It will just add simple tests (e.g. adding some type checks) and jump back to the interpreter/compiler when these assumptions fail and then reoptimize again.
You can see the inlining effect in Java when you use the -Xcomp flag, telling the Java VM to compile everything at load time. It cannot do as much speculative inlining there, as it does not know which method will be called and which class will be seen. Instead, it will have to compile the code using virtual method invocations, just like C++ would use for executing this. Except that in Java, every single method will be virtual (in C++, you have to be explicit). You will likely see a substantial performance drop when using this flag, it is not recommended to use. Instead, let hotspot perform its inlining magic. It will inline a lot by default - in particular tiny methods such as getters and setters.
I'd love to see something similar in C or C++. There are some optimizations that can only be done at runtime, not at compile time. Maybe not even at linking time; but only with runtime type information, and that may also change over time (e.g. the user first computes a distance matrix for Euclidean distance, then for Manhattan distance).

Don't get me wrong. I'm not saying Java is perfect. There are a lot of common mistakes, such as using java.util.Collections for primitive types, which comes at a massive memory cost and garbage collection overhead. The first thing to debug when optimizing Java applications is to check for memory usage overhead. But all in all, good Java code can indeed perform well, and may even outperform C code, due to the inlining optimization I just discussed; in particular on large projects where you cannot fine-tune inlining in C anymore.

Sometimes, Hotspot may also fail. Which is largely why I've been investigating these issues recently. In ELKI 0.6.0 I'm facing a severe performance regression with linear scans (which is actually the simpler codepath, not using indexes but using a simple loop as seen above). I had this with 0.5.0 before, but back then I was able to revert back to an earlier version that still performed good (even though the code was much more redundant). This time, I would have had to revert a larger refactoring that I wanted to keep, unfortunately.
Because the regression was quite severe - from 600 seconds to 1500-2500 seconds (still clearly better than -Xcomp) - I first assumed I was facing an actual programming bug. Careful inspection down to the assembler code produced by the hotspot VM did not reveal any such error. Then I tried Java 8, and the regression was gone.
So apparently, it is not a programming error, but Java 7 failed at optimizing it remotely as good as it did with the previous ELKI version!
If you are an Java guru, interested at tracking down this regression, feel free to contact me. It's in an open source project, ELKI. I'd be happy to have good performance even for linear-scans, and Java 7. But I don't want to waste any more hours on this, but instead plan to move on to Java 8 for other reasons (lambda expressions, which will greatly reduce the amount of glue coded needed), too. Plus, Java 8 is faster in my benchmarks.
2013-12-16 15:13 — Categories: English CodingPermaLink & Comments

Numerical precision and "Big Data"

Everybody is trying (or pretending) to be "big data" these days. Yet, I have the impression that there are very few true success stories. People fight with the technology to scale, and have not even arrived at the point of doing detailed analysis - or even meta-analysis, i.e. whether the new solutions actuall perform better than the old "small data" approaches.
In fact, a lot of the "big data" approaches just reason "you can't do it with the existing solutions, so we cannot compare". Which is not exactly true.

In my experiments with large data (not big; it still fits into main memory) is that you have to be quite careful with your methods. Just scaling up existing methods does not always yield the expected results. The larger your data set, the more tiny problem surface that can ruin your computations.
"Big data" is often based on the assumption that just by throwing more data at your problem, your results will automatically become more precise. This is not true. On contrary: the larger your data, the more likely you have some contamination that can ruin everything.
We tend to assume that numerics and such issues have long been resolved. But while there are some solutions, it should be noted that they come at a price: they are slower, have some limitations, and are harder to implement.
Unforunately, they are just about everywhere in data analysis. I'll demonstrate it with a very basic example. Assume we want to compute the mean of the following series: [1e20, 3, -1e20]. Computing the mean, everybody should be able to do this, right? Well, let's agree that the true solution is 1, as the first and last term cancel out. Now let's try some variants:
  • Python, naive: sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy sum: numpy.sum([1e20, 3, -1e20])/3 yields 0.0
  • Python, NumPy mean: numpy.mean([1e20, 3, -1e20]) yields 0.0
  • Python, less-known function: math.fsum([1e20, 3, -1e20])/3 yields 1.0
  • Java, naive: System.out.println( (1e20+3-1e20)/3 ); yields 0.0
  • R, mean: mean( c(1e20,3,-1e20) ) yields 0
  • R, sum: sum( c(1e20,3,-1e20) )/3 yields 0
  • Octave, mean: mean([1e20,3,-1e20]) yields 0
  • Octave, sum: sum([1e20,3,-1e20])/3 yields 0
So what is happening here? All of these functions (except pythons less known math.fsum) use double precision. With double precision, 1e20 + 3 = 1e20, as double can only retain 15-16 digits of precision. To actually get the correct result, you need to keep track of your error using additional doubles.
Now you may argue, this would only happen when having large differences in magnitude. Unfortunately, this is not true. It also surfaces when you have a large number of observations! Again, I'm using python to exemplify (because math.fsum is accurate).
> a = array(range(-1000000,1000001)) * 0.000001
> min(a), max(a), numpy.sum(a), math.fsum(a)
(-1.0, 1.0, -2.3807511517759394e-11, 0.0)
As it can be seen from the math.fsum function, solutions exist. For example Shewchuk's algorithm (which is probably what powers math.fsum). For many cases, Kahan summation will also be sufficient, which essentially gives you twice the precision of doubles.
Note that these issues become even worse once you use subtraction, such as when computing variance. never use the famous E[X^2]-E[X]^2 formula. It's mathematically correct, but when your data is not central (i.e. E[X] is not close to 0, and much smaller than your standard deviation) then you will see all kinds of odd errors, including negative variance; which may then yield NaN standard deviation:
> b = a + 1e15
> numpy.var(a), numpy.var(b)
(0.33333366666666647, 0.33594164452917774)
> mean(a**2)-mean(a)**2, mean(b**2)-mean(b)**2
(0.33333366666666647, -21532835718365184.0)
(as you can see, numpy.var does not use the naive single-pass formula; probably they use the classic straight forward two-pass approach)
So why do we not always use the accurate computations? Well, we use floating point with fixed precision because it is fast. And most of the time, when dealing with well conditioned numbers, it is easily accurate enough. To show the performance difference:
> import timeit
> for f in ["sum(a)", "math.fsum(a)"]:
>     print timeit.timeit(f, setup="import math; a=range(0,1000)")
So unless we need that extra precision (e.g. because we have messy data with outliers of large magnitude) we might prefer the simpler approach which is roughly 3-6x faster (at least as long as pure CPU performance is concerned. Once I/O gets into play, the difference might just disappear altogether). Which is probably why all but the fsum function show the same inaccuracy: performance. In particular, as in 99% of situations the problems won't arise.

Long story. Short takeaway: When dealing with large data, pay extra attention to the quality of your results. In fact, even do so when handling small data that is dirty and contains outliers and different magnitudes. Don't assume that computer math is exact, floating point arithmetic is not.
Don't just blindly scale up approaches that seemed to work on small data. Analyze them carefully. And last but not least, consider if adding more data will actually give you extra precision.
2013-11-02 23:47 — Categories: English Technology Research CodingPermaLink & Comments

Big Data madness and reality

"Big Data" has been hyped a lot, and due to this now is burning down to get some reality checking. It's been already overdue, but it is now happening.
I have seen a lot of people be very excited about NoSQL databases; and these databases do have their use cases. However, you just cannot put everything into NoSQL mode and expect it to just work. That is why we have recently seen NewSQL databases, query languages for NoSQL databases etc. - and in fact, they all move towards relational database management again.
Sound new database systems seem to be mostly made up of three aspects: in-memory optimization instead of optimizing for on-disk operation (memory just has become a lot cheaper the last 10 years), the execution of queries on the servers that store the actual data (you may want to call this "stored procedures" if you like SQL, or "map-reduce" if you like buzzwords), and optimized memory layouts (many of the benefits of "colum store" databases come from having a single, often primitive, datatype for the full column to scan, instead of alternating datatypes in records.

However, here is one point you need to consider:
is your data actually this "big"? Big as in: Google scale.
I see people use big data and Hadoop a lot when they just shouldn't. I see a lot of people run Hadoop in a VM on their laptop. Ouch.
The big data technologies are not a one-size-fits-all solution. They are the supersize-me solution, and supersize just does not fit every task.
When you look at the cases where Hadoop is really successful, it is mostly in keeping the original raw data, and enabling people to re-scan the full data again when e.g. their requirements changed. This is where Hadoop is really good at: managing 100 TB of data, and allowing you to quickly filter out the few GB that you really need for your current task.
For the actual analysis - or when you don't have 100 TB, and a large cluster anyway - then just don't try to hadoopify everything.
Here is a raw number from a current experiment. I have a student work on indexing image data; he is implementing some state of the art techniques. For these, a large number of image features are extracted, and then clustering is used to find "typical" features to improve image search.
The benchmarking image set is 56 GB (I have others with 2 TB to use next). The subset the student is currently processing is 13 GB. Extracting 2.3 million 128 dimensional feature vectors reduces this to about 1.4 GB. As you can see, the numbers drop quickly.
State of the art seems to be to load the data into Hadoop, and run clustering (actually, this is more of a vector quantization than clustering) into 1000 groups. Mahout is the obvious candidate to run this on Hadoop.
However, as I've put a lot of effort into the data mining software ELKI, I considered also to try processing it in ELKI.
By cutting the data into 10 MB blocks, Mahout/Hadoop can run the clustering in 52x parallel mappers. k-Means is an iterative algorithm, so it needs multiple processes. I have fixed the number of iterations to 10, which should produce a good enough approximation for my use cases.
K-means is embarrassingly parallel, so one would expect the cluster to really shine at this task. Well, here are some numbers:
  • Mahout k-Means took 51 minutes on average per iteration (The raw numbers are 38:24, 62:29, 46:21, 56:20, 52:15, 45:11, 57:12, 44:58, 52:01, 50:26, so you can see there is a surpisingly high amount of variance there).
  • ELKI k-Means on a single CPU core took 4 minutes 25 seconds per iteration, and 45 minutes total, including parsing the input data from an ascii file. Maybe I will try a parallel implementation next.

So what is happening? Why is ELKI beating Mahout by a factor of 10x?
It's (as always) a mixture of a number of things:
  • ELKI is quite well optimized. The Java Hotspot VM does a good job at optimizing this code, and I have seen it to be on par with R's k-means, which is written in C. I'm not sure if Mahout has received a similar amount of optimization yet. (In fact, 15% of the Mahout runtime was garbage collection runtime - indicating that it creates too many objects.)
  • ELKI can use the data in a uniform way, similar to a column store database. It's literally crunching the raw double[] arrays. Mahout on the other hand - as far as I can tell - is getting the data from a sequence file, which then is deserialized into a complex object. In addition to the actual data, it might be expecting sparse and dense vectors mixed etc.
  • Size: this data set fits well into memory. Once this no longer holds, ELKI will no longer be an option. Then MapReduce/Hadoop/Mahout wins. In particular, such an implementation will by design not keep the whole data set in memory, but need to de-serialize it from disk again on each iteration. This is overhead, but saves memory.
  • Design: MapReduce is designed for huge clusters, where you must expect nodes to crash during your computation. Well, chances are that my computer will survive 45 minutes, so I do not need this for this data size. However, when you really have large data, and need multiple hours on 1000 nodes to process it, then this becomes important to survive losing a node. The cost is that all interim results are written to the distributed filesystem. This extra I/O comes, again, at a cost.
Let me emphasize this: I'm not saying, Hadoop/Mahout is bad. I'm saying: this data set is not big enough to make Mahout beneficial.

Conclusions: As long as your data fits into your main memory and takes just a few hours to compute, don't go for Hadoop.
It will likely be faster on a single node by avoiding the overhead associated with (current) distributed implementations.
Sometimes, it may also be a solution to use the cluster only for preparing the data, then get it to a powerful workstation, and analyze it there. We did do this with the images: for extracting the image features, we used a distributed implementation (not on Hadoop, but on a dozen PCs).
I'm not saying it will stay that way. I have plans for starting "Project HELKI", aka "ELKI on Hadoop". Because I do sometimes hit the barrier of what I can compute on my 8 core CPU in an "reasonable" amount of time. And of course, Mahout will improve over time, and hopefully lose some of its "Java boxing" weight.
But before trying to run everything on a cluster, always try to run it on a single node first. You can still figure out how to scale up later, once you really know what you need.
And last but not least, consider whether scaling up really makes sense. K-means results don't really get better with a larger data set. They are averages, and adding more data will likely only change the last few digits. Now if you have an implementation that doesn't pay attention to numerical issues, you might end up with more numerical error than you gained from that extra data.
In fact, k-means can effectively be accelerated by processing a sample first, and only refining this on the full data set then. And: sampling is the most important "big data" approach. In particular when using statistical methods, consider whether more data can really be expected to improve your quality.
Better algorithms: K-means is a very crude heuristic. It optimizes the sum of squared deviations, which may be not too meaningful for your problem. It is not very noise tolerant, either. And there are thousands of variants. For example bisecting k-means, which no longer is embarrassingly parallel (i.e. it is not as easy to implement on MapReduce), but took just 7 minutes doing 20 iterations for each split. The algorithm can be summarized as starting with k=2, then always splitting the largest cluster in two until you have the desired k. For many use cases, this result will be just as good as the full k-means result.
Don't get me wrong. There are true big data problems. Web scale search, for example. Or product and friend recommendations at Facebook scale. But chances are that you don't have this amount of data. Google probably doesn't employ k-means at that scale either. (actually, Google runs k-means on 5 mio keypoints for vector quantization; which, judging my experience here, can still be done one a single host; in particular with hierarchical approaches such as bisecting k-means)
Don't choose the wrong tool for the wrong problem!
2013-09-27 19:34 — Categories: English Coding Technology ResearchPermaLink & Comments
This website uses cookies to personalise content and ads, to provide social media features and to analyse our traffic. We also share information about your use of our site with our social media, advertising and analytics partners. See details