ELKI call for contributions

ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.

The key feature that sets ELKI apart from existing open source tools used in data mining (e.g. Weka and R) is that it has support for index structures to speed up algorithms, and a very modular architecture that allows various combinations of data types, distance functions, index structures and algorithms. When looking for performance regressions and optimization potential in ELKI, I recently ran some benchmarks on a data set with 110250 images described by 8 dimensional color histograms. This is a decently sized dataset: it takes long enough (usually in the range of 1-10 minutes) to measure true hotspots. When including Weka and R in the comarison I was quite surprised: our k-means implementation runs at the same speed as Rs implementation in C (and around twice that of the more flexible “flexclus” version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order of magnitude faster than Weka and R, and adding index support speeds up the computation by another factor of 5-10x. In the most extreme case - DBSCAN in Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2 minutes (ELKI) as opposed to ~~11 hours~~ Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much worse. It is written in Java (as opposed to R’s kmeans, which is in C), uses a very flexible architecture which for example does not assume distances to be of type double and just has a lot of glue code inbetween. However, obviously, the Java Hotspot compiler actually lives up to its expectations and manages to inline the whole distance computations into k-means, and then compiles it at a level comparable to C. R executes vectorized operations quite fast, but on non-native code as in the LOF example it can become quite slow, too. (I would not take Weka as reference, ~~in particular with DBSCAN and OPTICS there seems to be something seriously broken.~~ Update 11.1.2013: Eibe Frank from Weka had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS implementation actually is not even complete, and both implementations actually copy all data out of Weka into a custom linear database, process it there, then feed back the result into Weka. They should just drop that “extension” altogether. The much newer and Weka-like LOF module is much more comparable.)

Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.

Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.

But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We’d love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it’s starting to move out into the “real” world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.

I tried to get ELKI into the “Google Summer of Code”, but it was not accepted. But I’d really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.

A web browser frontend would be cool. Maybe even connected to Google Refine, using Refine for preprocessing the data, then migrating it into ELKI for analysis. The current visualization engine of ELKI is using SVG - this should be fairly easy to port into the web browser. Likely, the web browers will even be faster than the current Apache Batik renderer.
Visual programming frontend. Weka, RapidMiner, Orange: they all have visual programming style UIs. This seems to work quite well to model the data flow within the analysis. I’d love to see this for ELKI, too.
Cluster/Cloud backend. ELKI can already handle fairly large data sets on a big enough system. If someone spends extra effort on the index structures, the data won’t even need to fit into main memory anymore. Yet, everybody now wants “big data”, and parallel computation probably is the future. I’m currently working on some first Hadoop YARN based experiments with ELKI. But this is a huge field, turning this into true “elk yarn”. I will likely only lay some foundations (unless I get funding to continue as a PostDoc on this project. I sure hope to get to do at least a few years of postdoc somewhere, as I really enjoy working with students on this kind of project)
New visualization engine. The current visualization engine, based on Apache Batik and SVG is quite okay. It does what I need, which is to get a quick glance at the results and the ability to export them for publications in print quality. (in particular, I can easily edit the SVG files with Inkscape) But it is not really something fancy (although we have a lot of cool visualizations). And it is slow. I havn’t found a portable and fast graphics toolkit for Java yet that can produce SVG files. There is a lot of hype around processing, for example, but it seems to be too much about art for me. In fact, I’d love to use either something like Clutter or Cairo. But getting them to work for Windows and Mac OSX will likely be a pain.
Human Computer Interaction (HCI). This is in my opinion the biggest challenge we are facing with all the “big data” stuff. If you really go into big data (and not just run Hadoop and Mahout on a single system; yes - a lot of people seem to do this), you will at some point need to go beyond just crunching the numbers. So far, the challenges that we are tackling are largely data summarization and selection. TeraSort is a cool project, and a challenge. Yet, what do you actually get from sorting this large amount of data? What do you get from running k-means on a terabyte? When doing data mining on a small data set, you quickly learn that the main challenge actually is preprocessing the data and choosing parameters the right way so that your result is not bogus. Unless you are doing simple prediction tasks, you often don’t have a clearly defined objective. Sure, when predicting churn rates, you can hope to just throw all the data into a big cloud and hope you get some enlightement out. But when you are doing cluster analysis or outlier detection - unsupervised methods - the actual objective by definition cannot be hardcoded into a computer. The key objective then is learn something new on the data set. But if you want to have your user learn something on the data set, you will have to have the user guide the whole process, and you will have to present results to the user. Which gets immensely more difficult with larger data. Big data just does no longer look like this. And neither are the algorithms as simple as k-means or hierarchical clustering. Hierarchical clustering is good for teaching the basic ideas of clustering. But you will not be using a dendrogram for a huge data set. Plus, it has a naive complexity of O(n^3) and for some special cases O(n^2) - too slow for truly big data.
For the “big data future” once we get over all the excitement of being able to just somehow crunch these numbers we will need to seriously look into what to do with the results (in particular, how to present them to the user), and how to make the algorithms accessible and usable for non-techies. Right now, you cannot expect a social sciences researcher to be able to use a Hadoop cluster. Yet to make sense of the results. But if you are a smart guy to actually solve this, and open up “big data processing” to the average non-IT user, this will be big.
Oh, and of course there are just hundreds of algorithms not yet available (accessible) as open source. Not in ELKI, and usually not anywhere else either. Just to name a few from my wishlist (I could probably implement many of them in a few hours in ELKI, but I don’t have the time to do so myself, plus they are good student or starter project to get used to ELKI): BIRCH, CLARA, CLARANS, CLINK, COBWEB, CURE, DOC, DUSC, EDSC, INSCY, MAFIA, P3C, SCHISM, STATPC, SURFING, … just to name a few.

If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.