
for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC'ing) a single object. And actually, this is as literal as a for loop can get.
ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).
ELKI is a data mining software project that I have been working on for the last years as part of my PhD research. It is open source (AGPL-3 licensed) and avilable as both a Debian package and Ubuntu package in the official repositories. So a simple aptitude install elki should get you going and give you a menu entry for ELKI. These packages come with the scripts elki to launch the MiniGUI and elki-cli to run from command line.
The key feature that sets ELKI apart from existing open source tools used in
data mining (e.g. Weka and R) is that it has support for index structures to
speed up algorithms, and a very modular architecture that allows various
combinations of data types, distance functions, index structures and
algorithms. When looking for performance regressions and optimization potential
in ELKI, I recently ran some
benchmarks on a data
set with 110250 images described by 8 dimensional color histograms. This is a
decently sized dataset: it takes long enough (usually in the range of 1-10
minutes) to measure true hotspots. When including Weka and R in the comarison I
was quite surprised: our k-means implementation runs at the same speed as Rs
implementation in C (and around twice that of the more flexible "flexclus"
version). For some of the key agorithms (DBSCAN, OPTICS, LOF) we are an order
of magnitude faster than Weka and R, and adding index support speeds up the
computation by another factor of 5-10x. In the most extreme case - DBSCAN in
Weka vs. DBSCAN with R-tree in ELKI - the speedup was a factor of 330x, or 2
minutes (ELKI) as opposed to 11 hours Update 11.1.2013: 84 minutes, after some code cleanup (Weka).
The reason why I was suprised is that I expected ELKI to perform much
worse. It is written in Java (as opposed to R's kmeans, which is in C), uses a
very flexible architecture which for example does not assume distances to be of
type double and just has a lot of glue code inbetween. However, obviously,
the Java Hotspot compiler actually lives up to its expectations and manages to
inline the whole distance computations into k-means, and then compiles it at a
level comparable to C. R executes vectorized operations quite fast, but on
non-native code as in the LOF example it can become quite slow, too. (I would not
take Weka as reference, in particular with DBSCAN and OPTICS there seems to be
something seriously broken. Update 11.1.2013: Eibe Frank from Weka
had a look at Weka DBSCAN, and removed some unnecessary safety checks in the code, yielding a 7.5x speedup. Judging from a quick look at it, the OPTICS
implementation actually is not even complete, and both implementations actually
copy all data out of Weka into a custom linear database, process it there, then
feed back the result into Weka. They should just drop that "extension"
altogether. The much newer and Weka-like LOF module is much more comparable.)
Note that we also have a different focus than Weka. Weka is really popular for machine learning, in particular for classification. In ELKI, we do not have a single classification algorithm because there is Weka for that. Instead, ELKI focuses on cluster analysis and outlier detection. And ELKI has a lot of algorithms in this domain, I dare to say the largest collection. In particular, they are all in the same framework, so they can be easily compared. R does of course have an impressive collection in CRAN, but in the end they do not really fit together.
Anyway, ELKI is a cool research project. It keeps on growing, we have a number of students writing extensions as part of their thesis. It has been extremely helpful for me in my own research, as I could quickly prototype some algorithms, then try different combinations and use my existing evaluation and benchmarking. You need some time to get started (largely because of the modular architecture, Java generics and such hurdles), but then it is a very powerful research tool.
But there are just many more algorithms, published sometime, somewhere, but barely with source code available. We'd love to get all these published algorithms into ELKI, so researchers can try them out. And enhance them. And use them for their actual data. So far, ELKI was mostly used for algorithmic research, but it's starting to move out into the "real" world. More and more people that are not computer scientists start using ELKI to analyze their data. Because it has algorithms that no other tools have.
I tried to get ELKI into the "Google Summer of Code", but it was not accepted. But I'd really like to see it gain more traction outside the university world. There are a number of cool projects associated with ELKI that I will not be able to do myself the next years, unfortunately.
If you are a researcher in cluster analysis or outlier detection, consider contributing your algorithms to ELKI. Spend some time optimizing them, adding some documentation. Because, if ELKI keeps on growing and gaining popularity, it will be the future benchmark platform. And this can give you citations, which are somewhat the currency of science these days. Algorithms available in the major toolkits just do get cited more, because people compare to them. See this list for an overview of work cited by ELKI - scientific work that we reimplemented at least to some extend for ELKI. It is one of the services that we provide with ELKI for researchers: not only the algorithm, but also the appropriate citation.
Does anyone know how to setup a partial subversion mirror?
Essentially we have a source code tree, containing various subdirectories,
including doc, test, src/tld/domain and
src/experimentalcode/usernames.
We'd like to allow public access to the main source subtree, while not exposing
the various user directories to the web server (the master subversion is
separate anyway; since I want to use Trac I need a local SVN copy!)
I'm currently trying to setup "tailor" to do this, but it is not trivial: it seems easy at first to just ignore any change in a particular folder. But when you encounter "move" aka "rename" operations that rename a file from the non-exposed folder to the exposed folder, they will obviously fail on the "stripped" repository. So they need to be mapped to an "add" operation there!
Any hints? I might have it working using some hacks in tailor, but I'd like to know any simple solution, if there is any ...