DBSCAN [wikipedia] and OPTICS [wikipedia] are two of the most well-known density based clustering algorithms. You can read more on them in the Wikipedia articles linked above.

An interesting property of density based clustering is that these algorithms do not assume clusters to have a particular shape. Furthermore, the algorithms allow “noise” objects that do not belong to any of the clusters. K-means for examples partitions the data space in Voronoi cells (some people claim it produces spherical clusters - that is incorrect). See Wikipedia for the true shape of K-means clusters and an example that canot be clustered by K-means. Internal measures for cluster evaluation also usually assume the clusters to be well-separated spheres (and do not allow noise/outlier objects) - not surprisingly, as we tend to experiment with artificial data generated by a number of Gaussian distributions.

The key parameter to DBSCAN and OPTICS is the “minPts” parameter. It roughly controls the minimum size of a cluster. If you set it too low, everything will become clusters (OPTICS with minPts=2 degenerates to a type of single link clustering). If you set it too high, at some point there won’t be any clusters anymore, only noise. However, the parameter usually is not hard to choose. If you for example expect clusters to typically have 100 objects, I’d start with a value of 10 or 20. If your clusters are expected to have 10000 objects, then maybe start experimenting with 500.

The more difficult parameter for DBSCAN is the radius. In some cases, it will be very obvious. Say you are clustering users on a map. Then you might know that a good radius is 1 km. Or 10 km. Whatever makes sense for your particular application. In other cases, the parameter will not be obvious, or you might need multiple values. That is when OPTICS comes into play.

OPTICS is based on a very clever idea: instead of fixing MinPts and the Radius, we only fix minpts, and plot the radius at which an object would be considered dense by DBSCAN. In order to sort the objects on this plot, we process them in a priority heap, so that nearby objects are nearby in the plot. this image on Wikipedia shows an example for such a plot.

OPTICS comes at a cost compared to DBSCAN. Largely because of the priority heap, but also as the nearest neighbor queries are more complicated than the radius queries of DBSCAN. So it will be slower, but you no longer need to set the parameter epsilon. However, OPTICS won’t produce a strict partitioning. Primarily it produces this plot, and in many situations you will actually want to visually inspect the plot. There are some methods to extract a hierarchical partitioning out of this plot, based on detecting “steep” areas.

The open source ELKI data mining framework (package “elki” in Debian and Ubuntu) has a very fast and flexible implementation of both algorithms. I’ve benchmarked this against GNU R (“fpc” package”) and Weka, and the difference is enormous. ELKI without index support runs in roughly 11 minutes, with index down to 2 minutes for DBSCAN and 3 minutes for OPTICS. Weka takes 11 hours Update 11.1.2013: 84 minutes (with the 1.0.3 revision of the DBSCAN extension, some performance issues were resolved) and GNU R/fpc takes 100 minutes (DBSCAN, no OPTICS available). And the implementation of OPTICS in Weka is not even complete (it does not support proper cluster extraction from the plot). Many of the other OPTICS implementations you can find with Google (e.g. in Python or MATLAB) seem to be based on this Weka version …

ELKI is open source. So if you want to peek at the code, here are direct links: DBSCAN.java, OPTICS.java.

Some part of the code may be a bit confusing at first. The “Parameterizer” classes serve the purpose of allowing automatic UI generation, for example. So there is quite a bit of meta code involved.

Plus, ELKI is quite extensively optimized. For example, it does not use Java Collections much anymore. Java Iterators, for example, require returning an object on next();. The C++ style iterators used by ELKI can have multiple values, and primitive values.

for(DBIDIter id = relation.iterDBIDs(); id.valid(); id.advance())

is a typical for loop in ELKI, iterating over all objects of a relation, but the whole loop requires creating (and GC’ing) a single object. And actually, this is as literal as a for loop can get.

ModifiableDBIDs processedIDs = DBIDUtil.newHashSet(size);

is another example. Essentially, this is like a HashSet<DBID>. Except that it is a lot faster, because the object IDs do not need to live a Java objects, but can internally be stored more efficiently (the only currently available implementation of the DBID layer uses primitive integers).

Java advocates always accuse you of premature optimization when you avoid creating objects for primitives. Yet, in all my benchmarking, I have seen this continuously to have a major impact how many objects you allocate. At least when it is inside a loop that is heavily used. Java collections with boxed primitives just eat a lot of memory, and the memory management overhead does often make a huge difference. Which is why libraries such as Trove (which ELKI uses a lot) exist. Because memory usage does make a difference.

(Avoiding boxing/unboxing systematically in ELKI yielded approximately a 4x speedup. But obviously, ELKI involves a lot of numerical computations.)