Clustering 23 mio Tweet locations

To test scalability of ELKI, I’ve clustered 23 million Tweet locations from the Twitter Statuses Sample API obtained over 8.5 months (due to licensing restrictions by Twitter, I cannot make this data available to you, sorry.

23 million points is a challenge for advanced algorithms. It’s quite feasible by k-means; in particular if you choose a small k and limit the number of iterations. But k-means does not make a whole lot of sense on this data set - it is a forced quantization algorithm, but does not discover actual hotspots.

Density-based clustering such as DBSCAN and OPTICS are much more appropriate. DBSCAN is a bit tricky to parameterize - you need to find the right combination of radius and density for the whole world. Given that Twitter adoption and usage is quite different it is very likely that you won’t find a single parameter that is appropriate everywhere.

OPTICS is much nicer here. We only need to specify a minimum object count - I chose 1000, as this is a fairly large data set. For performance reasons (and this is where ELKI really shines) I chose a bulk-loaded R*-tree index for acceleration. To benefit from the index, the epsilon radius of OPTICS was set to 5000m. Also, ELKI allows using geodetic distance, so I can specify this value in meters and do not get much artifacts from coordinate projection.

To extract clusters from OPTICS, I used the Xi method, with xi set to 0.01 - a rather low value, also due to the fact of having a large data set.

The results are pretty neat - here is a screenshot (using KDE Marble and OpenStreetMap data, since Google Earth segfaults for me right now):

Some observations: unsurprisingly, many cities turn up as clusters. Also regional differences are apparent as seen in the screenshot: plenty of Twitter clusters in England, and low acceptance rate in Germany (Germans do seem to have objections about using Twitter; maybe they still prefer texting, which was quite big in Germany - France and Spain uses Twitter a lot more than Germany).

Spam - some of the high usage in Turkey and Indonesia may be due to spammers using a lot of bots there. There also is a spam cluster in the ocean south of Lagos - some spammer uses random coordinates [0;1]; there are 36000 tweets there, so this is a valid cluster…

A benefit of OPTICS and DBSCAN is that they do not cluster every object - low density areas are considered as noise. Also, they support clusters of different shape (which may be lost in this visualiation, which uses convex hulls!) and different size. OPTICS can also produce a hierarchical result.

Note that for these experiments, the actual Tweet text was not used. This has a rough correspondence to Twitter popularity “heatmaps”, except that the clustering algorithms will actually provide a formalized data representation of activity hotspots, not only a visualization.

You can also ~~explore the clustering result in your browser~~ - ~~the Google Drive visualization functionality seems to work much better than Google Earth~~ (Google apparently by now removed this visualization capabilities.)

If you go to Istanbul or Los Angeles, you will see some artifacts - odd shaped clusters with a clearly visible spike. This is caused by the Xi extraction of clusters, which is far from perfect. At the end of a valley in the OPTICS plot, it is hard to decide whether a point should be included or not. These errors are usually the last element in such a valley, and should be removed via postprocessing. But our OpticsXi implementation is meant to be as close as possible to the published method, so we do not intend to “fix” this.

Certain areas - such as Washington, DC, New York City, and the silicon valley - do not show up as clusters. The reason is probably again the Xi extraction - these region do not exhibit the steep density increase expected by Xi, but are too blurred in their surroundings to be a cluster.

Hierarchical results can be found e.g. in Brasilia and Los Angeles.

Compare the OPTICS results above to k-means results (below) - see why I consider k-means results to be a meaningless quantization?

Sure, k-means is fast (30 iterations; not converged yet. Took 138 minutes on a single core, with k=1000. The parallel k-means implementation in ELKI took 38 minutes on a single node, Hadoop/Mahout on 8 nodes took 131 minutes, as slow as a single CPU core!). But you can see how sensitive it is to misplaced coordinates (outliers, but mostly spam), how many “clusters” are somewhere in the ocean, and that there is no resolution on the cities? The UK is covered by 4 clusters, with little meaning; and three of these clusters stretch all the way into Bretagne - k-means clusters clearly aren’t of high quality here.

If you want to reproduce these results, you need to get the current ELKI beta version (0.6.5~20141030 - the output of cluster convex hulls was just recently added to the default codebase), and of course data. The settings I used are:

-dbc.in coords.tsv.gz
-db.index tree.spatial.rstarvariants.rstar.RStarTreeFactory
-pagefile.pagesize 500
-spatial.bulkstrategy SortTileRecursiveBulkSplit
-time
-algorithm clustering.optics.OPTICSXi
-opticsxi.xi 0.01
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 1000
-resulthandler KMLOutputHandler -out /tmp/out.kmz

and the total runtime for 23 million points on a single core was about 29 hours. The indexes helped a lot: less than 10000 distances were computed per point, instead of 23 million - the expected speedup over a non-indexed approach is 2400.

Don’t try this with R or Matlab. Your average R clustering algorithm will try to build a full distance matrix, and you probably don’t have an exabyte of memory to store this matrix. Maybe start with a smaller data set first, then see how long you can afford to increase the data size.