My thesis is about data mining, clustering of correlated data in high dimensional vector spaces, to be a bit more precise.

In detail, I’m working on methods to improve upon existing clustering algorithms such as 4C (Computing Clusters of Correlation Connected Objects) and ERiC (On Exploring Complex Relationships of Correlation Clusters), where you need to pick some parameters (e.g. k for a k nearest neighbour based approach) appropriately.

My approach is twofold. On one hand, I’m improving upon the traditional covariance based correlation (which is quite sensitive to noise), so the parameters become easier to pick, on the other hand I’m working on an approach to automatically fine-tune the parameters to further improve stability.

For testing my computations I needed a visualization of this data. I was considering using gnuplot (and in fact I’m using gnuplot a lot), but for some situation I needed animation capabilities, and thats where gnuplot becomes really messy.

So I decided to dive into SVG and Javascript. Here’s my first SVG project:

Visualizing kNN correlation in SVG with Javascript

(Internet Exploder is not supported. I don’t have Windows, and for all I know it doesn’t really support SVG. Use a Gecko-based browser such as Firefox, Opera and Safari (at least on Windows) also seem to work. I didn’t get it to work on kHTML/Konqueror/Webkit. I’m just doing this for myself, so I have no need to support other browsers.)

It’s a 3D dataset, consisting of 300 points. 100 points are noise, 100 points are in a 2D cluster (green) and 100 points are on a 1D cluster embedded into this plane (I’m working on algorithms that support hierarchical clusters, so I needed a dataset with this property!).

There are two buttons in the UI, one toggles rotation, the other one toggles the playback of “k”. It will cycle k through a range of about 3-200. When offset hits 20 (so k would be 22 or 23), the main correlation vectors - the big blue lines - already point along the 1D cluster. At an offset of around 80 they have already diverged quite a bit from the 1D cluster - at this point, the correlation is seeing the 2D plane quite well already.

I could also show you the behaviour for points in the 2D plane (but outside of the 1D cluster) and noise points.

We’re preparing a paper for SSDBM 2008.

[Update: Safari works at least on Windows]