Some time ago I’ve been working on an AI to classify web pages. I’ve spidered (and classified!) about half a million web pages I’m using for training.

This totals to about 40.000 categories and 100.000 useful tokens. I need 8 bytes of storage - two float values - for each combination, which is a matrix of about 27 Gigabytes. I’m using mmap to map my data files into memory for convenient access. On a 64 bit system, I can keep them mmapped all the time.

This results in the following funny top output:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
12443 erich     26  10 27.7g 309m  21m R  8.5 32.8  86:01.98 train

So according to top, the process is using 27 GB of virtual memory. That is the data files, actually. The 300 MB resident memory is the training data it’s processing. At this point, all my data has been converted to 32-bit integers, so this is already condensed down a bit.

It’s also obvious that ext3 supports sparse files - running “du” on my database directory currently gives 1.5 GB (it just has started filling the matrix), adding the “apparent size” switch gives 29G.

I hope that when the training run has finally completed (ETA is a couple of days), the AI will actually be able to classify webpages somewhat reliably. :-) You can think of it like a huge bayesian spam filter. Except that I’ve added some boosting that should improve results, some computational improvements for speed and scaled it up to 40.000 categories instead of just “spam”.

Other uses I’m planning for the AI is automatic email filtering into multiple folders (best to be employed on Google GMail, where an email can have multiple labels, not just one folder), automatic document classification for document and knowledge management systems, etc. pp.

[Update: I havn’t yet decided if or when I’ll release the code as GPL. Right now, the code could require some cleanups. And I’m considering to start a business around this datamining and classification stuff, so I might want to keep it closed for some time at least. The business will most likely be around integrating the code though, so a public release might well be possible.]