(no, not the animals, but web crawlers!)

For a pet project of mine, I’ve recently been spidering the web a bit myself. So far, I’ve processed over 100.000 websites. The machine doing the spidering is an old K6-450, so it’s not particularly fast…

My spider is downloading the web pages HTML, and eventually some framesets (but at most 1 level deep). It’s using text contents, image ‘alt’ attributes, title and some meta tags. The text contents are tokenized and stemmed.

This results in some fun numbers:

  • The average web page uses about 194 different words.
  • The average token (after stemming!) is 6.8 characters long

Each of the web pages I’m spidering has about 6.4 categories assigned to it. I’ll be using this training set to train an AI to classify web sites.

(I’ve also started a web page for the project, but it’s still pretty much empty so far, not worth looking.)