These days I picked up the Debtags AI topic again (so yes, development is alive, not the AI).

Some people will remember this topic, it was a Google Summer of Code project. The project didn’t really complete back then.

Since my thesis isn’t coming along too well (I don’t know how to explain to my professor what I’m trying to achieve…), I’ve been playing around with different stuff. So I’ve picked up the Debtags AI again.

Back at the end of the GSoC or sometime later last summer, I had started a rewrite of the Debtags AI stuff in C++ for performance reasons. I also had some ideas on how to make it scale better and on how to improve the results. I think I’ve reached an interesting state now; training time of the AI is down to a few minutes (remember it was a few days when the GSoC started?), so I managed to cut off another order of magnitude. Evaluation of a text against the database (i.e. for every tag) happens in the fraction of a second.

This AI is turning out quite well now. I have one more idea for a trick that could improve the results from an algorithm side, and one from the data side. Enrico is already looking at how to integrate it with the web apps. :-)

And I also want to scale it up further. How about using it to classify web pages by their text content? I already have an idea on how to get some good training data… maybe I should consider writing a business plan instead of a diploma thesis…