Today I’ve been fighting Planet, the well-known blog aggregator tool. After a while I had found out how/why it was scrambling Atom feeds horribly.

I’m not sure if actually is a planet bug - maybe it is fine with older python versions. The SGML parser of python2.4 however fails on tags such as <br />, a very common case in blogs and thus in atom feeds. Strage additional > brackets appeared in the output.

The reason is, that the SGML parser as of Python2.4 is looking for <tag/foo/ as an equivalent to <tag>foo</tag>, and thus treats <br/><br/> the same as <br>><br<br>> with the inner chars somehow magically escaped…

The fix is quite simple: add

sgmllib.shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/(/*)>')

to your feedparser.py file in the obvious place (next to sgmllib.tagfind). This will break support for these true SGML short tags, but I’ve never heard of a blog feed using them anyway.

I told you that I’m not really sure whether this is a planet bug: It might be a bug of pythons sgmllib, too. But maybe Planet should just use a XML parser for XML files, and fallback to an SGML parser (or maybe a robust XML parser) for other files (unfortunately, many blogs - including mine - do not ensure correct XML). And Planet could use some proper XML handling, too, anyway… Right now, the code is so string-array-based, it makes me sick.

You might also want some extra magic to re-fold <br/> tags to not confuse older browsers.