Fixing planet
Today I’ve been fighting Planet, the well-known blog aggregator tool. After a while I had found out how/why it was scrambling Atom feeds horribly.
I’m not sure if actually is a planet bug - maybe it is fine with older python
versions. The SGML parser of python2.4 however fails on tags such as
<br />
, a very common case in blogs and thus in atom feeds.
Strage additional > brackets appeared in the output.
The reason is, that the SGML parser as of Python2.4 is looking for
<tag/foo/
as an equivalent to <tag>foo</tag>
,
and thus treats <br/><br/>
the same as
<br>><br<br>>
with the inner chars somehow
magically escaped…
The fix is quite simple: add
sgmllib.shorttag = re.compile('<([a-zA-Z][-.a-zA-Z0-9]*)/(/*)>')
to your feedparser.py file in the obvious place (next to sgmllib.tagfind). This will break support for these true SGML short tags, but I’ve never heard of a blog feed using them anyway.
I told you that I’m not really sure whether this is a planet bug: It might be a bug of pythons sgmllib, too. But maybe Planet should just use a XML parser for XML files, and fallback to an SGML parser (or maybe a robust XML parser) for other files (unfortunately, many blogs - including mine - do not ensure correct XML). And Planet could use some proper XML handling, too, anyway… Right now, the code is so string-array-based, it makes me sick.
You might also want some extra magic to re-fold <br/>
tags to
not confuse older browsers.