I’ve upgraded an old woody box to etch these days. Live - I’ve been telling people they need to find a replacement for this machine for years now, but it’s still around and playing an important role.

And with ‘woody’ I mean ‘a woody heavily played around with, with a couple of backports and exotic stuff like XFS, a custom 2.4.25 kernel with grsecurity’. Especially the latter was why I was afraid of doing the upgrade.

In a first step, I incrementally upgraded everything to sarge (well, at least where the backports weren’t already newer than sarge). I installed the stock sarge kernel, and prepared a reboot.

The reboot was then done out of schedule, because the machine died on it’s own

  • it has been crashing about 1-2 times a year, and it hadn’t happened for some time. Judging from the logs, it was that old problem again.

At second try it came up with the new kernel: I always forget to enable init ramdisks in the bootloader when switching from a custom kernel with every needed module built-in.

I then continued upgrading to etch, pretty much one service at a time.

Some things (I didn’t do a full list; the box was running to many modified versions of packages of this being useful as upgrade reports to the release team, and it’s a bit late in the schedule, too) where upgrading didn’t work out of the box:

  • Incompatible configuration file changes: amavis, courier, monit
  • OpenLDAP refusing to load one of the third-party schemas
  • Newer Courier POP/IMAP is incompatible to drac, breaking SMTP-after-POP
  • Configuration changes of saslauthd breaking SMTP-AUTH
  • IMP3, a webmail interface, doesn’t work with PHP5. It has been replaced by IMP4 in sarge, however I’d like my users to be able to keep on using it for some while (and I need to find a way to migrate e.g. the address book over). I consider this a major upgrade annoyance with PHP5, that it no longer allows “return false;” in some situations (if you want to return by reference, you need to use a named variable). The behaviour of “get_class” also has changed in an incompatible way.

But other than that, the upgrade was mostly a job for apt-get, not for me.

The bad news: there is still something wrong with the machine. It doesn’t crash, but spits e.g. the following to dmesg:

amavisd-new: page allocation failure. order:5, mode:0xd0
 [<c013aa08>] __alloc_pages+0x2f8/0x370
 [<c013aaa5>] __get_free_pages+0x25/0x40
 [<c013e0b2>] kmem_getpages+0x22/0xc0
 [<c013ed0a>] cache_grow+0xba/0x180
 [<e0aebe1e>] xfs_bmap_read_extents+0x36e/0x540 [xfs]
 [<c013ef3a>] cache_alloc_refill+0x16a/0x220
 [<e0ae80fd>] xfs_bmap_alloc+0xe0d/0x1c60 [xfs]
 [<c013f3e4>] __kmalloc+0x74/0x80
 [<e0b389c9>] kmem_alloc+0x59/0xc0 [xfs]
...

So there is something wrong with the XFS malloc handling. This has happened twice now since the reboot. I’ve been suspecting XFS of being related to the servers’ crashes before (which tend to occur during load; the crash which ‘allowed’ me to switch to 2.6 actually showed the OOM killer of that old kernel killing some inappropriate processes). After the easter holidays, I’ll reboot the machine with the etch kernel, maybe this is gone then.