Lost an ext3 filesystem

These days, something happened to one of my external USB drives that I so far only knew from ReiserFS (which I since called ReisswolFS, German word play on “shredder” …). But, it’s not ext3 which I blame.

Short story what happened:

Resumed the system from ‘suspend’.
I copied some files onto the first file system.
I copied the same files to a second external disk (dual backup…)
I copied some files from the first disk, which caused an access-beyond-end-of-disk, mounting the filesystem read only
Unmounted the filesystem, started e2fsck
Started copying the files from the secondary filesystem
Got the same error on the second disk.
Cancelled e2fsck doing more damage to the first disk.
Shutdown and reboot
Memcheck, three iterations. Nothing.
Checked second disk, no errors in filesystem (!), copied the files I had issues accessing just fine.
Filesystem on disk #1 seriously trashed.
Had ext2fsck try to recover filesystem on disk #1
Pretty much all data on disk #1 is now in lost+found, it seems as if all major folders were corrupted. Lots of corrupted file entries (character devices with random permissions and numbers) there, too.

What I will do now:

Reformat disk #1, and restore it from the other backup (Extra backup for teh win! I also have a 3rd copy of about 2 months ago off-site)

As you can see, something was wrong with the system, not with the file system.

I have a strong suspect to have caused this. In case you wondered why I included “resumed from suspend” above: I’ve been having system stability issues with resume ever since upgrading to the Intel driver 2.9.0 and KMS (Debian unstable+testing) with kernels up to 2.6.31. In about 1 out of 5 resumes, I get a Xorg or system lockup after anything from 1 to 60 minutes. Sometimes I also experience video corruption after a few minutes, trashing some terminal emulation until the next redraw. Just before writing this email I had a typical lockup: when scrolling the terminal emulator. This has been a typical trigger for lockups. On contrast I havn’t seen any such crashes (or screen corruption) on a fresh boot.

Freedesktop bug reporting the same issue closed as “not our bug, blame it on the kernel”.

Note that 2.6.32 release candidate Changelog contain many changes for the intel DRI kernel driver. So the bug might already be fixed in the RC kernels.

Same report in Kernel Bugzilla is still ‘NEW’ though.

Related bug report in Debian, blaming it on KMS.

[Update: I’ve disabled KMS and upgraded to 2.6.32-rc8 and not had such a crash since. But I can’t pinpoint it to one or the other yet.]

[Update: just tried another external harddisk …

[305032.148616] EXT3-fs: mounted filesystem with ordered data mode.
[305066.061708] usb 1-8.3.3: reset high speed USB device using ehci_hcd and address 27
[305081.132471] usb 1-8.3.3: device descriptor read/64, error -110
...
[305147.468857] sd 4:0:0:0: Device offlined - not ready after error recovery
[305147.468880] sd 4:0:0:0: [sdb] Unhandled error code
[305147.468886] sd 4:0:0:0: [sdb] Result: hostbyte=DID_ABORT driverbyte=DRIVER_OK
...
[305147.473500] WARNING: at /build/buildd-linux-2.6_2.6.32~rc8-1~experimental.1-i386-g1b8iG/linux-2.6-2.6.32~rc8/debian/build/source_i386_none/fs/buffer.c:1159 mark_buffer_dirty+0x20/0x7a()

It seems as if the USB disk stack still doesn’t really survive suspends? Let me try on a fresh boot later on.