Listening to music with CDs is a pain. Because every CD gets boring quickly. Radio stations are fun for some time, but they also play the same bad songs over and over again. So I still have a collection of MP3 songs (albeit all CDs I converted recently are now ogg vorbis, since this offers a better compression, and my portable player can play ogg, too).

So if you try to maintain a collection of your favourite MP3s, you’ll sooner or later end up having some duplicates. Or many of them.

The cleanest approach would of course be to just delete your whole collection and start converting your CDs again (this time paying more attention to tagging and such). But this would take a lot of time again…

I’ve written a small tool which uses the musicbrainz library to generate audio fingerprints of songs; these are then stored in a small database, and another tool can list duplicates from that database. This of course still doesn’t catch all duplicates (well, some songs are slightly different among different albums, and sometimes they result in the same TRM id, sometimes they do not) but does a fairly good job. Beware of (rare) false positives, however.

It should work on Windows, too. But I really have no clue on how to setup musicbrainz, pyogg and pymad on Windows. Debian users can just apt-get them. Don’t ask me about how to install on windows: I don’t use windows.

Grab it from it’s temp home for the TRM-dupefinder (grab all .py files, trmdb.py is the main lib, the others are the applications).

Calculating all the audio fingerprints takes a long time; but the gen tool will only process each file once and continue where it left off. So you can do it in multiple runs. Then run the dupe tool at the end.

Tunepimp is probably a smarter way to do this (and maybe verify tags in the same run), especially since it apparently can store the TRM id in the ID3 tag, to avoid recalculation even better.

However, I didn’t get the python-tunepimp bindings to work properly.