Well, like what's already been noted, there are a ton of apps that already exist to do this, but if you want to keep going, hell why not, it's good experience either way.
A big suggestion would be not to hash the whole file unless you need to. Hashes take a while to run, especially across larger files, and it's more advantageous to scan more files than to scan whole files on the first pass. I had to write a similar program in C a long time ago, and I just opened the file, hashed the first 8K and stuck that in the hash table. If there were any collisions, at that point you do the whole hash check just to be sure. Most files (images, songs, etc) will show a delta in the first two blocks. The only ones that won't are files such as source code, where the full-file hash check will catch them. It should shave quite a bit of time off the current run-time of the script. You could go with an adjustable block-size too to tune your hash performance.
I'd write the patch for you but Python is Greek to me.
Bookmarks