Wow! Sounds really impressive and useful! I hope you will share your script with us when you have it finished. Thanks for your input!
Wow! Sounds really impressive and useful! I hope you will share your script with us when you have it finished. Thanks for your input!
Will do... lots of testing is the order of the day for me right now....I hope you will share your script with us when you have it finished.
I haven't forgotten about this. I have had little time lately, but I still do have thousands of duplicates to sort out. I only have 2 PCs but an OS upgrade and a disk failure have kept me busy. When I finally sort out my own duplicate photos I will post the script(s) back here.
Thanks for not forgetting! I haven't come round to sorting out my duplicates either, but it is still on my TO-DO list.
Dangerous, I wouldn't do it. Duplicates can be, file, File, file(1), _file, etc, and still have different content.
You can use an image viewer app like xnview, search & find all image files in a partition and then sort them alphabetically. Duplicate names will show up together.
What about writing a few bash lines to iterate over all files, calculate a hash sum for each of them and generating a table with lines likeNext use sort to sort according to hash value and pipe the result into uniq -d -w<some number>. This will give you a list of the duplicates. Iterate through that list and use rm to delete the duplicates. read will read the file one line at a time and put each line in a bash variable, delete the hash from the line and give the result to rm.Code:<hash> /path/to/file
(If some files are present three times you have to do this again, as this will only remove one duplicate. The only risk is a hash collision, but you may want to check the duplicates list first.)
Last edited by Impavidus; January 31st, 2013 at 08:05 PM.
Not thoroughly tested but seems to work:
Outputs the list of redundant files while preserving all uniques. E.g. if there are two identical files (same size AND same hash), only one of them will be in the output.Code:#!/urs/bin/env python import os import hashlib # More advanced version of this function here: # http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/ def md5Checksum(file_path): with open(file_path, 'rb') as open_file: file_data = open_file.read() check_sum = hashlib.md5(file_data) return check_sum.hexdigest() # Working folder. Can be defined by a command line parameter, # e.g: root_dir = argv[1] root_dir = '/home' all_files = {} uniq_files = {} # Loop through all files in the working folder and its subfolders: for root, folders, files in os.walk(root_dir): for file_name in files: # Get absolute path: file_path = os.path.join(root, file_name) # Exclude symlinks: if not os.path.islink(file_path): # Calculate md5 hash: file_md5 = md5Checksum(file_path) # Get file size: file_size = os.stat(file_path).st_size # This dictionary will contain all files: all_files[file_path] = file_md5, file_size # This dictionary will contain all unique files, # including those that have duplicates: uniq_files[file_md5, file_size] = file_path # Redundant files = all files - unique files: dup_files = set(all_files.keys()) - set(uniq_files.values()) for file_path in dup_files: print file_path
Last edited by prodigy_; March 22nd, 2013 at 05:31 PM. Reason: Added comments
Bookmarks