Bash challenge: script to remove duplicate files

**Statia** · September 28th, 2012

Wow! Sounds really impressive and useful! I hope you will share your script with us when you have it finished. Thanks for your input!

**Scooby-2** · September 30th, 2012

I hope you will share your script with us when you have it finished.

Will do... lots of testing is the order of the day for me right now....

**Scooby-2** · November 28th, 2012

I haven't forgotten about this. I have had little time lately, but I still do have thousands of duplicates to sort out. I only have 2 PCs but an OS upgrade and a disk failure have kept me busy. When I finally sort out my own duplicate photos I will post the script(s) back here.

**Statia** · November 29th, 2012

Thanks for not forgetting! I haven't come round to sorting out my duplicates either, but it is still on my TO-DO list.

**Rebelli0us** · January 31st, 2013

Dangerous, I wouldn't do it. Duplicates can be, file, File, file(1), _file, etc, and still have different content.

You can use an image viewer app like xnview, search & find all image files in a partition and then sort them alphabetically. Duplicate names will show up together.

**Impavidus** · January 31st, 2013

What about writing a few bash lines to iterate over all files, calculate a hash sum for each of them and generating a table with lines like

Code:

<hash> /path/to/file

Next use sort to sort according to hash value and pipe the result into uniq -d -w<some number>. This will give you a list of the duplicates. Iterate through that list and use rm to delete the duplicates. read will read the file one line at a time and put each line in a bash variable, delete the hash from the line and give the result to rm.

(If some files are present three times you have to do this again, as this will only remove one duplicate. The only risk is a hash collision, but you may want to check the duplicates list first.)

**prodigy_** · January 31st, 2013

Not thoroughly tested but seems to work:

Code:

#!/urs/bin/env python

import os
import hashlib

# More advanced version of this function here:
# http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/
def md5Checksum(file_path):
    with open(file_path, 'rb') as open_file:
        file_data = open_file.read()
        check_sum = hashlib.md5(file_data)
    return check_sum.hexdigest()

# Working folder. Can be defined by a command line parameter,
# e.g: root_dir = argv[1]
root_dir = '/home'

all_files = {}
uniq_files = {}

# Loop through all files in the working folder and its subfolders:
for root, folders, files in os.walk(root_dir):
    for file_name in files:
        # Get absolute path:
        file_path = os.path.join(root, file_name)
        # Exclude symlinks:
        if not os.path.islink(file_path):
            # Calculate md5 hash:
            file_md5 = md5Checksum(file_path)
            # Get file size:
            file_size = os.stat(file_path).st_size
            # This dictionary will contain all files:
            all_files[file_path] = file_md5, file_size
            # This dictionary will contain all unique files,
            # including those that have duplicates:
            uniq_files[file_md5, file_size] = file_path

# Redundant files = all files - unique files:
dup_files = set(all_files.keys()) - set(uniq_files.values())
for file_path in dup_files:
    print file_path

Outputs the list of redundant files while preserving all uniques. E.g. if there are two identical files (same size AND same hash), only one of them will be in the output.