Page 2 of 2 FirstFirst 12
Results 11 to 18 of 18

Thread: Bash challenge: script to remove duplicate files

  1. #11
    Join Date
    Jul 2012
    Location
    /tropics/islands/statia
    Beans
    275
    Distro
    Kubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    Wow! Sounds really impressive and useful! I hope you will share your script with us when you have it finished. Thanks for your input!

  2. #12
    Join Date
    Jul 2011
    Location
    Spain
    Beans
    56
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    I hope you will share your script with us when you have it finished.
    Will do... lots of testing is the order of the day for me right now....

  3. #13
    Join Date
    Jul 2011
    Location
    Spain
    Beans
    56
    Distro
    Ubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    I haven't forgotten about this. I have had little time lately, but I still do have thousands of duplicates to sort out. I only have 2 PCs but an OS upgrade and a disk failure have kept me busy. When I finally sort out my own duplicate photos I will post the script(s) back here.

  4. #14
    Join Date
    Jul 2012
    Location
    /tropics/islands/statia
    Beans
    275
    Distro
    Kubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    Thanks for not forgetting! I haven't come round to sorting out my duplicates either, but it is still on my TO-DO list.

  5. #15
    Join Date
    Jan 2013
    Beans
    1

    Re: Bash challenge: script to remove duplicate files

    Quote Originally Posted by Statia View Post
    I used the excellent tool fdupes to find duplicate photos and wrote its output to a file:

    Code:
    fdupes -r /home/statia/Pictures > ~/duplicates.txt
    Lines in this file look like this:

    Code:
    /home/statia/Pictures/Nevis 2009/img_1789.jpg
    /home/statia/Pictures/Unsorted/Canon A520 372.jpg
    
    /home/statia/Pictures/Nevis 2009/img_1853.jpg
    /home/statia/Pictures/Unsorted/Canon A520 374.jpg
    I have about 550 duplicates, but my girlfriend more than 10.000, so I'd like to automate the task of removing the duplicates, with a smart script.
    The script should take duplicates.txt as input and remove one of each duplicate.
    Files can be removed from directories with names like "Unsorted", "backup" and "recovered". Probably easiest to set those names in an variable like $REMOVE_DIR.
    Duplicate pairs not in one of those directories can be left untouched for manual removal.
    Unfortunately, I am not that great with bash, could anyone give me a hand?

    I used to have trouble deleting duplicate files. When I tried Duplicate Files Deleter, it worked like a charm without harming other files and the best part is that the program's for free!

  6. #16
    Rebelli0us is offline Extra Foam Sugar Free Ubuntu
    Join Date
    Feb 2008
    Beans
    722

    Re: Bash challenge: script to remove duplicate files

    Dangerous, I wouldn't do it. Duplicates can be, file, File, file(1), _file, etc, and still have different content.

    You can use an image viewer app like xnview, search & find all image files in a partition and then sort them alphabetically. Duplicate names will show up together.

  7. #17
    Join Date
    Aug 2011
    Location
    52° N 6° E
    Beans
    2,539
    Distro
    Xubuntu 14.04 Trusty Tahr

    Re: Bash challenge: script to remove duplicate files

    What about writing a few bash lines to iterate over all files, calculate a hash sum for each of them and generating a table with lines like
    Code:
    <hash> /path/to/file
    Next use sort to sort according to hash value and pipe the result into uniq -d -w<some number>. This will give you a list of the duplicates. Iterate through that list and use rm to delete the duplicates. read will read the file one line at a time and put each line in a bash variable, delete the hash from the line and give the result to rm.

    (If some files are present three times you have to do this again, as this will only remove one duplicate. The only risk is a hash collision, but you may want to check the duplicates list first.)
    Last edited by Impavidus; January 31st, 2013 at 08:05 PM.

  8. #18
    prodigy_ is offline May the Ubuntu Be With You!
    Join Date
    Mar 2008
    Beans
    1,219

    Re: Bash challenge: script to remove duplicate files

    Not thoroughly tested but seems to work:
    Code:
    #!/urs/bin/env python
    
    import os
    import hashlib
    
    # More advanced version of this function here:
    # http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/
    def md5Checksum(file_path):
        with open(file_path, 'rb') as open_file:
            file_data = open_file.read()
            check_sum = hashlib.md5(file_data)
        return check_sum.hexdigest()
    
    # Working folder. Can be defined by a command line parameter,
    # e.g: root_dir = argv[1]
    root_dir = '/home'
    
    all_files = {}
    uniq_files = {}
    
    # Loop through all files in the working folder and its subfolders:
    for root, folders, files in os.walk(root_dir):
        for file_name in files:
            # Get absolute path:
            file_path = os.path.join(root, file_name)
            # Exclude symlinks:
            if not os.path.islink(file_path):
                # Calculate md5 hash:
                file_md5 = md5Checksum(file_path)
                # Get file size:
                file_size = os.stat(file_path).st_size
                # This dictionary will contain all files:
                all_files[file_path] = file_md5, file_size
                # This dictionary will contain all unique files,
                # including those that have duplicates:
                uniq_files[file_md5, file_size] = file_path
    
    # Redundant files = all files - unique files:
    dup_files = set(all_files.keys()) - set(uniq_files.values())
    for file_path in dup_files:
        print file_path
    Outputs the list of redundant files while preserving all uniques. E.g. if there are two identical files (same size AND same hash), only one of them will be in the output.
    Last edited by prodigy_; March 22nd, 2013 at 05:31 PM. Reason: Added comments

Page 2 of 2 FirstFirst 12

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •