Page 2 of 2 FirstFirst 12
Results 11 to 17 of 17

Thread: Bash challenge: script to remove duplicate files

  1. #11
    Join Date
    Jul 2012
    Location
    /tropics/islands/statia
    Beans
    275
    Distro
    Kubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    Wow! Sounds really impressive and useful! I hope you will share your script with us when you have it finished. Thanks for your input!

  2. #12
    Join Date
    Jul 2011
    Location
    Spain
    Beans
    81
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Bash challenge: script to remove duplicate files

    I hope you will share your script with us when you have it finished.
    Will do... lots of testing is the order of the day for me right now....

  3. #13
    Join Date
    Jul 2011
    Location
    Spain
    Beans
    81
    Distro
    Ubuntu 14.04 Trusty Tahr

    Re: Bash challenge: script to remove duplicate files

    I haven't forgotten about this. I have had little time lately, but I still do have thousands of duplicates to sort out. I only have 2 PCs but an OS upgrade and a disk failure have kept me busy. When I finally sort out my own duplicate photos I will post the script(s) back here.

  4. #14
    Join Date
    Jul 2012
    Location
    /tropics/islands/statia
    Beans
    275
    Distro
    Kubuntu 12.04 Precise Pangolin

    Re: Bash challenge: script to remove duplicate files

    Thanks for not forgetting! I haven't come round to sorting out my duplicates either, but it is still on my TO-DO list.

  5. #15
    Rebelli0us is offline Extra Foam Sugar Free Ubuntu
    Join Date
    Feb 2008
    Beans
    722

    Re: Bash challenge: script to remove duplicate files

    Dangerous, I wouldn't do it. Duplicates can be, file, File, file(1), _file, etc, and still have different content.

    You can use an image viewer app like xnview, search & find all image files in a partition and then sort them alphabetically. Duplicate names will show up together.

  6. #16
    Join Date
    Aug 2011
    Location
    52.5° N 6.4° E
    Beans
    6,820
    Distro
    Xubuntu 22.04 Jammy Jellyfish

    Re: Bash challenge: script to remove duplicate files

    What about writing a few bash lines to iterate over all files, calculate a hash sum for each of them and generating a table with lines like
    Code:
    <hash> /path/to/file
    Next use sort to sort according to hash value and pipe the result into uniq -d -w<some number>. This will give you a list of the duplicates. Iterate through that list and use rm to delete the duplicates. read will read the file one line at a time and put each line in a bash variable, delete the hash from the line and give the result to rm.

    (If some files are present three times you have to do this again, as this will only remove one duplicate. The only risk is a hash collision, but you may want to check the duplicates list first.)
    Last edited by Impavidus; January 31st, 2013 at 08:05 PM.

  7. #17
    prodigy_ is offline May the Ubuntu Be With You!
    Join Date
    Mar 2008
    Beans
    1,219

    Re: Bash challenge: script to remove duplicate files

    Not thoroughly tested but seems to work:
    Code:
    #!/urs/bin/env python
    
    import os
    import hashlib
    
    # More advanced version of this function here:
    # http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/
    def md5Checksum(file_path):
        with open(file_path, 'rb') as open_file:
            file_data = open_file.read()
            check_sum = hashlib.md5(file_data)
        return check_sum.hexdigest()
    
    # Working folder. Can be defined by a command line parameter,
    # e.g: root_dir = argv[1]
    root_dir = '/home'
    
    all_files = {}
    uniq_files = {}
    
    # Loop through all files in the working folder and its subfolders:
    for root, folders, files in os.walk(root_dir):
        for file_name in files:
            # Get absolute path:
            file_path = os.path.join(root, file_name)
            # Exclude symlinks:
            if not os.path.islink(file_path):
                # Calculate md5 hash:
                file_md5 = md5Checksum(file_path)
                # Get file size:
                file_size = os.stat(file_path).st_size
                # This dictionary will contain all files:
                all_files[file_path] = file_md5, file_size
                # This dictionary will contain all unique files,
                # including those that have duplicates:
                uniq_files[file_md5, file_size] = file_path
    
    # Redundant files = all files - unique files:
    dup_files = set(all_files.keys()) - set(uniq_files.values())
    for file_path in dup_files:
        print file_path
    Outputs the list of redundant files while preserving all uniques. E.g. if there are two identical files (same size AND same hash), only one of them will be in the output.
    Last edited by prodigy_; March 22nd, 2013 at 05:31 PM. Reason: Added comments

Page 2 of 2 FirstFirst 12

Tags for this Thread

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •