Page 1 of 2 12 LastLast
Results 1 to 10 of 18

Thread: HOWTO: Remove duplicate files in a directory tree

  1. #1
    Join Date
    Apr 2006
    Beans
    19

    HOWTO: Remove duplicate files in a directory tree

    I recently had to merge an old backup onto a new drive, and of course ended up with a large number of duplicate files scattered all over the new drive. By digging around, I discovered the program "fdupes" which will run through a directory tree and find all the duplicates by calculating the MD5 checksums and comparing these. This is great, as it means that even if the file names are different, it will find it. It even has a -d switch, which will ask you which file to keep and which to delete.

    However...... with roughly 12,000 duplicates in my directory tree, I was not going to sit there and answer the same question 12,000 times. So I wrote a little Python program to do the dirty work.

    Here's what I did, after installing fdupes via Synaptic:

    Code:
    fdupes -r /home/top_dir > xdupes.txt
    This will run through the directory tree starting at top_dir and output its results into the file xdupes.txt.

    Next, I ran this little Python program:

    Code:
    import sys
    text_file=open("xdupes.txt","r")
    lines=text_file.readlines()
    text_file.close()
    text_file=open("xdupes_rm.sh","w")
    my_count=len(lines)
    
    for i in range(my_count):
    	next_count = i + 1
    	if next_count == my_count:
    		sys.exit()				# escape hatch
    	else:
    		if len(lines[next_count]) > 1:		# next line is not blank, so rm this line
    			out_line1 = lines[i].rstrip()	# remove trailing \n
    			out_line = 'rm ' + '"' + out_line1 + '"' + '\n'
    							#add a \n after the closing quote
    			if len (out_line) > 6:		# don't write out rm ""\n, Nigel!
    				text_file.write(out_line)
    
    text_file.close()
    This then creates an output file named "xdupes_rm.sh" that contains all the duplicate file names read in from "xdupes.txt", except the LAST name. For example, if fdupes has found files A, B, and C to be the same, then the file names A and B will be written to "xdupes_rm.sh".

    Next, make it executable:

    Code:
    chmod 777 xdupes_rm.sh
    If you are unsure, use a text editor to check xdupes_rm.sh and xdupes.txt, and check that not all files will be removed.

    Now run the removal script:

    Code:
    ./xdupes_rm.sh
    Note: With a large number of files it can take a LONG time. Fortunately, fdupes five you visual feedback, so you know it's working.

  2. #2
    Join Date
    Feb 2005
    Beans
    56

    Re: HOWTO: Remove duplicate files in a directory tree

    Hello, been trying to use you suggestion but can get the python script to work?

    ./fdupes-batch.py
    ./fdupes-batch.py: line 1: import: command not found
    ./fdupes-batch.py: line 2: syntax error near unexpected token `('
    ./fdupes-batch.py: line 2: `text_file=open("xdupes.txt","r")'

    Would love to get it working as have 14000+ dup email to sort out.

    I created file fdupes-batch.py with you python script and chmod it 777.


    many thanks

  3. #3
    Join Date
    Apr 2008
    Beans
    1

    Re: HOWTO: Remove duplicate files in a directory tree

    Thanks, works fine for me out of the box.
    @ carpman: python fdupes-batch.py, not sh fdupes-batch.py

  4. #4
    Join Date
    Nov 2007
    Beans
    8
    Distro
    Ubuntu

    Re: HOWTO: Remove duplicate files in a directory tree

    After some experimentation, I found a nice one-line command.
    Let's say your working directory is "abunchoffiles" and your setup looks like this:

    /home/
    /home/abunchoffiles/
    /home/duplicates/

    Code:
    fdupes -fr . | xargs mv  -t ../duplicates
    Using the above code, you can move all the duplicate files (recursively) into the directory "duplicates." I find this much safer (mentally, at least) than rm-ing them.
    I have run into a small problem with escape characters ($,(,), etc), but I imagine there's some way around it. I just haven't figured it out yet...

  5. #5
    Join Date
    Jul 2005
    Location
    Estonia
    Beans
    12
    Distro
    Dapper Drake Testing/

    Re: HOWTO: Remove duplicate files in a directory tree

    Hi,

    Does anyone know a graphical tool to remove duplicate photos under Ubuntu?

    Thanks,

  6. #6
    Join Date
    Apr 2006
    Beans
    19

    Re: HOWTO: Remove duplicate files in a directory tree

    You have to run it as a Python script:

    >python fdupes-batch.py

  7. #7
    Join Date
    May 2007
    Location
    Harbin, China
    Beans
    18
    Distro
    Kubuntu 9.10 Karmic Koala

    Re: HOWTO: Remove duplicate files in a directory tree

    A little late for you Chill, I'm sure, but for the rest of you here looking for the answer. Fslint is a duplicate file finder, with a nice GUI, and its in our repositories. http://ndblocks.com/index.php?hl=f5&...rf-bs-svyrf%2F

    The link has a good description and howto. I used it on my ebboks folder, I found I had ten copies of some books! Hope this helps you too.

    Philip
    Last edited by spider_0k; August 3rd, 2008 at 09:54 PM. Reason: I didn't check my spelling and grammer :-(

  8. #8
    Join Date
    May 2007
    Location
    Whidbey Island, WA
    Beans
    Hidden!
    Distro
    Kubuntu 9.04 Jaunty Jackalope

    Re: HOWTO: Remove duplicate files in a directory tree

    Spider,

    I would really like to see that link to a good description and howto for FSlint. I can see that FSlint is a powerful program but it is anything but intuitive to use. The help pages and MAN file are so vague as to be useless.

    Anyway, enough rant, when I navigate to your link I get:

    NDBlocks
    Resource Error: An error has occured while trying to browse through the proxy.
    It appears that you are trying to access a resource through this proxy from a remote Website.
    For security reasons, please use the form below to do so.
    So if you have that link still and can provide it that would be fabulous!

  9. #9
    Join Date
    May 2007
    Location
    Harbin, China
    Beans
    18
    Distro
    Kubuntu 9.10 Karmic Koala

    Arrow Re: HOWTO: Remove duplicate files in a directory tree

    Firstly, sorry. Very bad of me not to check my post before submission. I am based in China so I often have to use a proxy (Firefox, Gladder extension) to get around even "normal" web sites. I had to search again
    for the links

    http://www.ubuntugeek.com/fslint-too...tems-data.html

    Ubuntu Geeks have a reasonable guide. That was the one I followed.

    If you needs are pretty straight forward but you need some visual confirmation first (after all deleting is the scariest thing), then this is the app for you.

    As I said I didn't do anything 'special', but if you feel the need for more help .......

  10. #10
    Join Date
    Sep 2009
    Beans
    19
    Distro
    Ubuntu 10.10 Maverick Meerkat

    Re: HOWTO: Remove duplicate files in a directory tree

    Fdupes and FSlint are useless if your duplicates are not identical: see Robust image duplicate finder; I found GQview to be the best tool.

Page 1 of 2 12 LastLast

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •