kimes
June 21st, 2005, 01:31 AM
Let me tell you my case. I've got lots of lots of gigabytes files some might be dupulicated.. My task is to remove extactly dupulucated things.. At first I thought I can check'em with file size. but soon I realized that I can't make sure those files are EXACTLY same only with file size. so I searched internet and found 'md5' or 'crc' which is also builtin(?) tools in linux.. I was happy but I again realized that It takes too long time to digest those files which is over number of gigabytes..
How can I handle this kind of situation?
I'm thinking after take some number of bytes on the front of file and digest only for it. it must take much shorter time but the result wouldn't be good..
Is there any other better algorithm?
How can I handle this kind of situation?
I'm thinking after take some number of bytes on the front of file and digest only for it. it must take much shorter time but the result wouldn't be good..
Is there any other better algorithm?