Finding and deleting duplicate image files

cerbere · Apr 13, 2009

Hi everyone,

I want to find (and optionally delete) duplicate images
in my directories. I already have written a Perl program
(here : http://ixedix.x10hosting.com/findsame.htm ) that lists
JPG files of the same size, as possible candidates for manual deletion.

To make the deletion automatic, I need to be pretty sure that the images
are the same. From what I read while googling the subject, it seems
that there is no checksum or similar signature (MD5, etc) in the
JPG format. My idea now is to read a 256 bytes block in the middle
of each potentially duplicate image, compare it to a block at the same
offset from the first image of that size, and delete the clone if
the blocks are identical.

Of course, a byte-by-byte comparison of the entire files could do
the job, but it would be a lot slower.

Any suggestions on how to do it the fastest, safest way ?

Ancilliary question : is there a "move file" function in Perl or PHP ?
I'd like to move the duplicate files to a garbage bin before finally
deleting them for good.

lordskid · Apr 13, 2009

http://us.php.net/crc32#86628

check this out. they say you can use this to generate crc16 checks on a file.

now all you have to do is to run this code to both jpgs and if the results of both are the same then the file most likely will be the same. Hence they can be deleted.

cerbere · Apr 13, 2009

Thanks for the suggestion, lordskid, but calculating a CRC
for each candidate file would be even slower than doing
a plain byte-by-byte comparison...

On second thought, the above is clear if I have only 2 files
of the same length. For more files, maybe your idea is the best way.

Here is a fragment of my list of potential duplicates :

Code:

c:/2008Mai4/LFS001090.JPG   208163
c:/2008Nov2/1227225623.JPG   208163

c:/Sept2007Bis/1189442639290.JPG   209199
c:/Oct2007Ter/1192542190263.JPG   209199
c:/Oct2007Quad/1193434359791.JPG   209199
c:/Oct2007Penta/1193639838277.JPG   209199

c:/2008Mai2/117943817617.JPG   210542
c:/2008Juin3/1213585426621.JPG   210542

The number on each line is the file size.

misson · Apr 13, 2009

cerbere said:
Thanks for the suggestion, lordskid, but calculating a CRC
for each candidate file would be even slower than doing
a plain byte-by-byte comparison...

On second thought, the above is clear if I have only 2 files
of the same length. For more files, maybe your idea is the best way.

That was going to be my suggestion: compare files with only 2 size collisions, look for checksum collision when >= 3 size collisions.

As for comparing random samples of the content, it's hard to even estimate how many samples you'd need to achieve a given confidence interval without knowing more about what's depicted in the images, but I'd be very surprised if you need to compare more than 25-50% of the file contents, and even that estimate is very conservative. Assuming there's more variation in the center of the image, a good strategy would be to pick blocks from the middle of a file for baseline JPEG and the start of a file for a progressive JPEG.

Some JPEG markers might also be good candidates to compare (e.g. JFIF, if it contains a thumbnail), depending on what produced the images. Quantization tables, in particular, might vary highly from source to source. If the images were produced by the same software, it might be better to skip the markers & compare just the scan data. Of course, to reliably check or skip markers for fast comparison purposes depends on them being at the start of the file.

Are the files stored on an NTFS formatted disk? I don't think there are any NTFS attributes (other than size) that are helpful, but someone more familiar with NTFS internals could prove me wrong.

Finally, how many files do you have to compare? You probably have the cycles to spare if this isn't a frequent task or the number of duplicates is small.

cerbere said:
Ancilliary question : is there a "move file" function in Perl or PHP ?
I'd like to move the duplicate files to a garbage bin before finally
deleting them for good.

For Perl, see File::Copy::move() or (failing that) rename. For PHP, see rename().

cerbere · Apr 15, 2009

Thanks for the pointers, misson.

I have around 60000 images (art, technology, eye-candy :pigeon:

) ),
with an average fike size of say 200 kb. They are spread
over 3 drives.

I now realize that I must give more thought to the subject,
notably about the ">= 3" case. My PC is very low-end
("jurassic" comes to mind : 133 MHz Pentium II with 32 Mb of RAM)
so I must get smart with the comparison algorithm
if I want it to complete this year ! Good thing is that I will
run it only every 3 months or so, untill I can afford a new PC
with a HUGE drive (14 Gb total right now).

So I will experiment with different approaches, and post what
I find here in a few days.

P.S. Shame on me for not having thought of "rename"...

misson · Apr 15, 2009

Given the nature of the problem, I don't expect changing the programming language will have a huge effect, but it's worth testing. Python's fairly fast, even for numeric processing. A mix of C and assembly might be the fastest, but compiler optimizations may be able to beat hand-coded assembly.

Where do the images come from? Do you need to rescan the entire image store each time? You also might be able to use one of the APPn markers to tag JPEGs with checksums to speed future comparisons.

Finding and deleting duplicate image files

cerbere

New Member

lordskid

New Member

cerbere

New Member

misson

Community Paragon

cerbere

New Member

misson

Community Paragon

Free Web Hosting

Our Community

Legal