07 July 2008

My first Ruby script: finding duplicate photo files

I've started on the large task of organizing my digital photo files. My disorganization stems from a couple of things. First, when I began using a digital camera, I used iPhoto. I don't really use it now. But I've noticed that iPhoto makes some duplicate files. Second, I think I have duplicate files from backups and from migrating from old machines to new ones. Compounded with the fact that most of the files begin with DSCN or DCP, it is a mess. So, to help me get started, I thought I would write a script that would create a report of any duplicate files and their locations. Also, since I'm about one third of the way through the Everyday Scripting with Ruby by Brian Marick, I know just enough to be dangerous. After a couple of evenings of blundering my way through this, I finally have my first Ruby script.

require 'find'
argv = ARGV.empty? ? %w{.} : ARGV
file_counts = Hash.new(0)
files = Array.new
Find.find(*argv) do |fullname|
# looking for image files only
next unless fullname =~ /\.(JPEG|JPG|GIF|MOV)$/i
file = File.basename(fullname).downcase
dir = File.dirname(fullname)
files.push([file,dir])
file_counts[file.downcase] += 1
end
file_counts.each { |file, count|
if count > 1 then
# print the number of occurences and the file name
printf("%5d %s\n", count, file)
# since assoc only returns the first occurence, we print it
# then delete it so we can find the next occurence
while (a = files.assoc(file))
# print each directory name
printf("\t %s\n", a[1])
files.delete(a)
end
end
}


To call this script, you can pass in a list of directories, or it will default to the current directory.

2 comments:

Anonymous said...

This little duplicate finding script is pretty neat. I ran it on a Windows XP machine from my C: drive. I copied the script from your blog and saved it on my machine as findDupFiles.rb and ran it. I also redirected the output to a file instead of having the output scroll through my command window.
Example:
prompt>ruby findDupFiles.rb c:\ > duplicates.txt

I was surprised at just how many duplicated image names there were on my system. Just as a guide, there were 8,740 lines in the text file that was generated with the results!! Yikes, that is a lot of duplicate filenames.

Tip: One word of caution to others that may find this tool helpful though. Although a filename may be duplicated, it does not necessarily mean that the file in separate locations is the same file. I suggest further physical inspection before removing or deleting etc. any of your duplicates.

Keep up the good work!

Cynthia Sadler said...

When I ran it, my output file was 15000+ lines long. I don't know if this will really help me organize my files, but it was interesting nonetheless. I have a lot of work to do!