Linux Tutorial: Finding Duplicates

What’s the Problem?

Someone brought this data de-duplication problem to me:

There’s an ever-growing collection of video files. For reasons out of our control, the collection process leads to duplicate uploads. A file will get a unique name, unique timestamp, all its metadata unique, but it’s an identical copy of an existing file that has already been saved.

We need to find the duplicates, so that all but the first can be deleted.

clone-153447_640

What Do We Mean By “Duplicate?”

The same data has already been uploaded into another file with a different name and timestamp. It’s the same length, with every bit identical. So, not similar, not “it’s the same for a while”, but identical.

Hash Functions to the Rescue

We know that the hash values of two identical files will be the same. A good hash function will exhibit the avalanche effect. This means that a single-bit change in input will cause, on average, half the output bits to change. A collision, or the event in which we stumble across two different inputs generating identical hash outputs, should happen so infrequently that we would expect it to require an astronomical number of comparisons before we encounter such an event.

Let’s Put It In Words

We have a large collection of files. In the situation I’m describing, they happened to contain video data, but this doesn’t matter at all. We will:

  1. Make a list of the SHA-2-256 hash values of those files.
  2. Count how many times each hash output occurs. We expect each to be unique, one occurance each.
  3. If any values appear more than once, each repeated value corresponds to sets of files that almost (but not quite) certainly are duplicates. For each of those duplicated hash outputs, do bit-by-bit comparisons of the relevant files to find any actual duplicates.

So far, we’re at the conceptual level of Learning Tree’s System and Network Security Introduction course. Let’s take it up a notch.

Let’s Actually Do This

The following works in Linux, Mac OS X, BSD, Solaris, AIX, HP-UX, basically anything that isn’t Windows. This is a bash script that finds possible duplicates. See the Linux Introduction course and then the shell programming course for better ways of solving this problem.

#!/bin/sh

# Change this to the directory where files are stored:
AREA=/path/to/data
# Recall that $$ will be replaced with the running shell's process ID:
HASHLIST=/tmp/$$-hashes
FILELIST=/tmp/$$-files

# Find the list of SHA-2-256 hashes and
# put it into the HASHLIST file.
find $AREA -type f -exec openssl sha256 "{}" \; > $HASHLIST

# Which ones appear more than once?
POSSIBLEDUPES="$( awk '{print $NF}' $HASHLIST | \
        sort | uniq -c | awk '$1 > 1 {print $2}')"

# Report those
for HASH in $POSSIBLEDUPES
do
        # Report a hash value with possible duplicates
        echo "Possible duplication for $DUPE"
        # Save the corresponding file names in FILELIST.
        grep $HASH $HASHLIST |
                sed 's/SHA256(//' |
                sed 's/)= .*//' |
                tee $FILELIST
        # Do pairwise comparisons with the corresponding files.
        # There are n files in the list.
        n=$(cat $FILELIST | wc -l)
        # Outer loop: i = 1, ... n-1
        i=1
        while [ $i -lt $n ]
        do
                # Inner loop: j = i+1, ... n
                j=$(( $i + 1 ))
                while [ $j -le $n ]
                do
                        # Extract ith and jth files in list
                        FILE1="$(head -$i $FILELIST | tail -1)"
                        FILE2="$(head -$j $FILELIST | tail -1)"
                        # If cmp returns 0, files were identical
                        cmp -s "${FILE1}" "${FILE2}"
                        if [ $? == "0" ]
                        then
                                echo " IDENTICAL FILES:"
                                echo "   ${FILE1}"
                                echo "   ${FILE2}"
                        fi
                        j=$(( $j + 1 ))
                done
                i=$(( $i + 1 ))
        done

        echo ""

done

rm -f $HASHLIST $FILELIST

I have assumed in the above that file names consist of nothing but upper and lower-case letters, digits, and safe punctuation marks “.”, “-“, and “_”, and the space character. See the shell programming course for how to put long-handled tongs into your script and handle dangerous file name characters like those in this list:
' ` " ! ~ # $ & ( ) [ ] { } | \ ; : < >

Type to search blog.learningtree.com

Do you mean "" ?

Sorry, no results were found for your query.

Please check your spelling and try your search again.