Skip to content Skip to sidebar Skip to footer

Compare Multiple File In Python

i have a set of directories with n number of files, I need to compare each of those files (within one directory) and find if there is any difference in them. I tried filecmp and d

Solution 1:

I thought I'd share how combining the md5 hash compare with os.path.walk() can help you ferret out all the duplicates in a directory tree. The larger the number of directories and files gets, the more helpful it might be to first sort files by size to rule out any files that can't duplicates because they are of different size. Hope this helps.

import os, sys
from hashlib import md5

nonDirFiles = []

defcrawler(arg, dirname, fnames):
    '''Crawls directory 'dirname' and creates global 
    list of paths (nonDirFiles) that are files, not directories'''
    d = os.getcwd()
    os.chdir(dirname)

    global nonDirFiles
    for f in fnames:
        ifnot os.path.isfile(f):

            continueelse:       
            nonDirFiles.append(os.path.join(dirname, f))
    os.chdir(d)

defstartCrawl():
    x = raw_input("Enter Dir: ")
    print'Scanning directory "%s"....' %x
    os.path.walk(x, crawler, nonDirFiles)

deffindDupes():
    dupes = []
    outFiles = []
    hashes = {}
    for fileName in nonDirFiles:
        print'Scanning file "%s"...' % fileName
        f = file(fileName, 'r')
        hasher = md5()
        data = f.read()
        hasher.update(data)
        hashValue = hasher.digest()

        if hashes.has_key(hashValue):

            dupes.append(fileName)
        else:
            hashes[hashValue] = fileName

    return dupes

if __name__ == "__main__":
    startCrawl()
    dupes = findDupes()
    print"These files are duplicates:"for d in dupes:print d

Solution 2:

Your question doesn't specify whether you need to determine the differences or only find which files are the same/different - so I will focus on grouping together like files.

You can use hashing to group identical files together:

from hashlib import md5
from pprint import pprint

def get_filenames():
    return ('file1', 'file2', 'file3')

hashes = {}
for f inget_filenames():
    hd = md5(open(f).read()).hexdigest()
    hashes[hd] = hashes.get(hd, []) + [f]

pprint(hashes)
{'420248eb2e8226ac441cb7516fb7ff23': ['file2'],
 '4f2d7139dc1aa23235e7fad418a5bd10': ['file1', 'file3']}

Given that your files contain lists of host names you might like to sort the files beforehand so that, for example,

file1 <- host1 
         host2
         host3

and

file3 <- host3 
         host2
         host1

are considered equivalent.

Post a Comment for "Compare Multiple File In Python"