Compare Multiple File In Python
i have a set of directories with n number of files, I need to compare each of those files (within one directory) and find if there is any difference in them. I tried filecmp and d
Solution 1:
I thought I'd share how combining the md5 hash compare with os.path.walk() can help you ferret out all the duplicates in a directory tree. The larger the number of directories and files gets, the more helpful it might be to first sort files by size to rule out any files that can't duplicates because they are of different size. Hope this helps.
import os, sys
from hashlib import md5
nonDirFiles = []
defcrawler(arg, dirname, fnames):
'''Crawls directory 'dirname' and creates global
list of paths (nonDirFiles) that are files, not directories'''
d = os.getcwd()
os.chdir(dirname)
global nonDirFiles
for f in fnames:
ifnot os.path.isfile(f):
continueelse:
nonDirFiles.append(os.path.join(dirname, f))
os.chdir(d)
defstartCrawl():
x = raw_input("Enter Dir: ")
print'Scanning directory "%s"....' %x
os.path.walk(x, crawler, nonDirFiles)
deffindDupes():
dupes = []
outFiles = []
hashes = {}
for fileName in nonDirFiles:
print'Scanning file "%s"...' % fileName
f = file(fileName, 'r')
hasher = md5()
data = f.read()
hasher.update(data)
hashValue = hasher.digest()
if hashes.has_key(hashValue):
dupes.append(fileName)
else:
hashes[hashValue] = fileName
return dupes
if __name__ == "__main__":
startCrawl()
dupes = findDupes()
print"These files are duplicates:"for d in dupes:print d
Solution 2:
Your question doesn't specify whether you need to determine the differences or only find which files are the same/different - so I will focus on grouping together like files.
You can use hashing to group identical files together:
from hashlib import md5
from pprint import pprint
def get_filenames():
return ('file1', 'file2', 'file3')
hashes = {}
for f inget_filenames():
hd = md5(open(f).read()).hexdigest()
hashes[hd] = hashes.get(hd, []) + [f]
pprint(hashes)
{'420248eb2e8226ac441cb7516fb7ff23': ['file2'],
'4f2d7139dc1aa23235e7fad418a5bd10': ['file1', 'file3']}
Given that your files contain lists of host names you might like to sort the files beforehand so that, for example,
file1 <- host1
host2
host3
and
file3 <- host3
host2
host1
are considered equivalent.
Post a Comment for "Compare Multiple File In Python"