Comparing Large Files With Grep Or Python
Solution 1:
This method creates a set from the first file (listA
). The the only memory requirement is enough space to hold this set. It then iterates through each url in the listB.txt
file (very memory efficient). If the url is not in this set, it writes it to a new file (also very memory efficient).
filename_1 = 'listA.txt'
filename_2 = 'listB.txt'
filename_3 = 'listC.txt'
with open(filename_1, 'r') as f1, open(filename_2, 'r') as f2, open(filename_3, 'w') as fout:
s = set(val.strip() forvalin f1.readlines())
for row in f2:
row = row.strip()
if row not in s:
fout.write(row + '\n')
Solution 2:
If you have sufficient memory, read the files in to two lists. Then convert the lists to sets ie setA = set(listA)
then you can use the various operators available with Python sets to do whatever operations you like e.g. setA - setB
I've used it before and it's very efficient.
Solution 3:
You will want to follow the solution here:
Get difference between two lists
But first, you will need to know how to load the file into a list, which is here:
How do I read a file line-by-line into a list?
Good luck. So something like this:
withopen('listA.txt') as a:
listA = a.readlines()
a.close()
withopen('listB.txt') as b:
listB = b.readlines()
b.close()
diff = list(set(listB) - set(listA))
#One choice for printingprint'[%s]' % ', '.join(map(str, diff))
Solution 4:
If you can't fit even the smaller file into memory, Python is not going to help. The usual solution is to sort the inputs and use an algorithm which operates on just three entries at a time (it reads one entry from one file and one from the other, then based on their sort order decides which file to read from next. It needs to keep three of them in memory at any time to decide which branch to take in the code).
GNU sort
will fall back to disk-based merge sort if it can't fit stuff into memory so it is basically restricted only by available temporary disk space.
#!/bin/shexport LC_ALL=C # use trad POSIX sort order
t=$(mktemp -t listA.XXXXXXXX) || exit 123
trap'rm -f $t' EXIT HUP INT
sort listA.txt >"$t"sort listB.txt | comm -12 "$t" -
If the input files are already sorted, obviously comm
is all you need.
Bash (and I guess probably also Zsh and ksh
) offers process substitution like comm <(sort listA.txt) <(sort listB.txt)
but I'm not sure if that's robust under memory exhaustion.
As I'm sure you have already discovered, if the files are radically different size, it makes sense to keep the smaller one in memory regardless of your approach (so switch the order of listA.txt
and listB.txt
if listB.txt
is the smaller one, here and in your original grep
command line; though I guess it will make less of a difference here).
Post a Comment for "Comparing Large Files With Grep Or Python"