Concatenating Large Files, Piping, And A Bonus
Solution 1:
First things first; I think you've got your modes incorrect:
unzipfile1 = gzip.open(zipfile1, 'wb')
This should open zipfile1
for writing, not reading. I hope your data still exists.
Second, you do not want to try to work with the entire data all at once. You should work with the data in blocks of 16k or 32k or something. (The optimum size will vary based on many factors; make it configurable if this task has to be done many times, so you can time different sizes.)
What you're looking for is probably more like this untested pseudo-code:
while (block = unzipfile1.read(4096*4)):
p1.stdin.write(a)
If you're trying to hook together multiple processes in a pipeline in Python, then it'll probably look more like this:
while (block = unzipfile1.read(4096*4)):
p1.stdin.write(a)
p2.stdin.write(p1.stdout.read())
This gives the output from p1
to p2
as quickly as possible. I've made the assumption that p1
won't generate significantly more input than it was given. If the output of p1
will be ten times greater than the input, then you should make another loop similar to this one.
But, I've got to say, this feels like a lot of extra work to replicate the shell script:
gzip -cd file1.gz file2.gz file3.gz | dataclean.py | dataprocess.pl
gzip(1)
will automatically handle the block-sized data transfer as I've described above, and assuming your dataclean.py
and dataprocess.pl
scripts also work with data in blocks rather than performing full reads (as your original version of this script does), then they should all run in parallel near the best of their abilities.
Post a Comment for "Concatenating Large Files, Piping, And A Bonus"