Skip to content Skip to sidebar Skip to footer

Concatenating Large Files, Piping, And A Bonus

There has been similar questions asked (and answered), but never really together, and I can't seem to get anything to work. Since I am just starting with Python, something easy to

Solution 1:

First things first; I think you've got your modes incorrect:

unzipfile1 = gzip.open(zipfile1, 'wb')

This should open zipfile1 for writing, not reading. I hope your data still exists.

Second, you do not want to try to work with the entire data all at once. You should work with the data in blocks of 16k or 32k or something. (The optimum size will vary based on many factors; make it configurable if this task has to be done many times, so you can time different sizes.)

What you're looking for is probably more like this untested pseudo-code:

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)

If you're trying to hook together multiple processes in a pipeline in Python, then it'll probably look more like this:

while (block = unzipfile1.read(4096*4)):
    p1.stdin.write(a)
    p2.stdin.write(p1.stdout.read())

This gives the output from p1 to p2 as quickly as possible. I've made the assumption that p1 won't generate significantly more input than it was given. If the output of p1 will be ten times greater than the input, then you should make another loop similar to this one.


But, I've got to say, this feels like a lot of extra work to replicate the shell script:

gzip -cd file1.gz file2.gz file3.gz | dataclean.py | dataprocess.pl

gzip(1) will automatically handle the block-sized data transfer as I've described above, and assuming your dataclean.py and dataprocess.pl scripts also work with data in blocks rather than performing full reads (as your original version of this script does), then they should all run in parallel near the best of their abilities.

Post a Comment for "Concatenating Large Files, Piping, And A Bonus"