Skip to content Skip to sidebar Skip to footer

Unpickle A Data Structure Vs. Build By Calling Readlines()

I have a use case where I need to build a list from the lines in a file. This operation will be performed potentially 100s of times on a distributed network. I've been using the

Solution 1:

Would there be any performance increase if I did this?

Test it and see!

try:
    import cPickle as pickle
except:
    import pickle
import timeit

deflines():
    withopen('lotsalines.txt') as f:
         return f.readlines()

defpickles():
    withopen('lotsalines.pickle', 'rb') as f:
        return pickle.load(f)

ds = lines()
withopen('lotsalines.pickle', 'wb') as f:
    t = timeit.timeit(lambda: pickle.dump(ds, file=f, protocol=-1), number=1)
print('pickle.dump: {}'.format(t))

print('readlines:   {}'.format(timeit.timeit(lines, number=10))
print('pickle.load: {}'.format(timeit.timeit(pickles, number=10))

My 'lotsalines.txt' file is just that source duplicated until it's 655360 lines long, or 15532032 bytes.

Apple Python 2.7.2:

readlines:   0.640027999878pickle.load: 2.67698192596

And the pickle file is 19464748 bytes.

Python.org 3.3.0:

readlines:   1.5357899703085423pickle.load: 1.5975534357130527

And it's 20906546 bytes.

So, Python 3 has sped up pickle quite a bit over Python 2, at least if you use pickle protocol 3, but it's still nowhere near as fast as a simple readlines. (And readlines has gotten a lot slower in 3.x, as well as being deprecated.)

But really, if you've got performance concerns, you should consider whether you need the list in the first place. A quick test shows that building a list of this size is almost half the cost of the readlines (timing list(range(655360)) in 3.x, list(xrange(655360)) in 2.x). And it uses a ton of memory (which is probably actually why it's slow, too). If you don't actually need the list—and usually you don't—just iterate over the file, getting lines as you need them.

Post a Comment for "Unpickle A Data Structure Vs. Build By Calling Readlines()"