This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip.GzipFile is slow
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: bob.ippolito Nosy List: akuchling, april, bob.ippolito, brett.cannon, jimjjewett, ronaldoussoren
Priority: low Keywords:

Created on 2003-11-25 15:45 by ronaldoussoren, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg19134 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2003-11-25 15:45
gzip.GzipFile is significantly (an order of a magnitude) 
slower than using the gzip binary. I've been bitten by this 
several times, and have replaced "fd = gzip.open('somefile', 
'r')" by "fd = os.popen('gzcat somefile', 'r')" on several 
occassions.

Would a patch that implemented GzipFile in C have any 
change of being accepted?
msg19135 - (view) Author: Jim Jewett (jimjjewett) Date: 2003-11-25 17:35
Logged In: YES 
user_id=764593

Which compression level are you using?

It looks like most of the work is already done by zlib (which is in C), but GzipFile defaults to compression level 9.  Many other zips (including your gzcat?) default to a lower (but much faster) compression level.  
msg19136 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2003-11-25 21:03
Logged In: YES 
user_id=580910

The files are created using GzipFile. That speed is acceptable 
because it happens in a batch-job, reading back is the problem 
because that happens on demand and a user is waiting for the 
results.

gzcat is a *uncompress* utility (specifically it is "gzip -dc"), the 
compression level is irrelevant for this discussion. 

The python code seems to do quite some string manipulation, 
maybe that is causing the slowdown (I'm using fd.readline() in a 
fairly tight loop). I'll do some profiling to check what is taking so 
much time.

BTW. I'm doing this on Unix systems (Sun Solaris and Mac OS X).
msg19137 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2003-11-25 21:12
Logged In: YES 
user_id=580910

To be more precise:

$ ls -l gzippedfile
-rw-r--r--  1 ronald  admin  354581 18 Nov 10:21 gzippedfile

$ gzip -l gzippedfile
compressed  uncompr. ratio uncompressed_name
   354581   1403838  74.7% gzippedfile

The file contains about 45K lines of text (about 40 characters/line)

$ time gzip -dc gzippedfile >  /dev/null

real    0m0.100s
user    0m0.060s
sys     0m0.000s

$ python read.py gzippedfile > /dev/null

real    0m3.222s
user    0m3.020s
sys     0m0.070s

$ cat read.py
#!/usr/bin/env python

import sys
import gzip

fd = gzip.open(sys.argv[1], 'r')

ln = fd.readline()
while ln:
    sys.stdout.write(ln)
    ln = fd.readline()


The difference is also significant for larger files (e.g. the 
difference is not caused by the different startup-times)

msg19138 - (view) Author: Jim Jewett (jimjjewett) Date: 2003-11-25 22:05
Logged In: YES 
user_id=764593

In the library, I see a fair amount of work that doesn't really 
do anything except make sure you're getting exactly a line at 
a time.

Would it be an option to just read the file in all at once, split it 
on newlines, and then loop over the list?  (Or read it into a 
cStringIO, I suppose.)
msg19139 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-12-04 19:51
Logged In: YES 
user_id=357491

Looking at GzipFile.read and ._read , I think a large chunk of time 
is burned in the decompression of small chunks of data.  It initially 
reads and decompresses 1024 bits, and then if that read did not 
hit the EOF, it multiplies it by 2 and continues until the EOF is 
reached and then finishes up.

The problem is that for each read a call to _read is made that sets 
up a bunch of objects.  I would not be surprised if the object 
creation and teardown is hurting the performance.  I would also 
not be surprised if the reading of small chunks of data is an initial 
problem as well.  This is all guesswork, though, since I did not run 
the profiler on this.

*But*, there might be a good reason for reading small chunks.  If 
you are decompressing a large file, you might run out of memory 
very quickly by reading the file into memory *and* decompressing 
at the same time.  Reading it in successively larger chunks means 
you don't hold the file's entire contents in memory at any one 
time.

So the question becomes whether causing your memory to get 
overloaded and major thrashing on your swap space is worth the 
performance increase.  There is also the option of inlining _read 
into 'read', but since it makes two calls that seems like poor 
abstraction and thus would most likely not be accepted as a 
solution.  Might be better to just have some temporary storage in 
an attribute of objects that are used in every call to _read and 
then delete the attribute once the reading is done.  Or maybe 
allow for an optional argument to read that allowed one to specify 
the initial read size (and that might be a good way to see if any of 
these ideas are reasonable; just modify the code to read the 
whole thing and go at it from that).

But I am in no position to make any of these calls, though, since I 
never use gzip.  If someone cares to write up a patch to try to fix 
any of  this it will be considered.
msg19140 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2003-12-23 17:10
Logged In: YES 
user_id=11375

It should be simple to check if the string operations are responsible 
-- comment out the 'self.extrabuf = self.extrabuf + data'
in _add_read_data.  If that makes a big difference, then _read 
should probably be building a list instead of modifying a string.
msg19141 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2003-12-28 16:25
Logged In: YES 
user_id=580910

Leaving out the assignment sure sped thing up, but only because 
the input didn't contain lines anymore ;-)

I did an experiment where I replaced self.extrabuf by a list, but 
that did slow things down. This may be because there seemed to 
be very few chunks in the buffer (most of the time just 2)

According to profile.run('testit()') the function below spends about 
50% of its time in the readline method:

def testit()
    fd = gzip.open('testfile.gz', 'r')
    ln = fd.readline()
    cnt = bcnt = 0
    while ln:
        ln = fd.readline()
        cnt += 1
        bcnt += len(ln)
    print bcnt, cnt
    return bcnt,cnt

testfile.gz is a simple textfile containing 40K lines of about 70 
characters each.

Replacing the 'buffers' in readline by a string (instead of a list) 
slightly speeds things up (about 10%). 

Other experiments did not bring any improvement. Even writing a 
simple C function to split the buffer returned by self.read() didn't 
help a lot (splitline(strval, max) -> match, rest, match is strval 
upto the first newline and at most max characters, rest is the rest 
of strval).

msg19142 - (view) Author: April King (april) Date: 2005-05-04 16:18
Logged In: YES 
user_id=747439

readlines(X) is even worse, as all it does is call
readline() X times.

readline() is also biased towards files where each line is
less than 100 characters:

readsize = min(100, size)

So, if it's longer than that, it calls read, which calls
_read, and so on.  I've found using popen to be roughly 20x
faster than using the gzip module.  That's pretty bad.
msg19143 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2007-01-05 14:42
Patch #1281707 improved readline() performance and has been applied.  I'll close this bug; please re-open if there are still performance issues.
History
Date User Action Args
2022-04-11 14:56:01adminsetgithub: 39601
2003-11-25 15:45:18ronaldoussorencreate