This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: gzip dies on gz files with many appended headers
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: akuchling Nosy List: akuchling, eichin
Priority: normal Keywords:

Created on 2004-11-27 17:29 by eichin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
make_gz_thing.py eichin, 2004-11-27 17:29 test case that demonstrates the bug
gzip-patch eichin, 2004-11-27 17:48 patch for GzipFile header-reading bug
Messages (4)
msg23346 - (view) Author: Mark Eichin (eichin) Date: 2004-11-27 17:29
One of the values of the gzip format is that one can reopen for 
append and the file is, as a whole, still valid.  This is accomplished 
by adding new headers on reopen.  gzip.py (as tested on 2.1, 2.3, 
and 2.4rc1 freshly built) doesn't deal well with more than a certain 
number of appended headers.

The included test case generates (using gzip.py) such a file, runs 
gzip -tv on it to show that it is valid, and then tries to read it with 
gzip.py -- and it blows out, with 

OverflowError: long int too large to convert to int

in earlier releases, MemoryError in 2.4rc1 - what's going on is that 
gzip.GzipFile.read keeps doubling readsize and calling _read again; 
_read does call _read_gzip_header, and consumes *one* header.  
So, readsize doubling means that older pythons blow out by not 
autopromoting past 2**32, and 2.4 blows out trying to call file.read 
on a huge value - but basically, more than 30 or so headers and it 
fails.

The test case below is based on a real-world queueing case that 
generates over 200 appended headers - and isn't bounded in any 
useful way.  I'll think about ways to make GzipFile more clever, but 
I don't have a patch yet.
msg23347 - (view) Author: Mark Eichin (eichin) Date: 2004-11-27 17:48
Logged In: YES 
user_id=79734

Oh, this is actually easy to fix: just clamp readsize.  After all, you don't 
*actually* want to try to read gigabyte chunks most of the time.  (The 
supplied patch allows one to override gzip.GzipFile.max_read_chunk if 
one really does.) Tested on 2.4rc1, and a version backported to 2.1 
works there too.
msg23348 - (view) Author: Mark Eichin (eichin) Date: 2004-11-27 23:28
Logged In: YES 
user_id=79734

Patch sent to patch-tracker as 1074381.
msg23349 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2005-06-09 14:23
Logged In: YES 
user_id=11375

Patch applied to both HEAD and 2.4-maint branches; thanks!
History
Date User Action Args
2022-04-11 14:56:08adminsetgithub: 41236
2004-11-27 17:29:30eichincreate