This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile.py enhancements
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: loewis Nosy List: lars.gustaebel, loewis, nnorwitz
Priority: normal Keywords: patch

Created on 2004-03-17 15:59 by lars.gustaebel, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tarfile-patches.tar.gz lars.gustaebel, 2004-03-17 15:59 8GB-limit.patch, stream-detect-compr.patch
test.patch lars.gustaebel, 2004-07-21 12:55 testcase for the stream-detect-compr.patch
stream-detect-compr.patch lars.gustaebel, 2005-03-05 11:37 Updated version of stream-detect-compr.patch including the testcase.
Messages (8)
msg45584 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2004-03-17 15:59
I still develop tarfile.py sporadically on a separate
branch (http://www.gustaebel.de/lars/tarfile/), and so
there are two features from this branch that I'd like
to propose for inclusion in Python's tarfile.py:

1. Overcoming the 8GB file size limit (8GB-limit.patch)

At the moment it is not possible to add files to a tar
archive that exceed 8GB size. Although this is POSIX
compliant, GNU tar offers an extension header for
largefiles that encodes file sizes in an 88-bit number
instead of the common 11-digits octal number. Like all
other GNU extensions in tarfile.py, this feature is
turned on and off using the TarFile.posix attribute. 

2. Automatic compression detection for the stream
interface (stream-detect-compr.patch)

tarfile.py's stream interface (which can be used to
access tape devices or simply read a tar from stdin) is
a bit difficult to use because it's not able to detect
whether an archive is compressed or not. Compression
has to be explicitly specified using mode ("r|",
"r|gz", "r|bz2"). The patch introduces a fourth mode
"r|*" that makes automatic detection possible.


Both patches are not vitally important, but especially
the 8GB-patch is useful IMO.
msg45585 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2004-07-20 22:28
Logged In: YES 
user_id=33168

I checked in the 8GB limit patch. Lib/tarfile.py 1.14.

I didn't check in the stream patch for 2 reasons:
1) I don't know the need.  Is this common?  I've never heard
of it.
2) The type parameter name was changed to comtype.  I wasn't
sure if this was necessary.  It potentially (albeit
unlikely) could break a program.  I'm not concerned about
changing the name of attribute.

Lars, can you provide a good reason to add this part of the
patch?  If it's not likely to be used, I don't think it
should be added.  If it is added, there should also be a test.

Thanks.
msg45586 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2004-07-20 22:31
Logged In: YES 
user_id=33168

Lars, could you look at bug 949052 and provide any guidance?
 Thanks.
msg45587 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2004-07-21 07:54
Logged In: YES 
user_id=642936

tarfile.py's stream interface must be used if the user wants
to read an archive that is not a seekable file, e.g. stdin
or a tape device. ATM, it is the user's job to find out
whether the stream is compressed (mode="r|gz" or "r|bz2") or
uncompressed (mode="r|"), which makes the stream interface
kind of awkward and unusable for many users. The patch
introduces an additional mode "r|*" which does this job. I
admit it's just a convenience thing but I think the stream
interface is somehow too complicated without it.

The reason why I changed the "type" argument to "comptype"
was just that the TarFile class uses "comptype" and the
_Stream class uses "type" for the same thing. It doesn't
need to be changed.

You're absolute right about the testcase. I had enough time
to write one ;-)
msg45588 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2004-07-21 12:55
Logged In: YES 
user_id=642936

I just created tests for the stream-detect-compr.patch,
attached as test.patch.

BTW, I examined bug #949052, and opened a patch (#995126).
msg45589 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-03-04 19:58
Logged In: YES 
user_id=21627

Lars, the streaming patch is outdated. If you still think it
is necessary, please update the patch.

While I can understand what the feature "automatic
detection" does, I fail to see why you need a new syntax for
open. AFAICT, "r" is equivalent to the newly-proposed "r:*".
Why is it necessary to have two ways to spell the same thing?
msg45590 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2005-03-05 11:37
Logged In: YES 
user_id=642936

The asterisk notation is necessary only for the stream
interface There are the three possible modes "r|", "r|gz"
and "r|bz2", and "r|*" is a placeholder for all of them
combined.
For symmetry reasons I thought I'd add the same thing to the
file interface as well. It also has these three modes "r:",
"r:gz" and "r:bz2", for which "r:*" could act as a wildcard.
Let's say "r:*" is the explicit version of "r".

I thought about something like the following example as a
use case:

def open_tar(filename, stream=False):
    mode = "r" + [":", "|"][stream] + "*"
    [...]

I have attached an updated patch including the test.
msg45591 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-03-05 12:48
Logged In: YES 
user_id=21627

Thanks for the patch and the explanation; committed as

libtarfile.tex 1.9
tarfile.py 1.27
test_tarfile.py 1.18
NEWS 1.1268
History
Date User Action Args
2022-04-11 14:56:03adminsetgithub: 40040
2004-03-17 15:59:38lars.gustaebelcreate