This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: [2.4 regression] seeking in codecs.reader broken
Type: Stage:
Components: Extension Modules Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: doerwalter Nosy List: doerwalter, doko, lemburg, smurf
Priority: high Keywords:

Created on 2005-03-03 22:29 by doko, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt doerwalter, 2005-03-04 11:44
codecs-seek.diff doko, 2005-03-14 11:02 added doc string
Messages (8)
msg24449 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2005-03-03 22:29
[forwarded from
https://bugzilla.ubuntu.com/show_bug.cgi?id=6972 ]

This is a regression; the following script (call as
"scriptname some_textfile")
fails.
It is obvious that the file starts with a number of
random bytes from the
previous run.

Uncommenting the two #XXX lines makes the bug go away.
So does running it with
Python 2.3.5

import sys
import codecs
from random import random

data = codecs.getreader("utf-8")(open(sys.argv[1]))
df = data.read()
for t in range(30):
    #XXX data.seek(0,1)
    #XXX data.read()
    data.seek(0,0)
    dn=""
    for l in data:
        dn += l
        if random() < 0.1: break
    if not df.startswith(dn):
        print "OUCH",t
        print "BAD:", dn[0:100]
        print "GOOD:", df[0:100]
        sys.exit(1)

print "OK",len(df)
sys.exit(0)
msg24450 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-03-04 09:56
Logged In: YES 
user_id=38388

This is obviously related to the buffer logic that Walter added
to support .readline().

In order to fix the problem, a .seek() method must be
implemented
that resets the buffers whenever called (before asking the
stream
to seek to the specified stream position).
msg24451 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-03-04 11:44
Logged In: YES 
user_id=89016

How about the following patch? Unfortunately this breaks the
codec in more obscure cases. Calling seek(0, 1) should have
now effect, but with this patch it does. Maybe calling
seek() should be prohibited? Calling a seek(1, 1) in a
UTF-16 stream completely messes up the decoded text.
msg24452 - (view) Author: Matthias Urlichs (smurf) Date: 2005-03-08 13:20
Logged In: YES 
user_id=10327

Ahem -- seek(0,*whatever*) should still be allowed, whatever
else you do, please.

Reading UTF-16 from an odd position in a file isn't always
an error -- sometimes text is embedded in weird on-disk data
structures. As long as tell() returns something you can
seek() back to, nobody's got a right to complain -- file
position arithmetic in general is nonportable.
msg24453 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-03-08 14:00
Logged In: YES 
user_id=38388

Walter: the patch looks good. Please also add a doc-string
mentioning the resetting of the codec in case .seek() is used.

Whether .seek() causes a mess or not is not within the
responsibility of the codec - it's an application space
decision to make, otherwise we would have to introduce the
notion of seeking code points (rather than bytes) which I'd
rather not like to do since this can break existing
applications in many ways.
msg24454 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-03-08 14:40
Logged In: YES 
user_id=89016

OK, I'll check in the patch at the beginning of next week
(I'm currently away from CVS).
msg24455 - (view) Author: Matthias Klose (doko) * (Python committer) Date: 2005-03-14 11:02
Logged In: YES 
user_id=60903

added doc string
msg24456 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-03-14 19:22
Logged In: YES 
user_id=89016

Checked in as:
Lib/codecs.py 1.39
Lib/encodings/utf_16.py 1.6
Lib/test/test_codecs.py 1.21
I've also added a reset() method to the UTF-16 reader that
resets the decode method as well as a test in test_codecs.py.

Backported to release24-maint as:
Lib/codecs.py 1.35.2.4
Lib/encodings/utf_16.py 1.5.2.1
Lib/test/test_codecs.py 1.15.2.3

(Here the test is implemented differently, because the 2.4
branch doesn't have a BasicUnicodeTest test case in
test_codecs.py)
History
Date User Action Args
2022-04-11 14:56:10adminsetgithub: 41648
2005-03-03 22:29:53dokocreate