This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: codecs.open and iterators
Type: Stage:
Components: Library (Lib) Versions: Python 2.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: lemburg Nosy List: doerwalter, facundobatista, lemburg, toddreed
Priority: normal Keywords:

Created on 2003-03-19 23:02 by toddreed, last changed 2022-04-10 16:07 by admin. This issue is now closed.

Messages (5)
msg15212 - (view) Author: Todd Reed (toddreed) Date: 2003-03-19 23:02
Greg Aumann originally posted this problem in 
comp.lang.python on Nov 4, 2002, but I could not find a 
bug report.  I've simply copied his news post, which 
explains the problem:
-----------
Recently I figured out how to use iterators and 
generators. Quite easy to
use and a great improvement.

But when I refactored some of my code I came across a 
discrepancy that seems
like it must be a bug. If you use the old file reading idiom 
with a codec
the lines are converted to unicode but if you use the new 
iterators idiom
then they retain the original encoding and the line is 
returned in non
unicode strings. Surely using the new "for line in file:" 
idiom should give
the same result as the old, "while 1: ...."

I came across this when using the pythonzh Chinese 
codecs but the below code
uses the cp1252 encoding to illustrate the problem 
because everyone should
have those codecs. The symptoms are the same with 
both codecs.

I am using python 2.2.2 on win2k.

Is this definitely a bug, or is it an undocumented 'feature' 
of the codecs
module?

Greg Aumann

The following code illustrates the problem:
------------------------------------------------------------------------
"""Check readline iterator using a codec."""

import codecs

fname = 'tmp.txt'
f = file(fname, 'w')
for i in range(0x82, 0x8c):
    f.write( '%x, %s\n' % (i, chr(i)))
f.close()

def test_iter():
    print '\ntesting codec iterator.'
    f = codecs.open(fname, 'r', 'cp1252')
    for line in f:
        l = line.rstrip()
        print repr(l)
        print repr(l.decode('cp1252'))
    f.close()

def test_readline():
    print '\ntesting codec readline.'
    f = codecs.open(fname, 'r', 'cp1252')
    while 1:
        line = f.readline()
        if not line:
            break
        l = line.rstrip()
        print repr(l)
        try:
            print repr(l.decode('cp1252'))
        except AttributeError, msg:
            print 'AttributeError', msg
    f.close()

test_iter()
test_readline()
------------------------------------------------------------------------
This code gives the following output:
------------------------------------------------------------------------
testing codec iterator.
'82, \x82'
u'82, \u201a'
'83, \x83'
u'83, \u0192'
'84, \x84'
u'84, \u201e'
'85, \x85'
u'85, \u2026'
'86, \x86'
u'86, \u2020'
'87, \x87'
u'87, \u2021'
'88, \x88'
u'88, \u02c6'
'89, \x89'
u'89, \u2030'
'8a, \x8a'
u'8a, \u0160'
'8b, \x8b'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
AttributeError 'unicode' object has no attribute 'decode'
u'83, \u0192'
AttributeError 'unicode' object has no attribute 'decode'
u'84, \u201e'
AttributeError 'unicode' object has no attribute 'decode'
u'85, \u2026'
AttributeError 'unicode' object has no attribute 'decode'
u'86, \u2020'
AttributeError 'unicode' object has no attribute 'decode'
u'87, \u2021'
AttributeError 'unicode' object has no attribute 'decode'
u'88, \u02c6'
AttributeError 'unicode' object has no attribute 'decode'
u'89, \u2030'
AttributeError 'unicode' object has no attribute 'decode'
u'8a, \u0160'
AttributeError 'unicode' object has no attribute 'decode'
u'8b, \u2039'
AttributeError 'unicode' object has no attribute 'decode'
------------------------------------------------------------------------
msg15213 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-03-20 09:35
Logged In: YES 
user_id=38388

That's a bug in the iterator support which was added
to the codecs module: the .next() methods should not
call the .next() methods on the reader directly, but instead
redirect to the .readline() method.
msg15214 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2005-01-15 17:38
Logged In: YES 
user_id=752496

Please, could you verify if this problem persists in Python 2.3.4
or 2.4?

If yes, in which version? Can you provide a test case?

If the problem is solved, from which version?

Note that if you fail to answer in one month, I'll close this bug
as "Won't fix".

Thank you! 

.    Facundo
msg15215 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2005-01-15 17:38
Logged In: YES 
user_id=752496

Can not test it so far, all I got is:

testing codec iterator.
u'82, \u201a'

Traceback (most recent call last):
  ...
  File "C:\Python24\lib\encodings\cp1252.py", line 22, in decode
    return codecs.charmap_decode(input,errors,decoding_map)
UnicodeEncodeError: 'ascii' codec can't encode character
u'\u201a' in position 4: ordinal not in range(128)

I'm on Win2k, sp2, with Py2.4
msg15216 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-01-17 20:55
Logged In: YES 
user_id=89016

Using Python 2.4 on Windows the test works perfectly if the
broken code that tries to decode the unicode again via
cp1252 (i.e. "print repr(l.decode('cp1252'))") is removed.
Here is the output:

testing codec iterator.
u'82, \u201a'
u'83, \u0192'
u'84, \u201e'
u'85, \u2026'
u'86, \u2020'
u'87, \u2021'
u'88, \u02c6'
u'89, \u2030'
u'8a, \u0160'
u'8b, \u2039'

testing codec readline.
u'82, \u201a'
u'83, \u0192'
u'84, \u201e'
u'85, \u2026'
u'86, \u2020'
u'87, \u2021'
u'88, \u02c6'
u'89, \u2030'
u'8a, \u0160'
u'8b, \u2039'

So the output is proper decoded unicode.

This bug has been fixed in codecs.py 1.28 (which went into
Python 2.3).

Closing the bug report.
History
Date User Action Args
2022-04-10 16:07:47adminsetgithub: 38188
2003-03-19 23:02:09toddreedcreate