This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: read() / readline() blow up if file has even number of char.
Type: Stage:
Components: Unicode Versions: Python 2.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lemburg Nosy List: doerwalter, georg.brandl, lemburg, superwesman
Priority: normal Keywords:

Created on 2005-12-09 21:43 by superwesman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg27018 - (view) Author: superwesman (superwesman) Date: 2005-12-09 21:43
Hello, I am having a problem with the read() and
readline() functions.  I'm using codecs.open() to open
a text file, then using either read() or readline() to
get its contents.  In python 2.4.2, if the file has an
even number of characters, I get a UnicodeDecodeError.
 If python 2.4.1 this works regardless of the character
count.  I've pasted below a sample script and the
sample text file I was running.  This is the command I
executed at the Windows 2000 CMD prompt:

python sample.py sample.txt

Again, in 2.4.1, this works fine - in 2.4.2 it breaks
when the file-to-be-read has an odd number of characters.

Thanks.
-w

# start: sample.py

import codecs
import sys

print "open the file"
in_file = codecs.open( sys.argv[1], "r",
"unicode_internal" )
print "read the file"
the_file = in_file.read()
print "close the file"
in_file.close()
print "done"

# end: sample.py

# start: sample.txt
RESULTHOST=vivaldi
RESULTPORT=a
DB_XML=/test/art/jfw/config/DBList.xml
LOGCHECK_IGNORE=art_actions.txt

# end: sample.txt
msg27019 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-12-09 22:04
Logged In: YES 
user_id=38388

Why would you want to read a file using the Python internal
Unicode encoding (unicode_internal) ?

This is an encoding that is only used Python internally and
should not be used for anything else.
msg27020 - (view) Author: superwesman (superwesman) Date: 2005-12-09 23:17
Logged In: YES 
user_id=1401447

I didn't realize that 'unicode_internal' was not a
legitimate value to pass into this function.  If
'unicode_internal' is not a valid 3rd parameter to
codecs.open(), shouldn't that function complain?  If it is a
valid option (that should only be used "Python internally" -
not sure what that means) then it should perform
consistently regardless of the number of characters in the
file, should it not?

Seems to me that pilot-error uncovered a bug.  If this is
not a valid choice, then codecs.open() should complain.  If
it is valid, it should perform consistently, IMHO.
msg27021 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-12-10 10:57
Logged In: YES 
user_id=1188172

I'd suggest unicode_internal to be removed from the docs.
msg27022 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-12-12 13:30
Logged In: YES 
user_id=89016

With the Python 2.4.2 I get the following output both on
Linux and Windows:

open the file
read the file
close the file
done

This is totally independent of the type of line feeds in
sample.txt or the length of the file (even or odd).

> If it is a valid option (that should only be used
> "Python internally" - not sure what that means)
> then it should perform consistently regardless
> of the number of characters in the file, should it not?

unicode_internal just dumps the data bytes of the Unicode
object. This means that (depending on the way Python is
compiled) the length of a unicode_internal encoded byte
string will always be a multiple of 2 or 4. So a byte string
that has on odd number of bytes clearly is broken and
decoding would have the right to complain about that. In
2.4.2 it doesn't, because it's not clear to the StreamReader
API if there's more data available on subsequent calls to
read() (and the last odd byte is silently dropped).

BTW, the data read by your script is probably not what you
might have expected. On a UCS-2 build the result is:

u'\u2023\u7473\u7261\u3a74\u7320\u6d61\u6c70\u2e65\u7874\u0a74\u4552\u5553\u544c\u4f48\u5453\u763d\u7669\u6c61\u6964\u520a\u5345\u4c55\u5054\u524f\u3d54\u0a61\u4244\u585f\u4c4d\u2f3d\u6574\u7473\u612f\u7472\u6a2f\u7766\u632f\u6e6f\u6966\u2f67\u4244\u694c\u7473\u782e\u6c6d\u4c0a\u474f\u4843\u4345\u5f4b\u4749\u4f4e\u4552\u613d\u7472\u615f\u7463\u6f69\u736e\u742e\u7478'

(or something similar depending on your line feeds).
msg27023 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-12-12 13:39
Logged In: YES 
user_id=38388

Closing this bug report as "won't fix" (even though SF seems
to have removed this option from the tracker, or at least I
don't see it in Firefox).

Removing "unicode_internal" from the docs is not an option:
this is a valid encoding, albeit one that depends on the way
Python is built.
msg27024 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-12-12 14:39
Logged In: YES 
user_id=89016

Strange, Firefox seems to have some layout problems. The
"Resolution" box has moved way to the right.
History
Date User Action Args
2022-04-11 14:56:14adminsetgithub: 42674
2005-12-09 21:43:57superwesmancreate