Issue 1377394: read() / readline() blow up if file has even number of char.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42674

classification

Title:	read() / readline() blow up if file has even number of char.
Type:		Stage:
Components:	Unicode	Versions:	Python 2.4

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	doerwalter, georg.brandl, lemburg, superwesman
Priority:	normal	Keywords:

Created on 2005-12-09 21:43 by superwesman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg27018 - (view)	Author: superwesman (superwesman)	Date: 2005-12-09 21:43
Hello, I am having a problem with the read() and readline() functions. I'm using codecs.open() to open a text file, then using either read() or readline() to get its contents. In python 2.4.2, if the file has an even number of characters, I get a UnicodeDecodeError. If python 2.4.1 this works regardless of the character count. I've pasted below a sample script and the sample text file I was running. This is the command I executed at the Windows 2000 CMD prompt: python sample.py sample.txt Again, in 2.4.1, this works fine - in 2.4.2 it breaks when the file-to-be-read has an odd number of characters. Thanks. -w # start: sample.py import codecs import sys print "open the file" in_file = codecs.open( sys.argv[1], "r", "unicode_internal" ) print "read the file" the_file = in_file.read() print "close the file" in_file.close() print "done" # end: sample.py # start: sample.txt RESULTHOST=vivaldi RESULTPORT=a DB_XML=/test/art/jfw/config/DBList.xml LOGCHECK_IGNORE=art_actions.txt # end: sample.txt
msg27019 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-12-09 22:04
Logged In: YES user_id=38388 Why would you want to read a file using the Python internal Unicode encoding (unicode_internal) ? This is an encoding that is only used Python internally and should not be used for anything else.
msg27020 - (view)	Author: superwesman (superwesman)	Date: 2005-12-09 23:17
Logged In: YES user_id=1401447 I didn't realize that 'unicode_internal' was not a legitimate value to pass into this function. If 'unicode_internal' is not a valid 3rd parameter to codecs.open(), shouldn't that function complain? If it is a valid option (that should only be used "Python internally" - not sure what that means) then it should perform consistently regardless of the number of characters in the file, should it not? Seems to me that pilot-error uncovered a bug. If this is not a valid choice, then codecs.open() should complain. If it is valid, it should perform consistently, IMHO.
msg27021 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2005-12-10 10:57
Logged In: YES user_id=1188172 I'd suggest unicode_internal to be removed from the docs.
msg27022 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-12-12 13:30
Logged In: YES user_id=89016 With the Python 2.4.2 I get the following output both on Linux and Windows: open the file read the file close the file done This is totally independent of the type of line feeds in sample.txt or the length of the file (even or odd). > If it is a valid option (that should only be used > "Python internally" - not sure what that means) > then it should perform consistently regardless > of the number of characters in the file, should it not? unicode_internal just dumps the data bytes of the Unicode object. This means that (depending on the way Python is compiled) the length of a unicode_internal encoded byte string will always be a multiple of 2 or 4. So a byte string that has on odd number of bytes clearly is broken and decoding would have the right to complain about that. In 2.4.2 it doesn't, because it's not clear to the StreamReader API if there's more data available on subsequent calls to read() (and the last odd byte is silently dropped). BTW, the data read by your script is probably not what you might have expected. On a UCS-2 build the result is: u'\u2023\u7473\u7261\u3a74\u7320\u6d61\u6c70\u2e65\u7874\u0a74\u4552\u5553\u544c\u4f48\u5453\u763d\u7669\u6c61\u6964\u520a\u5345\u4c55\u5054\u524f\u3d54\u0a61\u4244\u585f\u4c4d\u2f3d\u6574\u7473\u612f\u7472\u6a2f\u7766\u632f\u6e6f\u6966\u2f67\u4244\u694c\u7473\u782e\u6c6d\u4c0a\u474f\u4843\u4345\u5f4b\u4749\u4f4e\u4552\u613d\u7472\u615f\u7463\u6f69\u736e\u742e\u7478' (or something similar depending on your line feeds).
msg27023 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2005-12-12 13:39
Logged In: YES user_id=38388 Closing this bug report as "won't fix" (even though SF seems to have removed this option from the tracker, or at least I don't see it in Firefox). Removing "unicode_internal" from the docs is not an option: this is a valid encoding, albeit one that depends on the way Python is built.
msg27024 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2005-12-12 14:39
Logged In: YES user_id=89016 Strange, Firefox seems to have some layout problems. The "Resolution" box has moved way to the right.

History
Date	User	Action	Args
2022-04-11 14:56:14	admin	set	github: 42674
2005-12-09 21:43:57	superwesman	create