Issue 844561: codecs.open().readlines(sizehint) bug

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/39564

classification

Title:	codecs.open().readlines(sizehint) bug
Type:		Stage:
Components:	Unicode	Versions:	Python 2.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	jepler, lemburg
Priority:	low	Keywords:

Created on 2003-11-18 17:22 by jepler, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
codecs_readlines_bug.py	jepler, 2003-11-18 17:22	Counts lines wrong with codecs.open()

Messages (8)
msg19029 - (view)	Author: Jeff Epler (jepler)	Date: 2003-11-18 17:22
codecs.open().readlines(sizehint) can return truncated lines. The attached script, which uses readlines(sizehint) to count the number of lines in a file, demonstrates the problem. Correct output would be 1000 in both cases, but different values are returned depending on sizehint because of the truncated lines.
msg19030 - (view)	Author: Jeff Epler (jepler)	Date: 2003-11-18 17:28
Logged In: YES user_id=2772 The script triggers the assertion error using at least python 2.3.2 (locally compiled) and python 2.2.2 (redhat 9 RPM)
msg19031 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-02-25 23:04
Logged In: YES user_id=38388 It's hard to say whether this is a bug or not. The sizehint argument is not well documented and the way you use it does not look a proper way to use it. From the docs: """" f the optional sizehint argument is present, instead of reading up to EOF, whole lines totalling approximately sizehint bytes (possibly after rounding up to an internal buffer size) are read. """" In your example the underlying open() implementation seems to round up the sizehint value to include the whole line, while the codec.open() version will only read sizehint bytes without any rounding (see the codecs.py implementation).
msg19032 - (view)	Author: Jeff Epler (jepler)	Date: 2004-02-26 01:14
Logged In: YES user_id=2772 To me, the phrase "whole lines totalling approximately sizehint" means that no item from readlines(sizehint) will be an incomplete line. I don't understand why this requirement isn't clearly indicated to you by the text you included in your comments.
msg19033 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-02-26 09:51
Logged In: YES user_id=38388 Good catch. I must have overread the "whole lines" bit :-) In that case, it's probably best to have .readlines() ignore the sizehint argument altogether. An efficient implementation is hard to do since the line breaking is not done at C level, but after the data has been read.
msg19034 - (view)	Author: Jeff Epler (jepler)	Date: 2004-02-26 14:50
Logged In: YES user_id=2772 Ignoring sizehint and reading the whole file is probably better than truncating lines. This change would also fix another bug I realized exists in codecs readlines(sizehint) currently: if it reads only part of a multi-byte character, you get a decoding error... A slightly more complicated approach would be to read sizehint bytes and then while the result doesn't end in a newline, read one more byte and decode again. When sizehint is large enough, doing byte-at-a-time reading of the last half-line shouldn't be that bad for performance. No, I don't have a patch. Is there a way to differentiate between "the byte string ends with an incomplete multi-byte character" and "the byte string contains an invalid sequence of bytes"?
msg19035 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-02-26 15:20
Logged In: YES user_id=38388 Ok, I'll fix codecs.py to ignore the sizehint argument then (should not break any code; at worst it might cause problems with MemoryOverflows). To answer your question: whether a byte string is incomplete or in error depends on the encoding and only the codec can decide what to do. While the codecs do differentiate and the error callback logic could be used to work out a correct solution, this would require a lot of work.
msg19036 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-02-26 15:26
Logged In: YES user_id=38388 Fixed in CVS.

History
Date	User	Action	Args
2022-04-11 14:56:01	admin	set	github: 39564
2003-11-18 17:22:40	jepler	create