Issue 1735418: file.read() truncating strings under Windows

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/45083

classification

Title:	file.read() truncating strings under Windows
Type:	behavior	Stage:	test needed
Components:	None	Versions:	Python 3.1, Python 2.6

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	ajaksu2, benjamin.peterson, cgkanchi, ilgiz, pitrou
Priority:	normal	Keywords:

Created on 2007-06-12 00:19 by cgkanchi, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
splitfile.py	cgkanchi, 2007-06-12 00:19	Source code demonstrating the bug

Messages (6)
msg32305 - (view)	Author: cgkanchi (cgkanchi)	Date: 2007-06-12 00:19
On Python 2.4.4 and 2.5.1 under Windows, file.read() fails to read a varying number of characters from the last line(s) of text files when asked to read more than 800 characters from near the end of the file. For example, if the last word of a 500kb file is "superlative", file.read() might output "erlative". The file pointer at this stage is very close (a few words at most) to the end of the file. I ran into this problem while writing a program to split .txt ebooks into smaller files so that my ancient iPod could handle them. The behaviour is identical on both 2.4.4 and 2.5.1 under Windows, but does not appear under Mac OS X. I was unable to test it under Linux. To test the bug, I used various books from http://gutenberg.org . The one primarily used was Pride and Prejudice by Jane Austen.
msg32306 - (view)	Author: Ilguiz Latypov (ilgiz)	Date: 2007-06-12 15:47
This is your coding bug. (a) I would not trust tell(). Calculate the absolute position and use seek(). (b) Just from the documentation to Python's file-like objects I can assume that read() and tell() belong to different levels of API. The read() function has this in its documentation: "Note that this method may call the underlying C function fread() more than once in an effort to acquire as close to size bytes as possible". http://docs.python.org/lib/bltin-file-objects.html The tell() function's documentation refers to stdio's ftell(). This hints that tell() will return the position of the fread() buffer's end, not the read()'s end. (c) It also appears that by adding 1 to the "current position - unget size" you are skipping the space character itself. (d) The rfind() might return -1 if the search fails.
msg32307 - (view)	Author: Ilguiz Latypov (ilgiz)	Date: 2007-06-12 15:51
(e) To have tell() on the same level with read(), try the unbuffered mode by specifying bufsize=0 in open(), http://docs.python.org/lib/built-in-funcs.html
msg32308 - (view)	Author: cgkanchi (cgkanchi)	Date: 2007-06-14 17:59
>(e) To have tell() on the same level with read(), try the unbuffered mode >by specifying bufsize=0 in open(), > > http://docs.python.org/lib/built-in-funcs.html This does not work either. There is no change in the behaviour of the program. >(a) I would not trust tell(). Calculate the absolute position and use >seek(). That defeats the purpose of having native string handling in python. It means I have to do things the C way. Therefore, it is a bug in the implementation. >(b) Just from the documentation to Python's file-like objects I can assume >that read() and tell() belong to different levels of API. The read() >function has this in its documentation: >"Note that this method may call the underlying C function fread() more >than once in an effort to acquire as close to size bytes as possible". >http://docs.python.org/lib/bltin-file-objects.html That should not make any difference whatsoever. >The tell() function's documentation refers to stdio's ftell(). This hints >that tell() will return the position of the fread() buffer's end, not the >read()'s end. Again, irrelevant. >(c) It also appears that by adding 1 to the "current position - unget >size" you are skipping the space character itself. This is by design. I didn't want the space. Functionally, it makes no difference. >(d) The rfind() might return -1 if the search fails. This is by design as well, when there are no spaces in the remaining file, i.e., the file pointer is on the last word, a return value of -1 causes read() to read till EOF. I did however find the solution in the python docs, but it is a workaround rather than a fix for a very obvious bug. "tell() Return the file's current position, like stdio's ftell(). Note: On Windows, tell() can return illegal values (after an fgets()) when reading files with Unix-style line-endings. Use binary mode ('rb') to circumvent this problem. " Cheers, cgkanchi
msg85624 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-04-06 09:42
Is this valid?
msg85631 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-04-06 10:24
It's a bug in the program's logic. The program assumes that the file pointer will have advanced by the same number of bytes as were returned by read(), but it is false when opened in text mode ('r') since text mode under Windows will convert Windows newlines ('\r\n') into C newlines ('\n'). Also, please note this is a feature of Windows itself, not of Python. That's why you don't see it happening on e.g. Mac OS X. And that's why the fix, short of changing the program's logic, is to open in binary mode ('rb').

History
Date	User	Action	Args
2022-04-11 14:56:24	admin	set	github: 45083
2009-04-06 10:24:38	pitrou	set	status: open -> closed resolution: not a bug messages: + msg85631
2009-04-06 09:42:02	ajaksu2	set	versions: + Python 2.6, Python 3.1 nosy: + ajaksu2, pitrou, benjamin.peterson messages: + msg85624 type: behavior stage: test needed
2007-06-12 00:19:09	cgkanchi	create