Issue 580331: xreadlines caching, file iterator

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36878

classification

Title:	xreadlines caching, file iterator
Type:		Stage:
Components:	Interpreter Core	Versions:	Python 2.3

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:	gvanrossum	Nosy List:	gvanrossum, orenti, tim.peters
Priority:	normal	Keywords:	patch

Created on 2002-07-11 21:45 by orenti, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
xreadlinescache.patch	orenti, 2002-07-11 21:45
xreadlinescache2.patch	orenti, 2002-07-16 05:26
fileiterreadahead.patch	orenti, 2002-08-05 06:27
fileiterreadahead2.patch	orenti, 2002-08-05 20:01
fileiterreadahead3.patch	orenti, 2002-08-06 05:03	xreadlines just return self now

Messages (15)
msg40543 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-07-11 21:45
Calling f.xreadlines() multiple times returns the same xreadlines object. A file is an iterator - __iter__() returns self and next() calls the cached xreadlines object's next method.
msg40544 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-07-15 14:38
Logged In: YES user_id=6380 I posted some comments to python-dev.
msg40545 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-07-16 05:26
Logged In: YES user_id=562624 Now invalidates cache on a seek.
msg40546 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-07-17 01:33
Logged In: YES user_id=6380 I'm reviewing this and will check it in, or something like it (probably).
msg40547 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-07-17 17:50
Logged In: YES user_id=6380 Alas, there's a fatal flaw. The file object and the xreadlines object now both have pointers to each other, creating an unbreakable cycle (since neither participates in GC). Weak refs can't be used to resolve this dilemma. I personally think that's enough to just stick with the status quo (I was never more than +0 on the idea of making the file an interator anyway). But I'll leave it to Oren to come up with another hack (please use this same SF patch). Oren, if you'd like to give up, please say so and I'll close the item in a jiffy. In fact, I positively encourage you to give up. But I don't expect you to take this offer. :-)
msg40548 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-05 06:27
Logged In: YES user_id=562624 The version of the patch still makes a file an iterator but it no longer depends on xreadlines - it implements the readahead buffering inside the file object. It is about 19% faster than xreadlines for normal text files and about 40% faster for files with 100k lines. The methods readline and read do not use this readahead mechanism because it skews the current file position (just like xreadlines does).
msg40549 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-08-05 14:52
Logged In: YES user_id=6380 This begins to look good. What's a normal text file? One with a million bytes? :-) Have you made sure this works as expected in Universal newline mode? I'd like a patch that doesn't use #define WITH_READAHEAD_BUFFER. You might also experiment with larger buffer sizes (I predict that a larger buffer doesn't make much difference, since it didn't for xreadlines, but it would be nice to verify that and then add a comment; at least once a year someone asks whether the buffer shouldn't be much larger).
msg40550 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-05 15:22
Logged In: YES user_id=562624 > What's a normal text file? One with a million bytes? :-) I meant 100kBYTE lines... Some apps actually use such long lines. Yes, it works just fine with universal newlines. Ok, the #ifdefs will go. Strange, a bigger buffer seems to actually slow it down... I'll have to investigate this further.
msg40551 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-08-05 15:29
Logged In: YES user_id=6380 OK, I'll await a new patch.
msg40552 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-08-05 16:14
Logged In: YES user_id=31435 Just FYI, in apps that do "read + process" in a loop, a small buffer is often faster because the data has a decent shot at staying in L1 cache. Make the buffer very large (100s of Kb), and it won't even stay in L2 cache.
msg40553 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-05 20:01
Logged In: YES user_id=562624 Updated patch. What to do about the xreadlines method? The patch doesn't touch it but It could be made an alias to __iter__ and the dependency of file objects on the xreadlines module will be eliminated. On my linux machine the highest performance is achieved for buffer sizes somewhere around 4096-8192. Higher or lower values are significantly slower.
msg40554 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-08-05 20:07
Logged In: YES user_id=6380 Thanks! Making xreadlines an alias for __iter__ sounds about right, for backwards compatibility. Then we should probably deprecate xreadlines, despite the fact that it could be useful for other file-like objects; it's just not a pretty enough interface.
msg40555 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-06 05:03
Logged In: YES user_id=562624
msg40556 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-08-06 15:30
Logged In: YES user_id=6380 Hm, test_file fails on a technicality. I'll take it from here. Thanks!
msg40557 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2002-08-06 15:56
Logged In: YES user_id=6380 Thanks! Checked in.

History
Date	User	Action	Args
2022-04-10 16:05:30	admin	set	github: 36878
2002-07-11 21:45:57	orenti	create