Issue 665835: filter() treatment of str and tuple inconsistent

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37756

classification

Title:	filter() treatment of str and tuple inconsistent
Type:		Stage:
Components:	Interpreter Core	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	rhettinger	Nosy List:	doerwalter, gvanrossum, rhettinger, tim.peters
Priority:	normal	Keywords:

Created on 2003-01-10 16:36 by doerwalter, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Messages (20)
msg13987 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-01-10 16:36
class tuple2(tuple): · def __getitem__(self, index): · · return 2*tuple.__getitem__(self, index) class str2(str): · def __getitem__(self, index): · · return chr(ord(str.__getitem__(self, index))+1) print filter(lambda x: x>1, tuple2((1, 2))) print filter(lambda x: x>"a", str2("ab")) this prints: (2,) bc i.e. the overwritten __getitem__ is ignored in the first case, but honored in the second.
msg13988 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-01-25 03:47
Logged In: YES user_id=80475 The problem isn't with filter() which correctly calls iter() in both cases. Tuple object have their own iterator which loops over elements directly and has no intervening calls to __getitem__(). String objects do not define a custom iterator, so iter() wraps itself around consecutive calls to __getitem__(). The resolution is to provide string objects with their own iterator. As a side benefit, iteration will run just a tiny bit faster. The same applies to unicode objects. Guido, do you care about this and want me to fix it or would you like to close it as "won't fix".
msg13989 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-01-25 13:36
Logged In: YES user_id=6380 I don't know which Python sources Raymond has been reading, but in the sources I've got in front of me, there are special cases for strings and tuples, and these don't use iter(). It so happens that the tuple special-case calls PyTuple_GetItem(), which doesn't call your __getitem__, while the string special-case calls the sq_item slot function, which (in your case) will be a wrapper that calls your __getitem__. A minimal fix would be to only call filtertuple for strict tuples -- although this changes the output type, but I don't think one should count on filter() of a tuple subclass returning a tuple (and it can't be made to return an instance of the subclass either -- we don't know the constructor signature). Similar fixes probably need to be made to map() and maybe reduce().
msg13990 - (view)	Author: Tim Peters (tim.peters) *	Date: 2003-01-25 13:45
Logged In: YES user_id=31435 Just noting that filter() is unique in special-casing the type of the input. It's always been surprising that way, and, e.g., filtering a string produces a string, but filtering a Unicode string produces a list. map() and reduce() don't play games like that, and always use the iteration protocol to march over their inputs.
msg13991 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-01-25 13:51
Logged In: YES user_id=6380 (But in addition th that, I don't mind having a custom string iterator -- as long as it calls __getitem__ properly. Hm, shouldn't the tuple iterator call __getitem__ properly too?)
msg13992 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-01-25 16:45
Logged In: YES user_id=80475 None of the existing iterators (incl dicts, lists, tuples, and files) use __getitem__. Most likely, user defined iterators also access the data structure directly (for flexiblity and speed). Also, anything that uses PyTuple_GET_ITEM bypasses __getitem__. If string/unicode iterators are added, they should also go directly to the underlying data; otherwise, there is no point to it. Also, the proposal to change filtertuple(), doesn't solve inconsistencies within filterstring() which uses __getitem__ when there is a function call, but bypasses it when the function parameter is Py_None. I think the right answer is to change filterstring() to use an iterator and to implement string/unicode iterators that access the data directly (not using __getitem__). FYI for Tim: MvL noticed and fixed the unicode vs string difference. His patch, SF #636005, has not been applied yet.
msg13993 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-01-27 00:17
Logged In: YES user_id=6380 Hm... that means that iter() of amy built-in type subclass overriding __getitem__ bypasses the override, unless the subclass also overrides __iter__. This sounds like a step in the wrong direction. I think the built-in iterators should be aware of subclasses overriding __getitem__ one way or another. I hadn't realized this when we started the trend of creating faster iterators for built-in types. :-(
msg13994 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-01-27 00:54
Logged In: YES user_id=80475 I understand. Ideally, all methods would respond to a single overridden method, but I think this is just a fact of life in object oriented programming. I can't remember where you gave an example of a d.__getitem__() subclass override, but you were careful to point out that other methods, like d.get() also needed to be overridden so that the modified access applied everywhere. Likewise, __iter__() or any other object access method must be assumed to access the underlying data structure directly and must be overridden. For instance, creating a dictionary with case insensitive lookups entails overriding __getitem__(k), get(k,default), and pop(k) -- no one of them can be presumed to inform the others.
msg13995 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-01-27 01:13
Logged In: YES user_id=80475 One other thought: A major reason for implementing __iter__ in the first place is that objects were overriding __getitem__ and disregarding the index -- the __getitem__ interface just didn't make sense for iteration in some situations. __iter__ was supposed to provide enormous flexibility in various ways to loop over a collection (inorder, preorder, postorder, priorityorder, sortedorder, hashorder, randomorder, etc). Making iter() default to using __getitem__ was only supposed to be an expedient for backwards compatability. Always using __getitem__ diminishes the flexibility and speed advantages. Maybe the discussion belongs on python-dev. I'm sure a number of people feel strongly one way or the other. The question might as well be addressed head-on before 2.3 goes out the door.
msg13996 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-01-27 12:24
Logged In: YES user_id=89016 Another problem with filter() is that filterstring() (and the new filterunicode()) blindly assume that tp_as_sequence->sq_item returns a str or unicode object with len==1. This might fail with str or unicode subclasses: ---- class badstr(str): def __getitem__(self, index): return 42 s = filter(lambda x: x>=42, badstr("1234")) print len(s), repr(s) ---- This prints 4 '\x00\x00\x00\x00'
msg13997 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-03 22:36
Logged In: YES user_id=6380 Walter: if you can fix the bug in your latest message here, go ahead and check it in. Seems like a case of a missing test. Raymond: it turns out that the iterator in Python 2.2 has the same problem with lists -- it special-cases lists. But for tuples, the iterator uses PySequence_GetItem; the fast tuple iterator in Python 2.3 introduces the problem for tuples though. I actually don't think there would be much disagreement that this behavior (ignoring __getitem__) is a bug. There may be disagreement over how important it is to fix it. Personally, I've generally been on the side of "it needn't be fixed if it slows down the common case", as long as a workaround (like overriding __iter__ alongside the __getitem__ override) exists. But I draw the line at being backwards incompatible with Python 2.2. There fore I think the tuple iterator (and probably also the string iterator) needs to be fixed, and I still think that it would be best if the list iterator were also fixed. One way to do this would be for the tp_iter implementation to check whether self->ob_type->tp_as_sequence->sq_item is not equal to the list_item function (this is a good check to detect a __getitem__ override) and then return an instance of the generic sequence iterator instead of the list-specific iterator.
msg13998 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-02-04 17:15
Logged In: YES user_id=89016 OK, the problem of __getitem__ not returning str or unicode is fixed. Unfortunately the result is rather ugly. With the following class: class u(unicode): def __getitem__(self, index): return u(2*unicode.__getitem__(self, index)) filter neither returns a list nor an u object, but a unicode object, defeating the whole purpose of the special treatment of str/unicode. If we remove the special treatment this problem would go away, furthermore __getitem__ returning objects that are not str/unicode instances wouldn't be problem any longer.
msg13999 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-04 17:33
Logged In: YES user_id=6380 Yes, the special treatment of tuple, str and unicode is problematic. :-( I wish filter() had always returned a list for all input types. But it's too late to change that. However, I don't think that filter() should ever return a subclass of tuple, str or unicode. Note that slicing a subclass of these also doesn't return a subclass instance, unless the subclass specifically overrides __getslice__. I note that filter() of a tuple almost implements what I think it should, except that if it receives an empty tuple subclass, it returns it unchanged. The slicing and other methods (e.g. lower()) have all been modified to make a copy whose type is the base class; I think filter() should follow suit. Similar for filter() of strings and unicode.
msg14000 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-02-04 20:34
Logged In: YES user_id=89016 The subclass problem has been fixed in: Python/bltinmodule.c 2.275 Lib/test/test_builtin.py 1.9 But now something strange happens: --- class badstr(str): ···def __getitem__(self, index): ······return str.__getitem__(self, index).upper() print filter(None, badstr("abc")) print filter(lambda x: x, badstr("abc")) --- This prints --- abc ABC --- although according to the filter docstring they should be the same.
msg14001 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-04 21:06
Logged In: YES user_id=6380 So it does. I guess the special shortcut for None should only be taken when it's a proper str instance.
msg14002 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-02-10 13:27
Logged In: YES user_id=89016 OK, this has been fixed: function==None and function==lambda x:x now behave the same (for str and unicode, for tuples it's still broken, because PyTuple_GetItem() is used. (Checked in as Python/bltinmodule.c 2.278 and Lib/test/test_builtin.py 1.12) Why can't we simply replace PyTuple_GetItem() with tp_as_sequence->sq_item in filtertuple()?
msg14003 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2003-02-10 15:20
Logged In: YES user_id=6380 Feel free to fix filtertuple() too. Just note that tp_as_sequence might be NULL, or ...->sq_item might be NULL. I'm not 100% sure that those can never be NULL for a tuple subclass.
msg14004 - (view)	Author: Walter Dörwald (doerwalter) *	Date: 2003-02-10 17:44
Logged In: YES user_id=89016 Checked in as: Lib/test/test_builtin.py 1.13 Python/bltinmodule.c 2.280
msg14005 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-02-10 17:52
Logged In: YES user_id=80475 I'm still working on fixing the iterators so that __getitem__ overrides are recognized by __iter__. Hope that simplifies your changes.
msg14006 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-04-24 16:53
Logged In: YES user_id=80475 Committed changes as: Objects/listobject.c 2.149 Objects/tupleobject.c 2.79 Lib/test/test_types.py 1.50

History
Date	User	Action	Args
2022-04-10 16:06:07	admin	set	github: 37756
2003-01-10 16:36:24	doerwalter	create