Issue 1564763: Unicode comparison change in 2.4 vs. 2.5

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44021

classification

Title:	Unicode comparison change in 2.4 vs. 2.5
Type:		Stage:
Components:	Unicode	Versions:	Python 2.5

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	arigo, lemburg, piman
Priority:	normal	Keywords:

Created on 2006-09-24 23:43 by piman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicodestreq.py	piman, 2006-09-24 23:43	Sample of unicode comparison breakage

Messages (8)
msg29970 - (view)	Author: Joe Wreschnig (piman)	Date: 2006-09-24 23:43
Python 2.5 changed the behavior of unicode comparisons in a significant way from Python 2.4, causing a test case failure in a module of mine. All tests passed with an earlier version of 2.5, though unfortunately I don't know what version in particular it started failing with. The following code prints out all True on Python 2.4; the strings are compared case-insensitively, whether they are my lowerstr class, real strs, or unicodes. On Python 2.5, the comparison between lowerstr and unicode is false, but only in one direction. If I make lowerstr inherit from unicode rather than str, all comparisons are true again. So at the very least, this is internally inconsistent. I also think changing the behavior between 2.4 and 2.5 constitutes a serious bug.
msg29971 - (view)	Author: Armin Rigo (arigo) *	Date: 2006-09-25 21:11
Logged In: YES user_id=4771 This is an artifact of the change in the unicode class, which now has the proper __eq__, __ne__, __lt__, etc. methods instead of the semi-deprecated __cmp__. The mixture of __cmp__ and the other methods is not very well-defined. This is why your code worked in 2.4: a bit by chance. Indeed, in theory it should not, according to the language reference. So what I am saying is that although it is a behavior change from 2.4 to 2.5, I would argue that it is not a bug but a bug fix... The reason is that if we ignore the __eq__ vs __cmp__ issues, the operation 'a == b' is defined as: Python tries a.__eq__(b); if this returns NotImplemented, then Python tries b.__eq__(a). As an exception, if type(b) is a strict subclass of type(a), then Python tries in the other order. This is why you get the 2.5 behavior: if lowerstr inherits from str, it is not a subclass of unicode, so u'abc' == lowerstr() tries u'abc'.__eq__(), which works immediately. On the other hand, if lowerstr inherits from unicode, then Python tries first lowerstr().__eq__(u'abc'). This part of the Python object model - when to reverse the order or not - is a bit obscure and not completely helpful... Subclassing built-in types generally only works a bit. In your situation you should use a regular class that behaves in a string-like fashion, with an __eq__() method doing the case-insensitive comparison... if you can at all - there are places where you need a real string, so this "solution" might not be one either, but I don't see a better one :-(
msg29972 - (view)	Author: Armin Rigo (arigo) *	Date: 2006-09-25 21:33
Logged In: YES user_id=4771 Sorry, I missed your comment: if lowerstr inherits from unicode then it just works. The reason is that 'abc'.__eq__(u'abc') returns NotImplemented, but u'abc'.__eq__('abc') returns True. This is only inconsistent because of the asymmetry between strings and unicodes: strings can be transparently turned into unicodes but not the other way around -- so unicode.__eq__(x) can accept a string as the argument x and convert it to a unicode transparently, but str.__eq__(x) does not try to convert x to a string if it is a unicode. It's not a completely convincing explanation, but I think it shows at least why we got at the current situation of Python 2.5.
msg29973 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-09-26 10:39
Logged In: YES user_id=38388 Armin, is it possible that the missing Py_TPFLAGS_HAVE_RICHCOMPARE type flag in the Unicode type is causing this ? I just had a look at the code and it appears that the comparison code checks the flag rather than just looking at the slot itself (didn't even know there was such a type flag).
msg29974 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-09-26 10:55
Logged In: YES user_id=38388 Ah, wrong track: Py_TPFLAGS_HAVE_RICHCOMPARE is set via Py_TPFLAGS_DEFAULT.
msg29975 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-09-26 11:13
Logged In: YES user_id=38388 In any case, the introduction of the Unicode tp_richcompare slot is likely the cause for this behavior: $python2.5 lowerstr.py u'baR' == l'Bar'? False $ python2.4 lowerstr.py u'baR' == l'Bar'? True Note that in both Python 2.4 and 2.5, the lowerstr.__eq__() method is not even called. This is probably due to the fact that Unicode can compare itself to strings, so the w.__eq__(v) part of the rich comparison is never tried. Now, the Unicode .__eq__() converts the string to Unicode, so the right hand side becomes u'Bar' in both cases. I guess a debugger session is due...
msg29976 - (view)	Author: Armin Rigo (arigo) *	Date: 2006-09-27 08:58
Logged In: YES user_id=4771 Well, yes, that's what I tried to explain. I also tried to explain how the 2.5 behavior is the "right" one, and the previous 2.4 behavior is a mere accident of convoluted __eq__-vs-__cmp__ code paths in the comparison code. In other words, there is no chance to get the 2.4 behavior in, say, Python 3000, because the __cmp__-related convolutions will be gone and we will only have the "right" behavior left.
msg29977 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-09-27 10:22
Logged In: YES user_id=38388 Agreed. In Python 2.4, doing the u'baR' == l'Bar' comparison does try l'Bar' == u'baR' due to the special case in default_3way_compare() I removed for Python 2.5. In Python 2.5 it doesn't due to the new rich comparison code for Unicode. I don't see any way to make Joe's code work with Python 2.5 other than using unicode as baseclass which is probably the right things to do anyway in preparation for Python 3k. Closing as won't fix.

History
Date	User	Action	Args
2022-04-11 14:56:20	admin	set	github: 44021
2006-09-24 23:43:07	piman	create