This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode comparison change in 2.4 vs. 2.5
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lemburg Nosy List: arigo, lemburg, piman
Priority: normal Keywords:

Created on 2006-09-24 23:43 by piman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicodestreq.py piman, 2006-09-24 23:43 Sample of unicode comparison breakage
Messages (8)
msg29970 - (view) Author: Joe Wreschnig (piman) Date: 2006-09-24 23:43
Python 2.5 changed the behavior of unicode comparisons
in a significant way from Python 2.4, causing a test
case failure in a module of mine. All tests passed with
an earlier version of 2.5, though unfortunately I don't
know what version in particular it started failing with.

The following code prints out all True on Python 2.4;
the strings are compared case-insensitively, whether
they are my lowerstr class, real strs, or unicodes. On
Python 2.5, the comparison between lowerstr and unicode
is false, but only in one direction.

If I make lowerstr inherit from unicode rather than
str, all comparisons are true again. So at the very
least, this is internally inconsistent. I also think
changing the behavior between 2.4 and 2.5 constitutes a
serious bug.
msg29971 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2006-09-25 21:11
Logged In: YES 
user_id=4771

This is an artifact of the change in the unicode class, which
now has the proper __eq__, __ne__, __lt__, etc. methods
instead of the semi-deprecated __cmp__.  The mixture of
__cmp__ and the other methods is not very well-defined.  This
is why your code worked in 2.4: a bit by chance.

Indeed, in theory it should not, according to the language
reference.  So what I am saying is that although it is a
behavior change from 2.4 to 2.5, I would argue that it is not
a bug but a bug fix...

The reason is that if we ignore the __eq__ vs __cmp__ issues,
the operation 'a == b' is defined as: Python tries
a.__eq__(b); if this returns NotImplemented, then Python
tries b.__eq__(a).  As an exception, if type(b) is a strict
subclass of type(a), then Python tries in the other order. 
This is why you get the 2.5 behavior: if lowerstr inherits
from str, it is not a subclass of unicode, so u'abc' ==
lowerstr() tries u'abc'.__eq__(), which works immediately. 
On the other hand, if lowerstr inherits from unicode, then
Python tries first lowerstr().__eq__(u'abc').

This part of the Python object model - when to reverse the
order or not - is a bit obscure and not completely helpful...
Subclassing built-in types generally only works a bit.  In
your situation you should use a regular class that behaves in
a string-like fashion, with an __eq__() method doing the
case-insensitive comparison... if you can at all - there are
places where you need a real string, so this "solution" might
not be one either, but I don't see a better one :-(
msg29972 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2006-09-25 21:33
Logged In: YES 
user_id=4771

Sorry, I missed your comment: if lowerstr inherits from
unicode then it just works.  The reason is that
'abc'.__eq__(u'abc') returns NotImplemented, but
u'abc'.__eq__('abc') returns True.

This is only inconsistent because of the asymmetry between
strings and unicodes: strings can be transparently turned
into unicodes but not the other way around -- so
unicode.__eq__(x) can accept a string as the argument x
and convert it to a unicode transparently, but str.__eq__(x)
does not try to convert x to a string if it is a unicode.

It's not a completely convincing explanation, but I think it
shows at least why we got at the current situation of Python
2.5.
msg29973 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-09-26 10:39
Logged In: YES 
user_id=38388

Armin, is it possible that the missing
Py_TPFLAGS_HAVE_RICHCOMPARE type flag in the Unicode type is
causing this ?

I just had a look at the code and it appears that the
comparison code checks the flag rather than just looking at
the slot itself (didn't even know there was such a type flag).
msg29974 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-09-26 10:55
Logged In: YES 
user_id=38388

Ah, wrong track: Py_TPFLAGS_HAVE_RICHCOMPARE is set via
Py_TPFLAGS_DEFAULT.
msg29975 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-09-26 11:13
Logged In: YES 
user_id=38388

In any case, the introduction of the Unicode tp_richcompare
slot is likely the cause for this behavior:

$python2.5 lowerstr.py
u'baR' == l'Bar'?       False
$ python2.4 lowerstr.py
u'baR' == l'Bar'?       True

Note that in both Python 2.4 and 2.5, the lowerstr.__eq__()
method is not even called. This is probably due to the fact
that Unicode can compare itself to strings, so the
w.__eq__(v) part of the rich comparison is never tried.

Now, the Unicode .__eq__() converts the string to Unicode,
so the right hand side becomes u'Bar' in both cases.

I guess a debugger session is due...
msg29976 - (view) Author: Armin Rigo (arigo) * (Python committer) Date: 2006-09-27 08:58
Logged In: YES 
user_id=4771

Well, yes, that's what I tried to explain.  I also tried to
explain how the 2.5 behavior is the "right" one, and the
previous 2.4 behavior is a mere accident of convoluted
__eq__-vs-__cmp__ code paths in the comparison code.

In other words, there is no chance to get the 2.4 behavior
in, say, Python 3000, because the __cmp__-related
convolutions will be gone and we will only have the "right"
behavior left.
msg29977 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-09-27 10:22
Logged In: YES 
user_id=38388

Agreed.

In Python 2.4, doing the u'baR' == l'Bar' comparison does
try l'Bar' == u'baR' due to the special case in
default_3way_compare() I removed for Python 2.5. 

In Python 2.5 it doesn't due to the new rich comparison code
for Unicode.

I don't see any way to make Joe's code work with Python 2.5
other than using unicode as baseclass which is probably the
right things to do anyway in preparation for Python 3k.

Closing as won't fix.
History
Date User Action Args
2022-04-11 14:56:20adminsetgithub: 44021
2006-09-24 23:43:07pimancreate