This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: test_unicode fails in wide unicode build
Type: Stage:
Components: Unicode Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lemburg Nosy List: doerwalter, lemburg, mwh
Priority: normal Keywords:

Created on 2002-05-11 16:25 by mwh, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (8)
msg10730 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2002-05-11 16:25
Assigned somewhat arbitrarily.

It's a roundtrip test, I think.
msg10731 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2002-05-13 13:38
Logged In: YES 
user_id=89016

The minimal failing testcase is:

>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8") ==
u"\udb00\udc00"
False

which is strange, because they *seem* to be the same:

u"\udb00\udc00"
u'\U000d0000'
>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8")      
            
u'\U000d0000'
msg10732 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2002-05-13 13:58
Logged In: YES 
user_id=6656

>>> a = u"\udb00\udc00"
[20811 refs]
>>> b = unicode(a.encode("utf-8"), "utf-8")
[21061 refs]
>>> a, b 
(u'\U000d0000', u'\U000d0000')
[21063 refs]
>>> len(a), len(b)
(2, 1)
[21063 refs]

Erm...?
msg10733 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2002-05-13 14:06
Logged In: YES 
user_id=6656

Even better: 

$ ./python 
Adding parser accelerators ...
Done.
Python 2.2.1 (#1, May 13 2002, 15:02:01) 
[GCC 2.96 20000731 (Red Hat Linux 7.1 2.96-98)] on linux2
Type "help", "copyright", "credits" or "license" for more
information.
>>> unicode(u"\udb00\udc00".encode("utf-8"), "utf-8") ==
u"\udb00\udc00"
0
[18762 refs]

but the test passes.  And there was me thinking that it
wasn't a problem on the release22-maint branch.
msg10734 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2002-10-09 12:57
Logged In: YES 
user_id=6656

Hmm.  The test has stopped failing, so maybe we can close this.

I'd be happier if I knew why, though.
msg10735 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2002-10-10 15:30
Logged In: YES 
user_id=38388

I'm not exactly sure why things work again, but I do
know that I looked into this some time ago. Perhaps I
simply forgot to close the bug or one of the UTF-8
codec overhauls remedied the problem.

Here's what I get with python 2.3 UCS4:

>>> len(u'\U000d0000')
1
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
False
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
1
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
1

This is what I get with Python 2.2.1:
>>> len(u'\U000d0000')
2
>>> len(u"\udb00\udc00")
2
>>> u'\U000d0000' == u"\udb00\udc00"
1
>>> len(unicode(u"\udb00\udc00".encode('utf-8'), 'utf-8'))
2
>>> len(unicode(u'\U000d0000'.encode('utf-8'), 'utf-8'))
2

There's still a difference there, but the UTF-8 codec behaves
consistently.
msg10736 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-01-19 23:02
Logged In: YES 
user_id=38388

Michael, is the test still failing or can I close this ?
msg10737 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2003-01-20 10:12
Logged In: YES 
user_id=6656

Let's get rid of it.  I still don't understand what
happened, but we can worry about that if it resurfaces.
History
Date User Action Args
2022-04-10 16:05:18adminsetgithub: 36592
2002-05-11 16:25:58mwhcreate