Issue 1704793: incorrect return value of unicodedata.lookup() - beoynd BMP

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44879

classification

Title:	incorrect return value of unicodedata.lookup() - beoynd BMP
Type:		Stage:
Components:	Unicode	Versions:	Python 2.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	loewis	Nosy List:	georg.brandl, hyeshik.chang, lemburg, loewis, vlbrom
Priority:	high	Keywords:

Created on 2007-04-21 10:52 by vlbrom, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicodedata-lookup-ucs2fix.diff	hyeshik.chang, 2007-06-12 11:12	proposed fix that encodes non-BMP as surrogate pair.

Messages (9)
msg31851 - (view)	Author: vbr (vlbrom)	Date: 2007-04-21 10:52
There seem to be an incorrect handling of unicode characters beyond the BMP (code point higher than 0xFFFF) in the unicodedata module - function lookup() on narrow unicode python builds (python 2.5.1, Windows XPh) >>> unicodedata.lookup("GOTHIC LETTER FAIHU") u'\u0346' (should be u'\U00010346' - the beginning of the literal is truncated - leading to the ambiguity - in this case u'\u0346' is a combining diacritics "COMBINING BRIDGE ABOVE") on the contrary, the unicode string literals \N{name} work well. >>> u"\N{GOTHIC LETTER FAIHU}" u'\U00010346' Unfortunately, I haven't been able to find the problematic pieces of sourcecode, so I'm not able to fix it. It seems, that initially the correct information on the given codepoint is used, but finally only the last four digits of the codepoint value are taken into account using the "narrow" unicode literal \uxxxx instead of \Uxxxxxxxx , while the same task is handled correctly by the unicodeescape codec used for unicode string literals. vbr
msg31852 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2007-04-21 20:29
Confirmed with an linux-x86 UCS-4 build here.
msg31853 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2007-06-12 09:35
Martin, please have a look. Thanks.
msg31854 - (view)	Author: Hyeshik Chang (hyeshik.chang) *	Date: 2007-06-12 11:12
I attached a working fix for the problem. The patch encodes non-BMP characters as a surrogate pair in the lookup function. The surrogate pair encoding can be thought as something to be included in the standard unicode API. How about to provide UTF-32 codecs in the Python C-API to help this kind of use? File Added: unicodedata-lookup-ucs2fix.diff
msg31855 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-06-13 04:25
gbrandl: what precisely can you confirm? In any UCS-4 build, the lookup should return the correct result, and it does so on my machine. An alternative solution to the change proposed by perky would be to raise a ValueError, similar to unichr().
msg31856 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2007-06-13 06:37
Indeed, it is UCS-2, sorry.
msg31857 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-07-27 18:33
I'm skeptical about applying this to 2.5.x: I think it could be surprising if you suddenly get length-two results. How about raising a ValueError instead if the resulting character is out of range?
msg31858 - (view)	Author: Hyeshik Chang (hyeshik.chang) *	Date: 2007-07-28 03:30
I'm agree about raising a ValueError for 2.5.x.
msg31859 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-07-28 07:05
In implementing it, I found that KeyError is better than ValueError, as this is the only exception that you currently get. This is now fixed in r56600 and r56601.

History
Date	User	Action	Args
2022-04-11 14:56:23	admin	set	github: 44879
2007-04-21 10:52:05	vlbrom	create