Issue 599377: re searches don't work with 4-byte unico

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37080

classification

Title:	re searches don't work with 4-byte unico
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.2

process

Created on 2002-08-23 19:16 by dcjim, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (4)
msg12145 - (view)	Author: Jim Fulton (dcjim)	Date: 2002-08-23 19:16
For Python 2.2.1 or the CVS head, as of this posting, with Python configured for 4-byte unicode (--enable-unicode=ucs4) searches against unicode regular expressions that use characters above \xff don't seem to work. Here's an example: invalid_xml_char = re.compile(u'[\ud800-\udfff]') invalid_xml_char.search(u'\ud800') returns None, rather than a match.
msg12146 - (view)	Author: Peter Schneider-Kamp (nowonder) *	Date: 2002-08-27 16:49
Logged In: YES user_id=14463 I could reproduce this behaviour exactly. No idea what is causing it, though.
msg12147 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-09-26 16:53
Logged In: YES user_id=21627 Added a work-around in sre_compile 1.44 and 1.41.14.2: it disables big charsets for UCS-4 builds. I leave this report open, so that a proper fix can be designed.
msg12148 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-06-14 15:10
Logged In: YES user_id=21627 This is now fixed for Python 2.3, with _sre.c 2.89.