This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re searches don't work with 4-byte unico
Type: Stage:
Components: Library (Lib) Versions: Python 2.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: loewis Nosy List: dcjim, loewis, nowonder
Priority: normal Keywords:

Created on 2002-08-23 19:16 by dcjim, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (4)
msg12145 - (view) Author: Jim Fulton (dcjim) (Python triager) Date: 2002-08-23 19:16
For Python 2.2.1 or the CVS head, as of this posting, 
with Python configured for 4-byte unicode
(--enable-unicode=ucs4)
searches against unicode regular expressions that use 
characters above \xff don't seem to work.

Here's an example:

  invalid_xml_char = re.compile(u'[\ud800-\udfff]')
  invalid_xml_char.search(u'\ud800')

returns None, rather than a match.
msg12146 - (view) Author: Peter Schneider-Kamp (nowonder) * (Python triager) Date: 2002-08-27 16:49
Logged In: YES 
user_id=14463

I could reproduce this behaviour exactly. No idea what is
causing it, though.
msg12147 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-09-26 16:53
Logged In: YES 
user_id=21627

Added a work-around in sre_compile 1.44 and 1.41.14.2: it
disables big charsets for UCS-4 builds.

I leave this report open, so that a proper fix can be designed.
msg12148 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-06-14 15:10
Logged In: YES 
user_id=21627

This is now fixed for Python 2.3, with _sre.c 2.89.
History
Date User Action Args
2022-04-10 16:05:37adminsetgithub: 37080
2002-08-23 19:16:04dcjimcreate