This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: \b in unicode regex gives strange results
Type: Stage:
Components: Regular Expressions Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: niemeyer Nosy List: akaihola, georg.brandl, loewis, niemeyer
Priority: normal Keywords:

Created on 2006-12-07 21:44 by akaihola, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg30755 - (view) Author: akaihola (akaihola) Date: 2006-12-07 21:44
The problem: This doesn't give a match:
>>> re.match(r'ä\b', 'ä ', re.UNICODE)

This works ok and gives a match:
>>> re.match(r'.\b', 'ä ', re.UNICODE)

Both of these work as well:
>>> re.match(r'a\b', 'a ', re.UNICODE)
>>> re.match(r'.\b', 'a ', re.UNICODE)

Docs say \b is defined as an empty string between \w and \W. These do match accordingly:
>>> re.match(r'\w', 'ä', re.UNICODE)
>>> re.match(r'\w', 'a', re.UNICODE)
>>> re.match(r'\W', ' ', re.UNICODE)

So something strange happens in my first example, and I can't help but assume it's a bug.
msg30756 - (view) Author: akaihola (akaihola) Date: 2006-12-07 22:18
As a work-around I currently use a regex like r'ä(?=\W)'. Seems to work ok.

Also, the \b problem doesn't seem to exist in the \W\w case, i.e. at the beginning of words.
msg30757 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-12-08 17:18
Notice that the re.UNICODE flag is only meaningful if you are using Unicode strings; in the examples you give, you are using byte strings.

Please re-test with Unicode strings both as the expression and as the string to match.
msg30758 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-12-08 20:51
FWIW, the first example works fine for me with and without Unicode strings.
msg30759 - (view) Author: akaihola (akaihola) Date: 2006-12-14 00:30
Ok so this does work:
>>> re.match(ur'ä\b', u'ä ', re.UNICODE)

If I understand correctly, I was comparing UTF-8 encoded strings in my examples (my Ubuntu is UTF-8 by default) and regex special operators just don't work in that domain.
History
Date User Action Args
2022-04-11 14:56:21adminsetgithub: 44315
2006-12-07 21:44:28akaiholacreate