Issue 1611131: \b in unicode regex gives strange results

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44315

classification

Title:	\b in unicode regex gives strange results
Type:		Stage:
Components:	Regular Expressions	Versions:	Python 2.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:	niemeyer	Nosy List:	akaihola, georg.brandl, loewis, niemeyer
Priority:	normal	Keywords:

Created on 2006-12-07 21:44 by akaihola, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg30755 - (view)	Author: akaihola (akaihola)	Date: 2006-12-07 21:44
The problem: This doesn't give a match: >>> re.match(r'ä\b', 'ä ', re.UNICODE) This works ok and gives a match: >>> re.match(r'.\b', 'ä ', re.UNICODE) Both of these work as well: >>> re.match(r'a\b', 'a ', re.UNICODE) >>> re.match(r'.\b', 'a ', re.UNICODE) Docs say \b is defined as an empty string between \w and \W. These do match accordingly: >>> re.match(r'\w', 'ä', re.UNICODE) >>> re.match(r'\w', 'a', re.UNICODE) >>> re.match(r'\W', ' ', re.UNICODE) So something strange happens in my first example, and I can't help but assume it's a bug.
msg30756 - (view)	Author: akaihola (akaihola)	Date: 2006-12-07 22:18
As a work-around I currently use a regex like r'ä(?=\W)'. Seems to work ok. Also, the \b problem doesn't seem to exist in the \W\w case, i.e. at the beginning of words.
msg30757 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-12-08 17:18
Notice that the re.UNICODE flag is only meaningful if you are using Unicode strings; in the examples you give, you are using byte strings. Please re-test with Unicode strings both as the expression and as the string to match.
msg30758 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-12-08 20:51
FWIW, the first example works fine for me with and without Unicode strings.
msg30759 - (view)	Author: akaihola (akaihola)	Date: 2006-12-14 00:30
Ok so this does work: >>> re.match(ur'ä\b', u'ä ', re.UNICODE) If I understand correctly, I was comparing UTF-8 encoded strings in my examples (my Ubuntu is UTF-8 by default) and regex special operators just don't work in that domain.

History
Date	User	Action	Args
2022-04-11 14:56:21	admin	set	github: 44315
2006-12-07 21:44:28	akaihola	create