This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: re.LOCALE, umlaut and \w
Type: Stage:
Components: Regular Expressions Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: effbot Nosy List: effbot, glchapman, loewis, peterno
Priority: normal Keywords:

Created on 2003-02-22 00:06 by peterno, last changed 2022-04-10 16:07 by admin. This issue is now closed.

Messages (3)
msg14771 - (view) Author: peter nordlund (peterno) Date: 2003-02-22 00:06
I submit this problem although I am not sure it is
a real bug. It could be that I don't know how this
locale stuff works.

Anyway, I have been browsing around quite some time on
the net to find some
good examples of code demonstating how to use regexp in
python to get hold
of åäö when using \w, but I have not found any complete
examples.

If the code below behaves correctly, I suggest that the
regexp documentation
is improved by adding a complete example that shows how
to use re.LOCALE.
(The code behaves in the same way with python 2.2.2.)

#----------------------------------------
import locale
locale.setlocale(locale.LC_ALL,'swedish')
import re
reguml=re.compile(r"[a-zä]", re.LOCALE) # I expect
reguml and regw to give the same result.
regw=re.compile(r"\w", re.LOCALE)
reguml2=re.compile(r"[a-zä]+", re.LOCALE) # I expect
reguml2 and regw2 to give the same result.
regw2=re.compile(r"[\w]+", re.LOCALE)
str="abcä d\344e ä f ";

print reguml.findall(str) # Behaves as I expect.
print regw.findall(str) # Here I expect same result as
above, but I don't get it.
print reguml2.findall(str) # Behaves as I expect.
print regw2.findall(str) # Behaves as I expect.
#----------------------------------------



>>> import locale
>>> locale.setlocale(locale.LC_ALL,'swedish')
'swedish'
>>> import re
>>> reguml=re.compile(r"[a-zä]", re.LOCALE) # I expect
reguml and regw to give the same result.
>>> regw=re.compile(r"\w", re.LOCALE)
>>> reguml2=re.compile(r"[a-zä]+", re.LOCALE) # I
expect reguml2 and regw2 to give the same result.
>>> regw2=re.compile(r"[\w]+", re.LOCALE)
>>> str="abcä d\344e ä f ";
>>>
>>> print reguml.findall(str) # Behaves as I expect.
['a', 'b', 'c', '\xe4', 'd', '\xe4', 'e', '\xe4', 'f']
>>> print regw.findall(str) # Here I expect same result
as above, but I don't get it.
['a', 'b', 'c', 'd', 'e', 'f']
>>> print reguml2.findall(str) # Behaves as I expect.
['abc\xe4', 'd\xe4e', '\xe4', 'f']
>>> print regw2.findall(str) # Behaves as I expect.
['abc\xe4', 'd\xe4e', '\xe4', 'f']
---------------------------------------------------------
peternl:Python-2.3a2>>
/work1/pkg/dev-tools/python/2.3a2/bin/python -V
Python 2.3a2
peternl:Python-2.3a2>>uname -a
Linux peternl.computervision.se 2.4.18-6mdk-petern #2
Thu May 23 06:40:30 CEST 2002 i686 unknown
msg14772 - (view) Author: Greg Chapman (glchapman) Date: 2003-02-22 17:15
Logged In: YES 
user_id=86307

I believe this is fixed by this patch:

   http://www.python.org/sf/633359

At any rate, using a patched 2.22, regw behaves identically to reguml. 
msg14773 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-04-19 08:14
Logged In: YES 
user_id=21627

This has been fixed with Greg's patch.
History
Date User Action Args
2022-04-10 16:07:01adminsetgithub: 38028
2003-02-22 00:06:40peternocreate