The effects of the re.UNICODE flag are incorrectly
documented in the library reference. Currently it says
(Section 4.2.3):
<snip>
U
UNICODE
Make \w, \W, \b, and \B dependent on the Unicode
character properties database. New in version 2.0.
</snip>
But this flag in fact also affects \d, \D, \s, and \S
at least since Python 2.1 (I have checked 2.1.3 on
Linux, 2.2.3, 2.3.5 and 2.4 on OS X and the source of
_sre.c makes this obvious). Proof:
Python 2.4 (#1, Feb 13 2005, 18:29:12)
[GCC 3.3 20030304 (Apple Computer, Inc. build 1666)] on
darwin
Type "help", "copyright", "credits" or "license" for
more information.
>>> import re
>>> not re.match(r"\d", u"\u0966")
True
>>> re.match(r"\d", u"\u0966", re.UNICODE)
<_sre.SRE_Match object at 0x36ee20>
>>> not re.match(r"\s", u"\u2001")
True
>>> re.match(r"\s", u"\u2001", re.UNICODE)
<_sre.SRE_Match object at 0x36ee20>
\u0966 is some Indian digit, \u2001 is an em space.
I propose to change the docs to:
<snip>
U
UNICODE
Make \w, \W, \b, \B, \d, \D, \s, and \S dependent on
the Unicode character properties database. New in
version 2.0.
</snip>
Maybe the documentation of \d, \D, \s, and \S in
section 2.4.1 of the library reference should also be
adapted.
|