This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: RE parser too loose with {m,n} construct
Type: Stage:
Components: Regular Expressions Versions: Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: niemeyer Nosy List: georg.brandl, josiahcarlson, niemeyer, rhettinger, skip.montanaro
Priority: normal Keywords:

Created on 2005-05-15 21:59 by skip.montanaro, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sre-brace-diff-2 georg.brandl, 2005-06-01 21:32
sre-brace-diff georg.brandl, 2005-06-03 08:01
Messages (19)
msg25327 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2005-05-15 21:59
This seems wrong to me:

>>> re.match("(UNIX{})", "UNIX{}").groups()
('UNIX',)

With no numbers or commas, "{}" should not be considered
special in the pattern.  The docs identify three numeric
repetition possibilities: {m}, {m,} and {m,n}.  There's no
description of {} meaning anything.  Either the docs should
say {} implies {1,1}, {} should have no special meaning, or
an exception should be raised during compilation of the
regular expression.
msg25328 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-01 16:54
Logged In: YES 
user_id=1188172

It's interesting what other RE implementations do with this
ambiguity:
Perl treats {} as literal in REs, as Skip proposes.
Ruby does, too, but issues a warning about } being unescaped.
GNU (e)grep v2.5.1 allows a bare {} only if it is at the
start of a RE, but matches it literally then.
GNU sed v4.1.4 does never allow it.
GNU awk v3.1.4 is gracious and acts like Perl.

Attached is a patch that fixes this behaviour in the
appearing "common sense".
msg25329 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-06-01 20:25
Logged In: YES 
user_id=80475

IMO, the simplest rule is that braces always be considered
special.  This accommodates future extensions, simplifies
the re compiler, and makes it easier to know what needs to
be escaped.
msg25330 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-01 20:30
Logged In: YES 
user_id=1188172

So, should a {} raise an error, or warn like in Ruby?
msg25331 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-06-01 21:07
Logged In: YES 
user_id=80475

I prefer Skip's third option, raising an exception during
compilation.  This is an re syntax error.  Treat it the same
way that we handle similar situations with regular Python:

>>> a[]
SyntaxError: invalid syntax
msg25332 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-01 21:32
Logged In: YES 
user_id=1188172

Okay. Attaching patch which does that.

BTW, these things are currently allowed too (treated as
literals):

"{"
"{x"
"{x}"
"{x,y}"
"{1,x}"
etc.

The patch changes that, too.
msg25333 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2005-06-02 11:16
Logged In: YES 
user_id=44345

In the absence of strong technical reasons, I'd vote to do what Perl
does.  I believe the assumption all along has been that most people 
coming to Python who already know how to use regular expressions are 
Perl programmers.  It wouldn't seem to make sense to throw little land
mines in their paths.  I realize that explicit is better than implicit, but
practicality beats purity.
msg25334 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-03 08:01
Logged In: YES 
user_id=1188172

I just realized that e.g. the string module uses unescaped
braces, so I think we should not become overly strict as it
would break much code...

Perhaps the original patch (sre-brace-diff) is better...
msg25335 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2005-06-03 15:13
Logged In: YES 
user_id=44345

Can you elaborate?  I fail to see what the string module
has to do with the re module.  Can you give an example
of code that would break?
msg25336 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-03 18:00
Logged In: YES 
user_id=1188172

Raymond said that braces should always be considered
special. This includes constructs like "{(?P<braces>.*)}"
which the string module uses, and which would be a syntax
error then.
msg25337 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2005-06-03 18:46
Logged In: YES 
user_id=80475

Hmm, it looks like they cannot be treated differently
without breaking backwards compatability.
msg25338 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-03 19:10
Logged In: YES 
user_id=1188172

Then, I think, we should follow Perl's behaviour and treat
"{}" as a literal, just like every other brace construct
that isn't a repeat specifier.
msg25339 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-31 21:55
Logged In: YES 
user_id=1188172

Any more objections against treating "{}" as literal?

The impact on existing code will be minimal, as I presume no
one will write "{}" in a RE instead of "{1,1}" (well, who
writes "{1,1}" anyway...).
msg25340 - (view) Author: Gustavo Niemeyer (niemeyer) * (Python committer) Date: 2005-08-31 22:11
Logged In: YES 
user_id=7887

I support Skip's opinion on following whatever perl is currently doing, if    
that won't lead to unexpected errors on current running code which was    
considered sane (expecting {} to behave like {1,1} is not sane :-).    
  
Your original patch looks under-optimal though (look at the tests around 
it). I'll fix it, or if you prefer to do it by yourself, I may apply the 
patch/review it/whatever. :-) 
msg25341 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-31 22:16
Logged In: YES 
user_id=1188172

No, you're the expert, so you'll get the honor of fixing it. :P
msg25342 - (view) Author: Gustavo Niemeyer (niemeyer) * (Python committer) Date: 2005-09-14 08:58
Logged In: YES 
user_id=7887

Fixed in:

Lib/sre_parse.py: 1.64 -> 1.65
Lib/test/test_re.py: 1.55 -> 1.56
Misc/NEWS: 1.1360 -> 1.1361

Notice that perl will also handle constructs like '{,2}' as
literals, while Python will consider them as '{0,2}'. I
think it's too late to change that one though, as this
behavior may be relied upon in code out there.
msg25343 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-09-14 10:58
Logged In: YES 
user_id=1188172

Will you backport the fix?
msg25344 - (view) Author: Josiah Carlson (josiahcarlson) * (Python triager) Date: 2005-09-15 06:07
Logged In: YES 
user_id=341410

Was it a bug, or was it merely confusing semantics?
msg25345 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-09-15 06:12
Logged In: YES 
user_id=1188172

I would say bug.
History
Date User Action Args
2022-04-11 14:56:11adminsetgithub: 41987
2005-05-15 21:59:16skip.montanarocreate