This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: maximum recursion limit exceeded in match
Type: Stage:
Components: Regular Expressions Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: niemeyer Nosy List: effbot, gvanrossum, nanotech, niemeyer, peterdejong, tim.peters
Priority: low Keywords:

Created on 2001-12-14 10:54 by peterdejong, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Messages (16)
msg8182 - (view) Author: P. de Jong (peterdejong) Date: 2001-12-14 10:54
RuntimeError: maximum recursion limit exceeded
in match, while trying to match a string of 16384
bytes. (Python 2.0)

The error does not occur after eliminating some 100
characters from the string.

The error does not occur in Python 1.5.2. So I cannot
upgrade.

Peter de Jong
msg8183 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-12-14 13:53
Logged In: YES 
user_id=6380

Hi Peter!

This usually happens when the pattern contains * or + in a
way that causes more backtracking than one would naively
expect. Can you show us the pattern? There's usually an easy
way to rewrite the pattern so that it won't overflow -- and
it will be faster too...

BTW I would upgrade to Python 2.1.1 -- that's the most
stable release to date.
msg8184 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-12-14 15:57
Logged In: YES 
user_id=31435

Assigned to /F.  Echo Guido's belief that the regexp can 
almost certainly be rewritten to avoid this and run much 
faster at the same time.
msg8185 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2001-12-14 16:32
Logged In: YES 
user_id=38376

Three additional comments:

1) The fact that you get this under 2.0 indicates that your 
regular expression doesn't run very well under 1.5.2 
either; it can most likely be rewritten to be much faster, 
and use much less memory.

2) But if that's okay, you can always work around this by 
replacing "import re" with "import pre as re" (or "import 
pre; re = pre")

3) This will be fixed in future versions.

</F>
msg8186 - (view) Author: P. de Jong (peterdejong) Date: 2001-12-15 13:23
Logged In: YES 
user_id=402001

Hi Guido!

The regular expression used was:

'(?s)"(.*?)"(;|\n)'

Peter
msg8187 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-12-15 16:03
Logged In: YES 
user_id=6380

Do you really *want* that pattern to match input like this:

    "xxx"xxx";

??? 

If not, replace the (.*?) with ([^"]*) -- this will
dramatically reduce the amount of backtracking generated if
there are unbalanced quotes in the input.
msg8188 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-12-15 19:08
Logged In: YES 
user_id=31435

Peter, assuming Guido is correct that you did not intend to 
match strings with embedded double quotes, here's a correct 
(and much more efficient) replacement for your regexp:

r'"[^"\n]*"[;\n]'

Note that (contra Guido's suggestion <wink>), a newline has 
to be part of the negated character class, because you 
asked for single-line mode so presumably didn't want 
your .*? to match any newlines either.
msg8189 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-12-16 01:30
Logged In: YES 
user_id=6380

Tim, Peter's regex starts with (?s) which is the same as
compiling with re.S or re.DOTALL which makes '.' match
newline -- the opposite of what you seem to think (?s)
means.
msg8190 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-12-16 04:34
Logged In: YES 
user_id=31435

Oops!  I used to make the same mistake in Perl too (which 
is I pushed to name the symbolic flag DOTALL instead of 
SINGLELINE).  So the correct regexp is

r'"[^"]*"[;\n]'

Assuming, of course, that Peter does want embedded newlines 
to get sucked up.
msg8191 - (view) Author: P. de Jong (peterdejong) Date: 2001-12-17 09:02
Logged In: YES 
user_id=402001

Tim and Guido,

I want to match things like:

"jlkjkl ""kjlklkjkl""ljk
;lkk;l";
or:
"jlkjkl ""kjlklkjkl""ljk
;lkk;l"

Maybe I can trust the embedded quotes to be doubled,
but I was not sure at the moment of writing the
expression. I do have to include newlines inside the 
enclosing quotes.

Peter de Jong

msg8192 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-12-17 14:08
Logged In: YES 
user_id=6380

Peter, if the internal quotes are doubled, try this:

r'"([^"]|"")*"[;\n]'

if you can't trust them to be doubled, you're in trouble --
your language is ambiguous. :-(
msg8193 - (view) Author: P. de Jong (peterdejong) Date: 2001-12-17 16:47
Logged In: YES 
user_id=402001

Guido,
Indeed. in many cases my language is ambious.
I will try your solution.
It is not clear to me what induces the recursion;
Maybe something like
'(?s)"(.*?)(";|"\n)'
will not give the recursion cascade?
msg8194 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2001-12-17 17:00
Logged In: YES 
user_id=6380

I don't think that would make a difference. I don't
understand SRE enough to know where the recursion cascade
comes from; maybe Tim or Fredrik can explain? I believe it
has something to do with an unexpected behavior of the *?
operator.

Note that it's a well-known theoretical result that parsing
ambiguous languages can be O(n**3), so you might want to
rethink this decision.
msg8195 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2001-12-17 18:15
Logged In: YES 
user_id=31435

I'm afraid Guido's rewrite stacks a backtracking point for 
each character, so it can still die on strings of the 
length you're looking at.  For example, here's a ~16KB 
string that kills it on Windows:

test = '"' + ('a' * 128 + '""') * 128 + '";'

The only info I know of on how to write robust regexps is 
in Friedl's "Mastering Regular Expressions" book, which 
does an excellent job.  Using his "unrolling" pattern leads 
to the regexp

r'"[^"]*(""[^"]*)*"[;\n]'

which is an instance of the general

normal* (special normal*)*

pattern, and reduces the number of stacked backtracking 
points from the number of characters in the string to the 
number of special strings within it (given various 
preconditions that happen to be satisfied here -- you 
really need to read the book, as it resists a pithy 
summary).

That works fine with the test string above, and even if you 
change it to

test = '"' + ('a' * 5000 + '""') * 5000 + '";'

At that point you're matching a 25MB string, which should 
be big enough for most web use <wink>.
msg8196 - (view) Author: Quentin Crain (nanotech) Date: 2002-09-10 16:12
Logged In: YES 
user_id=191231

I am hoping this is a useful comment/code:

import re

s1=('macro\n'+'a'*200+'\norcam\n')*10
s2=('macro\n'+'a'*20000+'\norcam\n')*10

p=re.compile(r'macro.*?orcam',re.DOTALL)

for x in re.findall(p,s1):
    print x

for x in re.findall(p,s2):
    print x

Using re.findall bit the users I support because it could
not scale with the data.

Thanks for your efforts!

    Quentin Crain
msg8197 - (view) Author: Gustavo Niemeyer (niemeyer) * (Python committer) Date: 2003-05-24 16:55
Logged In: YES 
user_id=7887

As Gary Herron correctly pointed me out, this was fixed in
2.3 with the  introduction of a new opcode to handle single
character non-greedy matching.

This won't be fixed in 2.2.3, but hopefully will be
backported to 2.2.4 together with other regular expression
fixes.
History
Date User Action Args
2022-04-10 16:04:46adminsetgithub: 35740
2001-12-14 10:54:20peterdejongcreate