Issue493252
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2001-12-14 10:54 by peterdejong, last changed 2022-04-10 16:04 by admin. This issue is now closed.
Messages (16) | |||
---|---|---|---|
msg8182 - (view) | Author: P. de Jong (peterdejong) | Date: 2001-12-14 10:54 | |
RuntimeError: maximum recursion limit exceeded in match, while trying to match a string of 16384 bytes. (Python 2.0) The error does not occur after eliminating some 100 characters from the string. The error does not occur in Python 1.5.2. So I cannot upgrade. Peter de Jong |
|||
msg8183 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-12-14 13:53 | |
Logged In: YES user_id=6380 Hi Peter! This usually happens when the pattern contains * or + in a way that causes more backtracking than one would naively expect. Can you show us the pattern? There's usually an easy way to rewrite the pattern so that it won't overflow -- and it will be faster too... BTW I would upgrade to Python 2.1.1 -- that's the most stable release to date. |
|||
msg8184 - (view) | Author: Tim Peters (tim.peters) * | Date: 2001-12-14 15:57 | |
Logged In: YES user_id=31435 Assigned to /F. Echo Guido's belief that the regexp can almost certainly be rewritten to avoid this and run much faster at the same time. |
|||
msg8185 - (view) | Author: Fredrik Lundh (effbot) * | Date: 2001-12-14 16:32 | |
Logged In: YES user_id=38376 Three additional comments: 1) The fact that you get this under 2.0 indicates that your regular expression doesn't run very well under 1.5.2 either; it can most likely be rewritten to be much faster, and use much less memory. 2) But if that's okay, you can always work around this by replacing "import re" with "import pre as re" (or "import pre; re = pre") 3) This will be fixed in future versions. </F> |
|||
msg8186 - (view) | Author: P. de Jong (peterdejong) | Date: 2001-12-15 13:23 | |
Logged In: YES user_id=402001 Hi Guido! The regular expression used was: '(?s)"(.*?)"(;|\n)' Peter |
|||
msg8187 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-12-15 16:03 | |
Logged In: YES user_id=6380 Do you really *want* that pattern to match input like this: "xxx"xxx"; ??? If not, replace the (.*?) with ([^"]*) -- this will dramatically reduce the amount of backtracking generated if there are unbalanced quotes in the input. |
|||
msg8188 - (view) | Author: Tim Peters (tim.peters) * | Date: 2001-12-15 19:08 | |
Logged In: YES user_id=31435 Peter, assuming Guido is correct that you did not intend to match strings with embedded double quotes, here's a correct (and much more efficient) replacement for your regexp: r'"[^"\n]*"[;\n]' Note that (contra Guido's suggestion <wink>), a newline has to be part of the negated character class, because you asked for single-line mode so presumably didn't want your .*? to match any newlines either. |
|||
msg8189 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-12-16 01:30 | |
Logged In: YES user_id=6380 Tim, Peter's regex starts with (?s) which is the same as compiling with re.S or re.DOTALL which makes '.' match newline -- the opposite of what you seem to think (?s) means. |
|||
msg8190 - (view) | Author: Tim Peters (tim.peters) * | Date: 2001-12-16 04:34 | |
Logged In: YES user_id=31435 Oops! I used to make the same mistake in Perl too (which is I pushed to name the symbolic flag DOTALL instead of SINGLELINE). So the correct regexp is r'"[^"]*"[;\n]' Assuming, of course, that Peter does want embedded newlines to get sucked up. |
|||
msg8191 - (view) | Author: P. de Jong (peterdejong) | Date: 2001-12-17 09:02 | |
Logged In: YES user_id=402001 Tim and Guido, I want to match things like: "jlkjkl ""kjlklkjkl""ljk ;lkk;l"; or: "jlkjkl ""kjlklkjkl""ljk ;lkk;l" Maybe I can trust the embedded quotes to be doubled, but I was not sure at the moment of writing the expression. I do have to include newlines inside the enclosing quotes. Peter de Jong |
|||
msg8192 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-12-17 14:08 | |
Logged In: YES user_id=6380 Peter, if the internal quotes are doubled, try this: r'"([^"]|"")*"[;\n]' if you can't trust them to be doubled, you're in trouble -- your language is ambiguous. :-( |
|||
msg8193 - (view) | Author: P. de Jong (peterdejong) | Date: 2001-12-17 16:47 | |
Logged In: YES user_id=402001 Guido, Indeed. in many cases my language is ambious. I will try your solution. It is not clear to me what induces the recursion; Maybe something like '(?s)"(.*?)(";|"\n)' will not give the recursion cascade? |
|||
msg8194 - (view) | Author: Guido van Rossum (gvanrossum) * | Date: 2001-12-17 17:00 | |
Logged In: YES user_id=6380 I don't think that would make a difference. I don't understand SRE enough to know where the recursion cascade comes from; maybe Tim or Fredrik can explain? I believe it has something to do with an unexpected behavior of the *? operator. Note that it's a well-known theoretical result that parsing ambiguous languages can be O(n**3), so you might want to rethink this decision. |
|||
msg8195 - (view) | Author: Tim Peters (tim.peters) * | Date: 2001-12-17 18:15 | |
Logged In: YES user_id=31435 I'm afraid Guido's rewrite stacks a backtracking point for each character, so it can still die on strings of the length you're looking at. For example, here's a ~16KB string that kills it on Windows: test = '"' + ('a' * 128 + '""') * 128 + '";' The only info I know of on how to write robust regexps is in Friedl's "Mastering Regular Expressions" book, which does an excellent job. Using his "unrolling" pattern leads to the regexp r'"[^"]*(""[^"]*)*"[;\n]' which is an instance of the general normal* (special normal*)* pattern, and reduces the number of stacked backtracking points from the number of characters in the string to the number of special strings within it (given various preconditions that happen to be satisfied here -- you really need to read the book, as it resists a pithy summary). That works fine with the test string above, and even if you change it to test = '"' + ('a' * 5000 + '""') * 5000 + '";' At that point you're matching a 25MB string, which should be big enough for most web use <wink>. |
|||
msg8196 - (view) | Author: Quentin Crain (nanotech) | Date: 2002-09-10 16:12 | |
Logged In: YES user_id=191231 I am hoping this is a useful comment/code: import re s1=('macro\n'+'a'*200+'\norcam\n')*10 s2=('macro\n'+'a'*20000+'\norcam\n')*10 p=re.compile(r'macro.*?orcam',re.DOTALL) for x in re.findall(p,s1): print x for x in re.findall(p,s2): print x Using re.findall bit the users I support because it could not scale with the data. Thanks for your efforts! Quentin Crain |
|||
msg8197 - (view) | Author: Gustavo Niemeyer (niemeyer) * | Date: 2003-05-24 16:55 | |
Logged In: YES user_id=7887 As Gary Herron correctly pointed me out, this was fixed in 2.3 with the introduction of a new opcode to handle single character non-greedy matching. This won't be fixed in 2.2.3, but hopefully will be backported to 2.2.4 together with other regular expression fixes. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:04:46 | admin | set | github: 35740 |
2001-12-14 10:54:20 | peterdejong | create |