This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: RE engine internal error with LARGE RE: scalability bug
Type: Stage:
Components: Regular Expressions Versions: Python 2.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: effbot Nosy List: effbot, fdborges
Priority: low Keywords:

Created on 2003-12-10 16:21 by fdborges, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg19361 - (view) Author: Francisco Dellatorre Borges (fdborges) Date: 2003-12-10 16:21
I lot's of lines with the format:

(d+@w[^|]+|)+

I'll call the last bit after the @ of /features/. I
need to delete some of these,  so I have this code that
would produce a list matching what I would /not/ use,
pass it over re.escape and them  build a re using a
concatenation of the list and delete this from the text
before actually doing any parsing.

Problem is: I have about 220.000 different features and
I need to delete some 200.000 different ones from  my
files before doing something. 

So I tried to use a list of the 20.000 I want and then
delete anything that matches the <not> of it:

#---------------
# ftrlist is the stuff I *want* to keep: 
ftrlist = [re.escape(i) for i in ftrlist ]
re.compile(r'(?!(%s))'  %( '|'.join(ftrlist)) )
#--------------

but when I apply it I get something like:

RuntimeError: internal error in regular expression engine

I tried the same thing but with a smaller number of
elements, say 1000 ftrlist[:1000], and then it worked. 

So I guess there is a bug on the scalability of the re
engine when doing alternative searches.

Attached I'm sending a tar ball that reproduces this.
I'm gzipping it (hope sourceforge does not have a
problem with the resulting binary file).
msg19362 - (view) Author: Francisco Dellatorre Borges (fdborges) Date: 2003-12-10 16:24
Logged In: YES 
user_id=576554

Problem with the file name the first time. Here is the file.
msg19363 - (view) Author: Francisco Dellatorre Borges (fdborges) Date: 2003-12-10 16:31
Logged In: YES 
user_id=576554

apparently sf does not want me to load tar balls... or I'm
too dumb to do it.
msg19364 - (view) Author: Francisco Dellatorre Borges (fdborges) Date: 2003-12-10 16:35
Logged In: YES 
user_id=576554

Ok, I give up on loading the file the size is restricted as
one might have expected. If you want download the tar ball
that would reproduce the error from
www.let.rug.nl/~borges/ScalabilityREBug.tar.gz
msg19365 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2003-12-11 13:39
Logged In: YES 
user_id=38376

The SRE engine is usually compiled with 16-bit opcodes, 
which places a limit on the size of the compiled engine.  
Changing that to 32 bits might solve your problem...

But in this case, I don't think one huge RE is really the best 
tool for the task; it's as easy (and much more efficient) to 
use an RE to *extract* the feature code, and use a 
dictionary to keep track of the ones you want to keep.

(example: use re.sub("d+@(w[^|]+)|", callback, data) and let 
the callback look m.group(1) up in the dictionary; return 
m.group(0) if you want to keep the feature, "" if you don't).

Hope this helps!

</F>
msg19366 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2004-09-09 08:04
Logged In: YES 
user_id=38376

Alternative solution proposed; no response from submitter
in ten months.
History
Date User Action Args
2022-04-11 14:56:01adminsetgithub: 39689
2003-12-10 16:21:55fdborgescreate