This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xml.sax memory leak with ExpatParser
Type: Stage:
Components: XML Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: fdrake Nosy List: dyoo, fdrake, loewis, tim.peters
Priority: normal Keywords:

Created on 2002-03-26 23:24 by dyoo, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test_memory_leak.py dyoo, 2002-03-28 22:37
expatreader.diff fdrake, 2002-04-04 05:25
test_memory_leak_2.py dyoo, 2002-04-04 07:14 Verified cyclic memory collection is ok
Messages (13)
msg10005 - (view) Author: Danny Yoo (dyoo) Date: 2002-03-26 23:24
I've isolated a memory leak in the ExpatParser that
deals with the destruction of ContentHandlers.  I'm
including my test program test_memory_leak.py that
tests the behavior --- I generate a bunch of
ContentParsers, and see if they get destroyed reliably.


This appears to affect Python 2.1.1 and 2.1.2. 
Thankfully, the leak appears to be fixed in 2.2.1c. 
Here's some of the test runs:

### Python 2.1.1:
[dyoo@tesuque dyoo]$ /opt/Python-2.1.1/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:



Test2:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
###



### Python 2.1.2:
[dyoo@tesuque dyoo]$ /opt/Python-2.1.2/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:
TestParser destructed.
TestParser destructed.



Test2:
###


### Python 2.2.1c
[dyoo@tesuque dyoo]$ /opt/Python-2.2.1c2/bin/python
test_memory_leak.py
This is a test of an apparent XML memory leak.
Test1:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.



Test2:
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
TestParser destructed.
###



msg10006 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-03-27 12:23
Logged In: YES 
user_id=21627

There's no uploaded file!  You have to check the
checkbox labeled "Check to Upload & Attach File"
when you upload a file.

Please try again.

(This is a SourceForge annoyance that we can do
nothing about. :-( )
msg10007 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-03-27 12:24
Logged In: YES 
user_id=21627

Also, what kind of action do you expect. Chances are minimal
that there will be a 2.1.3 release, so why bother?
msg10008 - (view) Author: Danny Yoo (dyoo) Date: 2002-03-28 22:37
Logged In: YES 
user_id=49843

Hi Martin,

Yikes; Sorry about that.  I've attached the file.

---


I did some more experimentation with xml.sax, and there does
appear to be a serious problem with object destruction, even
with Python 2.2.1c.

I'm working with a fairly large XML file located on the TIGR
(The Institute for Genomic Research) ftp site.  A sample
file would be something like:

ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PSEUDOCHROMOSOMES/chr1.xml

(60 MBs)

and I noticed that my scripts were leaking memory.  I've
isolated the problem to what looks like a garbage collection
problem: it looks like my ContentHandlers are not getting
recycled.  Here's a simplified program:

###
import xml.sax
import glob
from cStringIO import StringIO


class FooParser(xml.sax.ContentHandler):
    def __init__(self):
        self.bigcontent = StringIO()

    def startElement(self, name, attrs):
        pass

    def endElement(self, name):
        pass

    def characters(self, chars):
        self.bigcontent.write(chars)


filename =
'/home/arabidopsis/bacs/20020107/PSEUDOCHROMOSOME/chr1.xml'
i = 0
while 1:
    print "Iteration %d" % i
    xml.sax.parse(open(filename), FooParser())
    i = i + 1
###

I've watched 'top', and the memory usage continues growing.
 Any suggestions?  Thanks!
msg10009 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-03-28 22:48
Logged In: YES 
user_id=31435

Assigned to Fred, after he begged me to <wink>.
msg10010 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-04-04 04:50
Logged In: YES 
user_id=3066

I don't remember if the cycle detector was enabled by
default in 2.1.* -- that all seems so long ago!

The content handler ends up being part of a circular
reference cycle, with the ExpatParser acting as it's own
locator object.  This happens because the parser references
the content handler, and hands a reference to itself for the
content handler to squirrel away as the locator.

I see two approaches to removing this dependency.  The first
is simply to call setDocumentLocator(None) after calling
endDocument(), but that's fragile; it assumes the parse gets
that far.  The second is to use a separate object to provide
the locator to the content handler; this seems more robust
as it doesn't assume that the parse succeeds.

I'll start on a patch that uses the second approach.

Martin, do you see any other alternatives?  There will be a
2.1.3 release for other reasons, BTW, so this might make it in.
msg10011 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-04-04 05:07
Logged In: YES 
user_id=3066

Looking at the code, it's not quite so trivial as I'd
thought, but not entirely difficult either.  I started by
creating a locator that had a reference to the parser object
from xml.parser.expat, but that of course has references to
the ExpatParser, so the cycle still exists.

As long as we're trying to solve the problem for Python 2.1
and newer, though, we can use a locator object that has a
weakref to the ExpatParser object, thereby breaking the
cycle.  I like that.  ;-)
msg10012 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-04-04 05:25
Logged In: YES 
user_id=3066

I've attached a patch.  I think this meets all the backward
compatibility requirements and is low-risk, and it removes
the circular reference.  So far I've only tested it against
the standard tests for Python 2.1.*; I'll try it tomorrow
with the sample test code, and think about a test that can
be added to the test suite.
msg10013 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-04-04 05:35
Logged In: YES 
user_id=3066

I'll note that the patch is against the release21-maint
branch of Python, and I've only tried it there.  It may need
changes for more recent versions of Python, but that branch
appears most critical since we're looking at a 2.1.3 release
next week.

OK, enough.  I'm heading to bed.
msg10014 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-04-04 06:21
Logged In: YES 
user_id=21627

I think the problem is elsewhere. Danny's demo script
clearly is buggy; if you use the IncrementalParser
interface, you *must* invoke .close() at the end of the
parse run; else you get cyclic garbage.

The cyclic garbage collector will pick up that garbage; just
invoke gc.collect() after test1 and test2 to see all
TestParsers destroyed.

So I don't think any action on Python code is necessary as a
bug fix; if there are remaining problems, then they must be
in pyexpat.c. I'll investigate 2.52 and 2.54 as candidates
for backporting.
msg10015 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-04-04 07:05
Logged In: YES 
user_id=21627

Also, when parsing the large xml file: if you invoke
gc.collect() after each iteration, memory consumption will
go down, and not grow over time. The reason that GC does not
trigger automatically is that you allocate all the space
through strings. GC will be invoked after 1000 new container
objects have been allocated, but you exhaust the memory
before that - so either set the GC threshold down, or invoke
GC on your own.

For the specific application, it would be sufficient if
xml.sax.__init__.parse would invoke
parser.setContentHandler(None) after parsing has completed;
this should already break the cycle.

To solve the general problem, I like your suggestion of
using a separate locator.
msg10016 - (view) Author: Danny Yoo (dyoo) Date: 2002-04-04 07:14
Logged In: YES 
user_id=49843

Martin is right: I need to retract part of my bug-report: I
just remembered that any classes that have a __del__ method
aren't automatically cleaned  by a gc.collect().  I
triggered a pseudo-Heisenbug during my testing.

After removing the __del__ method from my test class and by
using parseString() instead of feed(), I've verified that
the XML parsing isn't the source of my memory leak.  (Test
file test_memory_leak_2.py included.)

However, after further investigation, I did find the true
source of my problems.. in MySQLdb:

http://sourceforge.net/tracker/index.php?func=detail&aid=536624&group_id=22307&atid=374932

Thank you again for looking into this.
msg10017 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2002-04-04 18:00
Logged In: YES 
user_id=3066

Checked in the fixed version as Lib/xml/sax/expatreader.py
revisions 1.26, 1.25.16.1, and 1.22.4.1 (this last means
it'll be in Python 2.1.3).

This or a similar change should be added to PyXML.
History
Date User Action Args
2022-04-10 16:05:09adminsetgithub: 36337
2002-03-26 23:24:47dyoocreate