Issue 535474: xml.sax memory leak with ExpatParser

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/36337

classification

Title:	xml.sax memory leak with ExpatParser
Type:		Stage:
Components:	XML	Versions:

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	fdrake	Nosy List:	dyoo, fdrake, loewis, tim.peters
Priority:	normal	Keywords:

Created on 2002-03-26 23:24 by dyoo, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
test_memory_leak.py	dyoo, 2002-03-28 22:37
expatreader.diff	fdrake, 2002-04-04 05:25
test_memory_leak_2.py	dyoo, 2002-04-04 07:14	Verified cyclic memory collection is ok

Messages (13)
msg10005 - (view)	Author: Danny Yoo (dyoo)	Date: 2002-03-26 23:24
I've isolated a memory leak in the ExpatParser that deals with the destruction of ContentHandlers. I'm including my test program test_memory_leak.py that tests the behavior --- I generate a bunch of ContentParsers, and see if they get destroyed reliably. This appears to affect Python 2.1.1 and 2.1.2. Thankfully, the leak appears to be fixed in 2.2.1c. Here's some of the test runs: ### Python 2.1.1: [dyoo@tesuque dyoo]$ /opt/Python-2.1.1/bin/python test_memory_leak.py This is a test of an apparent XML memory leak. Test1: Test2: TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. ### ### Python 2.1.2: [dyoo@tesuque dyoo]$ /opt/Python-2.1.2/bin/python test_memory_leak.py This is a test of an apparent XML memory leak. Test1: TestParser destructed. TestParser destructed. Test2: ### ### Python 2.2.1c [dyoo@tesuque dyoo]$ /opt/Python-2.2.1c2/bin/python test_memory_leak.py This is a test of an apparent XML memory leak. Test1: TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. Test2: TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. TestParser destructed. ###
msg10006 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-03-27 12:23
Logged In: YES user_id=21627 There's no uploaded file! You have to check the checkbox labeled "Check to Upload & Attach File" when you upload a file. Please try again. (This is a SourceForge annoyance that we can do nothing about. :-( )
msg10007 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-03-27 12:24
Logged In: YES user_id=21627 Also, what kind of action do you expect. Chances are minimal that there will be a 2.1.3 release, so why bother?
msg10008 - (view)	Author: Danny Yoo (dyoo)	Date: 2002-03-28 22:37
Logged In: YES user_id=49843 Hi Martin, Yikes; Sorry about that. I've attached the file. --- I did some more experimentation with xml.sax, and there does appear to be a serious problem with object destruction, even with Python 2.2.1c. I'm working with a fairly large XML file located on the TIGR (The Institute for Genomic Research) ftp site. A sample file would be something like: ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PSEUDOCHROMOSOMES/chr1.xml (60 MBs) and I noticed that my scripts were leaking memory. I've isolated the problem to what looks like a garbage collection problem: it looks like my ContentHandlers are not getting recycled. Here's a simplified program: ### import xml.sax import glob from cStringIO import StringIO class FooParser(xml.sax.ContentHandler): def __init__(self): self.bigcontent = StringIO() def startElement(self, name, attrs): pass def endElement(self, name): pass def characters(self, chars): self.bigcontent.write(chars) filename = '/home/arabidopsis/bacs/20020107/PSEUDOCHROMOSOME/chr1.xml' i = 0 while 1: print "Iteration %d" % i xml.sax.parse(open(filename), FooParser()) i = i + 1 ### I've watched 'top', and the memory usage continues growing. Any suggestions? Thanks!
msg10009 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-03-28 22:48
Logged In: YES user_id=31435 Assigned to Fred, after he begged me to <wink>.
msg10010 - (view)	Author: Fred Drake (fdrake)	Date: 2002-04-04 04:50
Logged In: YES user_id=3066 I don't remember if the cycle detector was enabled by default in 2.1.* -- that all seems so long ago! The content handler ends up being part of a circular reference cycle, with the ExpatParser acting as it's own locator object. This happens because the parser references the content handler, and hands a reference to itself for the content handler to squirrel away as the locator. I see two approaches to removing this dependency. The first is simply to call setDocumentLocator(None) after calling endDocument(), but that's fragile; it assumes the parse gets that far. The second is to use a separate object to provide the locator to the content handler; this seems more robust as it doesn't assume that the parse succeeds. I'll start on a patch that uses the second approach. Martin, do you see any other alternatives? There will be a 2.1.3 release for other reasons, BTW, so this might make it in.
msg10011 - (view)	Author: Fred Drake (fdrake)	Date: 2002-04-04 05:07
Logged In: YES user_id=3066 Looking at the code, it's not quite so trivial as I'd thought, but not entirely difficult either. I started by creating a locator that had a reference to the parser object from xml.parser.expat, but that of course has references to the ExpatParser, so the cycle still exists. As long as we're trying to solve the problem for Python 2.1 and newer, though, we can use a locator object that has a weakref to the ExpatParser object, thereby breaking the cycle. I like that. ;-)
msg10012 - (view)	Author: Fred Drake (fdrake)	Date: 2002-04-04 05:25
Logged In: YES user_id=3066 I've attached a patch. I think this meets all the backward compatibility requirements and is low-risk, and it removes the circular reference. So far I've only tested it against the standard tests for Python 2.1.*; I'll try it tomorrow with the sample test code, and think about a test that can be added to the test suite.
msg10013 - (view)	Author: Fred Drake (fdrake)	Date: 2002-04-04 05:35
Logged In: YES user_id=3066 I'll note that the patch is against the release21-maint branch of Python, and I've only tried it there. It may need changes for more recent versions of Python, but that branch appears most critical since we're looking at a 2.1.3 release next week. OK, enough. I'm heading to bed.
msg10014 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-04 06:21
Logged In: YES user_id=21627 I think the problem is elsewhere. Danny's demo script clearly is buggy; if you use the IncrementalParser interface, you must invoke .close() at the end of the parse run; else you get cyclic garbage. The cyclic garbage collector will pick up that garbage; just invoke gc.collect() after test1 and test2 to see all TestParsers destroyed. So I don't think any action on Python code is necessary as a bug fix; if there are remaining problems, then they must be in pyexpat.c. I'll investigate 2.52 and 2.54 as candidates for backporting.
msg10015 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-04-04 07:05
Logged In: YES user_id=21627 Also, when parsing the large xml file: if you invoke gc.collect() after each iteration, memory consumption will go down, and not grow over time. The reason that GC does not trigger automatically is that you allocate all the space through strings. GC will be invoked after 1000 new container objects have been allocated, but you exhaust the memory before that - so either set the GC threshold down, or invoke GC on your own. For the specific application, it would be sufficient if xml.sax.__init__.parse would invoke parser.setContentHandler(None) after parsing has completed; this should already break the cycle. To solve the general problem, I like your suggestion of using a separate locator.
msg10016 - (view)	Author: Danny Yoo (dyoo)	Date: 2002-04-04 07:14
Logged In: YES user_id=49843 Martin is right: I need to retract part of my bug-report: I just remembered that any classes that have a __del__ method aren't automatically cleaned by a gc.collect(). I triggered a pseudo-Heisenbug during my testing. After removing the __del__ method from my test class and by using parseString() instead of feed(), I've verified that the XML parsing isn't the source of my memory leak. (Test file test_memory_leak_2.py included.) However, after further investigation, I did find the true source of my problems.. in MySQLdb: http://sourceforge.net/tracker/index.php?func=detail&aid=536624&group_id=22307&atid=374932 Thank you again for looking into this.
msg10017 - (view)	Author: Fred Drake (fdrake)	Date: 2002-04-04 18:00
Logged In: YES user_id=3066 Checked in the fixed version as Lib/xml/sax/expatreader.py revisions 1.26, 1.25.16.1, and 1.22.4.1 (this last means it'll be in Python 2.1.3). This or a similar change should be added to PyXML.

History
Date	User	Action	Args
2022-04-10 16:05:09	admin	set	github: 36337
2002-03-26 23:24:47	dyoo	create