This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: sgmllib.sgmlparser is not thread safe
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: andresriancho, georg.brandl, josiahcarlson
Priority: normal Keywords:

Created on 2006-08-29 02:32 by andresriancho, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
sgml-not-threadSafe.py andresriancho, 2006-08-29 02:32
Messages (3)
msg29696 - (view) Author: Andres Riancho (andresriancho) Date: 2006-08-29 02:32
Python version:
===============

dz0@fre3ak:~$ python
Python 2.4.3 (#2, Apr 27 2006, 14:43:58)
[GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2

Problem description:
====================

sgmlparser is not thread safe, i discovered this
problem when trying to fetch and parse many html files
at the same time. 

An example of this bug can be found attached.

The sgmlparser input html is this string:
'<html></html>'*100 , this was written this way to
simplify the code, please note that if you replace this
string with a "large" html document, it will also fail.

solution:
=========

make the lib thread safe, or add some lines to the
documentation saying that it aint thread safe.


Traceback:
==========
 python sgml-not-threadSafe.py
Started all threads
Successfully parsed html
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.4/threading.py", line 442, in
__bootstrap
    self.run()
  File "/usr/lib/python2.4/threading.py", line 422, in run
    self.__target(*self.__args, **self.__kwargs)
  File "sgml-not-threadSafe.py", line 10, in parseHtml
    self._parser.feed( html )
  File "/usr/lib/python2.4/sgmllib.py", line 95, in feed
    self.goahead(0)
  File "/usr/lib/python2.4/sgmllib.py", line 129, in
goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.4/sgmllib.py", line 262, in
parse_starttag
    self.error('unexpected call to parse_starttag')
  File "/usr/lib/python2.4/sgmllib.py", line 102, in error
    raise SGMLParseError(message)
SGMLParseError: unexpected call to parse_starttag

Successfully parsed html
Successfully parsed html

Additional note
===============

To recreate this bug, you should run the sample code
more than one time. Thread handling aint always the
same, the issue is there but sometimes it fails to
appear on the first (second, third...) run.
msg29697 - (view) Author: Josiah Carlson (josiahcarlson) * (Python triager) Date: 2006-09-03 19:02
Logged In: YES 
user_id=341410

The sgmllib makes no claims as to thread safety, which
implies that it is generally not sharable between threads.

You can work around this issue by creating a new parser
instance for each thread that you want to parse.

Suggested close as "Wont Fix".
msg29698 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-09-06 06:18
Logged In: YES 
user_id=849994

I agree with Josiah.

Each thread will have to use its own HTMLParser instance.
History
Date User Action Args
2022-04-11 14:56:19adminsetgithub: 43907
2006-08-29 02:32:15andresrianchocreate