Issue 736428: allow HTMLParser error recovery

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38487

classification

Title:	allow HTMLParser error recovery
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	allow HTMLParser to continue after a parse error View: 755660
Assigned To:		Nosy List:	ajaksu2, georg.brandl, kingswood, smroid
Priority:	normal	Keywords:

Created on 2003-05-12 11:37 by smroid, last changed 2022-04-10 16:08 by admin. This issue is now closed.

Messages (5)
msg60329 - (view)	Author: Steven Rosenthal (smroid)	Date: 2003-05-12 11:37
I'm using 2.3a2. HTMLParser correctly raises a "malformed start tag" error on: <meta NAME=DESCRIPTION Content=Lands' End quality... outerwear and more.> because my application is imprecise by nature (web scraping), I want to be able to continue after such errors. I can override the error() method to not raise an exception. To make this work, I also needed to alter HTMLParser.py, near line 316, to read as: self.updatepos(i, j) self.error("malformed start tag") return j # ADDED THIS LINE raise AssertionError("we should not get here!") My enhancement request is for every place where self.error() is called, to ensure that the "override error() to not raise an exception" continuation strategy works as well as can be hoped. Thanks, Steve
msg60330 - (view)	Author: Frank Vorstenbosch (kingswood)	Date: 2004-03-16 09:53
Logged In: YES user_id=555155 Fixed by my patch against 2.3.3. The patch adds recovery to ensure progress and tries to not miss any data in the input. The error() method is now commented as being overridable, just def error(): pass to ignore any parsing errors.
msg60331 - (view)	Author: Frank Vorstenbosch (kingswood)	Date: 2004-04-03 18:04
Logged In: YES user_id=555155 This problem is actually more widespread than previously indicated. Not only do all calls to self.error where that function returns need to cope with that, and recover (the HTMLParser defines that every character in the input will be visited exactly once), but other modules are also affected. In particular, feeding HTML (from spam) with a tag <!12345> into HTMLParser causes markupbase._scan_name to emit an error that now needs to recover. The patch in #917188 may be better than the one suggested here as it deals with all places where self.error() can return. More is needed to fix the problem completely. In markupbase.py, at least this is necessary --- markupbase.py.orig Sat Apr 03 17:43:48 2004 +++ markupbase.py Sat Apr 03 18:02:48 2004 @@ -377,6 +377,8 @@ else: self.updatepos(declstartpos, i) self.error("expected name token") + return None,rawdata.find(">",i) # To be overridden -- handlers for unknown objects def unknown_decl(self, data):
msg81442 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-02-09 06:16
Superseder: issue 755660.
msg85553 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2009-04-05 18:45
Setting as superseder.

History
Date	User	Action	Args
2022-04-10 16:08:42	admin	set	github: 38487
2009-04-05 18:45:17	georg.brandl	set	status: open -> closed nosy: + georg.brandl messages: + msg85553 superseder: allow HTMLParser to continue after a parse error resolution: duplicate
2009-02-09 06:16:29	ajaksu2	set	nosy: + ajaksu2 messages: + msg81442
2003-05-12 11:37:44	smroid	create