Issue505747
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2002-01-19 14:37 by glchapman, last changed 2022-04-10 16:04 by admin. This issue is now closed.
Messages (6) | |||
---|---|---|---|
msg8887 - (view) | Author: Greg Chapman (glchapman) | Date: 2002-01-19 14:37 | |
Using Python 2.2., I tried to use websucker.py on this page: http://magix.fri.uni-lj.si/orange/start/ This resulted in an exception in ParserBase._scan_name because _declname_match failed. Examining the source for the page above, I see there are several tags that look like: "<![endif]>" where the first character after "<!" is a '[', not an alpha as mandated by _delcname_match. Perhaps this is badly formed HTML (I see it was produced by FrontPage), but if not, it appears that _scan_name may have to be modified. FYI, here's the traceback from the exception: Traceback (most recent call last): File "C:\Python22\Tools\webchecker\websucker.py", line 126, in ? sys.exit(main() or 0) File "C:\Python22\Tools\webchecker\websucker.py", line 43, in main c.run() File "C:\Python22\Tools\webchecker\webchecker.py", line 349, in run self.dopage(url) File "C:\Python22\Tools\webchecker\webchecker.py", line 403, in dopage page = self.getpage(url_pair) File "C:\Python22\Tools\webchecker\webchecker.py", line 507, in getpage return Page(text, url, maxpage=self.maxpage, checker=self) File "C:\Python22\Tools\webchecker\webchecker.py", line 671, in __init__ self.parser.feed(self.text) File "c:\Python22\lib\sgmllib.py", line 95, in feed self.goahead(0) File "c:\Python22\lib\sgmllib.py", line 161, in goahead k = self.parse_declaration(i) File "c:\Python22\lib\markupbase.py", line 66, in parse_declaration decltype, j = self._scan_name(j, i) File "c:\Python22\lib\markupbase.py", line 313, in _scan_name self.error("expected name token") File "c:\Python22\lib\sgmllib.py", line 102, in error raise SGMLParseError(message) sgmllib.SGMLParseError: expected name token |
|||
msg8888 - (view) | Author: Fred Drake (fdrake) | Date: 2002-02-15 06:13 | |
Logged In: YES user_id=3066 Ugh! I don't think that's legal HTML at all. I'll have to think about the right way to deal with it. |
|||
msg8889 - (view) | Author: Fred Drake (fdrake) | Date: 2002-06-14 01:39 | |
Logged In: YES user_id=3066 Ok, here's what I think. This is not an actual bug in the interpretation of HTML, and there has not been a recurring pattern of complaints about this. Given that we do not want to encourage the creation of broken HTML, this edge case will not be allowed to further complicate the code. |
|||
msg8890 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2003-03-30 14:52 | |
Logged In: YES user_id=21627 This has now been fixed with patch 545300, on grounds of conformance with SGML. |
|||
msg8891 - (view) | Author: Alan Ezust (ezust) | Date: 2004-11-09 16:20 | |
Logged In: YES user_id=935841 I am running into this problem too. It seems quite common to have invalid HTML in real-world web pages, and if you are running a scraper program, I guess it's to be expected that one will encounter invalid HTML from time to time. So in answer to your question about how to respond, I think what's most important is that you output a better error message. Then it won't be considered a bug in the library. The error should indicate where in the document it encountered this parse error. Second, I don't understand what getpos() returns, and how it relates to the parse error. It returns a 1,2, when actually in the particular page where I encountered the error, the problem was on line 12 (see http://www.cs.uvic.ca/~gshoja/ as example). How do I get this information from the object? |
|||
msg8892 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2004-11-14 10:29 | |
Logged In: YES user_id=21627 ezust: Posting to an SF tracker item is inadequate for asking for help, please post to comp.lang.python instead. If you think there are bugs remaining in Python, please submit separate bug reports. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-10 16:04:54 | admin | set | github: 35953 |
2002-01-19 14:37:04 | glchapman | create |