This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Webchecker error on http://www.naleo.org
Type: Stage:
Components: Demos and Tools Versions:
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: jhylton Nosy List: jhylton, mcsolrac, mwh
Priority: normal Keywords:

Created on 2002-08-08 04:40 by mcsolrac, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
WSJArticle002_source.txt mcsolrac, 2002-08-08 04:40 Source code of an HTML document
Messages (5)
msg11864 - (view) Author: Carlos Conti (mcsolrac) Date: 2002-08-08 04:40
Webchecker version 1.25.6.1 on Windows 2000 
Professional. 

Run webchecker with this argument 
http://www.naleo.org/WSJArticle002.htm 
Webchecker will return this traceback: 

Traceback (most recent call last):
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 858, in ?
    main()
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 222, in main
    c.run()
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 349, in run
    self.dopage(url)
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 403, in dopage
    page = self.getpage(url_pair)
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 507, in getpage
    return Page(text, url, maxpage=self.maxpage, 
checker=self)
  File "C:\Python22\Tools\webchecker\webchecker.py", 
line 671, in __init__
    self.parser.feed(self.text)
  File "C:\Python22\lib\sgmllib.py", line 95, in feed
    self.goahead(0)
  File "C:\Python22\lib\sgmllib.py", line 161, in goahead
    k = self.parse_declaration(i)
  File "C:\Python22\lib\markupbase.py", line 66, in 
parse_declaration
    decltype, j = self._scan_name(j, i)
  File "C:\Python22\lib\markupbase.py", line 313, in 
_scan_name
    self.error("expected name token")
  File "C:\Python22\lib\sgmllib.py", line 102, in error
    raise SGMLParseError(message)
sgmllib.SGMLParseError: expected name token

I believe this is because of the xml in the source code 
(see WSJArticle002_source.txt attached to this bug 
report).

Even if the code in this page is poorly formatted, 
webchecker should be able continue checking other 
links in this domain (rather than stopping). For example 
webchecker could report “unable to check 
http://www.naleo.org/WSJArticle002.htm” and return 
traceback like the above, and then continue with the rest 
of the domain. 
msg11865 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2002-08-08 19:20
Logged In: YES 
user_id=31392

I've seen a variety of parsing problems kill webchecker.  I
agree that these exceptions should be caught somewhere so
that they are not fatal.  Care to submit a patch?
msg11866 - (view) Author: Carlos Conti (mcsolrac) Date: 2002-08-08 22:06
Logged In: YES 
user_id=591396

I'd love to submit a patch, but I am a newbie to both Python 
and programming. I apologize if this space is only intended 
for programmers; I am a QA engineer just getting acquainted 
to the wonderful world of Python. 
msg11867 - (view) Author: Jeremy Hylton (jhylton) (Python triager) Date: 2002-08-13 13:36
Logged In: YES 
user_id=31392

No need to apologize.  Everyone is welcome to submit bug
reports here.  There are, however, lots of programmers who
submit bugs, so I find it helpful to ask :-).  I'll look
into this, but it's not the highest priority.
msg11868 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2004-08-07 20:49
Logged In: YES 
user_id=6656

jlgijsbers reports this as fixed by revision 1.30 of
webchecker.py on #python-dev IRC.
History
Date User Action Args
2022-04-10 16:05:34adminsetgithub: 36999
2002-08-08 04:40:27mcsolraccreate