This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: sgmllib.SGMLParser
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: effbot, pbirnie
Priority: normal Keywords:

Created on 2005-02-06 14:04 by pbirnie, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg24174 - (view) Author: Paul Birnie (pbirnie) Date: 2005-02-06 14:04
sgmllib.SGMLParser calls start tag and end_methods 
correctly until it encounters

        <a title="link1" href="url1">One</a>
        <br/><a title="link2" href="someurl2">Two</a>
        <a title="link2" href="url3">Three</a> 

the <br/> seems to cause its parsing to become 
confused and I conly get call backs for tag a twice (link 
1 and 3)
  

msg24175 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-02-08 08:01
Logged In: YES 
user_id=38376

footnote: <br/> is an XML construct, and is not valid HTML.  
In HTML, "<tag/blah/" is short for "<tag>blah</tag>", so the 
BR section is parsed as

START br
DATA ><a title="link2" href="someurl2">Two<
END br
DATA a>

which is 100% correct.  For more on this topic, see:

http://www.cs.tut.fi/~jkorpela/html/empty.html
msg24176 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-02-08 08:03
Logged In: YES 
user_id=38376

footnote 2: if you need to deal with broken HTML, use 
TidyLib:

http://utidylib.berlios.de/
http://effbot.org/zone/element-tidylib.htm
msg24177 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-02-08 08:14
Logged In: YES 
user_id=38376

footnote 3: for the link case, also note that the HTMLParser 
module handles this in a more practical way (that is, it limits 
itself to SGML features that's actually used on the web).
msg24178 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2005-02-14 11:17
Logged In: YES 
user_id=38376

closing, due to lack of feedback.  using HTMLParser instead
of sgmllib should solve the problem.
History
Date User Action Args
2022-04-11 14:56:09adminsetgithub: 41532
2005-02-06 14:04:25pbirniecreate