This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser chokes on my.yahoo.com output
Type: Stage:
Components: Library (Lib) Versions: Python 2.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: fdrake Nosy List: fdrake, georg.brandl, gvanrossum, nnorwitz, rjwalsh, tmick
Priority: normal Keywords:

Created on 2003-06-26 21:11 by rjwalsh, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
htmlparser-patch.txt gvanrossum, 2003-06-30 15:49 context diff, reconstructed.
Messages (9)
msg16611 - (view) Author: Robert Walsh (rjwalsh) Date: 2003-06-26 21:11
The HTML parser chokes on the output produced by
http://my.yahoo.com/.  The problem appears to be that
the HTML Yahoo is producing contains stuff like this:

<option foo bar=>

The bar= without any value causes HTMLParser to get
confused.  I made the following patch to HTMLParser.py
and everything is now happy.  This may be illegal HTML,
but it appears to be popular.  Basically, this patch
tells it that the part after the = is optional.

--- HTMLParser.py.orig  2003-06-26 14:05:07.670049324 -0700
+++ HTMLParser.py       2003-06-26 14:05:14.440298260 -0700
@@ -36,7 +36,7 @@
         (?:'[^']*'                   # LITA-enclosed value
           |\"[^\"]*\"                # LIT-enclosed value
           |[^'\">\s]+                # bare value
-         )
+         )?
        )?
      )
    )*
msg16612 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-06-27 02:55
Logged In: YES 
user_id=33168

It's difficult to read the patch as posted since whitespace
is lost.  Please attach the patch as a file.  Thanks.
msg16613 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2003-06-30 15:49
Logged In: YES 
user_id=6380

Here it is (a one-char change). Looks harmless to me.
msg16614 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-06-01 12:24
Logged In: YES 
user_id=1188172

Should it be applied, then?
msg16615 - (view) Author: Robert Walsh (rjwalsh) Date: 2005-06-01 20:51
Logged In: YES 
user_id=608672

It's been so long since I looked at this, I don't believe I
even have the code any more.  It's just a one-character
change, though - can you recreate it yourself by just adding
the ? character to the end of line 39 in HTMLParser.py. 
Unless it's moved in the meantime, of course.
msg16616 - (view) Author: Robert Walsh (rjwalsh) Date: 2005-06-01 20:52
Logged In: YES 
user_id=608672

Crap.  Stupid SourceForge bug tracker puts the latest stuff
on top - I was replying to the wrong one.  The change can be
applied, in my opinion.
msg16617 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-31 22:09
Logged In: YES 
user_id=1188172

Checked in as Lib/HTMLParser.py r1.16, 1.15.2.1.
msg16618 - (view) Author: Trent Mick (tmick) (Python triager) Date: 2005-09-02 00:04
Logged In: YES 
user_id=34892

...and subsequently backed out in r1.15.2.2 and r1.17.

    Reverting previous checkin. This breaks too much of 
    HTMLParser to be applied without thought. Anyway, such 
    malformed HTML is better handled by something
    like BeautifulSoup.


Apologies, Reinhold, if you were getting to this. I just
happened to notice this while reading python-checkins. Cheers.
msg16619 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-09-02 11:01
Logged In: YES 
user_id=1188172

Yes, thanks for noting, it was still on my todo list...
History
Date User Action Args
2022-04-10 16:09:26adminsetgithub: 38718
2003-06-26 21:11:15rjwalshcreate