This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser parsers AT&T to AT
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: jimjjewett, jrm, lhy719, loewis
Priority: normal Keywords:

Created on 2003-12-09 02:47 by lhy719, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg19332 - (view) Author: Hammer Lee (lhy719) Date: 2003-12-09 02:47
I use HTMLParser to parse HTML files. There is an 
mistake when HTML contents have '&', like <BR>AT&T 
Research Labs Cambridge - WinVNC Version 3, 3, 3, 7.

HTMLParser parses "AT&T Research" to "AT
 Research".

It happens on "ETTC&P EpSCTWeb_Fr Application Version 
1, 0, 0, 1" also.

I'm a newbie in Python, I don't know how to solve it.
msg19333 - (view) Author: Jim Jewett (jimjjewett) Date: 2003-12-11 18:32
Logged In: YES 
user_id=764593

Technically, that isn't legal html; they're supposed to write 
&amp;  (follow the & with the word "amp;"), because & is an 
escape character.

That said, it is a pretty common error in web pages.  The 
parser already recovers at the next space (instead of waiting 
for a ";", and I think it would be reasonable to just return the 
"&T" when T doesn't turn out to be a known entity.

You would do this by overriding handle_entityref -- but to be 
honest, I suspect that you're "really" using some other library 
(or local code) which already does this, so you may have to 
make the modification there.
msg19334 - (view) Author: Jordan R McCoy (jrm) Date: 2003-12-24 17:40
Logged In: YES 
user_id=813983

The HTML being parsed should use '&amp;' for the '&'; 
however, HTMLParser uses this regexp to identify entity 
references (line 20):

    entityref = re.compile(
        '&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

which doesn't match the ';' required at the end by the HTML 
specification. This may or may not be intentional.
msg19335 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-12-30 11:21
Logged In: YES 
user_id=21627

What do you mean, "it parses it to AT Research". It most
certainly does no such thing. Instead, it invokes
handle_entityref with the "T" entity, which you should process.

Closing as not-a-bug
History
Date User Action Args
2022-04-11 14:56:01adminsetgithub: 39682
2003-12-09 02:47:41lhy719create