This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser should support entities in attributes
Type: Stage:
Components: None Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: fdrake Nosy List: aaronsw, fdrake, loewis
Priority: normal Keywords: patch

Created on 2004-03-09 01:20 by aaronsw, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
replacement.py aaronsw, 2004-03-09 01:21 replacement unescape function for HTMLParser.py
Messages (4)
msg45480 - (view) Author: Aaron Swartz (aaronsw) Date: 2004-03-09 01:20
HTMLParser doesn't currently support entities in attributes, 
like this:

<span title="&8221; is a nice character">foo</span>

This patch fixes that. Simply replace the unescape in 
HTMLParser.py with:


import htmlentitydefs

def unescape(self, s):

	def replaceEntities(s):
		s = s.groups()[0]
		if s[0] == "#":
			s = s[1:]
			if s[0] in ['x','X']:
				c = int(s[1:], 16)
			else:
				c = int(s)
			return unichr(c)
			
		else:
			return 
unichr(htmlentitydefs.name2codepoint[c])
			
	return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", 
replaceEntities, s)

msg45481 - (view) Author: Aaron Swartz (aaronsw) Date: 2004-03-09 01:21
Logged In: YES 
user_id=122141

Oops. The replacement function is attached.
msg45482 - (view) Author: Aaron Swartz (aaronsw) Date: 2004-03-09 01:21
Logged In: YES 
user_id=122141

Argh. Hopefully now.
msg45483 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-03-06 14:46
Thanks for the patch. Committed as r54165, with the following changes:

- added documentation changes
- added testsuite changes
- fixed incorrect usage of c in name2codepoint[c] (should be [s])
- included &apos; in the list of supported entities, for compatibility with older versions of HTMLParser
- fall back to replacing an unsupported entity reference with &name;
History
Date User Action Args
2022-04-11 14:56:03adminsetgithub: 40012
2004-03-09 01:20:04aaronswcreate