This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: htmlentitydefs.entitydefs assumes Latin-1 encoding
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: edemaine, loewis
Priority: normal Keywords:

Created on 2006-11-19 19:40 by edemaine, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg30630 - (view) Author: Erik Demaine (edemaine) Date: 2006-11-19 19:40
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff.  This should be <= 0x7f.

As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'.  But this is only "true" in the Latin-1 encoding.  For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'.  While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding.

This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data.  The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise.
msg30631 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-11-19 19:59
This is not a bug. entitydefs is specified to contain Latin-1 byte strings in its documentation, and many applications rely on that.

If you have different processing needs, you may want to use htmlentitydefs.name2codepoint instead, or derive yet another table automatically from it.
History
Date User Action Args
2022-04-11 14:56:21adminsetgithub: 44253
2006-11-19 19:40:19edemainecreate