Issue 1599325: htmlentitydefs.entitydefs assumes Latin-1 encoding

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/44253

classification

Title:	htmlentitydefs.entitydefs assumes Latin-1 encoding
Type:		Stage:
Components:	Library (Lib)	Versions:

process

Created on 2006-11-19 19:40 by edemaine, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (2)
msg30630 - (view)	Author: Erik Demaine (edemaine)	Date: 2006-11-19 19:40
The code in htmlentitydefs.py that sets entitydefs uses chr whenever the codepoint is <= 0xff. This should be <= 0x7f. As it currently stands, htmlentitydefs.entitydefs['nbsp'] == '\xa0'. But this is only "true" in the Latin-1 encoding. For example, in UTF8, the same character (u'\xa0') would be encoded '\xc2\xa0'. While I think it is reasonable for entitydefs to use the ASCII codec for characters encodable in that codec (<= 0x7f), I do not think it is reasonable to assume Latin-1 encoding. This issue affects sgmllib.SGMLParser, for example, when handle_entityref calls handle_data. The passed data can be '\xa0', which handle_data is forced to assume is Latin-1, when the source string might be encoded otherwise.
msg30631 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-11-19 19:59
This is not a bug. entitydefs is specified to contain Latin-1 byte strings in its documentation, and many applications rely on that. If you have different processing needs, you may want to use htmlentitydefs.name2codepoint instead, or derive yet another table automatically from it.