Patch for bug #1452246 htmllib doesn't properly
substitute entities
Continuation of patch #1087808 sgmllib.SGMLParser does
not unescape attribute values; patch
Substitute entities in argument values
import htmllib
import formatter
import StringIO
s = StringIO.StringIO()
p =
htmllib.HTMLParser(formatter.AbstractFormatter(formatter.DumbWriter(s)))
p.feed('<img alt="<>&">')
print s.getvalue()
will now print '<>&' instead of '<>&'.
The patch modifies module sgmllib, class SGMLParser,
method parse_starttag. In this method, the entities are
substituted in the argument values. The substitutions
are based on existing property SGMLParser.entitydefs.
For parsing is uses the regular expression entityref.
Regarding the differences between this patch and patch
#1087808:
- use self.entitydefs to determine the set of entity
names that are supported;
- unknown entities references are left alone;
- the regular expression entityref is used to find
references;
- a documentation patch is not needed as the method
is Internal.
Regarding the fact that semicolon after the entity name
is not mandatory in SGML, the way entityref is defined
"< " will become "< ", while "<" will stay "<",
regardless of being an attribute value.
The patch also adds test cases in module
test/test_sgmllib.py, class SGMLParserTestCase, method
test_attr_values. In that method, the proper
substitution is tested.
Ray
|