I am using the htmllib to parse web pages for plain text content. I
came across a web page that contained a script construct similar
to the example below. Note that the script is itself writing a script.
The htmllib appears to be confused by the use of single and double
quotes used within the real <script> and </script> tags.
I am using "Python 2.3 (#1, Sep 13 2003, 00:49:11) [GCC
3.3 20030304 (Apple Computer, Inc. build 1495)] on darwin" on a
PowerBook G4 running OSX 10.3.8.
<html>
<body>
<h1> This is a test </h1>
<br>
<blockquote>
<script language="JavaScript">
rnum = Math.round( Math.random() * 100000 );
document.write( '<scr' + 'ipt src="http://www.a.org/' +
rnum + '/"></scr' + 'ipt>' );
</script>
</blockquote>
</body>
</html>
Here is the Python trace:
Traceback (most recent call last):
File "cleanFeed.py", line 26, in ?
clean = stripHtml.strip( feed )
File "/Users/allan/Desktop/Mood for Today/stripHtml.py", line
144, in strip
parser.feed(s)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 327, in parse_endtag
self.error("bad end tag: %s" % `rawdata[i:j]`)
File "/System/Library/Frameworks/Python.framework/Versions/
2.3/lib/python2.3/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr' + 'ipt>", at line
1, column 309
|