urllib.splithost(url) requires that the url passed in
be of the form '//host[:port]/path'. Yet I've run
across some urls that are of the form
'//host[:port]?querystring'. This causes splithost to
return everything as the host and nothing as the path.
Section 3.2 of rfc2396 (Uniform Resource Identifiers:
Generic Syntax) states that 'The authority component is
preceded by a double slash "//" and is terminated by
the next slash "/", question-mark "?", or by the end of
the URI.'
Also, this is how it defines a URI:
absoluteURI = scheme ":" ( hier_part | opaque_part )
hier_part = ( net_path | abs_path ) [ "?" query ]
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments
Based on the above, you could certainly have:
'http://authority?query' as a valid url.
In python2.3 you would just need to change line 939 in
urllib.py from:
_hostprog = re.compile('^//([^/]*)(.*)$')
to:
_hostprog = re.compile('^//([^/?]*)(.*)$')
This appears to affect all python versions, I just
happened to be using 2.3.
|