This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.splithost parses incorrectly
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: georg.brandl, onlynone
Priority: normal Keywords:

Created on 2006-03-23 20:49 by onlynone, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg27856 - (view) Author: Steven Willis (onlynone) Date: 2006-03-23 20:49
urllib.splithost(url) requires that the url passed in
be of the form '//host[:port]/path'. Yet I've run
across some urls that are of the form
'//host[:port]?querystring'. This causes splithost to
return everything as the host and nothing as the path.


Section 3.2 of rfc2396 (Uniform Resource Identifiers:
Generic Syntax) states that 'The authority component is
preceded by a double slash "//" and is terminated by
the next slash "/", question-mark "?", or by the end of
the URI.'

Also, this is how it defines a URI:

absoluteURI   = scheme ":" ( hier_part | opaque_part )
hier_part     = ( net_path | abs_path ) [ "?" query ]
net_path      = "//" authority [ abs_path ]
abs_path      = "/"  path_segments

Based on the above, you could certainly have:
'http://authority?query' as a valid url.


In python2.3 you would just need to change line 939 in
urllib.py from:

        _hostprog = re.compile('^//([^/]*)(.*)$')

to:

        _hostprog = re.compile('^//([^/?]*)(.*)$')

This appears to affect all python versions, I just
happened to be using 2.3.
msg27857 - (view) Author: Steven Willis (onlynone) Date: 2006-03-24 17:12
Logged In: YES 
user_id=1299996

The problem I was having specifically was that the url had a
colon in the query string. Since the query string was being
parsed as part of the host, the text after the colon was
being treated as the port when urllib.splitport was called
later. The following is a simple testcase:

import urllib2
webpage = urllib2.urlopen("http://host.com?a=b:3b")

You will then get a "httplib.InvalidURL: nonnumeric port: '3b'"
msg27858 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-03-26 21:00
Logged In: YES 
user_id=849994

Fixed in rev. 43330.
History
Date User Action Args
2022-04-11 14:56:16adminsetgithub: 43078
2006-03-23 20:49:08onlynonecreate