This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urlparse doesn't handle host?bla
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: jepler, jlgijsbers, mrovner, msdemlei, paul.moore, staschuk
Priority: normal Keywords:

Created on 2002-04-24 15:36 by msdemlei, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (8)
msg10499 - (view) Author: Markus Demleitner (msdemlei) Date: 2002-04-24 15:36
The urlparse module (at least in 2.2 and 2.1, Linux)
doesn't
handle URLs of the form
http://www.maerkischeallgemeine.de?loc_id=49 correctly
-- everything up to the 9 ends up in the host.  I
didn't check the RFC, but in the real world URLs like
this do show up.  urlparse works fine when there's a
trailing slash on the host name:
http://www.maerkischeallgemeine.de/?loc_id=49

Example:
<pre>
>>> import urlparse
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de/?loc_id=49")
('http', 'www.maerkischeallgemeine.de', '/', '',
'loc_id=49', '')
>>>
urlparse.urlparse("http://www.maerkischeallgemeine.de?loc_id=49")
('http', 'www.maerkischeallgemeine.de?loc_id=49', '',
'', '', '')
</pre>

This has serious implications for urllib, since
urllib.urlopen will fail for URLs like the second one,
and with a pretty mysterious exception ("host not
found") at that.
msg10500 - (view) Author: Jeff Epler (jepler) Date: 2002-11-17 16:56
Logged In: YES 
user_id=2772

This actually appears to be permitted by RFC2396
[http://www.ietf.org/rfc/rfc2396.txt].   See section 3.2:


3.2. Authority Component

   Many URI schemes include a top hierarchical element for a
naming authority, such that the namespace defined by the
remainder of the URI is governed by that authority.  This
authority component is typically defined by an
Internet-based server or a scheme-specific registry of
naming authorities.

      authority     = server | reg_name

   The authority component is preceded by a double slash
"//" and is terminated by the next slash "/", question-mark
"?", or by the end of the URI.  Within the authority
component, the characters ";", ":", "@", "?", and "/" are
reserved.
msg10501 - (view) Author: Steven Taschuk (staschuk) Date: 2003-03-30 20:19
Logged In: YES 
user_id=666873

For comparison, RFC 1738 section 3.3:
   An HTTP URL takes the form:
      http://<host>:<port>/<path>?<searchpart>
   [...] If neither <path> nor <searchpart> is present,
   the "/" may also be omitted.
... which does not outright say the '/' may *not* be omitted if 
<path> is absent but <searchpart> is present (though imho 
that's implied).

But even if the / may not be omitted in this case, ? is not 
allowed in the authority component under either RFC 2396 or 
RFC 1738, so urlparse should either treat it as a delimiter or 
reject the URL as malformed.  The principle of being lenient in 
what you accept favours the former.

I've just submitted a patch (712317) for this.
msg10502 - (view) Author: Mike Rovner (mrovner) Date: 2004-01-27 01:13
Logged In: YES 
user_id=162094

According to RFC2396 (ftp://ftp.isi.edu/in-notes/rfc2396.txt) 
absoluteURI (part 3 URI Syntactic Components) can be:
"""
<scheme>://<authority><path>?<query>
each of which, except <scheme>, may be absent from a 
particular URI.
"""
Later on (3.2):
"""
The authority component is preceded by a double slash "//" 
and is terminated by the next slash "/", question-mark "?", 
or by the end of the URI.
"""
So URL "http://server?query" is perfectly legal and shall be 
allowed and patch 712317 rejected.
msg10503 - (view) Author: Johannes Gijsbers (jlgijsbers) * (Python triager) Date: 2004-10-23 07:03
Logged In: YES 
user_id=469548

Somehow I think I'm missing something. Please check my line
of reasoning:

1. http://foo?bar=baz is a legal URL.
2. urlparse's 'Network location' should be the same as
<authority> from rfc2396.
3. Inside <authority> an unescaped '?' is not allowed.
Rather: <authority> is terminated by the '?'.
4. Currently the 'network location' for http://foo?bar=baz
would be 'foo?bar=baz.
5. If 'network location' should be the same as <authority>,
it should also be terminated by the '?'. 

So shouldn't urlparse.urlsplit('http://foo?bar=baz') return
('http', 'foo', '', '', 'bar=baz', ''), as patch 712317
implements?
msg10504 - (view) Author: Mike Rovner (mrovner) Date: 2004-10-23 07:44
Logged In: YES 
user_id=162094

I'm sorry, I misunderstood the patch. If it accepts such URL 
and split it at '?', it's perfectly fine.
It shall not reject such URL as malformed.
msg10505 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2004-11-08 20:48
Logged In: YES 
user_id=113328

This issue still exists in Python 2.3.4 and Python 2.4b2.
msg10506 - (view) Author: Johannes Gijsbers (jlgijsbers) * (Python triager) Date: 2005-01-09 15:33
Logged In: YES 
user_id=469548

Fixed by applying patch #712317 on maint24 and HEAD.
History
Date User Action Args
2022-04-10 16:05:15adminsetgithub: 36493
2002-04-24 15:36:23msdemleicreate