This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urlparse.urljoin odd behaviour
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: andresriancho, georg.brandl, the_j10
Priority: normal Keywords:

Created on 2006-08-25 13:04 by andresriancho, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg29685 - (view) Author: Andres Riancho (andresriancho) Date: 2006-08-25 13:04
Hi !

   I think i have found a bug on the urljoin function
of the urlparse
module. I'm using Python 2.4.3 (#2, Apr 27 2006,
14:43:58), [GCC 4.0.3
(Ubuntu 4.0.3-1ubuntu5)] on linux2 . Here is a demo of
the bug :

>>> import urlparse
>>>urlparse.urljoin('http://www.f00.com/','//a')
'http://a'
>>>
urlparse.urljoin('http://www.f00.com/','https://0000/somethingIsWrong')
'https://0000/somethingIsWrong'
>>>
urlparse.urljoin('http://www.f00.com/','https://0000/somethingIsWrong')
'https://0000/somethingIsWrong'
>>>
urlparse.urljoin('http://www.f00.com/','file:///etc/passwd')
'file:///etc/passwd'


   The result for the first call to urljoin should be
either
'http://www.f00.com/a' or 'http://www.f00.com//a'. The
result to the
second and third call to urljoin should be
'http://www.f00.com/', or maybe an
exception ?

   Please correct me if i'm wrong and this is some kind
of feature or
the bug was already reported. This bug can result in a
security vuln,
take this code as an example:

// viewImage.py //
import htmlTools                                      
    # Some fake
module, just for the example
import urlparse                                       
     # module
with bug.

htmlTools.startHtml()                                 
  # print <html>
params = htmlTools.getParams()                # get the
query string
parameters
htmlTools.printToHtml( '<img src=' + urlparse.urljoin(
'http://myWebsite/' , params['image'] ) + '>' )
htmlTools.endHtml()                                   
 # print </html>
// viewImage.py //

   The code should generate an html that shows an image
from the site
http://myWebsite/, but with the urljoin bug, the image
source can be
manipulated and result in a completely different html.

Cheers,

Andres Riancho
msg29686 - (view) Author: Andrew Jones (the_j10) Date: 2006-08-29 11:29
Logged In: YES 
user_id=332575

The second argument in the urljoin method can be either an 
absolute url or a relative url as specified by rfc1808. So
your 1st example: '//a' gives a relative position w.r.t the
base resulting in: 'http://a'. This is similar to how `cd
/boot` takes you to a path relative to the filesystem's root
'/'. 

In the rest of your examples you have the scheme name
'https'in the url as the 2nd argument. urljoin follows the
rfc1808 and accepts the second argument if it has a scheme
name as the absolute url and returns it.

This behavior is not very intuitive. Perhaps the urlparse 
could be extended to have a urlappend method, which has the 
behavior you expected. Hmmm...

-- Andrew
msg29687 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-10-12 11:15
Logged In: YES 
user_id=849994

The behavior is okay, but the docs didn't say that. I added
a note in rev. 52303, 52304 (2.5).
History
Date User Action Args
2022-04-11 14:56:19adminsetgithub: 43899
2006-08-25 13:04:08andresrianchocreate