This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urljoin fails RFC tests
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: brett.cannon Nosy List: aaronsw, brett.cannon, fdrake, jribbens, mbrierst, skip.montanaro
Priority: normal Keywords:

Created on 2001-08-12 05:10 by aaronsw, last changed 2022-04-10 16:04 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
uritests.py aaronsw, 2001-11-05 18:34 URI Test Suite
Messages (10)
msg5898 - (view) Author: Aaron Swartz (aaronsw) Date: 2001-08-12 05:10
I've put together a test suite for Python's URLparse 
module, based on the tests in Appendix C of 
RFC2396 (the URI RFC). They're available at:

http://lists.w3.org/Archives/Public/uri/2001Aug/
0013.html

The major problem seems to be that it treats 
queries and parameters as special components 
(not just normal parts of the path), making this 
related to:

http://sourceforge.net/tracker/?group_id=5470&;
atid=105470&func=detail&aid=210834
msg5899 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2001-11-05 18:05
Logged In: YES 
user_id=3066

This looks like its probably related to #478038; I'll try to
tackle them together.  Can you attach your tests to the bug
report on SF?  Thanks!
msg5900 - (view) Author: Aaron Swartz (aaronsw) Date: 2001-11-05 18:30
Logged In: YES 
user_id=122141

Sure, here they are:



import urlparse

base = 'http://a/b/c/d;p?q'

assert urlparse.urljoin(base, 'g:h') == 'g:h'
assert urlparse.urljoin(base, 'g') ==   'http://a/b/c/g'
assert urlparse.urljoin(base, './g') == 'http://a/b/c/g'
assert urlparse.urljoin(base, 'g/') ==  'http://a/b/c/g/'
assert urlparse.urljoin(base, '/g') ==  'http://a/g'
assert urlparse.urljoin(base, '//g') == 'http://g'
assert urlparse.urljoin(base, '?y') ==  'http://a/b/c/?y'
assert urlparse.urljoin(base, 'g?y') == 'http://a/b/c/g?y'
assert urlparse.urljoin(base, '#s') ==  'http://a/b/c/
d;p?q#s'
assert urlparse.urljoin(base, 'g#s') == 'http://a/b/c/g#s'
assert urlparse.urljoin(base, 'g?y#s') == 'http://a/b/c/
g?y#s'
assert urlparse.urljoin(base, ';x') == 'http://a/b/c/;x'
assert urlparse.urljoin(base, 'g;x') ==  'http://a/b/c/g;x'
assert urlparse.urljoin(base, 'g;x?y#s') == 'http://a/b/c/
g;x?y#s'
assert urlparse.urljoin(base, '.') ==  'http://a/b/c/'
assert urlparse.urljoin(base, './') ==  'http://a/b/c/'
assert urlparse.urljoin(base, '..') ==  'http://a/b/'
assert urlparse.urljoin(base, '../') ==  'http://a/b/'
assert urlparse.urljoin(base, '../g') ==  'http://a/b/g'
assert urlparse.urljoin(base, '../..') ==  'http://a/'
assert urlparse.urljoin(base, '../../') ==  'http://a/'
assert urlparse.urljoin(base, '../../g') ==  'http://a/g'

assert urlparse.urljoin(base, '') == base

assert urlparse.urljoin(base, '../../../g')    ==  'http://a/../g'
assert urlparse.urljoin(base, '../../../../g') ==  'http://a/../../g'

assert urlparse.urljoin(base, '/./g') ==  'http://a/./g'
assert urlparse.urljoin(base, '/../g')         ==  'http://a/../g'
assert urlparse.urljoin(base, 'g.')            ==  'http://a/b/c/
g.'
assert urlparse.urljoin(base, '.g')            ==  'http://a/b/c/
.g'
assert urlparse.urljoin(base, 'g..')           == 'http://a/b/c/
g..'
assert urlparse.urljoin(base, '..g')           == 'http://a/b/c/
..g'

assert urlparse.urljoin(base, './../g')        ==  'http://a/b/g'
assert urlparse.urljoin(base, './g/.')         ==  'http://a/b/c/
g/'
assert urlparse.urljoin(base, 'g/./h')         ==  'http://a/b/c/
g/h'
assert urlparse.urljoin(base, 'g/../h')        ==  'http://a/b/c/
h'
assert urlparse.urljoin(base, 'g;x=1/./y')     ==  
'http://a/b/c/g;x=1/y'
assert urlparse.urljoin(base, 'g;x=1/../y')    ==  'http://a/b/
c/y'

assert urlparse.urljoin(base, 'g?y/./x')       ==  
'http://a/b/c/g?y/./x'
assert urlparse.urljoin(base, 'g?y/../x')      == 
'http://a/b/c/g?y/../x'
assert urlparse.urljoin(base, 'g#s/./x')       ==  'http://a/b/
c/g#s/./x'
assert urlparse.urljoin(base, 'g#s/../x')      ==  'http://a/b/
c/g#s/../x'

msg5901 - (view) Author: Aaron Swartz (aaronsw) Date: 2001-11-05 18:34
Logged In: YES 
user_id=122141

Oops, meant to attach it...
msg5902 - (view) Author: Jon Ribbens (jribbens) * Date: 2002-03-18 14:22
Logged In: YES 
user_id=76089

I think it would be better btw if '..' components taking 
you 'off the top' were stripped. RFC 2396 says this is 
valid behaviour, and it's what 'real' browsers do.

i.e.
  http://a/b/ + ../../../d == http://a/d
msg5903 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2002-03-23 05:34
Logged In: YES 
user_id=44345

added Aaron's RFC 2396 tests to test_urlparse.py
version 1.4 - the two failing tests are commented out

msg5904 - (view) Author: Michael Stone (mbrierst) Date: 2003-02-03 21:02
Logged In: YES 
user_id=670441

The two failing tests could not pass because RFC 1808 and RFC 2396 seem to conflict when a relative URI is given as just ;y or just ?y.

RFC 2396 claims to update RFC 1808, so presumably it describes the correct behavior.  The patch in this message (I can't upload it on sourceforge here for some reason) brings urljoin's behavior in line with RFC 2396, and changes the appropriate test cases.  I think if you apply this patch this bug can be closed.  Let me know what you think


Index: python/dist/src/Lib/urlparse.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/urlparse.py,v
retrieving revision 1.39
diff -c -r1.39 urlparse.py
*** python/dist/src/Lib/urlparse.py	7 Jan 2003 02:09:16 -0000	1.39
--- python/dist/src/Lib/urlparse.py	3 Feb 2003 20:51:08 -0000
***************
*** 157,169 ****
      if path[:1] == '/':
          return urlunparse((scheme, netloc, path,
                             params, query, fragment))
!     if not path:
!         if not params:
!             params = bparams
!             if not query:
!                 query = bquery
          return urlunparse((scheme, netloc, bpath,
!                            params, query, fragment))
      segments = bpath.split('/')[:-1] + path.split('/')
      # XXX The stuff below is bogus in various ways...
      if segments[-1] == '.':
--- 157,165 ----
      if path[:1] == '/':
          return urlunparse((scheme, netloc, path,
                             params, query, fragment))
!     if not (path or params or query):
          return urlunparse((scheme, netloc, bpath,
!                            bparams, bquery, fragment))
      segments = bpath.split('/')[:-1] + path.split('/')
      # XXX The stuff below is bogus in various ways...
      if segments[-1] == '.':
Index: python/dist/src/Lib/test/test_urlparse.py
===================================================================
RCS file: /cvsroot/python/python/dist/src/Lib/test/test_urlparse.py,v
retrieving revision 1.11
diff -c -r1.11 test_urlparse.py
*** python/dist/src/Lib/test/test_urlparse.py	6 Jan 2003 20:27:03 -0000	1.11
--- python/dist/src/Lib/test/test_urlparse.py	3 Feb 2003 20:51:12 -0000
***************
*** 54,59 ****
--- 54,63 ----
              self.assertEqual(urlparse.urlunparse(urlparse.urlparse(u)), u)
  
      def test_RFC1808(self):
+         # updated by RFC 2396
+ #        self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
+ #        self.checkJoin(RFC1808_BASE, ';x', 'http://a/b/c/d;x')
+ 
          # "normal" cases from RFC 1808:
          self.checkJoin(RFC1808_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC1808_BASE, 'g', 'http://a/b/c/g')
***************
*** 61,74 ****
          self.checkJoin(RFC1808_BASE, 'g/', 'http://a/b/c/g/')
          self.checkJoin(RFC1808_BASE, '/g', 'http://a/g')
          self.checkJoin(RFC1808_BASE, '//g', 'http://g')
-         self.checkJoin(RFC1808_BASE, '?y', 'http://a/b/c/d;p?y')
          self.checkJoin(RFC1808_BASE, 'g?y', 'http://a/b/c/g?y')
          self.checkJoin(RFC1808_BASE, 'g?y/./x', 'http://a/b/c/g?y/./x')
          self.checkJoin(RFC1808_BASE, '#s', 'http://a/b/c/d;p?q#s')
          self.checkJoin(RFC1808_BASE, 'g#s', 'http://a/b/c/g#s')
          self.checkJoin(RFC1808_BASE, 'g#s/./x', 'http://a/b/c/g#s/./x')
          self.checkJoin(RFC1808_BASE, 'g?y#s', 'http://a/b/c/g?y#s')
-         self.checkJoin(RFC1808_BASE, ';x', 'http://a/b/c/d;x')
          self.checkJoin(RFC1808_BASE, 'g;x', 'http://a/b/c/g;x')
          self.checkJoin(RFC1808_BASE, 'g;x?y#s', 'http://a/b/c/g;x?y#s')
          self.checkJoin(RFC1808_BASE, '.', 'http://a/b/c/')
--- 65,76 ----
***************
*** 103,111 ****
      def test_RFC2396(self):
          # cases from RFC 2396
  
!         ### urlparse.py as of v 1.32 fails on these two
!         #self.checkJoin(RFC2396_BASE, '?y', 'http://a/b/c/?y')
!         #self.checkJoin(RFC2396_BASE, ';x', 'http://a/b/c/;x')
  
          self.checkJoin(RFC2396_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC2396_BASE, 'g', 'http://a/b/c/g')
--- 105,113 ----
      def test_RFC2396(self):
          # cases from RFC 2396
  
!         # conflict with RFC 1808, tests commented out there
!         self.checkJoin(RFC2396_BASE, '?y', 'http://a/b/c/?y')
!         self.checkJoin(RFC2396_BASE, ';x', 'http://a/b/c/;x')
  
          self.checkJoin(RFC2396_BASE, 'g:h', 'g:h')
          self.checkJoin(RFC2396_BASE, 'g', 'http://a/b/c/g')
msg5905 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-05-12 00:35
Logged In: YES 
user_id=357491

mbrierst is right.  From C.1 of RFC 2396 (with http://a/b/c/d;p?q as the 
base):

    ?y            =  http://a/b/c/?y
    ;x            =  http://a/b/c/;x

And notice how this contradicts RFC 1808 ( with <URL:http://a/b/c/
d;p?q#f> as the base):

    ?y         = <URL:http://a/b/c/d;p?y>
    ;x         = <URL:http://a/b/c/d;x>

So obviously there is a conflict here.  And since RFC 2396 says "it revises and 
replaces the generic definitions in RFC 1738 and RFC 1808" (of which 
"generic" just means the actual syntax) this means that RFC 2396's solution 
should override.

Now the issue is whether the patch for this is the right thing to do (I am 
ignoring if the patch is correct; have not tested it yet).  This shouldn't break 
anything since the whole point of urlparse.urljoin is to have an abstracted 
way to create URIs without the user having to worry about all of these rules.  
So I say that it should be changed.

Fred, do you mind if I reassign this patch to myself and deal with it?
msg5906 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-06-12 07:24
Logged In: YES 
user_id=357491

Since there is the random possibility that this might break code 
that depends on this to act like RFC 1808 instead of 2396 and 
2.3 has hit beta I am going to wait for 2.4 before I deal with this.
msg5907 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2003-10-12 04:42
Logged In: YES 
user_id=357491

rev. 1.42 of Lib/urlparse.py and rev. 1.13 of Lib/test/
test_urlparse.py have mbrierst's fixes (thanks, Michael) after I 
had to do a second commit to get the comment correct.
History
Date User Action Args
2022-04-10 16:04:19adminsetgithub: 34947
2001-08-12 05:10:12aaronswcreate