This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib has trouble with Windows filenames
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: bobince, dpeastman, georg.brandl, shadowmorpher, zseil
Priority: normal Keywords:

Created on 2006-02-22 06:03 by dpeastman, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (7)
msg27585 - (view) Author: Donovan Eastman (dpeastman) Date: 2006-02-22 06:03
When you pass urllib the name of a local file including
a Windows drive letter (e.g. 'C:\dir\My File.txt')
URLopener.open() incorrectly interprets the drive
letter as the scheme of a URL.  Of course, given that
there is no scheme 'C', this fails.

I have solved this in my own code by putting the
following test before calling urllib.urlopen():

if url[1] == ':' and url[0].isalpha():
    url = 'file:' + url

Although this works fine in my particular case, it
seems like urllib should just simply "do the right
thing" without having to worry about it.  Therefore I
propose that urllib should automatically assume that
any URL that begins with a single alpha followed by a
colon is a local file.

The only potential downside would be that it would
preclude the use of single letter scheme names.  I did
a little research on this.  RFC 3986 suggests, but does
not explicitly state that scheme names must be more
than one character.
(http://www.gbiv.com/protocols/uri/rfc/rfc3986.html#scheme)
.  That said, there are no currently recognized single
letter scheme names
(http://www.iana.org/assignments/uri-schemes.html) and
it seems very unlikely that there every would be.

I would gladly write the code for this myself -- but I
suspect that it would take someone longer to review and
integrate my changes than it would to just write the code.

Thanks,
Donovan
msg27586 - (view) Author: Koen van de Sande (shadowmorpher) Date: 2006-03-13 19:19
Logged In: YES 
user_id=270334

Why should the URL lib module support opening of local 
files? It already does so through the file: protocol prefix, 
and do not see why it should support automatic detection of 
Windows filenames. AFAIK it does not do automatic detection 
of Unix filenames (one could recognize it from /home/
something), so why would Windows work differently?

I'm not an expert or anything, so I might be wrong.
msg27587 - (view) Author: Donovan Eastman (dpeastman) Date: 2006-03-14 01:56
Logged In: YES 
user_id=757799

Reasons why urllib should open local files:
1) This allows you to write code that handles local files
and Internet files equally well -- without having to do any
special magic of your own.
2) The docs all say that it should.

I believe this would work just fine under Unix. In
URLopener.open() it looks for the protocol prefix and if it
can't find one, it assumes that it is a local file.

The problem on Windows is that you have these pesky drive
letters.  The form 'C:\location' ends up looking a lot like
the form 'http://location'.  Therefore it looks for a
protocol called 'c' -- which obviously isn't going to work.
msg27588 - (view) Author: Donovan Eastman (dpeastman) Date: 2006-03-14 02:32
Logged In: YES 
user_id=757799

OK - Here's my suggested fix:
This can be fixed with a single if statement (and a comment
to explain it to confused unix programmers).

In splittype(), right after the line that reads: 
scheme = match.group(1)
add the following:
#ignore single char schemes to avoid confusion with win32
drive letters
if len(scheme) > 1:

...and indent the next line.

Alternatively, the if statement could read:
if len(scheme) > 1 or sys.platform != 'win32':
...which would allow single letter scheme names on
non-Windows systems.  I would argue that it is better to be
consistent and have it work the same way on all OS's.
msg27589 - (view) Author: Andrew Clover (bobince) * Date: 2006-03-20 17:41
Logged In: YES 
user_id=311085

Filepaths aren't URIs and attempting to hide the difference
in the backend is doomed to fail (as it did for SAX).

Throw filenames with colons in, network paths, Mac paths and
RISC OS paths into the mix, and you've got a situation where
it is all but impossible to handle correctly.

In any case, the docs *don't* say you can pass in a filepath:

  If the URL does not have a scheme identifier, or if
  it has file: as its scheme identifier, this opens a
  local file

This means the string you pass in is unequivocally a URL
*not* a pathname... just that you can leave the scheme
prefix off for file: URLs. Effectively this is a relative URL.

r'C:\spam' is *not* a valid way to refer to a local file
using a relative URL. Pass it through pathname2url and
you'll get '///C|/spam', which is okay; 'C|/spam' and
'/C|span' will also work.

Even on Unix, a filepath won't always work when passed to
urlopen. Filenames can have percent signs in, which have to
be encoded in URLs, for example. Always use pathname2url or
you're going to trip up.

(Suggest setting status INVALID, possible clarification to
docs to warn against passing a filepath to urlopen?)
msg27590 - (view) Author: Ziga Seilnacht (zseil) * (Python committer) Date: 2006-04-13 00:12
Logged In: YES 
user_id=1326842

There are already two platform specific functions
in urllib module just for this purpose: pathname2url
and url2pathname. See
http://docs.python.org/lib/module-urllib.html#l2h-3193.
I agree that this should be closed as invalid.
msg27591 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-05-03 05:35
Logged In: YES 
user_id=849994

I agree with zseil.
History
Date User Action Args
2022-04-11 14:56:15adminsetgithub: 42937
2006-02-22 06:03:16dpeastmancreate