This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode problem in os.path.getsize ?
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: loewis, ronrivest, terry.reedy
Priority: normal Keywords:

Created on 2004-02-13 02:49 by ronrivest, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (6)
msg19977 - (view) Author: Ronald L. Rivest (ronrivest) Date: 2004-02-13 02:49
I am running on Windows XP 5.1 using python version 2.3.
The following simple code fails on my system.

for dirpath,dirnames,filenames in os.walk("C:/"):
    for name in filenames:
	pathname = os.path.join(dirpath,name)
	size = os.path.getsize(pathname)
	print size, pathname

I get an error from getsize that the file given by 
pathname does not exist.  When it breaks, the
variable "name" contains two question marks, which
makes me think that this is a Unicode problem.

In any case, shouldn't names returned by walk be
acceptable in all cases to getsize???




 
            
            
msg19978 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2004-02-14 00:47
Logged In: YES 
user_id=593130

Though it might be, I suspect that this is not a Python bug.  
Whether is it a Windows design or coding bug in is another 
matter.

>variable "name" contains two question marks, which
>makes me think that this is a Unicode problem.

Since '?' is not legal in filenames, as you seem to know, I 
more believe this is the Windows substitute, in the Win 
function called by os.listdir and os.walk, for illegal characters 
in the filename.  So of course getsize, which wraps os.stat(), 
which calls a system function, chokes on it.

Could be disk bit glitch, or bad program writing directly to 
directory block.  Happened to me once - difficult to get rid of.

What does Windows Explorer show when you visit that 
directory?  Ditto for 'dir' in a CommandPrompt window
(Start/Accessories)? 
msg19979 - (view) Author: Ronald L. Rivest (ronrivest) Date: 2004-02-14 01:46
Logged In: YES 
user_id=863876

TJREEDY -- Thanks for the reply...

To answer your questions:
   (1) What does Windows show when I visit the directory?
        -- I have several files in this directory that have
the same
            problem.  It is a hard, reproducible problem, not a
            transient glitch.   The files are mp3 files that
have 
            the name "prelude.mp3", except that the first "e" is
            replaced by two question marks (for Python) or by 
            two "boxes" in Windows Explorer.  I would guess that
            this is some funky representation of the french "e"
            with an "accent aigu".  
    (2) What does "dir" do in a Command Prompt?
        -- From a command prompt, I see two question marks
            at the problematic position.

Does Windows allow one to create filenames with characters
in the filename that are illegal for Windows?  

As I said in the original post, I find it very disturbing that
os.walk should return a filename that os.path.exists says
doesn't exist!  If you can walk the directory and find the
file, then os.path.exists (or, equivalently, os.path.getsize),
should find it!  This looks like a Python bug to me... no?

    Cheers,
    Ron Rivest

msg19980 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-02-16 14:55
Logged In: YES 
user_id=21627

This behaviour is standard behaviour of Win32, and,
disturbing as it may sound, is somewhat outside Python's
control.

When a file is found whose name cannot be represented in the
system code page (CP_ACP, the "ANSI" code page), then
non-representable characters are converted to question
marks. What's worse: "roughly-representable" characters are
sometimes converted to look-alike characters.

When passing back such a file name to the Win32, it will not
find the file, as it does have question marks in it.

Withe the "ANSI" API, there is really no solution. Instead,
you should use Unicode file names, i.e. write

for dirpath,dirnames,filenames in os.walk(u"C:/"):

Closing as "won't fix".
msg19981 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2004-02-16 16:42
Logged In: YES 
user_id=593130

Final comment:

dir and explorer can display stats of files with bad names 
because they get both simultaneously without trying to use 
the bad names.  CommandPrompt equivalent of listdir (or 
walk) followed by getsize (or stat) is 'dir /w' followed by 'dir 
badname', which should also give "File not found' error 
message.

I believe this 'disturbing' behavior results from having filename 
rules that are not enforced by restricting directory disk block 
writes to os functions that respect the rules.

A roundabout fix: replace 'size = ...' with something like
try: size = ...
except WhateverErrorYouGot:
  file = os.popenx('dir %s' % dirpath).read()
   # x = whichever of 1,2,3,4 works
   <find line with badname>
   <parse out file size>

But prefixing 'u' to the root dir looks a lot easier if it gets you 
what you need.
msg19982 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-02-16 17:57
Logged In: YES 
user_id=21627

This is not true: dir and explorer use both the Unicode
("wide") API (FindFirstFileW). Explorer then tries to render
the file name correctly even if it is outside the code page.
If there is no glyph in the font, a square box is displayed.
dir.exe tries to convert the file name into the encoding of
the terminal (typically CP_OEMCP), and replaces them with
question marks on display.

Also, this behaviour is not caused by applications
performing direct IO to the directory disk block. First, XP
does not allow such IO, and second, very few applications
would know to write NTFS correctly. Instead, the problem is
caused by applications which use the "wide" API for file
names to create files, which is a problem for applications
that use the "narrow" API.

If Ron sees two sqare boxes where a single accented e should
be, the application creating the file most likely has messed
up the file name: Windows should be capable of representing
this letter with a single character, and explorer should be
capable of displaying it properly.
History
Date User Action Args
2022-04-11 14:56:02adminsetgithub: 39935
2004-02-13 02:49:48ronrivestcreate