This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: OEM codepage chars in mbcs filenames can be misinterpreted
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: loewis, mike_j_brown
Priority: normal Keywords:

Created on 2004-03-31 04:04 by mike_j_brown, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
oem_vs_mbcs_filename_demo.py mike_j_brown, 2004-03-31 10:28
Messages (3)
msg20395 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-03-31 04:04
My system: Windows XP, English - US locale, Python 
2.3.3

I believe the bug I am reporting here is this:

On Windows XP, when using os.listdir() with a non-
Unicode argument, characters that are not in the 
default locale's encoding (e.g. Greek capital letter 
Sigma, (U+03A3), is not in windows-1252), but that are 
in the default OEM code page (e.g. Sigma is in cp437), 
get mapped to ASCII characters other than '?'.

For example, things seem to work in a predictable way 
when I put windows-1252 characters into filenames (I 
do this in Explorer and then I see what os.listdir
(r'C:\path\to\the\dir') returns):

— (U+2014) becomes \x97
• (U+2022) becomes \x95
é (U+00E9) becomes \xe9

But things are much less predictable when I use 
characters from outside this range. I thought I'd try 
some Greek characters first. Some of them (the ones 
that happen to be in cp437, interestingly enough) come 
back as random ASCII letters:

˜ (U+0398) becomes "T"
£ (U+03A3) becomes "S"
¦ (U+03A6) becomes "F"

Greek letters that are not in cp437 come back as 
question marks, as expected (I guess):
¤ (U+03A4) becomes "?"
¥ (U+03A5) becomes "?"

...as do some Hebrew letters and Japanese hiragana:
Ð (U+05D0) becomes "?"
Ô (U+05D4) becomes "?"
á (U+05E1) becomes "?"
B (U+305F) becomes "?"
F (U+3046) becomes "?"
_ (U+3042) becomes "?"

I don't know if this is something that anyone cares 
about, since the filenames are useless anyway, but it 
does seem to be unintended behavior.

(And before you ask, it's just a theoretical exercise; I 
have no urgent need to use os.listdir with non-Unicode 
directory names on Windows.)
msg20396 - (view) Author: Mike Brown (mike_j_brown) Date: 2004-03-31 10:28
Logged In: YES 
user_id=371366

I've added a script that demonstrates the issue.
msg20397 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-03-31 20:11
Logged In: YES 
user_id=21627

There is nothing we can do about this: the mapping from
characters outside the ANSI CP is done completely inside
Windows, using an undocumented algorithm. This algorithm
will typically replace characters with "similar" ones. 

E.g. U+0398 is GREEK CAPITAL LETTER THETA, which is similar
in sound to LATIN CAPITAL LETTER T. Similarity is sometimes
determined by sound, sometimes by glyph-likeness in a
typical font. If no similar character is available, Windows
puts in a question mark. The system call performing the
directory listing does not indicate whether such a mapping
has taken place.

Closing this as third-party bug.
History
Date User Action Args
2022-04-11 14:56:03adminsetgithub: 40107
2004-03-31 04:04:59mike_j_browncreate