My system: Windows XP, English - US locale, Python
2.3.3
I believe the bug I am reporting here is this:
On Windows XP, when using os.listdir() with a non-
Unicode argument, characters that are not in the
default locale's encoding (e.g. Greek capital letter
Sigma, (U+03A3), is not in windows-1252), but that are
in the default OEM code page (e.g. Sigma is in cp437),
get mapped to ASCII characters other than '?'.
For example, things seem to work in a predictable way
when I put windows-1252 characters into filenames (I
do this in Explorer and then I see what os.listdir
(r'C:\path\to\the\dir') returns):
— (U+2014) becomes \x97
• (U+2022) becomes \x95
é (U+00E9) becomes \xe9
But things are much less predictable when I use
characters from outside this range. I thought I'd try
some Greek characters first. Some of them (the ones
that happen to be in cp437, interestingly enough) come
back as random ASCII letters:
˜ (U+0398) becomes "T"
£ (U+03A3) becomes "S"
¦ (U+03A6) becomes "F"
Greek letters that are not in cp437 come back as
question marks, as expected (I guess):
¤ (U+03A4) becomes "?"
¥ (U+03A5) becomes "?"
...as do some Hebrew letters and Japanese hiragana:
Ð (U+05D0) becomes "?"
Ô (U+05D4) becomes "?"
á (U+05E1) becomes "?"
B (U+305F) becomes "?"
F (U+3046) becomes "?"
_ (U+3042) becomes "?"
I don't know if this is something that anyone cares
about, since the filenames are useless anyway, but it
does seem to be unintended behavior.
(And before you ask, it's just a theoretical exercise; I
have no urgent need to use os.listdir with non-Unicode
directory names on Windows.)
|