This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: length of unicode string changes print behaviour
Type: Stage:
Components: IDLE Versions: Python 2.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: loewis Nosy List: hover_boy, kbk, loewis, terry.reedy
Priority: normal Keywords:

Created on 2006-02-22 09:45 by hover_boy, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
japanese.png hover_boy, 2006-02-22 09:45 screenshot
bug.py hover_boy, 2006-03-22 15:12 example file: seems not to be simplly the lenght of the string
Messages (6)
msg27592 - (view) Author: James (hover_boy) Date: 2006-02-22 09:45
Python 2.4.2 and IDLE (with Courier New font) on XP 
and the following code saved as a UTF-8 file 

if __name__ == "__main__": 
    print "零 一 二 三 四 五 六 七 八" 
    print "零 一 二 三 四 五 六 七 八 九 十 "

results in...

IDLE 1.1.2 
>>> ================================ RESTART 
================================ 
>>> 
零 一 二 三 å›› 五 å…七 å…« 
零 一 二 三 四 五 六 七 八 九 十 
>>> 



msg27593 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2006-03-06 01:44
Logged In: YES 
user_id=593130

I am fairly ignorant of unicode and encodings, but I am 
surprised you got anything coherent without an encoding 
cookie comment at the top (see manual).  Have you tried 
that?  Other questions that might help someone answer:

What specific XP version?  SP2 installed? Country version?
Your results for
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'mbcs'
What happens if you reverse the order of the print 
statements?  (Ie, is it really the shorter string that 
does not work or just the first?)

I don't know enough to know if this is really a bug.  If 
you don't get an answer here, you might try for more info 
on python-list/comp.lang.python
msg27594 - (view) Author: James (hover_boy) Date: 2006-03-22 15:12
Logged In: YES 
user_id=1458491


msg27595 - (view) Author: James (hover_boy) Date: 2006-03-22 15:21
Logged In: YES 
user_id=1458491

I've attached an example file to demonstrate the problem 
better.

it seems not to be the length but something else which I 
haven't figured out yet.

I've also added the encoding comment and also tried 
changing the default encoding in sitecustomize.py from latin
-1 to utf-8 but neither seem to work.

thanks,

James.

XP professional, SP2, english
msg27596 - (view) Author: Kurt B. Kaiser (kbk) * (Python committer) Date: 2006-07-23 05:33
Logged In: YES 
user_id=149084

I don't have a font installed which will print
those characters.  When I load your sample file,
I see print statements which include unicode
characters like \u5341.  The printed output
contains the same unicode characters as the
input program.  Maybe Martin has an idea.
msg27597 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2006-07-23 19:42
Logged In: YES 
user_id=21627

This is not a bug. The program should not attempt to print
byte strings, since it cannot know what the encoding of the
byte strings is. Instead, the program should use Unicode
strings, such as

    print u"å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«å…«"

If you attempt to print byte strings, they have to be in the
encoding of stdout, or else the behaviour is unspecified.

In my installation/locale, sys.stdout.encoding is cp1250.
IDLE's OutputWindow.write has this code:

        # Tk assumes that byte strings are Latin-1;
        # we assume that they are in the locale's encoding
        if isinstance(s, str):
            try:
                s = unicode(s, IOBinding.encoding)
            except UnicodeError:
                # some other encoding; let Tcl deal with it
                pass

Of the strings specified in the source file, only strings
2..5 decode properly as cp1250; the others don't. So these
get passed directly to Tcl, which then assumes they are
UTF-8, with some fallback also. The strings that look
"incorrectly" are actually printed out as designed: using
sys.stdout.encoding.
History
Date User Action Args
2022-04-11 14:56:15adminsetgithub: 42938
2006-02-22 09:45:16hover_boycreate