This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Len too large with national characters
Type: Stage:
Components: Library (Lib) Versions: Python 2.4
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: mwh Nosy List: henrikwj, mwh
Priority: normal Keywords:

Created on 2005-06-20 10:52 by henrikwj, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (5)
msg25587 - (view) Author: Henrik Winther Jensen (henrikwj) Date: 2005-06-20 10:52
It looks as if len returns the lenght of an UTF8 string
even if the string
only contains ascii characters and default encoding is
ascii. This
means that if you insert f. ex. one danish ø in a
string. len will return a
value of 2. i.e.

a='ø'
print len(a)

gives:
2
msg25588 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2005-06-20 12:12
Logged In: YES 
user_id=6656

How are you getting your danish character into the string?  If it's by typing 
it into a console, is your console in utf-8 mode?
msg25589 - (view) Author: Henrik Winther Jensen (henrikwj) Date: 2005-06-20 13:06
Logged In: YES 
user_id=1299770

Actually the problem persists whether i am reading from a
file or inputting from a keyboard. I am using python from the
command line in linux shell. I dont know what console that is.
But it is able to show the danish characters on the screen as 
well as reading them from the keyboard.
msg25590 - (view) Author: Michael Hudson (mwh) (Python committer) Date: 2005-06-20 13:12
Logged In: YES 
user_id=6656

Well, what encoding is the file in?

I suspect that it's in utf-8, so when you open the file and
call read() you get utf-8 data and thus your danish
character is represented as two bytes.

You might want to do 

import codecs
fileobj = codecs.open('filename.txt', encoding='utf-8')

and then fileobj.read() will return a unicode string of the
length you're expecting.

At any rate, I see no evidence of a Python bug here, so closing.
msg25591 - (view) Author: Henrik Winther Jensen (henrikwj) Date: 2005-06-20 13:41
Logged In: YES 
user_id=1299770

Yes, you are right, the problem is that the console-thingy 
converts my iso8859 characters to utf-8. Thanks for the
explanation.
History
Date User Action Args
2022-04-11 14:56:11adminsetgithub: 42102
2005-06-20 10:52:50henrikwjcreate