This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Incorrect length of unicode strings using .encode('utf-8')
Type: Stage:
Components: Unicode Versions: Python 2.4
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: lemburg Nosy List: edschofield, lemburg
Priority: normal Keywords:

Created on 2004-11-16 11:58 by edschofield, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
python unicode char length bug.txt edschofield, 2004-11-16 11:58 Code example exposing a bug in determining the length of utf-8 encoded strings
Messages (2)
msg23167 - (view) Author: Ed Schofield (edschofield) * Date: 2004-11-16 11:58
Python 2.3.4 and Python 2.4b2:

print "x = %-15s" %(x.encode('utf-8'),) + " more text"

gives an incorrect number of spaces when x is a
two-byte unicode character like à.  There is no such
problem if x is used alone rather than its encode(...)
method.

The reason seems to be this: if x = u'\u00e0' (the
character à) and s=x.encode('utf-8'), then len(s) = 2,
which breaks the print command above on a UTF-8 terminal.

A slightly longer example is attached.
msg23168 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-11-16 12:12
Logged In: YES 
user_id=38388

As you already noted: the problem is that you are mixing Unicode
and strings in a way which is bound to fail.

You should use:

print (u"x = %-15s" %x + u" more text").encode('utf-8')

ie. stay with Unicode as long as you can and only call encode
when doing I/O as last step before passing off the string
to an 8-bit stream.
History
Date User Action Args
2022-04-11 14:56:08adminsetgithub: 41179
2004-11-16 11:58:42edschofieldcreate