Issue 1067294: Incorrect length of unicode strings using .encode('utf-8')

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41179

classification

Title:	Incorrect length of unicode strings using .encode('utf-8')
Type:		Stage:
Components:	Unicode	Versions:	Python 2.4

process

Created on 2004-11-16 11:58 by edschofield, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
python unicode char length bug.txt	edschofield, 2004-11-16 11:58	Code example exposing a bug in determining the length of utf-8 encoded strings

Messages (2)
msg23167 - (view)	Author: Ed Schofield (edschofield) *	Date: 2004-11-16 11:58
Python 2.3.4 and Python 2.4b2: print "x = %-15s" %(x.encode('utf-8'),) + " more text" gives an incorrect number of spaces when x is a two-byte unicode character like à. There is no such problem if x is used alone rather than its encode(...) method. The reason seems to be this: if x = u'\u00e0' (the character à) and s=x.encode('utf-8'), then len(s) = 2, which breaks the print command above on a UTF-8 terminal. A slightly longer example is attached.
msg23168 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-11-16 12:12
Logged In: YES user_id=38388 As you already noted: the problem is that you are mixing Unicode and strings in a way which is bound to fail. You should use: print (u"x = %-15s" %x + u" more text").encode('utf-8') ie. stay with Unicode as long as you can and only call encode when doing I/O as last step before passing off the string to an 8-bit stream.