Issue 676346: String formatting operation Unicode problem.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37858

classification

Title:	String formatting operation Unicode problem.
Type:		Stage:
Components:	Unicode	Versions:	Python 2.2

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	lemburg	Nosy List:	dmgrime, facundobatista, lemburg
Priority:	low	Keywords:

Created on 2003-01-28 20:59 by dmgrime, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
func.py	dmgrime, 2003-01-28 21:00	Test Case

Messages (4)
msg14292 - (view)	Author: David M. Grimes (dmgrime)	Date: 2003-01-28 20:59
When performing a string formatting operation using %s and a unicode argument, the argument evaluation is performed more than once. In certain environments (see example) this leads to excessive calls. It seems Python-2.2.2:Objects/stringobject.c:3394 is where PyObject_GetItem is used (for dictionary-like formatting args). Later, at :3509, there is a"goto unicode" when a string argument is actually unicode. At this point, everything resets and we do it all over again in PyUnicode_Format. There is an underlying assumption that the cost of the call to PyObject_GetItem is very low (since we're going to do them all again for unicode). We've got a Python-based templating system which uses a very simple Mix-In class to facilitate flexible page generation. At the core is a simple __getitem__ implementation which maps calls to getattr(): class mixin: def __getitem__(self, name): print '%r::__getitem__(%s)' % (self, name) hook = getattr(self, name) if callable(hook): return hook() else: return hook Obviously, the print is diagnostic. So, this basic mechanism allows one to write hierarchical templates filling in content found in "%(xxxx)s" escapes with functions returning strings. It has worked extremely well for us. BUT, we recently did some XML-based work which uncovered this strange unicode behaviour. Given the following classes: class w1u(mixin): v1 = u'v1' class w2u(mixin): def v2(self): return '%(v1)s' % w1u() class w3u(mixin): def v3(self): return '%(v2)s' % w2u() class w1(mixin): v1 = 'v1' class w2(mixin): def v2(self): return '%(v1)s' % w1() class w3(mixin): def v3(self): return '%(v2)s' % w2() And test case: print 'All string:' print '%(v3)s' % w3() print print 'Unicode injected at w1u:' print '%(v3)s' % w3u() print As we can see, the only difference between the w{1,2,3} and w{1,2,3}u classes is that w1u defines v1 as unicode where w1 uses a "normal" string. What we see is the string-based one shows 3 calls, as expected: All string: <__main__.w3 instance at 0x8150524>::__getitem__(v3) <__main__.w2 instance at 0x814effc>::__getitem__(v2) <__main__.w1 instance at 0x814f024>::__getitem__(v1) v1 But the unicode causes a tree-like recursion: Unicode injected at w1u: <__main__.w3u instance at 0x8150524>::__getitem__(v3) <__main__.w2u instance at 0x814effc>::__getitem__(v2) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w2u instance at 0x814effc>::__getitem__(v2) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w3u instance at 0x8150524>::__getitem__(v3) <__main__.w2u instance at 0x814effc>::__getitem__(v2) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w2u instance at 0x814effc>::__getitem__(v2) <__main__.w1u instance at 0x814f024>::__getitem__(v1) <__main__.w1u instance at 0x814f024>::__getitem__(v1) v1 I'm sure this isn't a "common" use of the string formatting mechanism, but it seems that evaluating the arguments multiple times could be a bad thing. It certainly is for us 8^) We're running this on a RedHat 7.3/8.0 setup, not that it appears to matter (from looking in stringojbect.c). Also appears to still be a problem in 2.3a1. Any comments? Help? Questions?
msg14293 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2003-01-28 22:23
Logged In: YES user_id=38388 I don't see how you can avoid fetching the Unicode argument a second time without restructuring the formatting code altogether. If you know that your arguments can be Unicode, you should start with a Unicode formatting string to begin with. That's faster and doesn't involve a fallback solution. If you still want to see this fixed, I'd suggest to submit a patch.
msg14294 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2005-01-11 03:54
Logged In: YES user_id=752496 Please, could you verify if this problem persists in Python 2.3.4 or 2.4? If yes, in which version? Can you provide a test case? If the problem is solved, from which version? Note that if you fail to answer in one month, I'll close this bug as "Won't fix". Thank you! . Facundo
msg14295 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2005-05-30 19:55
Logged In: YES user_id=752496 Deprecated. Reopen only if still happens in 2.3 or newer. . Facundo

History
Date	User	Action	Args
2022-04-10 16:06:15	admin	set	github: 37858
2003-01-28 20:59:42	dmgrime	create