This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: String formatting operation Unicode problem.
Type: Stage:
Components: Unicode Versions: Python 2.2
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: lemburg Nosy List: dmgrime, facundobatista, lemburg
Priority: low Keywords:

Created on 2003-01-28 20:59 by dmgrime, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
func.py dmgrime, 2003-01-28 21:00 Test Case
Messages (4)
msg14292 - (view) Author: David M. Grimes (dmgrime) Date: 2003-01-28 20:59
When performing a string formatting operation using %s
and a unicode argument, the argument evaluation is
performed more than once.  In certain environments (see
example) this leads to excessive calls.

It seems Python-2.2.2:Objects/stringobject.c:3394 is
where PyObject_GetItem is used (for dictionary-like
formatting args).  Later, at :3509, there is a"goto
unicode" when a string argument is actually unicode. 
At this point, everything resets and we do it all over
again in PyUnicode_Format.

There is an underlying assumption that the cost of the
call to PyObject_GetItem is very low (since we're going
to do them all again for unicode).  We've got a
Python-based templating system which uses a very simple
Mix-In class to facilitate flexible page generation. 
At the core is a simple __getitem__ implementation
which maps calls to getattr():

class mixin:
    def __getitem__(self, name):
        print '%r::__getitem__(%s)' % (self, name)
        hook = getattr(self, name)
        if callable(hook):
            return hook()
        else:
            return hook

Obviously, the print is diagnostic.  So, this basic
mechanism allows one to write hierarchical templates
filling in content found in "%(xxxx)s" escapes with
functions returning strings.  It has worked extremely
well for us.

BUT, we recently did some XML-based work which
uncovered this strange unicode behaviour.  Given the
following classes:

class w1u(mixin):
    v1 = u'v1'

class w2u(mixin):
    def v2(self):
        return '%(v1)s' % w1u()

class w3u(mixin):
    def v3(self):
        return '%(v2)s' % w2u()

class w1(mixin):
    v1 = 'v1'

class w2(mixin):
    def v2(self):
        return '%(v1)s' % w1()

class w3(mixin):
    def v3(self):
        return '%(v2)s' % w2()

And test case:

print 'All string:'
print '%(v3)s' % w3()
print

print 'Unicode injected at w1u:'
print '%(v3)s' % w3u()
print


As we can see, the only difference between the w{1,2,3}
and w{1,2,3}u classes is that w1u defines v1 as unicode
where w1 uses a "normal" string.

What we see is the string-based one shows 3 calls, as
expected:

All string:
<__main__.w3 instance at 0x8150524>::__getitem__(v3)
<__main__.w2 instance at 0x814effc>::__getitem__(v2)
<__main__.w1 instance at 0x814f024>::__getitem__(v1)
v1

But the unicode causes a tree-like recursion:

Unicode injected at w1u:
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w3u instance at 0x8150524>::__getitem__(v3)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w2u instance at 0x814effc>::__getitem__(v2)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
<__main__.w1u instance at 0x814f024>::__getitem__(v1)
v1

I'm sure this isn't a "common" use of the string
formatting mechanism, but it seems that evaluating the
arguments multiple times could be a bad thing.  It
certainly is for us 8^)

We're running this on a RedHat 7.3/8.0 setup, not that
it appears to matter (from looking in stringojbect.c).
Also appears to still be a problem in 2.3a1.

Any comments?  Help?  Questions?
msg14293 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2003-01-28 22:23
Logged In: YES 
user_id=38388

I don't see how you can avoid fetching the Unicode
argument a second time without restructuring the
formatting code altogether.

If you know that your arguments can be Unicode, you
should start with a Unicode formatting string to begin
with. That's faster and doesn't involve a fallback
solution.

If you still want to see this fixed, I'd suggest to submit
a patch.
msg14294 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2005-01-11 03:54
Logged In: YES 
user_id=752496

Please, could you verify if this problem persists in Python 2.3.4
or 2.4?

If yes, in which version? Can you provide a test case?

If the problem is solved, from which version?

Note that if you fail to answer in one month, I'll close this bug
as "Won't fix".

Thank you! 

.    Facundo
msg14295 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2005-05-30 19:55
Logged In: YES 
user_id=752496

Deprecated. Reopen only if still happens in 2.3 or newer. 

.    Facundo
History
Date User Action Args
2022-04-10 16:06:15adminsetgithub: 37858
2003-01-28 20:59:42dmgrimecreate