This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: reading shelves is really slow
Type: performance Stage: test needed
Components: Extension Modules Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, ganssauge, rhettinger
Priority: normal Keywords:

Created on 2003-11-26 14:06 by ganssauge, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
all_idx2.shelve.bz2 ganssauge, 2003-11-27 10:42 The shelve in question
69228.profile ganssauge, 2003-11-27 10:43 The profiling data I made
slow_shelve.py ganssauge, 2003-11-28 16:01
Messages (10)
msg19147 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-26 14:06
My application uses a shelve-file which is created by 
another process using the same python version.
Before python2.3 using this shelve with the exact same 
application was almost twice as fast as a binary pickle 
containing the same data.
Now with python2.3 the same application is suddenly 
about 150 times slower than using the binary pickle.

The usage is as follows:
   idx_dict = shelve.open (idx_dict_name, "r")
   ...
   while not infile.eof:
      index = get_index_from_somewhere_else()
      if not idx_dict.has_key (index):
          do_something(index)
      else:
          do_something_else(index)

   idx.dict.close()
   
Profiling revealed that most of the time is spent within 
userdict.
msg19148 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-27 09:17
Logged In: YES 
user_id=80475

I can reproduce a four-fold slowdown that persists even
after the UserDict.DictMixin lines are commented out of
shelve.py and bsddb.__init__.py.  For me, the only thing
that has changed is the underlying bsddb implementation.

Let's see if you system is going somewhere else to get its
shelving done.  After the first line, add:  idx_dict.has_key
([])
Then post the traceback here.

Do that for both Py2.2 and for Py2.3.  Thank you.

Also, post what a typical record in the index and tell me
how many entries are typically in idx_dict.  That way, I can
try to reproduce your timings with greater fidelity.

Which os are you using and what the minor bugfix verion
numbers of the Py2.2 and PY2.3 you are using.
msg19149 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-27 10:32
Logged In: YES 
user_id=792746

I uploaded my profiling data, maybe it will help you ...
Here is the information you requested:
----------------><------------------------><------------
(gotti@gglinux 534) 
PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux-
i686-2.3 python Konvertierung/entsch_pass2.py HI69228 x HR 
all_idx2.shelve <hi69228.sgml
Traceback (most recent call last):
  File "Konvertierung/entsch_pass2.py", line 1026, in ?
    init_idx_dict (idx_dict_name)
  File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict
    idx_dict.has_key([])
  File "/usr/lib/python2.3/shelve.py", line 104, in has_key
    return self.dict.has_key(key)
  File "/usr/lib/python2.3/bsddb/__init__.py", line 142, in 
has_key
    return self.db.has_key(key)
TypeError: String or Integer object expected for key, list found
(gotti@gglinux 535) 
PYTHONPATH=../../../COMMON.DEVEL/Tools/python/lib.linux-
i686-2.2 python2.2 Konvertierung/entsch_pass2.py HI69228 x 
HR all_idx2.shelve <hi69228.sgml
Traceback (most recent call last):
  File "Konvertierung/entsch_pass2.py", line 1026, in ?
    init_idx_dict (idx_dict_name)
  File "../../COMMON/lib/EDB.py", line 54, in init_idx_dict
    idx_dict.has_key([])
  File "/usr/lib/python2.2/shelve.py", line 62, in has_key
    return self.dict.has_key(key)
TypeError: key type must be string
(gotti@gglinux 536) python -V
Python 2.3.2
(gotti@gglinux 537) python2.2 -V
Python 2.2.3
(gotti@gglinux 538) uname -a
Linux gglinux 2.4.22 #1 SMP Mon Nov 3 11:40:28 CET 2003 
i686 unknown unknown GNU/Linux
(gotti@gglinux 538) cat /etc/debian_version
testing/unstable
(gotti@gglinux 539) python2.2 -c 'import shelve ; d = 
shelve.open("all_idx2.shelve", "r"); print len (d.keys()) ; print 
d.keys()[0], d [d.keys()[0]]'
34983
HI568817 None
(gotti@gglinux 540)  python2.3 -c 'import shelve ; d = 
shelve.open("all_idx2.shelve", "r"); print "# items in shelve:", 
len (d.keys()) ; print "Items look like: index", d.keys()
[0], "value", d [d.keys()[0]]'
# items in shelve: 34983
Items look like: index HI568817 value None
msg19150 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-27 10:42
Logged In: YES 
user_id=792746

What the heck ... here is the shelve in question
msg19151 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-27 17:55
Logged In: YES 
user_id=80475

The fragment in the original posting showed the only
inner-loop shelve access was through has_key().   The
tracebacks show that UserDict is nowhere in the traceback
chain.  I conclude that the fragment does not represent what
is really going on in the problematic script. So, please
attach the profiled script, Konvertierung/entsch_pass2.py

The attached profile indicates that somewhere, there is a
line like:   for k,v in idx_dict.iteritems().  This is
surprising because shelves did not support iteritems() in
Py2.2.  That would be mean that you've timed and compared
two different pieces of code.

Please show the shortest script with data that runs at
radically different speeds on Py2.2 vs Py2.3.

msg19152 - (view) Author: Gottfried Ganßauge (ganssauge) Date: 2003-11-28 16:01
Logged In: YES 
user_id=792746

I think I found the answer:

apart from has_key() I'm using "dict != None".
If I leave that out in my test program both python variants 
run with the same speed.

The dict != None condition seems to trigger len(dict.keys()) 
and that seems to be way slower than before.

I definitely didn't time different scripts: the script is part of 
our CDROM production system and the only variables I had 
during my tests were python itself and the python path.

Find my test script attached...
msg19153 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-11-28 21:57
Logged In: YES 
user_id=80475

Yes, that was the culprit.

I'll look for a way to make __cmp__ a bit smarter.  In the
meantime, the proper way to check for None is always:  if
dict is None.
msg19154 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-12-07 11:55
Logged In: YES 
user_id=80475

I fixed-up your particular problem for Py2.3.3 and Py2.4.

Leaving the report open because there are other calls which 
have performance issues.
msg55408 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2007-08-29 01:57
Raymond - can we close this ticket?
msg110108 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-12 16:30
Raymond - can we close this ticket?
History
Date User Action Args
2022-04-11 14:56:01adminsetgithub: 39611
2010-07-12 19:41:33rhettingersetstatus: open -> closed
resolution: out of date
2010-07-12 16:30:01BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110108
2009-02-16 06:25:10skip.montanarosetnosy: - skip.montanaro
2009-02-14 12:32:22ajaksu2setstage: test needed
versions: + Python 2.7, - Python 2.3
2008-03-16 21:06:20georg.brandlsettype: performance
2007-08-29 01:57:38skip.montanarosetnosy: + skip.montanaro
messages: + msg55408
2003-11-26 14:06:12ganssaugecreate