This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Shelve slow after 7/8000 key
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: gregory.p.smith, jkew, marcoberi, skip.montanaro, theller, tim.peters
Priority: normal Keywords:

Created on 2004-01-21 17:09 by marcoberi, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
test1.py marcoberi, 2004-01-21 17:09 Little test program (9 lines) to show the problem
test1skip.py skip.montanaro, 2004-01-22 00:28
test3skip.py skip.montanaro, 2004-01-22 18:02
Messages (24)
msg19737 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-21 17:09
After about 8.000 insertion shelve became really, really 
slow.
This happens only with 2.3.3 #51 on Windows, not with 
2.2 and with 2.3 on Linux.
I try with writeback True or False: same problem.
Help! :-))
msg19738 - (view) Author: Thomas Heller (theller) * (Python committer) Date: 2004-01-21 18:24
Logged In: YES 
user_id=11105

Hm, are windows bugs automatically assigned to me ;-)??
msg19739 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-21 23:57
Logged In: YES 
user_id=588604

Skip Montanaro discovered that whichdb repors bsddb185 
with python 2.2 and dbhash with 2.3.3.
So why is it so slow after few thousand keys?
msg19740 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 00:28
Logged In: YES 
user_id=44345

Can't reproduce on Mac OS X.  I tried with 2.2, 2.3 and CVS using
attached test1skip.py (no writeback - 2.2 doesn't support it, no
import pickle - not used, no key prints - just muddies the water,
print whichdb's result).

The times are close enough to not worry me:

montanaro:tmp% time python2.3 test1.py
dbhash

real    0m1.927s
user    0m1.720s
sys     0m0.080s
montanaro:tmp% time python2.2 test1.py
dbhash

real    0m1.250s
user    0m0.850s
sys     0m0.360s
montanaro:tmp% time python test1.py
dbhash

real    0m2.179s
user    0m1.950s
sys     0m0.120s

Please try this modified version just to make sure we are both
looking at the same thing.

msg19741 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-22 07:30
Logged In: YES 
user_id=588604

I tried your version: 31.36 seconds vs 0.65.
Just to be sure I tried on three different computers with 
Windows 2000: same gap.

[c:\tmp]timer & \Python23\python test1skip.py & timer
Timer 1 on:  8.21.58
dbhash
Timer 1 off:  8.22.29  Elapsed: 0.00.31,36

[c:\tmp]timer & \Python22\python test1skip.py & timer
Timer 1 on:  8.22.40
dbhash
Timer 1 off:  8.22.41  Elapsed: 0.00.00,65
msg19742 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 17:29
Logged In: YES 
user_id=31435

FYI, on a Win98SE box, test1skip.py took about 30 seconds 
under 2.3.3, and about 1 second under both 2.2.3 and 2.1.3.  
Under 2.3.3, no significant time is taken by a.close(), so it's 
all in the loop.  It prints "dbhash" under all versions.
msg19743 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 18:01
Logged In: YES 
user_id=44345

Try test3skip.py.  You run it like this:

    python test3skip.py hashopen
    python test3skip.py btopen

I ran it on win2k under cygwin so I could use the time command 
(but ran the Windows version of Python).  Using btopen was much 
faster.  I got rid of shelve to eliminate it and pickle as possible 
sources of problems.

$ time /cygdrive/c/Python23/python test3skip.py hashopen

real    0m6.801s
user    0m0.015s
sys     0m0.000s

Administrator@CYCLOPS ~/tmp
$ time /cygdrive/c/Python23/python test3skip.py btopen

real    0m0.345s
user    0m0.015s
sys     0m0.015s

I don't know if the relationship between real, user and sys time 
means anything on cygwin, but the reported real times are very 
repeatable and match my subjective feel of the elapsed time.  This 
suggests there's something fishy with either the underlying library 
or with __setitem__ when using hash files.

I'm assigning to Greg so he can take a peek.  As the bsddb/
pybsddb guy he might have some better insight (certainly better 
than me).
msg19744 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 18:02
Logged In: YES 
user_id=44345

Try test3skip.py.  You run it like this:

    python test3skip.py hashopen
    python test3skip.py btopen

I ran it on win2k under cygwin so I could use the time command 
(but ran the Windows version of Python).  Using btopen was much 
faster.  I got rid of shelve to eliminate it and pickle as possible 
sources of problems.

$ time /cygdrive/c/Python23/python test3skip.py hashopen

real    0m6.801s
user    0m0.015s
sys     0m0.000s

Administrator@CYCLOPS ~/tmp
$ time /cygdrive/c/Python23/python test3skip.py btopen

real    0m0.345s
user    0m0.015s
sys     0m0.015s

I don't know if the relationship between real, user and sys time 
means anything on cygwin, but the reported real times are very 
repeatable and match my subjective feel of the elapsed time.  This 
suggests there's something fishy with either the underlying library 
or with __setitem__ when using hash files.

I'm assigning to Greg so he can take a peek.  As the bsddb/
pybsddb guy he might have some better insight (certainly better 
than me).
msg19745 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-22 18:16
Logged In: YES 
user_id=588604

I get your same results under normal cmd: 7.07 seconds vs 
0.46.

[c:\tmp]timer & \python23\python test3skip.py hashopen & 
timer
Timer 1 on: 19.13.22
Timer 1 off: 19.13.29  Elapsed: 0.00.07,07

[c:\tmp]timer & \python23\python test3skip.py btopen & timer
Timer 1 on: 19.13.45
Timer 1 off: 19.13.45  Elapsed: 0.00.00,46
msg19746 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2004-01-22 18:32
Logged In: YES 
user_id=413

This problem is not specific to windows.  hashopen in the
test3skip.py test case is 10x slower than btopen on my
linux-alpha system.

I don't know why BerkeleyDB hash databases are so much
slower than B-Tree ones.  My best suggestion is:  if it
hurts, don't do that.  Use a btree rather thah hash database.

Running the python process under strace on linux reveals
nothing obvious (no system calls are being made during the
time hash open is consuming lots of cpu...

You'll have to ask sleepycat themselves if you want a real
answer as to why hash databases don't perform well.
msg19747 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 18:56
Logged In: YES 
user_id=31435

The original question is why a BDB hash is some 30x slower 
under 2.3 than under 2.2 or 2.1, and that does appear 
specific to Windows.

Skip threw btrees into this too, but that complication doesn't 
appear relevant to the original report (despite marcoberi's 
hearsay 2004-01-21 18:57 comment -- others posted actual 
output, making clear that dbhash is used under all Python 
versions in test1skip).

I'll note in passing that the test case inserts keys in already-
mostly-sorted order, which is a friendly order for a btree-
based mapping.  To get back to the original report, ignore 
everything here concerning test3skip and btrees.
msg19748 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2004-01-22 19:12
Logged In: YES 
user_id=413

python 2.2 and earlier on windows linked against some form
of bsddb 1.85.

python 2.3 and later link against modern BerkeleyDB (not
really related to bsddb 1.85 much at all other than by name
and a legacy api).  They are very different libraries with
very different capabilities and performance.

regardless, i don't have a windows development platform
anymore.  someone who does, please take this.

i suspect this is not something we can fix.  try asking
sleepycat why modern DB_HASH databases might be slower than
bsddb 1.85 hash databases on windows and see what they say.
msg19749 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 20:22
Logged In: YES 
user_id=44345

I guess I get similar results on Mac OS X after looking at it a bit.  
The differences are just not as dramatic (or disappointing) as they 
are on Windows.  Here's the output of a little shell script which 
runs test3skip.py with various Python interpreters and Berkeley 
DB versions:

Python version: (2, 4, 0, 'alpha', 0)
Berkeley DB version: 4.2.4
hashopen: 0m1.621s
btopen:   0m0.608s

Python version: (2, 3, 3, 'final', 0)
Berkeley DB version: 4.2.0
hashopen: 0m1.359s
btopen:   0m0.450s

Python version: (2, 2, 0, 'final', 0)
Berkeley DB version: ???
hashopen: 0m0.514s
btopen:   0m0.202s

Only real (wall clock) times are displayed.

Mario,

Unfortunately, there doesn't seem to be much we can do at this
end to remedy the situation with hash files.  If you want to use 
shelve but switch to bsddb.btopen as the underlying db file open 
call, try posting to comp.lang.python.  Anything you do will 
probably be a miserable hack, but we can probably figure 
something out.

msg19750 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 20:28
Logged In: YES 
user_id=44345

Whoops, sorry about polluting the waters with the btree stuff.  
Dang time lag.

Looking at just the hashopen times between 2.2, 2.3 and 2.4 does 
show that it hash file times have gotten worse since Berkeley 1.85 
days.

Whether or not btree times muddy these particular waters, 
figuring out a way to switch to a different db type and still use the 
shelve module may be Marco's best bet for a short term 
performance improvement.
msg19751 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-01-22 20:36
Logged In: YES 
user_id=31435

Greg, I didn't expect you to fix it <wink>, I just didn't want 
the bug report closed based on misunderstanding what it was 
about.

I've unassigned this item, and if nobody volunteers to dig into 
it within a few weeks, it should indeed be closed as "3rd 
Party" and "Wont Fix

Skip, maybe we should try to force spambayes to use a btree 
mapping too -- then maybe we could get a whole new class 
of intractable corruption errors <wink -- but it might be a lot 
faster>.
msg19752 - (view) Author: Skip Montanaro (skip.montanaro) * (Python triager) Date: 2004-01-22 21:11
Logged In: YES 
user_id=44345

If we wanted speed and didn't care about corruption, my vote 
would be bsddb185. ;-)
msg19753 - (view) Author: James Kew (jkew) Date: 2004-01-22 23:53
Logged In: YES 
user_id=598066

FWIW, to throw another use case into the pot: I (used to) 
run Roundup (roundup.sf.net) trackers on anydbm/Win2K and 
experienced a significant drop in performance between 2.2.x 
(bsddb185) and 2.3.x (dbhash).

I understand that this is a third-party issue, and that there 
were significant known problems with bsddb 1.85, but it did 
cause me a bit of a double-take after having heard so much 
about Python 2.3 being faster...

I say "used to" because the slowdown prompted me to 
migrate to Roundup's sqlite backend, solving my problem.
msg19754 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 00:08
Logged In: YES 
user_id=588604

I get your same results under normal cmd: 7.07 seconds vs 
0.46.

[c:\tmp]timer & \python23\python test3skip.py hashopen & 
timer
Timer 1 on: 19.13.22
Timer 1 off: 19.13.29  Elapsed: 0.00.07,07

[c:\tmp]timer & \python23\python test3skip.py btopen & timer
Timer 1 on: 19.13.45
Timer 1 off: 19.13.45  Elapsed: 0.00.00,46
msg19755 - (view) Author: James Kew (jkew) Date: 2004-01-23 00:16
Logged In: YES 
user_id=598066

FWIW2, on skip's "miserable hack" comment below, vis-a-vis 
running shelve on btree: isn't this exactly the sort of thing 
shelve.Shelf is intended for?

import bsddb
import shelve

db = bsddb.btopen("temp.db")
sh = shelve.Shelf(db)
# do stuff with sh
sh.close()
# automatically calls close() on the underlying db

(Not sure why Shelf and friends are documented on 
shelve's "Restrictions" subsection...)

msg19756 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 00:44
Logged In: YES 
user_id=588604

jkew,
also I god a bit of a headache. I was pretty sure to improve 
performances with Python 2.3.3, while they get incredibly 
worse.
I know perhaps this is a third-party issue, but I use a python 
feature (shelve) and at least I think that it's better to remove 
it or signal this problem in the documentation.
We are talking about few thousand key, not billions!

BTW I didn't post twice the previuos message.
msg19757 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 10:01
Logged In: YES 
user_id=588604

I give a wrong info: I didn't try it on Linux so I'm not so sure 
it's a windows specific problem.
Besides this, looking at 2004-01-22 18:32 greg comment, it's 
seems that also Linux - alpha version has this problem.
Probably it's better to modify category to "Python library"?

msg19758 - (view) Author: Marco Beri (marcoberi) Date: 2004-01-23 10:03
Logged In: YES 
user_id=588604

I mean: I didn't try with python 2.3 on linux (just with python 
2.2)
msg19759 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2004-04-22 01:39
Logged In: YES 
user_id=31435

As threatened months ago, closed as 3rd Party, Won't Fix -- 
there's no sign that this will ever make progress.
msg19760 - (view) Author: Marco Beri (marcoberi) Date: 2005-02-17 13:42
Logged In: YES 
user_id=588604

FYI, with Python 2.4 speed is again ok.
So problem are confined to 2.3 version (also 2.3.5 has the
shelve slow problem).
History
Date User Action Args
2022-04-11 14:56:02adminsetgithub: 39844
2004-01-21 17:09:32marcobericreate