This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Faster commonprefix (OS independent)
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: glchapman, nnorwitz, rhettinger, s_keim, tzot
Priority: normal Keywords: patch

Created on 2003-02-06 17:03 by tzot, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
commonprefix.diff tzot, 2003-02-06 17:04 diff -u of ntpath, posixpath
commonprefix2.diff tzot, 2003-02-06 18:02 diff -u
commonprefix_nobuf.diff tzot, 2003-02-07 11:11 diff -u
Messages (8)
msg42702 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2003-02-06 17:03
This routine is about 20% faster on a test set of 7 sets 
of strings run 100000 times each (I can provide the test 
if requested).  The longer the common prefix is, the 
faster the routine becomes relative to original 
commonprefix.

My only worry is that it might get rejected if it is 
considered too fancy; therefore I wasn't shy on 
commenting.

I think we should also write a commonpathprefix, that 
will do what commonprefix should do, being in the 
*path.py module.  I'll do that if none other does.

The provided patch is for posixpath.py and ntpath.py, 
but since it's OS neutral it should work as is.  It uses 
itertools for speed, though, so it is not backportable, but 
it can be if requested by substituting map for imap and a 
normal slice for islice.
msg42703 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2003-02-06 17:04
Logged In: YES 
user_id=539787

For some reason, my IE never uploads the file on the first 
attempt.
msg42704 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2003-02-06 17:25
Logged In: YES 
user_id=33168

As much as I'd like to blame IE, it's a SF bug AFAIK.  
http://sf.net/tracker/?func=detail&atid=200001&aid=675910&group_id=1
msg42705 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2003-02-06 18:02
Logged In: YES 
user_id=539787

Best case: comparing this to the old version with a list: 
['/usr/local/lib/python2.3/posixpath.py']*120, 10000 iterations, 
the speed difference is:
old: 319.58 sec
new: 34.43 sec

Since prefix_len always grows in the "while next_bit:" loop, 
applying commonprefix2.diff to the *patched* version does a 
very minor speedup (comparing smaller buffers in every 
iteration); but it is only a matter of overoptimisation (ie it does 
not hurt, but it's a trivial one, just 0.1%).
msg42706 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) * Date: 2003-02-07 11:11
Logged In: YES 
user_id=539787

I did my homework better, and found out that the buffer object 
quite probably will be deprecated.  So I rewrote the routine 
without the buffer object (using str.startswith), which by the 
way got another 10% speedup (relative to the latest version 
using buffer.)
The commonprefix_nobuf.diff patch applies directly to the 
original posixpath.py, ntpath.py.  I will try to delete the other 
patches, but I don't think I am allowed to do it.
msg42707 - (view) Author: Sebastien Keim (s_keim) Date: 2003-03-04 08:01
Logged In: YES 
user_id=498191

I would suggest another possibility. This one use a property
of strings 
ordering: if you have a<=b<=c and c.startswith(a) then
b.startswith(a).

I have tested two implementations :

# a 5 lines function with a really straightforward  code.
# It can degenerate rather badly in the worst case (large
strings 
# with a short common prefix) but is generally quite fast.
def commonprefix1(m):
    if not m: return ''
    prefix, greater = min(m), max(m)
    while not greater.startswith(prefix):
	prefix = prefix[:-1]
    return prefix

# The second use a bissection to avoid the worst case. This make
# the implementation a little more complex but seems to
provide the
# fastest result.
def commonprefix2(m):
    prefix = ''
    if m:
	low, high = min(m), max(m)
	while low:
	    n = len(low)//2 + 1
	    l, h = low[:n], high[:n]
	    if h==l:
		prefix += l
		low, high = low[n:], high[n:]
	    else:
		low, high = l[:-1], h
    return prefix
      
I personally prefer the commonprefix1 implementation: its
the simplest one and it is probably fast enough for the few
commonprefix use-cases (anyway, it is still faster than the
current implementation).      
msg42708 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2003-12-31 22:47
Logged In: YES 
user_id=80475

Applied Terry Reedy's solution to Lib/posixpath.py 1.64.

His solution is at:

http://groups.google.com/groups?
th=fc7b54f11af6b24e&seekm=bss2so$om$5@news.t-
online.com
msg42709 - (view) Author: Greg Chapman (glchapman) Date: 2004-08-15 23:48
Logged In: YES 
user_id=86307

Sorry for commenting on a closed item, but it appears the
patched commonprefix has not made it beyond posixpath.py (at
any rate, it's  not in ntpath.py).  Perhaps the change can
be applied to the other path modules?
History
Date User Action Args
2022-04-10 16:06:35adminsetgithub: 37925
2003-02-06 17:03:54tzotcreate