This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: faster os.walk
Type: Stage:
Components: Library (Lib) Versions: Python 2.3
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: felixlee, jepler, loewis, mcherm
Priority: normal Keywords: patch

Created on 2004-07-19 20:04 by felixlee, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
walk4.py felixlee, 2004-07-19 20:04 standalone module with a faster walk()
Messages (8)
msg46398 - (view) Author: Felix Lee (felixlee) Date: 2004-07-19 20:04
On Unix-like filesystems, you can get a substantial
speedup for directory traversal if you look at a
directory's st_nlink count to find out how many
subdirectories are in a directory.  Once you've seen
that many subdirs, you know the rest of the entries
have to be nondirectory files, so you don't have to
stat() them.  this is often a big win because most
people put most of their files in leaf directories.

For my home directory, using the link count reduces it
from 100,000 stat calls to 20,000 stat calls, which
cuts execution time about 70% when all the filesystem
data is already in RAM.  I think the speedup is more
when the data has to be read from disk, but I don't
have an easy way of measuring that.

the GNU "find" program does this optimization.  so does
Perl's File::Find.

the attached file is a proof of concept implementation.
 if this seems like a reasonable thing to do, I can
rewrite it as a patch with appropriate documentation.

there are a few problems with it.

1. it needs to be tested on OSes I don't have.  this
uses a simple heuristic to decide whether the link
count can be used or not, so it shouldn't be necessary
to add OS-specific tests, but I don't know how perverse
other OSes are.

2. to do the nlink optimization, links to directories
have to be considered files rather than directories. 
this is different from existing behavior, where links
to directories are returned as directories but never
entered.  I don't think the existing behavior is
useful, but it seems like a bad idea to change it,
since there may be things that depend on it.

the simple way of handling this is to make use_nlink=0
the default, so the speedup and the changed behavior
happen only when someone specifically requests it.  I
don't really like this, because I think it's rare that
use_nlink=0 is desirable if the heuristic is reliable.

another way of handling this is to make a new function
instead of modifying os.walk.  a new function could
also have the option of following links without getting
stuck in loops.

3. the faster walk is still noticeably slower than
list(popen("find")).  (Perl has this problem too.) 
there isn't really much to be done about that, it's
mostly just interpreter overhead.  this speed
difference probably isn't significant in most cases,
and cross-platform portability is a good reason for
using walk instead of popen('find').
msg46399 - (view) Author: Michael Chermside (mcherm) (Python triager) Date: 2004-07-21 12:05
Logged In: YES 
user_id=99874

Okay, I like the idea (haven't read the patch yet). Your
problem 1 will just take time, and problem 3 isn't really a
problem. But problem 2 is a real issue. I think having a new
function is not worth it (or rather, I'd say, it's worth
having, just not in the standard library). I think having a
hardly-ever-used option "use_nlinik" is a non-starter. So
despite liking the idea, I'm stimied^H^H stymied^H^H
stimyed^H^H confounded.
msg46400 - (view) Author: Jeff Epler (jepler) Date: 2004-07-21 15:52
Logged In: YES 
user_id=2772

GNU find also uses os.chdir() to descend into the directory
hierarchy, which means that all its "stat" calls use
filenames within the current directory.  Clearly, python's
os.walk can't do this, because the working directory is
shared among threads.  Disaster would result.  But, anyway,
this is another reason that os.walk is slower than find.
msg46401 - (view) Author: Felix Lee (felixlee) Date: 2004-07-21 21:37
Logged In: YES 
user_id=77867

the chdir() optimization seems to be mostly insignificant on
modern systems (I'm doing timings on linux 2.6).  perl will
do chdir too, and it's only slightly faster than python if
nlinks gets used, and the amount of system time used is
pretty indistinguishable among all the versions.

I suspect ram is cheap enough that large dir->inode caches
are easy and fast.  the remaining difference in speed
between GNU find and a perl or python version is in user
time, and I think most of that can be attributed to
interpreter overhead.  crossing a function call boundary
100,000 times is hard to make fast in any interpreted language.

the nlink optimization is significant, I think that's worth
doing, but I don't think any other optimization is.
msg46402 - (view) Author: Felix Lee (felixlee) Date: 2004-07-21 22:35
Logged In: YES 
user_id=77867

thinking about it some more, a separate function is probably
a good idea.  at the moment I'm calling it 'lwalk' (analogy
to stat/lstat).  it should be usable on systems without
links too.  I'll attach it in a day or two.
msg46403 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-07-25 18:03
Logged In: YES 
user_id=21627

I dislike the idea. There are file systems which give
"incorrect" link counts, most prominently Microsoft's
"Server for NFS". They report two links for each directory,
and subtracting. and the parent link, you find that there
are never subdirectories. This causes scripts to break in
mysterious way. Fortunately, GNU find has an option to
suppress this optimization, but I'd like to avoid getting
into the same problems with Python.

Interestingly, Samba/smbfs also violates the link count
rule, reporting 1 link for each directory. Subtracting 2
gives -1 subdirectories, which GNU find, by coincidence,
essentially treats as "infinite number of subdirectories".

It would be much better to use the "file type" feature of
getdents/readdir where available, reporting an "unknown"
file type where it isn't available. This allows to avoid any
stat calls.
msg46404 - (view) Author: Felix Lee (felixlee) Date: 2004-07-25 18:45
Logged In: YES 
user_id=77867

my heuristic works correctly for those cases you describe. 
it never trusts st_nlink < 2, and before it trusts st_nlink
== 2, it checks one directory to see if st_nlink correctly
describes the number of subdirs.

the version in walk4.py will get confused if it crosses a
mount point from a filesystem where st_nlink is accurate to
a filesystem where it isn't.  that isn't hard to deal with
either, but the usual way to check for crossing a filesystem
boundary is to look at st_dev, and I don't have any idea how
reliable that is for non-unix systems.  it seems plausible
to me that any system where st_dev doesn't work will also
have an st_nlink that doesn't work, which will be detected
properly, so that might not be a problem.

this makes it more system-dependent than I like.  it might
be better to call it something like os.posix.lwalk.

the filetype returned by getdirentries isn't exposed by any
python functions. it would require adding C code, which may
be worth doing.  it seems to me that's a separate issue. 
none of the OSes I use will supply that information, so I'm
probably not the right person to look at this.
msg46405 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-07-25 20:47
Logged In: YES 
user_id=21627

Based on the possibility of the heuristics still being
wrong, I'm rejecting this patch. As it is pure Python, it
should be easy enough to incorporate into application that
really need the extra performance, and know for sure that
the heuristics is right in their case. So you might consider
publishing it in the Python Cookbook, or releasing it as a
separate module somewhere on the net.
History
Date User Action Args
2022-04-11 14:56:05adminsetgithub: 40598
2004-07-19 20:04:00felixleecreate