Issue 994057: faster os.walk

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/40598

classification

Title:	faster os.walk
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.3

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	felixlee, jepler, loewis, mcherm
Priority:	normal	Keywords:	patch

Created on 2004-07-19 20:04 by felixlee, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
walk4.py	felixlee, 2004-07-19 20:04	standalone module with a faster walk()

Messages (8)
msg46398 - (view)	Author: Felix Lee (felixlee)	Date: 2004-07-19 20:04
On Unix-like filesystems, you can get a substantial speedup for directory traversal if you look at a directory's st_nlink count to find out how many subdirectories are in a directory. Once you've seen that many subdirs, you know the rest of the entries have to be nondirectory files, so you don't have to stat() them. this is often a big win because most people put most of their files in leaf directories. For my home directory, using the link count reduces it from 100,000 stat calls to 20,000 stat calls, which cuts execution time about 70% when all the filesystem data is already in RAM. I think the speedup is more when the data has to be read from disk, but I don't have an easy way of measuring that. the GNU "find" program does this optimization. so does Perl's File::Find. the attached file is a proof of concept implementation. if this seems like a reasonable thing to do, I can rewrite it as a patch with appropriate documentation. there are a few problems with it. 1. it needs to be tested on OSes I don't have. this uses a simple heuristic to decide whether the link count can be used or not, so it shouldn't be necessary to add OS-specific tests, but I don't know how perverse other OSes are. 2. to do the nlink optimization, links to directories have to be considered files rather than directories. this is different from existing behavior, where links to directories are returned as directories but never entered. I don't think the existing behavior is useful, but it seems like a bad idea to change it, since there may be things that depend on it. the simple way of handling this is to make use_nlink=0 the default, so the speedup and the changed behavior happen only when someone specifically requests it. I don't really like this, because I think it's rare that use_nlink=0 is desirable if the heuristic is reliable. another way of handling this is to make a new function instead of modifying os.walk. a new function could also have the option of following links without getting stuck in loops. 3. the faster walk is still noticeably slower than list(popen("find")). (Perl has this problem too.) there isn't really much to be done about that, it's mostly just interpreter overhead. this speed difference probably isn't significant in most cases, and cross-platform portability is a good reason for using walk instead of popen('find').
msg46399 - (view)	Author: Michael Chermside (mcherm)	Date: 2004-07-21 12:05
Logged In: YES user_id=99874 Okay, I like the idea (haven't read the patch yet). Your problem 1 will just take time, and problem 3 isn't really a problem. But problem 2 is a real issue. I think having a new function is not worth it (or rather, I'd say, it's worth having, just not in the standard library). I think having a hardly-ever-used option "use_nlinik" is a non-starter. So despite liking the idea, I'm stimied^H^H stymied^H^H stimyed^H^H confounded.
msg46400 - (view)	Author: Jeff Epler (jepler)	Date: 2004-07-21 15:52
Logged In: YES user_id=2772 GNU find also uses os.chdir() to descend into the directory hierarchy, which means that all its "stat" calls use filenames within the current directory. Clearly, python's os.walk can't do this, because the working directory is shared among threads. Disaster would result. But, anyway, this is another reason that os.walk is slower than find.
msg46401 - (view)	Author: Felix Lee (felixlee)	Date: 2004-07-21 21:37
Logged In: YES user_id=77867 the chdir() optimization seems to be mostly insignificant on modern systems (I'm doing timings on linux 2.6). perl will do chdir too, and it's only slightly faster than python if nlinks gets used, and the amount of system time used is pretty indistinguishable among all the versions. I suspect ram is cheap enough that large dir->inode caches are easy and fast. the remaining difference in speed between GNU find and a perl or python version is in user time, and I think most of that can be attributed to interpreter overhead. crossing a function call boundary 100,000 times is hard to make fast in any interpreted language. the nlink optimization is significant, I think that's worth doing, but I don't think any other optimization is.
msg46402 - (view)	Author: Felix Lee (felixlee)	Date: 2004-07-21 22:35
Logged In: YES user_id=77867 thinking about it some more, a separate function is probably a good idea. at the moment I'm calling it 'lwalk' (analogy to stat/lstat). it should be usable on systems without links too. I'll attach it in a day or two.
msg46403 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2004-07-25 18:03
Logged In: YES user_id=21627 I dislike the idea. There are file systems which give "incorrect" link counts, most prominently Microsoft's "Server for NFS". They report two links for each directory, and subtracting. and the parent link, you find that there are never subdirectories. This causes scripts to break in mysterious way. Fortunately, GNU find has an option to suppress this optimization, but I'd like to avoid getting into the same problems with Python. Interestingly, Samba/smbfs also violates the link count rule, reporting 1 link for each directory. Subtracting 2 gives -1 subdirectories, which GNU find, by coincidence, essentially treats as "infinite number of subdirectories". It would be much better to use the "file type" feature of getdents/readdir where available, reporting an "unknown" file type where it isn't available. This allows to avoid any stat calls.
msg46404 - (view)	Author: Felix Lee (felixlee)	Date: 2004-07-25 18:45
Logged In: YES user_id=77867 my heuristic works correctly for those cases you describe. it never trusts st_nlink < 2, and before it trusts st_nlink == 2, it checks one directory to see if st_nlink correctly describes the number of subdirs. the version in walk4.py will get confused if it crosses a mount point from a filesystem where st_nlink is accurate to a filesystem where it isn't. that isn't hard to deal with either, but the usual way to check for crossing a filesystem boundary is to look at st_dev, and I don't have any idea how reliable that is for non-unix systems. it seems plausible to me that any system where st_dev doesn't work will also have an st_nlink that doesn't work, which will be detected properly, so that might not be a problem. this makes it more system-dependent than I like. it might be better to call it something like os.posix.lwalk. the filetype returned by getdirentries isn't exposed by any python functions. it would require adding C code, which may be worth doing. it seems to me that's a separate issue. none of the OSes I use will supply that information, so I'm probably not the right person to look at this.
msg46405 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2004-07-25 20:47
Logged In: YES user_id=21627 Based on the possibility of the heuristics still being wrong, I'm rejecting this patch. As it is pure Python, it should be easy enough to incorporate into application that really need the extra performance, and know for sure that the heuristics is right in their case. So you might consider publishing it in the Python Cookbook, or releasing it as a separate module somewhere on the net.

History
Date	User	Action	Args
2022-04-11 14:56:05	admin	set	github: 40598
2004-07-19 20:04:00	felixlee	create