Issue 1234674: filecmp.cmp's "shallow" option

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/42172

classification

Title:	filecmp.cmp's "shallow" option
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	fdrake	Nosy List:	Andrew.Kubera, Steven.Barker, chris.jerdonek, fdrake, georg.brandl, goatsofmendez, pvrijlandt, rhettinger, sandro.tosi
Priority:	normal	Keywords:	patch

Created on 2005-07-08 09:01 by goatsofmendez, last changed 2022-04-11 14:56 by admin.

Files
File name	Uploaded	Description	Edit
filecmp_real_shallow.diff	Steven.Barker, 2014-05-06 03:52		review
filecmp_extra_shallow_file_check.diff	Andrew.Kubera, 2014-07-27 04:13	Patch of test_filecmp.py which ensures the content of files with same os.stat() signatures are not compared when the shallow option is True	review
filecmp_test_patch.diff	Steven.Barker, 2014-08-03 23:52	patch to add tests (and to fix tests broken by behavior patch)	review
filecmp_behavior_and_doc_fix.diff	Steven.Barker, 2014-08-03 23:52	Fix for filecmp.cmp behavior and its documentation	review

Messages (12)
msg25751 - (view)	Author: Mendez (goatsofmendez)	Date: 2005-07-08 09:01
The filecmp.cmp function has a shallow option (set as default) to only compare files based on stats rather than doing a bit by bit comparison of the file itself. The relevant bit of the code follows. s1 = _sig(os.stat(f1)) s2 = _sig(os.stat(f2)) if s1[0] != stat.S_IFREG or s2[0] != stat.S_IFREG: return False if shallow and s1 == s2: return True if s1[1] != s2[1]: return False result = _cache.get((f1, f2)) if result and (s1, s2) == result[:2]: return result[2] outcome = _do_cmp(f1, f2) _cache[f1, f2] = s1, s2, outcome return outcome There's a check to see if the shallow mode is enabled and if that's the case and the stats match it returns true but the test for returning false is for only one of the stats attributes meaning that it's possible for a file not to match either test and the function to carry on to the full file check lower down. The check above to see if the stats match with stat.S_IFREG also looks wrong to me, but it could just be I don't understand what he's trying to do :) This code is the same in both 2.3 and 2.4
msg25752 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2005-08-26 08:19
Logged In: YES user_id=1188172 Hm. Looks like if the size is identical, but the mtime is not, the file will be read even in shallow mode. The filecmp docs say "Unless shallow is given and is false, files with identical os.stat() signatures are taken to be equal." The filecmp.cmp docstring says "shallow: Just check stat signature (do not read the files)" Two questions arise: - Should the file contents be compared in shallow mode if the mtimes do not match? - Should the mtimes matter in non-shallow mode?
msg25753 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2005-08-26 08:33
Logged In: YES user_id=80475 Fred, do you have thoughts on this patch?
msg25754 - (view)	Author: Mendez (goatsofmendez)	Date: 2005-08-26 11:38
Logged In: YES user_id=1309441 On my own system I've modified the testing code as follows if s1 != s2: return False if shallow and s1 == s2: return True Which works as I expected it to work If attributes aren't identical - Always fails If the attributes are identical and it's in shallow mode - returns true If the attributes are identical and it's not in shallow mode - goes on to check if the files are byte identical. Whether there should be additional modes for finding byte identical files with different names, attributes etc. is another matter.
msg25755 - (view)	Author: Patrick Vrijlandt (pvrijlandt)	Date: 2005-08-28 20:24
Logged In: YES user_id=1307917 The cache is buggy too; it can be fooled by changing the cwd between calls. Also, if it caches (a, b) it does not cache (b, a). Is this caching feature really useful? Maybe it's up to the caller to record what comparisons have been made. If a caching function is to be provided, it should recognize the (b, a) case and maybe also infer from (a, b) and (b, c) about (a, c). (My programs remember an md5-signature) I propose to eliminate the caching feature. Also I propose to publish cmp_shallow and cmp_bytes which I think will make the unit easier to understand and verify. cmp_bytes is of course _do_cmp, with a check added for length equality. Then, the docs should be updated to whatever implementation for cmp is chosen, because non-shallow comparison is ill-specified. cmp should return a bool. what should cmp do with funny input like a dir? It now returns false. Would an exception be better? BTW: is it really useful in _do_cmp to have a local variable bufsize? is 8k still optimal?
msg25756 - (view)	Author: Patrick Vrijlandt (pvrijlandt)	Date: 2005-08-28 22:16
Logged In: YES user_id=1307917 "cmp should return a bool" refers to the modules docstring. On other places, it does already. The default value for "shallow" should be a bool too.
msg146682 - (view)	Author: Sandro Tosi (sandro.tosi) *	Date: 2011-10-31 10:04
Hello, we recently received a this message on docs@ : http://mail.python.org/pipermail/docs/2011-October/006121.html that's actually related to this issue: how can we move it forward?
msg216937 - (view)	Author: Steven Barker (Steven.Barker) *	Date: 2014-04-21 08:30
A recent Stack Overflow question (http://stackoverflow.com/q/23192359/1405065) relates to this bug. The questioner was surprised that filecmp.cmp is much slower than usual for certain large files, despite the "shallow" parameter being True. It is pretty clear to me that the behavior of filecmp.cmp does not match its docstring with respect to "shallow". Either the docstring should be updated to match the existing behavior, or (more likely?) the behavior should change to match the docs.
msg217965 - (view)	Author: Steven Barker (Steven.Barker) *	Date: 2014-05-06 03:52
Here's a patch against the default branch that fixes filecmp.cmp's behavior when "shallow" is True, including an update to the module's docs (correcting the previous ambiguity discussed in the 2011 python-dev thread mentioned by Sandro Tosi) and a couple of new tests of files where shallow comparisons should be expected get the answer wrong. The changed behavior is very simple. The lines from Lib/filecmp.py in the original code: if shallow and s1 == s2: return True are changed to: if shallow: return s1 == s2 This presumes answers to the questions asked by Georg Brandl back in 2005 of "No, shallow comparisons never read the file contents" and "No, non-shallow comparisons don't care about mtimes (except for avoiding out of date cache entries)". If we only applied the test changes (not the behavior change) from my patch, one of the new tests would fail, as the current filecmp.cmp code never gives false negatives on comparisons in shallow mode (only false positives). We probably don't want to commit the patch exactly as it stands, because the changed behavior causes several of the previously existing tests to fail. Almost all the dircmp tests fail for me, and the "test_matching" filecmp.cmp test does so intermittently (due to my system sometimes being fast enough at creating the files to beat my filesystem's mtime resolution). The test failures are all because of shallow comparisons between files with the same contents, but (slightly) different mtimes. The new shallow comparison behavior is to say those files are unequal while the old code would fall back on testing the file contents despite the shallow parameter, and so it would return True. The failing tests can probably be updated to either explicitly set the file mtimes with sys.utime (as I do with the files used in the new tests), or we could rewrite some of the setup code to use something like shutil.copy2 to make copies of files while preserving their mtimes. I'm not very familiar with the best practices when it comes to writing unit tests, so I pretty much copied the style of the existing tests in Lib/test/test_filecmp.py (though I've split mine up a bit more). I understand from reading other filecmp bug reports that that test style is considered to be pretty poor, so there's probably room to improve my code as well. I thought I'd post this patch before making an effort at fixing the older tests, in order to get some feedback. I'll be happy to incorporate any suggested improvements! This issue has been open for almost 9 years. Hopefully my patch can help get it moving again towards being fixed!
msg224101 - (view)	Author: Andrew Kubera (Andrew.Kubera)	Date: 2014-07-27 04:13
Attached is a couple extra tests which run filecmp on two files with different content but the same length and relevant stat() info. This appears to successfully check if the shallow options works correctly. It uses time.sleep(1) to ensure the files start out with different creation times, checks that filecmp returns false, then sets the creation time with os.utime to ensure filecmp is true. These tests currently run successfully on 3.5, so if there was an issue, it has been resolved.
msg224128 - (view)	Author: Steven Barker (Steven.Barker) *	Date: 2014-07-27 13:22
I think that your test patch misses the confusing/possibly wrong case. That case is when two files have the same contents, but different mtimes. If you attempt a shallow comparison, you'll actually get a deep comparison (reading the whole files) and a result of True rather than the expected (though incorrect) False. Try the test part of my patch (without the behavior change), and you'll see the failure of "test_shallow_false_negative". In your first assertion (that "filecmp.cmp(self.name, self.name_uppercase)" is False), you get the expected result, but for the wrong reason (you get it because the file contents differ, not because they have different mtimes). Now, it might be that the "bad" case is actually working as we want it to (or at least, the behavior is established enough that we don't want to change it, for risk of breaking running code). If so, we should instead change the documentation (and especially the docstring) to explicit state that even if you request a shallow comparison, you might get a deep comparison instead.
msg224668 - (view)	Author: Steven Barker (Steven.Barker) *	Date: 2014-08-03 23:52
I've worked on this filecmp issue some more, and I have some new patches. First up is a patch that only modifies the tests. It has one test that fails without the behavior patch. The test patch also modifies some other tests so that they will work after the behavior patch is applied. Notably, test_keyword had a some tests that would fail due to false negatives on shallow comparisons. The second patch is the behavior and documentation changes required to actually fix this issue. This should be very simple to understand (the behavior change is only two lines of code, quoted in the discussion above). If you apply only this patch, you'll get several test failures, all due to false negative shallow comparisons (where two files have the same contents, but their stat signatures differ). With these new patches, I think this issue is ready for a review, and eventually to be committed. The behavior change is simple and I think, obviously correct (assuming we want to risk breaking backwards compatibility). Perhaps my test code can be improved, but I don't think it's too bad. So, the main question is "will too much outside code break if we make this behavior change?" I don't think filecmp is used very widely, but as was demonstrated by the standard library's only use of filecmp (in Lib/test/test_keyword.py), it's likely that a lot of users perform shallow comparisons (by default) where they really don't want to get false-negatives. If we decide that the changed behavior is too big of a break of backwards compatibility, we just need to document the current behavior better (at a minimum, the docstring for filecmp.cmp must be corrected).

History
Date	User	Action	Args
2022-04-11 14:56:12	admin	set	github: 42172
2014-08-14 05:09:22	berker.peksag	set	stage: test needed -> patch review
2014-08-03 23:52:31	Steven.Barker	set	files: + filecmp_behavior_and_doc_fix.diff
2014-08-03 23:52:17	Steven.Barker	set	files: + filecmp_test_patch.diff messages: + msg224668
2014-07-27 13:22:46	Steven.Barker	set	messages: + msg224128
2014-07-27 04:13:09	Andrew.Kubera	set	files: + filecmp_extra_shallow_file_check.diff versions: - Python 2.7, Python 3.2, Python 3.3, Python 3.4 nosy: + Andrew.Kubera messages: + msg224101
2014-05-06 03:52:33	Steven.Barker	set	files: + filecmp_real_shallow.diff keywords: + patch messages: + msg217965 versions: + Python 3.5
2014-04-21 08:30:17	Steven.Barker	set	versions: + Python 3.4 nosy: + Steven.Barker messages: + msg216937 components: + Library (Lib), - Extension Modules
2012-07-27 18:37:24	chris.jerdonek	set	nosy: + chris.jerdonek
2011-10-31 10:04:54	sandro.tosi	set	nosy: + sandro.tosi messages: + msg146682 versions: + Python 3.3, - Python 3.1
2010-08-21 18:27:35	BreamoreBoy	set	versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2009-02-16 00:27:32	ajaksu2	set	stage: test needed type: behavior versions: + Python 2.6, - Python 2.4
2005-07-08 09:01:20	goatsofmendez	create