Issue 645629: SequenceMatcher finds suboptimal sequenc

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37551

classification

Title:	SequenceMatcher finds suboptimal sequenc
Type:		Stage:
Components:	Library (Lib)	Versions:	Python 2.2

process

Status:	closed	Resolution:	wont fix
Dependencies:		Superseder:
Assigned To:	tim.peters	Nosy List:	rhettinger, tim.peters, yeti-dn
Priority:	low	Keywords:

Created on 2002-11-29 10:54 by yeti-dn, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Messages (5)
msg13487 - (view)	Author: David Necas (yeti-dn)	Date: 2002-11-29 10:54
The algorithm used for approximate string matching doesn't find the optimal edit sequence (it finds longest blocks instead). Example: >>> from difflib import SequenceMatcher >>> sm = SequenceMatcher() >>> sm.set_seqs('axfot', 'aoftax') >>> sm.ratio() 0.36363636363636365 >>> sm.get_matching_blocks() [(0, 4, 2), (5, 6, 0)] >>> sm.get_opcodes() [('insert', 0, 0, 0, 4), ('equal', 0, 2, 4, 6), ('delete', 2, 5, 6, 6)] What's wrong? Levenshtein distance with weight 2 for item replacement is only 5 (the weight 2 corresponds to what ratio() is supposed to compute, the classic Levenshtein distance is 4), so one would expect to get similarity (i.e. ratio()) (11-5)/11 = 6/11 = 0.545454545454..., and not only 4/11. And really, the maximal matching blocks are: [(0, 0, 1), (2, 2, 1), (4, 3, 1)] and the minimal edit sequence is: [('equal', 0, 1, 0, 1), ('replace', 1, 2, 1, 2), ('equal', 2, 3, 2, 3), ('delete', 3, 4, 3, 3), ('equal', 4, 5, 3, 4), ('insert', 5, 5, 4, 6)] The impact of this ``feature'' on diff-like applications may be even positive, beause the edit sequence then consists of smaller number of operations on lager chunks. Thus I'm not sure if this is something which should be fixed. However, it should be at least noted in the documentation the ratio() function gives only a lower bound of the string similarity (so people like me won't be tempted to use it to check results of their own Levenshtein distance/string similarity implementation).
msg13488 - (view)	Author: David Necas (yeti-dn)	Date: 2002-11-29 12:58
Logged In: YES user_id=658986 Sorry, I've changed my mind. This definitely should be fixed. In following strings finding `Observation' effectively inhibits finding the much better match: sm.set_seqs('Observation: What seems as a small glitch at the first sight may have large impact', 'What-seems-as-a-small-glitch-at-the-first-sight-may-have-large-impact (Observation)') Unfortunately this probably means complete rewrite, I can't see how the current algorithm could be changed to work in this case (but I don't understand it 100%, so maybe...).
msg13489 - (view)	Author: Tim Peters (tim.peters) *	Date: 2002-11-29 16:03
Logged In: YES user_id=31435 Please read the docs first. This isn't the Levenshtein algorithm, and it doesn't care about finding minimal edit distance.
msg13490 - (view)	Author: David Necas (yeti-dn)	Date: 2002-11-30 04:15
Logged In: YES user_id=658986 OK, I know it's not Levenshtein because I've read the source, and I agree I should really read the docs first and the source after, it's actually written in the docs. However, the docs say `This does not yield minimal edit sequences, but does tend to yield matches that ``look right'' to people.' This is not true -- see my last posting. Or perhaps you really think matching `Observation' looks better to people than matching the smaller parts from the rest of the string (which I strongly doubt). Then please add a note the matches it finds can be arbitrarily bad, at least (e.g., the ratio between optimal and difflib's best match can be as big as (N-1)^2/4N, where N is sequence length (of both sequecnes)). Thank you.
msg13491 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2003-07-12 01:46
Logged In: YES user_id=80475 I looked at possible wording changes but prefer the docs as they stand. Others have found the docs to be perfectly clear. Any more text about worse case ratios to minimal edit distances would, IMO, be a distractor and make the docs less clear. Marking this one as won't fix and closing. We do appreciate the report. Since this is an area of interest for you, consider designing a plug-in replacement using the Levenshtein algorithm and post it to the Vaults of Parnassus or in ActiveState's cookbook. That would provide some benefit to the occassional seeker of a minimal edit sequence.

History
Date	User	Action	Args
2022-04-10 16:05:57	admin	set	github: 37551
2002-11-29 10:54:30	yeti-dn	create