Skip to content

difflib.GestaltSequenceMatcher implementation #144132

@dg-pb

Description

@dg-pb

Feature or enhancement

Proposal:

Increase quality of results by being able to run Longest Common Subsequence with autojunk=False with little to no degradation in performance.

Implemented GestaltSequenceMatcher by default gives equivalent results to SequenceMatcher(autojunk=False).

case Myers difflib(aj=1) difflib(aj=0) New(aj=0) New+balancing=2/3
[L] diff_key ( 15, 15) 8 5 -3 5 -3 5 -3 8 0
[L] diff_mid1 ( 28, 27) 11 11 0 11 0 11 0 11 0
[L] diff_mid2 ( 128, 128) 73 72 -1 72 -1 72 -1 72 -1
[L] diff_mid3 ( 131, 131) 63 50 -13 50 -13 50 -13 57 -6
[L] difflib.py ( 2041, 2334) 1883 1878 -5 1879 -4 1879 -4 1879 -4
[L] linux_sched.c (11373, 11832) 10844 10793 -51 10837 -7 10837 -7 10844 0
[L] react.js ( 3555, 3337) 2631 2568 -63 2581 -50 2581 -50 2581 -50
[L] three.js (36282, 36793) 30688 25877 -4811 30588 -100 30588 -100 30656 -32
[L] webpack.js ( 2327, 3429) 929 772 -157 899 -30 899 -30 916 -13
-----
Total: 47130 42026 -5104 46922 -208 46922 -208 47024 -106
Avg % diff: 0% 10.58% 7.25% 7.25% 1.61%
Total % diff: 0% 10.83% 0.44% 0.44% 0.22%
Runtime (s): 32.451 0.694 41.517 0.292 0.362

Table above shows the number of individual match points.
Which is not a definitive metric.
More definitive metric could be a number of alignment points sacrificed for the sake of quality.

However, it can be safely assumed that autojunk=False / new matcher produces higher quality diffs by not resorting to approximations. Luckily, it also increases number of match points.

New optional balancing argument can be turned on to reduce the chance of greedily committing to unbalanced matches. It produces more concise diffs with more lines matched, while retaining block-oriented nature.

For visual inspection, can see outputs of different algorithms for tailored case HERE!

Legacy

O(n) solution is not that complex - 100 of Python lines.

I hacked in initial implementation and ran context_diff on the difflib.py - the change itself.

import difflib, time

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib.py') as f:
    s1 = f.read().splitlines(keepends=True)

with open('/Users/Edu/lib/src/cx_amb/src/cx/cli/snd/diff/difflib_new.py') as f:
    s2 = f.read().splitlines(keepends=True)

list(difflib.context_diff(s1, s2, fromfile='before.py', tofile='after.py'))
# 2.4 ms (Current)
# 2.9 ms (New)

So 30% faster.
So 20% slower.
Also, this is a fairly rough plug-in - there is a chance that the gap can be bridged.

O(n) solution should avoid some of the user dissatisfaction:

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

Made a couple of posts in unrelated thread.
Performance tables are there showing that this is faster even for sizes of 10.
https://discuss.python.org/t/include-time-and-space-complexity-in-documentation/105683/32

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePerformance or resource usagestdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions