mirror of
https://github.com/git/git.git
synced 2025-12-12 20:36:24 +01:00
xdiff: re-diff shifted change groups when using histogram algorithm
After a diff algorithm has been run, the compaction phase
(xdl_change_compact()) shifts and merges change groups to produce a
cleaner output. However, this shifting could create a new matched group
where both sides now have matching lines. This results in a
wrong-looking diff output which contains redundant lines that are the
same on both files.
Fix this by detecting this situation, and re-diff the texts on each side
to find similar lines, using the fall-back Myer's diff. Only do this for
histogram diff as it's the only algorithm where this is relevant. Below
contains an example, and more details.
For an example, consider two files below:
file1:
A
A
A
A
A
A
A
file2:
A
A
x
A
A
A
A
When using Myer's diff, the algorithm finds that only the "x" has been
changed, and produces a final diff result (these are line diffs, but
using word-diff syntax for ease of presentation):
A A[-A-]{+x+}A AAA
When using histogram diff, the algorithm first discovers the LCS "A
AAA", which it uses as anchor, then produces an intermediate diff:
{+A Ax+}A AAA[- AAA-].
This is a longer diff than Myer's, but it's still self-consistent.
However, the compaction phase attempts to shift the first file's diff
group upwards (note that this shift crosses the anchor that histogram
had used), leading to the final results for histogram diff:
[-A AA-]{+A Ax+}A AAA
This is a technically correct patch but looks clearly redundant to a
human as the first 3 lines should not be in the diff.
The fix would detect that a shift has caused matching to a new group,
and re-diff the "A AA" and "A Ax" parts, which results in "A A"
correctly re-marked as unchanged. This creates the now correct histogram
diff:
A A[-A-]{+x+}A AAA
This issue is not applicable to Myer's diff algorithm as it already
generates a minimal diff, which means a shift cannot result in a smaller
diff output (the default Myer's diff in xdiff is not guaranteed to be
minimal for performance reasons, but it typically does a good enough
job).
It's also not applicable to patience diff, because it uses only unique
lines as anchor for its splits, and falls back to Myer's diff within
each split. Shifting requires both ends having the same lines, and
therefore cannot cross the unique line boundaries established by the
patience algorithm. In contrast histogram diff uses non-unique lines as
anchors, and therefore shifting can cross over them.
This issue is rare in a normal repository. Below is a table of
repositories (`git log --no-merges -p --histogram -1000`), showing how
many times a re-diff was done and how many times it resulted in finding
matching lines (therefore addressing this issue) with the fix. In
general it is fewer than 1% of diff's that exhibit this offending
behavior:
| Repo (1k commits) | Re-diff | Found matching lines |
|--------------------|---------|----------------------|
| llvm-project | 45 | 11 |
| vim | 110 | 9 |
| git | 18 | 2 |
| WebKit | 168 | 1 |
| ripgrep | 22 | 1 |
| cpython | 32 | 0 |
| vscode | 13 | 0 |
Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
This commit is contained in:
committed by
Junio C Hamano
parent
9a2fb147f2
commit
e39a3410c4
@@ -496,6 +496,7 @@ integration_tests = [
|
||||
't4070-diff-pairs.sh',
|
||||
't4071-diff-minimal.sh',
|
||||
't4072-diff-max-depth.sh',
|
||||
't4073-diff-shifted-matched-group.sh',
|
||||
't4100-apply-stat.sh',
|
||||
't4101-apply-nonl.sh',
|
||||
't4102-apply-rename.sh',
|
||||
|
||||
137
t/t4073-diff-shifted-matched-group.sh
Executable file
137
t/t4073-diff-shifted-matched-group.sh
Executable file
@@ -0,0 +1,137 @@
|
||||
#!/bin/sh
|
||||
|
||||
test_description='shifted diff groups re-diffing during histogram diff'
|
||||
|
||||
. ./test-lib.sh
|
||||
|
||||
test_expect_success 'shifted diff group should re-diff to minimize patch' '
|
||||
test_write_lines A x A A A x A A A >file1 &&
|
||||
test_write_lines A x A Z A x A A A >file2 &&
|
||||
|
||||
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
|
||||
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
|
||||
|
||||
cat >expect <<-EOF &&
|
||||
diff --git a/file1 b/file2
|
||||
index $file1_h..$file2_h 100644
|
||||
--- a/file1
|
||||
+++ b/file2
|
||||
@@ -1,7 +1,7 @@
|
||||
A
|
||||
x
|
||||
A
|
||||
-A
|
||||
+Z
|
||||
A
|
||||
x
|
||||
A
|
||||
EOF
|
||||
|
||||
test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
|
||||
test_cmp expect output
|
||||
'
|
||||
|
||||
test_expect_success 're-diff should preserve diff flags' '
|
||||
test_write_lines a b c a b c >file1 &&
|
||||
test_write_lines x " b" z a b c >file2 &&
|
||||
|
||||
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
|
||||
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
|
||||
|
||||
cat >expect <<-EOF &&
|
||||
diff --git a/file1 b/file2
|
||||
index $file1_h..$file2_h 100644
|
||||
--- a/file1
|
||||
+++ b/file2
|
||||
@@ -1,6 +1,6 @@
|
||||
-a
|
||||
-b
|
||||
-c
|
||||
+x
|
||||
+ b
|
||||
+z
|
||||
a
|
||||
b
|
||||
c
|
||||
EOF
|
||||
|
||||
test_expect_code 1 git diff --no-index --histogram file1 file2 >output &&
|
||||
test_cmp expect output &&
|
||||
|
||||
cat >expect_iwhite <<-EOF &&
|
||||
diff --git a/file1 b/file2
|
||||
index $file1_h..$file2_h 100644
|
||||
--- a/file1
|
||||
+++ b/file2
|
||||
@@ -1,6 +1,6 @@
|
||||
-a
|
||||
+x
|
||||
b
|
||||
-c
|
||||
+z
|
||||
a
|
||||
b
|
||||
c
|
||||
EOF
|
||||
|
||||
test_expect_code 1 git diff --no-index --histogram --ignore-all-space file1 file2 >output_iwhite &&
|
||||
test_cmp expect_iwhite output_iwhite
|
||||
'
|
||||
|
||||
test_expect_success 'shifting on either side should trigger re-diff properly' '
|
||||
test_write_lines a b c a b c a b c >file1 &&
|
||||
test_write_lines a b c a1 a2 a3 b c1 a b c >file2 &&
|
||||
|
||||
file1_h=$(git rev-parse --short $(git hash-object file1)) &&
|
||||
file2_h=$(git rev-parse --short $(git hash-object file2)) &&
|
||||
|
||||
cat >expect1 <<-EOF &&
|
||||
diff --git a/file1 b/file2
|
||||
index $file1_h..$file2_h 100644
|
||||
--- a/file1
|
||||
+++ b/file2
|
||||
@@ -1,9 +1,11 @@
|
||||
a
|
||||
b
|
||||
c
|
||||
-a
|
||||
+a1
|
||||
+a2
|
||||
+a3
|
||||
b
|
||||
-c
|
||||
+c1
|
||||
a
|
||||
b
|
||||
c
|
||||
EOF
|
||||
|
||||
test_expect_code 1 git diff --no-index --histogram file1 file2 >output1 &&
|
||||
test_cmp expect1 output1 &&
|
||||
|
||||
cat >expect2 <<-EOF &&
|
||||
diff --git a/file2 b/file1
|
||||
index $file2_h..$file1_h 100644
|
||||
--- a/file2
|
||||
+++ b/file1
|
||||
@@ -1,11 +1,9 @@
|
||||
a
|
||||
b
|
||||
c
|
||||
-a1
|
||||
-a2
|
||||
-a3
|
||||
+a
|
||||
b
|
||||
-c1
|
||||
+c
|
||||
a
|
||||
b
|
||||
c
|
||||
EOF
|
||||
|
||||
test_expect_code 1 git diff --no-index --histogram file2 file1 >output2 &&
|
||||
test_cmp expect2 output2
|
||||
'
|
||||
|
||||
test_done
|
||||
@@ -793,6 +793,7 @@ static int group_slide_up(xdfile_t *xdf, struct xdlgroup *g)
|
||||
*/
|
||||
int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
|
||||
struct xdlgroup g, go;
|
||||
struct xdlgroup g_orig, go_orig;
|
||||
long earliest_end, end_matching_other;
|
||||
long groupsize;
|
||||
|
||||
@@ -806,6 +807,9 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
|
||||
if (g.end == g.start)
|
||||
goto next;
|
||||
|
||||
g_orig = g;
|
||||
go_orig = go;
|
||||
|
||||
/*
|
||||
* Now shift the change up and then down as far as possible in
|
||||
* each direction. If it bumps into any other changes, merge
|
||||
@@ -915,6 +919,45 @@ int xdl_change_compact(xdfile_t *xdf, xdfile_t *xdfo, long flags) {
|
||||
}
|
||||
}
|
||||
|
||||
/*
|
||||
* If this has a matching group from the other file, it could
|
||||
* either be the original match from the diff algorithm, or
|
||||
* arrived at by shifting and joining groups. When it's the
|
||||
* latter, it's possible for the two newly joined sides to have
|
||||
* matching lines. Re-diff the group to mark these matching
|
||||
* lines as unchanged and remove from the diff output.
|
||||
*
|
||||
* Only do this for histogram diff as its LCS algorithm makes
|
||||
* this scenario possible. In contrast, patience diff finds LCS
|
||||
* of unique lines that groups cannot be shifted across.
|
||||
* Myer's diff (standalone or used as fall-back in patience
|
||||
* diff) already finds minimal edits so it is not possible for
|
||||
* shifted groups to result in a smaller diff. (Without
|
||||
* XDF_NEED_MINIMAL, Myer's isn't technically guaranteed to be
|
||||
* minimal, but it should be so most of the time)
|
||||
*/
|
||||
if (end_matching_other != -1 &&
|
||||
XDF_DIFF_ALG(flags) == XDF_HISTOGRAM_DIFF &&
|
||||
(g.start != g_orig.start ||
|
||||
g.end != g_orig.end ||
|
||||
go.start != go_orig.start ||
|
||||
go.end != go_orig.end)) {
|
||||
xpparam_t xpp;
|
||||
xdfenv_t xe;
|
||||
|
||||
memset(&xpp, 0, sizeof(xpp));
|
||||
xpp.flags = flags & ~XDF_DIFF_ALGORITHM_MASK;
|
||||
|
||||
memcpy(&xe.xdf1, xdf, sizeof(xdfile_t));
|
||||
memcpy(&xe.xdf2, xdfo, sizeof(xdfile_t));
|
||||
|
||||
if (xdl_fall_back_diff(&xe, &xpp,
|
||||
g.start + 1, g.end - g.start,
|
||||
go.start + 1, go.end - go.start)) {
|
||||
return -1;
|
||||
}
|
||||
}
|
||||
|
||||
next:
|
||||
/* Move past the just-processed group: */
|
||||
if (group_next(xdf, &g))
|
||||
|
||||
Reference in New Issue
Block a user