274 Commits

Author SHA1 Message Date
Junio C Hamano 5830a84a14 Merge branch 'mm/diff-U-takes-no-negative-values'
The command line parser for "git diff" learned a few options take
only non-negative integers.

* mm/diff-U-takes-no-negative-values:
  parse-options: clarify what "negated" means for PARSE_OPT_NONEG
  xdiff: guard against negative context lengths
  diff: reject negative values for -U/--unified
  diff: reject negative values for --inter-hunk-context
2026-05-25 09:40:07 +09:00
Junio C Hamano 93f0e872a8 Merge branch 'pw/xdiff-shrink-memory-consumption'
Shrink wasted memory in Myers diff that does not account for common
prefix and suffix removal.

* pw/xdiff-shrink-memory-consumption:
  xdiff: reduce the size of array
  xprepare: simplify error handling
  xdiff: cleanup xdl_clean_mmatch()
  xdiff: reduce size of action arrays
2026-05-21 12:06:48 +09:00
Junio C Hamano 0330568152 Merge branch 'en/xdiff-cleanup-3'
Preparation of the xdiff/ codebase to work with Rust.

* en/xdiff-cleanup-3:
  xdiff/xdl_cleanup_records: make execution of action easier to follow
  xdiff/xdl_cleanup_records: make setting action easier to follow
  xdiff/xdl_cleanup_records: make limits more clear
  xdiff/xdl_cleanup_records: use unambiguous types
  xdiff: use unambiguous types in xdl_bogo_sqrt()
  xdiff/xdl_cleanup_records: delete local recs pointer
2026-05-19 09:57:44 +09:00
Michael Montalbo 4517e4ca10 xdiff: guard against negative context lengths
The xdemitconf_t fields ctxlen and interhunkctxlen are typed as long
(signed), but negative values are not meaningful for context line
counts. Unlike the diff_options fields changed in the previous two
commits, these cannot be converted to unsigned because the xdiff
arithmetic relies on signed subtraction:

    s1 = XDL_MAX(xch->i1 - xecfg->ctxlen, 0);

If ctxlen were unsigned long, the signed operand would be implicitly
converted to unsigned, and the subtraction would wrap to a large
positive value when i1 < ctxlen, defeating the XDL_MAX clamp. The
signed type is required for correct context-window calculations.

The previous two commits reject negative values at the parse layer
for --inter-hunk-context and -U/--unified, so negative values should
no longer reach xdiff in normal use. Add BUG() guards at the top of
xdl_get_hunk() as defense in depth to catch programming errors in
current or future callers that bypass option parsing.

xdl_get_hunk() is called by both xdl_emit_diff() and
xdl_call_hunk_func(), so a single guard covers all xdiff consumers.

Signed-off-by: Michael Montalbo <mmontalbo@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-13 10:13:26 +09:00
Phillip Wood dca97e79bb xdiff: reduce the size of array
When the myers algorithm is selected the input files are pre-processed
to remove any common prefix and suffix and any lines that appear
in only one file. This requires a map to be created between the
lines that are processed by the myers algorithm and the lines in
the original file. That map does not include the common lines at the
beginning and end of the files but the array is allocated to be the
size of the whole file. Move the allocation into xdl_cleanup_records()
where the map is populated and we know how big it needs to be.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-05 16:20:06 +09:00
Phillip Wood c8eb18f586 xprepare: simplify error handling
If either of the two allocations fail we want to take the same action
so use a single if statement. This saves a few lines and makes it
easier for the next commit to add a couple more allocations.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-05 16:20:05 +09:00
Phillip Wood 53d13887b8 xdiff: cleanup xdl_clean_mmatch()
Remove the "s" parameter as, since the last commit, this function
is always called with s == 0. Also change parameter "e" to expect a
length, rather than the index of the last line to simplify the caller.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-05 16:20:05 +09:00
Phillip Wood a814112533 xdiff: reduce size of action arrays
When the myers algorithm is selected the input files are pre-processed
to remove any common prefix and suffix. Then any lines that appear
only in one side of the diff are marked as changed and frequently
occurring lines are marked as changed if they are adjacent to a
changed line. This step requires a couple of temporary arrays. As as
the common prefix and suffix have already been removed, the arrays
only need to be big enough to hold the lines between them, not the
whole file. Reduce the size of the arrays and adjust the loops that
use them accordingly while taking care to keep indexing the arrays
in xdfile_t with absolute line numbers.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-05-05 16:20:05 +09:00
Junio C Hamano bfd057b2dd Merge branch 'en/xdiff-cleanup-3' into pw/xdiff-shrink-memory-consumption
* en/xdiff-cleanup-3:
  xdiff/xdl_cleanup_records: make execution of action easier to follow
  xdiff/xdl_cleanup_records: make setting action easier to follow
  xdiff/xdl_cleanup_records: make limits more clear
  xdiff/xdl_cleanup_records: use unambiguous types
  xdiff: use unambiguous types in xdl_bogo_sqrt()
  xdiff/xdl_cleanup_records: delete local recs pointer
2026-05-05 16:19:59 +09:00
Ezekiel Newren f87808b701 xdiff/xdl_cleanup_records: make execution of action easier to follow
Helped-by: Phillip Wood
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:51 +09:00
Ezekiel Newren c355f69cda xdiff/xdl_cleanup_records: make setting action easier to follow
Rewrite nested ternaries with a clear if/else ladder for
action1/action2 to improve readability while preserving
behavior.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:50 +09:00
Ezekiel Newren f99a023df2 xdiff/xdl_cleanup_records: make limits more clear
Make the handling of per-file limits and the minimal-case clearer.
  * Use explicit per-file limit variables (mlim1, mlim2) and initialize
    them.
  * The additional condition `!need_min` is redudant now, remove it.
Best viewed with --color-words.

Helped-by: Phillip Wood
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:50 +09:00
Ezekiel Newren 216802587e xdiff/xdl_cleanup_records: use unambiguous types
Change the parameters of xdl_clean_mmatch() and the local variables
i, nm, mlim in xdl_cleanup_records() to use unambiguous types. Best
viewed with --color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:50 +09:00
Ezekiel Newren c37f8cda05 xdiff: use unambiguous types in xdl_bogo_sqrt()
There is no real square root for a negative number and size_t may not
be large enough for certain applications, replace long with uint64_t.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:50 +09:00
Ezekiel Newren b7b8449e5a xdiff/xdl_cleanup_records: delete local recs pointer
Simplify the first 2 for loops by directly indexing the xdfile.recs.
recs is unused in the last 2 for loops, remove it. Best viewed with
--color-words.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-30 09:16:50 +09:00
Harald Nordgren 13817db274 stash: add --label-ours, --label-theirs, --label-base for apply
Allow callers of "git stash apply" to pass custom labels for conflict
markers instead of the default "Updated upstream" and "Stashed changes".
Document the new options and add a test.

Signed-off-by: Harald Nordgren <haraldnordgren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-04-29 21:46:02 +09:00
Junio C Hamano 231f8100c4 Merge branch 'yc/histogram-hunk-shift-fix'
The final clean-up phase of the diff output could turn the result of
histogram diff algorithm suboptimal, which has been corrected.

* yc/histogram-hunk-shift-fix:
  xdiff: re-diff shifted change groups when using histogram algorithm
2026-03-24 12:31:34 -07:00
Yee Cheng Chin e417277ae9 xdiff: re-diff shifted change groups when using histogram algorithm
After a diff algorithm has been run, the compaction phase
(xdl_change_compact()) shifts and merges change groups to produce a
cleaner output. However, this shifting could create a new matched group
where both sides now have matching lines. This results in a
wrong-looking diff output which contains redundant lines that are the
same on both files.

Fix this by detecting this situation, and re-diff the texts on each side
to find similar lines, using the fall-back Myer's diff. Only do this for
histogram diff as it's the only algorithm where this is relevant. Below
contains an example, and more details.

For an example, consider two files below:

    file1:
        A

        A
        A
        A

        A
        A
        A

    file2:
        A

        A
        x
        A

        A
        A
        A

When using Myer's diff, the algorithm finds that only the "x" has been
changed, and produces a final diff result (these are line diffs, but
using word-diff syntax for ease of presentation):

        A A[-A-]{+x+}A AAA

When using histogram diff, the algorithm first discovers the LCS "A
AAA", which it uses as anchor, then produces an intermediate diff:

        {+A Ax+}A AAA[- AAA-].

This is a longer diff than Myer's, but it's still self-consistent.
However, the compaction phase attempts to shift the first file's diff
group upwards (note that this shift crosses the anchor that histogram
had used), leading to the final results for histogram diff:

        [-A AA-]{+A Ax+}A AAA

This is a technically correct patch but looks clearly redundant to a
human as the first 3 lines should not be in the diff.

The fix would detect that a shift has caused matching to a new group,
and re-diff the "A AA" and "A Ax" parts, which results in "A A"
correctly re-marked as unchanged. This creates the now correct histogram
diff:

        A A[-A-]{+x+}A AAA

This issue is not applicable to Myer's diff algorithm as it already
generates a minimal diff, which means a shift cannot result in a smaller
diff output (the default Myer's diff in xdiff is not guaranteed to be
minimal for performance reasons, but it typically does a good enough
job).

It's also not applicable to patience diff, because it uses only unique
lines as anchor for its splits, and falls back to Myer's diff within
each split. Shifting requires both ends having the same lines, and
therefore cannot cross the unique line boundaries established by the
patience algorithm. In contrast histogram diff uses non-unique lines as
anchors, and therefore shifting can cross over them.

This issue is rare in a normal repository. Below is a table of
repositories (`git log --no-merges -p --histogram -1000`), showing how
many times a re-diff was done and how many times it resulted in finding
matching lines (therefore addressing this issue) with the fix. In
general it is fewer than 1% of diff's that exhibit this offending
behavior:

| Repo (1k commits)  | Re-diff | Found matching lines |
|--------------------|---------|----------------------|
| llvm-project       |  45     | 11                   |
| vim                | 110     |  9                   |
| git                |  18     |  2                   |
| WebKit             | 168     |  1                   |
| ripgrep            |  22     |  1                   |
| cpython            |  32     |  0                   |
| vscode             |  13     |  0                   |

Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-03-03 08:43:05 -08:00
Junio C Hamano 8ceb69f85d Merge branch 'pw/diff-anchored-optim'
"git diff --anchored=<text>" has been optimized.

* pw/diff-anchored-optim:
  diff --anchored: avoid checking unmatched lines
2026-02-20 11:36:18 -08:00
Junio C Hamano 5465d3683a Merge branch 'pw/xdiff-cleanups'
Small clean-up of xdiff library to remove unnecessary data
duplication.

* pw/xdiff-cleanups:
  xdiff: remove unused data from xdlclass_t
  xdiff: remove "line_hash" field from xrecord_t
2026-02-20 11:36:17 -08:00
Phillip Wood dd2a4c0c7a diff --anchored: avoid checking unmatched lines
For a line to be an anchor it has to appear in each of the files being
diffed exactly once. With that in mind lets delay checking whether
a line is an anchor until we know there is exactly one instance of
the line in each file. As each line is checked at most once, there
is no need to cache the result of is_anchor() and we can drop that
field from the hashmap entries. When diffing 5000 recent commits in
git.git this gives a modest speedup of ~2%. In the (rather extreme)
example below that consists largely of deletions the speedup is ~16%.

    seq 0 10000000 >old
    printf '%s\n' 300000 100000 200000 >new
    git diff --no-index --anchored=300000 old new

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-02-12 09:28:49 -08:00
Phillip Wood 5086213bd2 xdiff: remove unused data from xdlclass_t
Prior to commit 6d507bd41a (xdiff: delete fields ha, line, size
in xdlclass_t in favor of an xrecord_t, 2025-09-26) xdlclass_t
carried a copy of all the fields in xrecord_t. That commit embedded
xrecord_t in xdlclass_t to make it easier to change the types of
the fields in xrecord_t. However commit 6a26019c81 (xdiff: split
xrecord_t.ha into line_hash and minimal_perfect_hash, 2025-11-18)
added the "minimal_perfect_hash" field to xrecord_t which is not
used by xdlclass_t. To avoid wasting space stop copying the whole
of xrecord_t and just copy the pointer and length that we need to
intern the line. Together with the previous commit this effectively
reverts 6d507bd41a.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-01-26 08:38:29 -08:00
Phillip Wood c27afcbfd0 xdiff: remove "line_hash" field from xrecord_t
Prior to commit 6a26019c81 (xdiff: split xrecord_t.ha into line_hash
and minimal_perfect_hash, 2025-11-18) the "ha" field of xrecord_t
initially held the "line_hash" value and once the line had been
interned that field was updated to hold the "minimal_perfect_hash". The
"line_hash" is only used to intern the line so there is no point in
storing it after all the input lines have been interned.

Removing the "line_hash" field from xrecord_t and storing it in
xdlclass_t where it is actually used makes it clearer that it is a
temporary value and it should not be used once we're calculated the
"minimal_perfect_hash". This also reduces the size of xrecord_t by 25%
on 64-bit platforms and 40% on 32-bit platforms. While the struct is
small we create one instance per input line so any saving is welcome.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2026-01-26 08:38:29 -08:00
Junio C Hamano 7fc0b33b5d Merge branch 'yc/xdiff-patience-optim'
The way patience diff finds LCS has been optimized.

* yc/xdiff-patience-optim:
  xdiff: optimize patience diff's LCS search
2025-12-09 07:54:55 +09:00
Junio C Hamano 0c6707687f Merge branch 'en/xdiff-cleanup-2'
Code clean-up.

* en/xdiff-cleanup-2:
  xdiff: rename rindex -> reference_index
  xdiff: change rindex from long to size_t in xdfile_t
  xdiff: make xdfile_t.nreff a size_t instead of long
  xdiff: make xdfile_t.nrec a size_t instead of long
  xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
  xdiff: use unambiguous types in xdl_hash_record()
  xdiff: use size_t for xrecord_t.size
  xdiff: make xrecord_t.ptr a uint8_t instead of char
  xdiff: use ptrdiff_t for dstart/dend
  doc: define unambiguous type mappings across C and Rust
2025-12-05 14:49:56 +09:00
Yee Cheng Chin c7e3b8085b xdiff: optimize patience diff's LCS search
The find_longest_common_sequence() function in patience diff is
inefficient as it calls binary_search() for every unique line it
encounters when deciding where to put it in the sequence. From
instrumentation (using xctrace) on popular repositories, binary_search()
takes up 50-60% of the run time within patience_diff() when performing a
diff.

To optimize this, add a boundary condition check before binary_search()
is called to see if the encountered unique line is located after the
entire currently tracked longest subsequence. If so, skip the
unnecessary binary search and simply append the entry to the end of
sequence. Given that most files compared in a diff are usually quite
similar to each other, this condition is very common, and should be hit
much more frequently than the binary search.

Below are some end-to-end performance results by timing `git log
--shortstat --oneline -500 --patience` on different repositories with
the old and new code. Generally speaking this seems to give at least
8-10% speed up. The "binary search hit %" column describes how often the
algorithm enters the binary search path instead of the new faster path.
Even in the WebKit case we can see that it's quite rare (1.46%).

| Repo     | Speed difference | binary search hit % |
|----------|------------------|---------------------|
| vim      | 1.27x            | 0.01%               |
| pytorch  | 1.16x            | 0.02%               |
| cpython  | 1.14x            | 0.06%               |
| ripgrep  | 1.14x            | 0.03%               |
| git      | 1.13x            | 0.12%               |
| vscode   | 1.09x            | 0.10%               |
| WebKit   | 1.08x            | 1.46%               |

The benchmarks were done using hyperfine, on an Apple M1 Max laptop,
with git compiled with `-O3 -flto`.

Signed-off-by: Yee Cheng Chin <ychin.git@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-27 19:11:41 -08:00
Ezekiel Newren 22ce0cb639 xdiff: rename rindex -> reference_index
The classic diff adds only the lines that it's going to consider,
during the diff, to an array. A mapping between the compacted
array, and the lines of the file that they reference, is
facilitated by this array.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:11 -08:00
Ezekiel Newren 5004a8da14 xdiff: change rindex from long to size_t in xdfile_t
The field rindex describes an index offset for other arrays. Change it
to size_t.

Changing the type of rindex from long to size_t has no cascading
refactor impact because it is only ever used to directly index other
arrays.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:11 -08:00
Ezekiel Newren e35877eadb xdiff: make xdfile_t.nreff a size_t instead of long
size_t is used because nreff describes the number of elements in memory
for rindex.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:11 -08:00
Ezekiel Newren 016538780e xdiff: make xdfile_t.nrec a size_t instead of long
size_t is used because nrec describes the number of elements for both
recs, and for 'changed' + 2.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Ezekiel Newren 6a26019c81 xdiff: split xrecord_t.ha into line_hash and minimal_perfect_hash
The ha field is serving two different purposes, which makes the code
harder to read. At first glance, it looks like many places assume
there could never be hash collisions between lines of the two input
files. In reality, line_hash is used together with xdl_recmatch() to
ensure correct comparisons of lines, even when collisions occur.

To make this clearer, the old ha field has been split:
  * line_hash: a straightforward hash of a line, independent of any
    external context. Its type is uint64_t, as it comes from a fixed
    width hash function.
  * minimal_perfect_hash: Not a new concept, but now a separate
    field. It comes from the classifier's general-purpose hash table,
    which assigns each line a unique and minimal hash across the two
    files. A size_t is used here because it's meant to be used to
    index an array. This also avoids ` as usize` casts on the Rust
    side when using it to index a slice.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Ezekiel Newren b0d4ae30f5 xdiff: use unambiguous types in xdl_hash_record()
Convert the function signature and body to use unambiguous types. char
is changed to uint8_t because this function processes bytes in memory.
unsigned long to uint64_t so that the hash output is consistent across
platforms. `flags` was changed from long to uint64_t to ensure the
high order bits are not dropped on platforms that treat long as 32
bits.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Ezekiel Newren 9bd193253c xdiff: use size_t for xrecord_t.size
size_t is the appropriate type because size is describing the number of
elements, bytes in this case, in memory.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Ezekiel Newren 10f97d6aff xdiff: make xrecord_t.ptr a uint8_t instead of char
Make xrecord_t.ptr uint8_t because it's referring to bytes in memory.

In order to avoid a refactor avalanche, many uses of this field were
cast to char* or similar.

Places where casting was unnecessary:
xemit.c:156
xmerge.c:124
xmerge.c:127
xmerge.c:164
xmerge.c:169
xmerge.c:172
xmerge.c:178

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Ezekiel Newren f007f4f4b4 xdiff: use ptrdiff_t for dstart/dend
ptrdiff_t is appropriate for dstart and dend because they both describe
positive or negative offsets relative to a pointer.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-18 14:53:10 -08:00
Antonin Delpeuch 881793c4f7 xdiff: add 'minimal' to XDF_DIFF_ALGORITHM_MASK
The XDF_DIFF_ALGORITHM_MASK bit mask only includes bits for the patience
and histogram diffs, not for the minimal one. This means that when
reseting the diff algorithm to the default one, one needs to separately
clear the bit for the minimal diff. There are places in the code that fail
to do that: merge-ort.c and builtin/merge-file.c.

Add the XDF_NEED_MINIMAL bit to the bit mask, and remove the separate
clearing of this bit in the places where it hasn't been forgotten.

Signed-off-by: Antonin Delpeuch <antonin@delpeuch.eu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-17 09:31:59 -08:00
Junio C Hamano 9ff172d0ee Merge branch 'en/xdiff-cleanup'
A lot of code clean-up of xdiff.
Split out of a larger topic.

* en/xdiff-cleanup:
  xdiff: change type of xdfile_t.changed from char to bool
  xdiff: add macros DISCARD(0), KEEP(1), INVESTIGATE(2) in xprepare.c
  xdiff: rename rchg -> changed in xdfile_t
  xdiff: delete chastore from xdfile_t
  xdiff: delete fields ha, line, size in xdlclass_t in favor of an xrecord_t
  xdiff: delete redundant array xdfile_t.ha
  xdiff: delete struct diffdata_t
  xdiff: delete local variables that alias fields in xrecord_t
  xdiff: delete superfluous function xdl_get_rec() in xemit
  xdiff: delete unnecessary fields from xrecord_t and xdfile_t
  xdiff: delete local variables and initialize/free xdfile_t directly
  xdiff: delete static forward declarations in xprepare
2025-10-14 12:56:09 -07:00
Ezekiel Newren 8b9c5d2e3a xdiff: change type of xdfile_t.changed from char to bool
The only values possible for 'changed' is 1 and 0, which exactly maps
to a bool type. It might not look like this because action1 and action2
(which use to be dis1, and dis2) were also of type char and were
assigned numerical values within a few lines of 'changed' (what used to
be rchg).

Using DISCARD/KEEP/INVESTIGATE for action1[i]/action2[j], and true/false
for changed[k] makes it clear to future readers that these are
logically separate concepts.

Best-viewed-with: --color-words
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-10-03 10:19:40 -07:00
Ezekiel Newren e385e1b7d2 xdiff: add macros DISCARD(0), KEEP(1), INVESTIGATE(2) in xprepare.c
This commit is refactor-only; no behavior is changed. A future commit
will use bool literals for changed[i].

The functions xdl_clean_mmatch() and xdl_cleanup_records() will be
cleaned up more in a future patch series. The changes to
xdl_cleanup_records(), in this patch, are just to make it clear why
`char rchg` is refactored to `bool changed`.

Rename dis* to action* and replace literal numericals with macros.
The old names came from when dis* (which I think was short for discard)
was treated like a boolean, but over time it grew into a ternary state
machine. The result was confusing because dis* and rchg* both used 0/1
values with different meanings.

The new names and macros make the states explicit. nm is short for
number of matches, and mlim is a heuristic limit:

  nm == 0       -> action[i] = DISCARD     -> changed[i] = true
  0 < nm < mlim -> action[i] = KEEP        -> changed[i] = false
  nm >= mlim    -> action[i] = INVESTIGATE -> changed[i] = xdl_clean_mmatch()

When need_min is true, only DISCARD and KEEP occur because the limit
is effectively infinite.

Best-viewed-with: --color-words
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-10-03 10:19:40 -07:00
Ezekiel Newren b7de64a6d6 xdiff: rename rchg -> changed in xdfile_t
The field rchg (now 'changed') declares if a line in a file is changed
or not. A later commit will change it's type from 'char' to 'bool'
to make its purpose even more clear.

Best-viewed-with: --color-words
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren d43d591252 xdiff: delete chastore from xdfile_t
xdfile_t currently uses chastore_t which is an arena allocator. I
think that xrecord_t used to be a linked list and recs didn't exist
originally. When recs was added I think they forgot to remove
xdfile_t.next, but was overlooked. This dual data structure setup
makes the code somewhat confusing.

Additionally the C type chastore_t isn't FFI friendly, and provides
little to no performance benefit over using realloc to grow an array.

Performance impact of deleting fields from xdfile_t:
Deleting ha is about 5% slower.
Deleting cha is about 5% faster.

Delete ha, but keep cha
  time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_ha/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null'
  Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.269 s ±  0.017 s    [User: 1.135 s, System: 0.128 s]
    Range (min … max):    1.249 s …  1.286 s    10 runs

  Benchmark 2: build_delete_ha/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.339 s ±  0.017 s    [User: 1.234 s, System: 0.099 s]
    Range (min … max):    1.320 s …  1.358 s    10 runs

  Summary
    build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran
      1.06 ± 0.02 times faster than build_delete_ha/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null

Delete cha, but keep ha
  time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_chastore/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null'
  Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.290 s ±  0.001 s    [User: 1.154 s, System: 0.130 s]
    Range (min … max):    1.288 s …  1.292 s    10 runs

  Benchmark 2: build_delete_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.232 s ±  0.017 s    [User: 1.105 s, System: 0.121 s]
    Range (min … max):    1.205 s …  1.249 s    10 runs

  Summary
    build_delete_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran
      1.05 ± 0.01 times faster than build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null

Delete ha AND chastore
  time hyperfine --warmup 3 -L exe build_v2.51.0/git,build_delete_ha_and_chastore/git '{exe} log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null'
  Benchmark 1: build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.291 s ±  0.002 s    [User: 1.156 s, System: 0.129 s]
    Range (min … max):    1.287 s …  1.295 s    10 runs

  Benchmark 2: build_delete_ha_and_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null
    Time (mean ± σ):      1.306 s ±  0.001 s    [User: 1.195 s, System: 0.105 s]
    Range (min … max):    1.305 s …  1.308 s    10 runs

  Summary
    build_v2.51.0/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null ran
      1.01 ± 0.00 times faster than build_delete_ha_and_chastore/git log --oneline --shortstat --diff-algorithm=myers -3000 v2.39.1 >/dev/null

Best-viewed-with: --color-words
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren 6d507bd41a xdiff: delete fields ha, line, size in xdlclass_t in favor of an xrecord_t
The fields from xdlclass_t are aliases of xrecord_t:
xdlclass_t.line -> xrecord_t.ptr
xdlclass_t.size -> xrecord_t.size
xdlclass_t.ha   -> xrecord_t.ha

xdlclass_t carries a copy of the data in xrecord_t, but instead of
embedding xrecord_t it duplicates the individual fields. A future
commit will change the types used in xrecord_t so embed it in
xdlclass_t first, so we don't have to remember to change the types
here as well.

Best-viewed-with: --color-words
Helped-by: Phillip Wood <phillip.wood123@gmail.com>
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren 5c294dceb2 xdiff: delete redundant array xdfile_t.ha
When 0 <= i < xdfile_t.nreff the following is true:
xdfile_t.ha[i] == xdfile_t.recs[xdfile_t.rindex[i]]

This makes the code about 5% slower. The fields rindex and ha are
specific to the classic diff (myers and minimal). I plan on creating a
struct for classic diff, but there's a lot of cleanup that needs to be
done before that can happen and leaving ha in would make those cleanups
harder to follow.

A subsequent commit will delete the chastore cha from xdfile_t. That
later commit will investigate deleting ha and cha independently and
together.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren f4ea812b2d xdiff: delete struct diffdata_t
Every field in this struct is an alias for a certain field in xdfile_t.

diffdata_t.nrec   -> xdfile_t.nreff
diffdata_t.ha     -> xdfile_t.ha
diffdata_t.rindex -> xdfile_t.rindex
diffdata_t.rchg   -> xdfile_t.rchg

I think this struct existed before xdfile_t, and was kept for backward
compatibility reasons. I think xdiffi should have been refactored to
use the new (xdfile_t) struct, but was easier to alias it instead.

The local variables rchg* and rindex* don't shorten the lines by much,
nor do they really need to be there to make the code more readable.
Delete them.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren 7c6ce2e47b xdiff: delete local variables that alias fields in xrecord_t
Use the type xrecord_t as the local variable for the functions in the
file xdiff/xemit.c. Most places directly reference the fields inside of
this struct, doing that here makes it more consistent with the rest of
the code.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:46 -07:00
Ezekiel Newren 7bdeb3afad xdiff: delete superfluous function xdl_get_rec() in xemit
When xrecord_t was a linked list, and recs didn't exist, I assume this
function walked the list until it found the right record. Accessing
a contiguous array is so trivial that this function is now superfluous.
Delete it.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-30 14:12:39 -07:00
Ezekiel Newren efaf553b1a xdiff: delete unnecessary fields from xrecord_t and xdfile_t
xrecord_t.next, xdfile_t.hbits, xdfile_t.rhash are initialized,
but never used for anything by the code. Remove them.

Best-viewed-with: --color-words
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-26 16:08:55 -07:00
Ezekiel Newren d1c028bdf7 xdiff: delete local variables and initialize/free xdfile_t directly
These local variables are essentially a hand-rolled additional
implementation of xdl_free_ctx() inlined into xdl_prepare_ctx(). Modify
the code to use the existing xdl_free_ctx() function so there aren't
two ways to free such variables.

Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-26 16:08:54 -07:00
Ezekiel Newren 43d5f52ac4 xdiff: delete static forward declarations in xprepare
Move xdl_prepare_env() later in the file to avoid the need
for static forward declarations.

Best-viewed-with: --color-moved
Signed-off-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-26 16:08:54 -07:00
Alexander Monakov a4bbe8af0b xdiff: optimize xdl_hash_record_verbatim
xdl_hash_record_verbatim uses modified djb2 hash with XOR instead of ADD
for combining. The ADD-based variant is used as the basis of the modern
("GNU") symbol lookup scheme in ELF. Glibc dynamic loader received an
optimized version of this hash function thanks to Noah Goldstein [1].

Switch xdl_hash_record_verbatim to additive hashing and implement
an optimized loop following the scheme suggested by Noah.

Timing 'git log --oneline --shortstat v2.0.0..v2.5.0' under perf, I got

version | cycles, bn | instructions, bn
---------------------------------------
A         6.38         11.3
B         6.21         10.89
C         5.80          9.95
D         5.83          8.74
---------------------------------------

A: baseline (git master at e4ef0485fd)
B: plus 'xdiff: refactor xdl_hash_record()'
C: and plus this patch
D: with 'xdiff: use xxhash' by Phillip Wood

The resulting speedup for xdl_hash_record_verbatim itself is about 1.5x.

[1] https://inbox.sourceware.org/libc-alpha/20220519221803.57957-6-goldstein.w.n@gmail.com/

Signed-off-by: Alexander Monakov <amonakov@ispras.ru>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-18 08:44:49 -07:00