Files
git-mirror/t/t8020-last-modified.sh
Toon Claes 2a04e8c293 last-modified: implement faster algorithm
The current implementation of git-last-modified(1) works by doing a
revision walk, and inspecting the diff at each level of that walk to
annotate entries remaining in the hashmap of paths. In other words, if
the diff at some level touches a path which has not yet been associated
with a commit, then that commit becomes associated with the path.

While a perfectly reasonable implementation, it can perform poorly in
either one of two scenarios:

  1. There are many entries of interest, in which case there is simply
     a lot of work to do.

  2. Or, there are (even a few) entries which have not been updated in a
     long time, and so we must walk through a lot of history in order to
     find a commit that touches that path.

This patch rewrites the last-modified implementation that addresses the
second point. The idea behind the algorithm is to propagate a set of
'active' paths (a path is 'active' if it does not yet belong to a
commit) up to parents and do a truncated revision walk.

The walk is truncated because it does not produce a revision for every
change in the original pathspec, but rather only for active paths.

More specifically, consider a priority queue of commits sorted by
generation number. First, enqueue the set of boundary commits with all
paths in the original spec marked as interesting.

Then, while the queue is not empty, do the following:

  1. Pop an element, say, 'c', off of the queue, making sure that 'c'
     isn't reachable by anything in the '--not' set.

  2. For each parent 'p' (with index 'parent_i') of 'c', do the
     following:

     a. Compute the diff between 'c' and 'p'.
     b. Pass any active paths that are TREESAME from 'c' to 'p'.
     c. If 'p' has any active paths, push it onto the queue.

  3. Any path that remains active on 'c' is associated to that commit.

This ends up being equivalent to doing something like 'git log -1 --
$path' for each path simultaneously. But, it allows us to go much faster
than the original implementation by limiting the number of diffs we
compute, since we can avoid parts of history that would have been
considered by the revision walk in the original implementation, but are
known to be uninteresting to us because we have already marked all paths
in that area to be inactive.

To avoid computing many first-parent diffs, add another trick on top of
this and check if all paths active in 'c' are DEFINITELY NOT in c's
Bloom filter. Since the commit-graph only stores first-parent diffs in
the Bloom filters, we can only apply this trick to first-parent diffs.

Comparing the performance of this new algorithm shows about a 2.5x
improvement on git.git:

    Benchmark 1: master   no bloom
      Time (mean ± σ):      2.868 s ±  0.023 s    [User: 2.811 s, System: 0.051 s]
      Range (min … max):    2.847 s …  2.926 s    10 runs

    Benchmark 2: master with bloom
      Time (mean ± σ):     949.9 ms ±  15.2 ms    [User: 907.6 ms, System: 39.5 ms]
      Range (min … max):   933.3 ms … 971.2 ms    10 runs

    Benchmark 3: HEAD     no bloom
      Time (mean ± σ):     782.0 ms ±   6.3 ms    [User: 740.7 ms, System: 39.2 ms]
      Range (min … max):   776.4 ms … 798.2 ms    10 runs

    Benchmark 4: HEAD   with bloom
      Time (mean ± σ):     307.1 ms ±   1.7 ms    [User: 276.4 ms, System: 29.9 ms]
      Range (min … max):   303.7 ms … 309.5 ms    10 runs

    Summary
      HEAD   with bloom ran
        2.55 ± 0.02 times faster than HEAD     no bloom
        3.09 ± 0.05 times faster than master with bloom
        9.34 ± 0.09 times faster than master   no bloom

In short, the existing implementation is comparably fast *with* Bloom
filters as the new implementation is *without* Bloom filters. So, most
repositories should get a dramatic speed-up by just deploying this (even
without computing Bloom filters), and all repositories should get faster
still when computing Bloom filters.

When comparing a more extreme example of
`git last-modified -- COPYING t`, the difference is even 5 times better:

    Benchmark 1: master
      Time (mean ± σ):      4.372 s ±  0.057 s    [User: 4.286 s, System: 0.062 s]
      Range (min … max):    4.308 s …  4.509 s    10 runs

    Benchmark 2: HEAD
      Time (mean ± σ):     826.3 ms ±  22.3 ms    [User: 784.1 ms, System: 39.2 ms]
      Range (min … max):   810.6 ms … 881.2 ms    10 runs

    Summary
      HEAD ran
        5.29 ± 0.16 times faster than master

As an added benefit, results are more consistent now. For example
implementation in 'master' gives:

    $ git log --max-count=1 --format=%H -- pkt-line.h
    15df15fe07

    $ git last-modified -- pkt-line.h
    15df15fe07	pkt-line.h

    $ git last-modified | grep pkt-line.h
    5b49c1af03	pkt-line.h

With the changes in this patch the results of git-last-modified(1)
always match those of `git log --max-count=1`.

One thing to note though, the results might be outputted in a different
order than before. This is not considerd to be an issue because nowhere
is documented the order is guaranteed.

Based-on-patches-by: Derrick Stolee <stolee@gmail.com>
Based-on-patches-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Toon Claes <toon@iotcl.com>
Acked-by: Taylor Blau <me@ttaylorr.com>
[jc: tweaked use of xcalloc() to unbreak coccicheck]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-03 07:25:41 -08:00

231 lines
4.3 KiB
Bash
Executable File

#!/bin/sh
test_description='last-modified tests'
. ./test-lib.sh
test_expect_success 'setup' '
test_commit 1 file &&
mkdir a &&
test_commit 2 a/file &&
mkdir a/b &&
test_commit 3 a/b/file
'
test_expect_success 'cannot run last-modified on two trees' '
test_must_fail git last-modified HEAD HEAD~1
'
check_last_modified() {
local indir= &&
while test $# != 0
do
case "$1" in
-C)
indir="$2"
shift
;;
*)
break
;;
esac &&
shift
done &&
cat >expect &&
git ${indir:+-C "$indir"} last-modified "$@" >tmp.1 &&
git name-rev --annotate-stdin --name-only --tags \
<tmp.1 >tmp.2 &&
tr '\t' ' ' <tmp.2 >actual &&
test_cmp expect actual
}
test_expect_success 'last-modified non-recursive' '
check_last_modified <<-\EOF
3 a
1 file
EOF
'
test_expect_success 'last-modified recursive' '
check_last_modified -r <<-\EOF
3 a/b/file
2 a/file
1 file
EOF
'
test_expect_success 'last-modified recursive with show-trees' '
check_last_modified -r -t <<-\EOF
3 a/b
3 a/b/file
3 a
2 a/file
1 file
EOF
'
test_expect_success 'last-modified non-recursive with show-trees' '
check_last_modified -t <<-\EOF
3 a
1 file
EOF
'
test_expect_success 'last-modified subdir' '
check_last_modified a <<-\EOF
3 a
EOF
'
test_expect_success 'last-modified subdir recursive' '
check_last_modified -r a <<-\EOF
3 a/b/file
2 a/file
EOF
'
test_expect_success 'last-modified from non-HEAD commit' '
check_last_modified HEAD^ <<-\EOF
2 a
1 file
EOF
'
test_expect_success 'last-modified from subdir defaults to root' '
check_last_modified -C a <<-\EOF
3 a
1 file
EOF
'
test_expect_success 'last-modified from subdir uses relative pathspecs' '
check_last_modified -C a -r b <<-\EOF
3 a/b/file
EOF
'
test_expect_success 'limit last-modified traversal by count' '
check_last_modified -1 <<-\EOF
3 a
^2 file
EOF
'
test_expect_success 'limit last-modified traversal by commit' '
check_last_modified HEAD~2..HEAD <<-\EOF
3 a
^1 file
EOF
'
test_expect_success 'only last-modified files in the current tree' '
git rm -rf a &&
git commit -m "remove a" &&
check_last_modified <<-\EOF
1 file
EOF
'
test_expect_success 'subdirectory modified via merge' '
test_when_finished rm -rf repo &&
git init repo &&
(
cd repo &&
test_commit base &&
git switch --create left &&
mkdir subdir &&
test_commit left subdir/left &&
git switch --create right base &&
mkdir subdir &&
test_commit right subdir/right &&
git switch - &&
test_merge merge right &&
check_last_modified <<-\EOF
merge subdir
base base.t
EOF
)
'
test_expect_success 'cross merge boundaries in blaming' '
git checkout HEAD^0 &&
git rm -rf . &&
test_commit m1 &&
git checkout HEAD^ &&
git rm -rf . &&
test_commit m2 &&
git merge m1 &&
check_last_modified <<-\EOF
m2 m2.t
m1 m1.t
EOF
'
test_expect_success 'last-modified merge for resolved conflicts' '
git checkout HEAD^0 &&
git rm -rf . &&
test_commit c1 conflict &&
git checkout HEAD^ &&
git rm -rf . &&
test_commit c2 conflict &&
test_must_fail git merge c1 &&
test_commit resolved conflict &&
check_last_modified conflict <<-\EOF
resolved conflict
EOF
'
# Consider `file` with this content through history:
#
# A---B---B-------B---B
# \ /
# C---D
test_expect_success 'last-modified merge ignores content from branch' '
git checkout HEAD^0 &&
git rm -rf . &&
test_commit a1 file A &&
test_commit a2 file B &&
test_commit a3 file C &&
test_commit a4 file D &&
git checkout a2 &&
git merge --no-commit --no-ff a4 &&
git checkout a2 -- file &&
git merge --continue &&
check_last_modified <<-\EOF
a2 file
EOF
'
# Consider `file` with this content through history:
#
# A---B---B---C---D---B---B
# \ /
# B-------B
test_expect_success 'last-modified merge undoes changes' '
git checkout HEAD^0 &&
git rm -rf . &&
test_commit b1 file A &&
test_commit b2 file B &&
test_commit b3 file C &&
test_commit b4 file D &&
git checkout b2 &&
test_commit b5 file2 2 &&
git checkout b4 &&
git merge --no-commit --no-ff b5 &&
git checkout b2 -- file &&
git merge --continue &&
check_last_modified <<-\EOF
b5 file2
b2 file
EOF
'
test_expect_success 'last-modified complains about unknown arguments' '
test_must_fail git last-modified --foo 2>err &&
grep "unknown last-modified argument: --foo" err
'
test_done