`git ls-files --modified` and `git ls-files --deleted` have been
optimized to filter with pathspec before calling lstat() when there is
only a single pathspec item, avoiding unnecessary filesystem access
for entries that will not be shown.
* td/ls-files-pathspec-prefilter:
ls-files: filter pathspec before lstat
'git describe' has been taught to pass the 'refs/tags/' prefix down to
the ref iterator when '--all' is not requested, avoiding unnecessary
iteration over non-tag refs.
* td/describe-tag-iteration:
describe: limit default ref iteration to tags
"git index-pack" has been optimized by retaining child bases in the
delta cache instead of immediately freeing them, letting the existing
cache limit policy decide eviction.
* ab/index-pack-retain-child-bases:
index-pack: retain child bases in delta cache
The 'git describe --contains --all' command has been fixed to
properly honor the '--match' and '--exclude' options by passing
them down to 'git name-rev' with the appropriate reference
prefixes.
* jk/describe-contains-all-match-fix:
describe: fix --exclude, --match with --contains and --all
Streaming revision walks have been optimized by using a priority queue
for date-sorting commits, speeding up walks repositories with many
merges.
* kk/streaming-walk-pqueue:
revision: use priority queue for non-limited streaming walks
revision: introduce rev_walk_mode to clarify get_revision_1()
pack-objects: call release_revisions() after cruft traversal
Many core configuration variables have been migrated from global
variables into 'repo_config_values' to tie them to a specific
repository instance, avoiding cross-repository state leakage.
* ob/more-repo-config-values:
environment: move "warn_on_object_refname_ambiguity" into `struct repo_config_values`
environment: move "sparse_expect_files_outside_of_patterns" into `struct repo_config_values`
environment: move "core_sparse_checkout_cone" into `struct repo_config_values`
environment: move "precomposed_unicode" into `struct repo_config_values`
environment: move "pack_compression_level" into `struct repo_config_values`
environment: move `zlib_compression_level` into `struct repo_config_values`
environment: move "check_stat" into `struct repo_config_values`
environment: move "trust_ctime" into `struct repo_config_values`
"git config foo.bar=baz" is not likely to be a request to read the
value of such a variable with '=' in its name; rather it is plausible
that the user meant "git config set foo.bar baz". Give advice when
giving an error message.
* hn/config-typo-advice:
config: improve diagnostic for "set" with missing value
config: add git_config_key_is_valid() for quiet validation
In --deleted and --modified modes, show_files() calls lstat() for each
index entry before show_ce() applies the pathspec. prune_index() avoids
most of these calls for pathspecs with a common directory prefix, but
not for a top-level name or leading wildcard.
Match before lstat() to avoid accessing the worktree for entries that
cannot be shown. Treat this as a prefilter: do not update ps_matched,
and retain the match in show_ce() so --error-unmatch is satisfied only
by entries that the selected modes actually show.
Prefilter only a single pathspec item, bounding the added work for each
index entry. Applying match_pathspec() to multiple arguments can cost
more than the lstat() calls it avoids. In a synthetic repository with
10,000 clean files, passing every path to ls-files --modified increased
runtime from 112.5 ms to 494.1 ms when the prefilter was unconditional.
With $parent and $this exported as paths to binaries built from the
parent and this commit, on a repository with 881,290 index entries:
hyperfine --warmup 0 --runs 3 \
--command-name parent \
'$parent -c core.fsmonitor=false ls-files --deleted -- README.md >/dev/null' \
--command-name this-commit \
'$this -c core.fsmonitor=false ls-files --deleted -- README.md >/dev/null'
reported means of 65.790 seconds for the parent and 4.987 seconds for
this commit.
Link: https://lore.kernel.org/r/xmqqfr2tnfk0.fsf@gitster.g
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The loose object source has been refactored into a proper `struct
odb_source`.
* ps/odb-source-loose:
odb/source-loose: drop pointer to the "files" source
odb/source-loose: stub out remaining callbacks
odb/source-loose: wire up `write_object_stream()` callback
object-file: refactor writing objects to use loose source
odb/source-loose: wire up `write_object()` callback
loose: refactor object map to operate on `struct odb_source_loose`
odb/source-loose: wire up `freshen_object()` callback
odb/source-loose: drop `odb_source_loose_has_object()`
odb/source-loose: wire up `count_objects()` callback
odb/source-loose: wire up `find_abbrev_len()` callback
odb/source-loose: wire up `for_each_object()` callback
odb/source-loose: wire up `read_object_stream()` callback
odb/source-loose: wire up `read_object_info()` callback
odb/source-loose: wire up `close()` callback
odb/source-loose: wire up `reprepare()` callback
odb/source-loose: start converting to a proper `struct odb_source`
odb/source-loose: store pointer to "files" instead of generic source
odb/source-loose: move loose source into "odb/" subsystem
Without --all, git describe ignores refs outside refs/tags/. Commit
8a5a1884e9 (Avoid accessing non-tag refs in git-describe unless --all is
requested, 2008-02-24) moved this check ahead of object lookup. That
avoided loading objects for irrelevant refs, but the backend still has
to yield every ref before get_name() can reject it.
Pass refs/tags/ to the iterator so the backend can avoid visiting those
refs in the first place.
The new perf test creates 10,000 unrelated packed refs. It measures:
git describe --exact-match HEAD
The runtime drops from 0.03(0.01+0.01) to 0.02(0.00+0.00). In a
repository with 120,532 refs but only 330 tags, the same command went
from 171.7 ms to 9.9 ms.
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Formatting object name in full hexadecimal form has been optimized
by using a new strbuf_add_oid_hex() helper function.
* rs/strbuf-add-oid-hex:
hex: add and use strbuf_add_oid_hex()
Adding a decimal integer with strbuf_addf("%u") appears commonly;
they have been optimized by using a custom formatter.
* rs/strbuf-add-uint:
ls-tree: use strbuf_add_uint()
ls-files: use strbuf_add_uint()
cat-file: use strbuf_add_uint()
strbuf: add strbuf_add_uint()
"git push" learned to take a "remote group" name to push to, which
causes pushes to multiple places, just like "git fetch" would do.
* ua/push-remote-group:
push: support pushing to a remote group
remote: move remote group resolution to remote.c
remote: fix sign-compare warnings in push_cas_option
There are some typos in the documentation, comments, etc.
Fix them via codespell, and then adjust the "dump" files
used by the subversion tests to match the updated contents.
Signed-off-by: Andrew Kreimer <algonell@gmail.com>
[dscho noticed and fixed the problems in svn test]
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
[jc did final assembling of the three patches]
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git stash -p" has been optimized by reusing cached index
entries in its temporary index, avoiding unnecessary lstat()
calls on unchanged files.
* aj/stash-patch-optimize-temporary-index:
stash: reuse cached index entries in --patch temporary index
'git restore --staged' has been optimized to avoid unnecessarily expanding
the sparse index when operating on paths within the sparse checkout
definition, by handling sparse directory entries at the tree level.
* ds/restore-sparse-index:
restore: avoid sparse index expansion
t1092: test 'git restore' with sparse index
The GIT_WORK_TREE variable prepared to invoke the push-to-checkout
hook was leaking into the environment even when there was no hook
used and broke the default push-to-deploy (i.e., let "git checkout"
update the working tree only when the working tree is clean).
* ar/receive-pack-worktree-env:
receive-pack: fix updateInstead with core.worktree
"git config set pull.rebase=false" currently fails with "wrong
number of arguments", and the implicit form "git config
pull.rebase=false" fails with "invalid key". Neither points at
the real problem: the value is missing.
Report that directly, and when the argument has the shape
"<valid-key>=<value>", also suggest the split form:
$ git config set pull.rebase=false
error: missing value to set to the variable 'pull.rebase=false'
hint: did you mean "git config set pull.rebase false"?
When the prefix before "=" is not a valid key, drop the hint:
$ git config set foo=bar
error: missing value to set to a variable with an invalid name 'foo=bar'
Signed-off-by: Harald Nordgren <haraldnordgren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `core.warnAmbiguousRefs` configuration was previously stored in a
global `int` variable, making it shared across repository instances
and risking cross‑repository state leakage.
Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. This option is parsed eagerly because
ambiguity warnings influence how users interpret object references in
many commands; a lazy parse could cause these warnings to behave
inconsistently or to appear for the wrong repository, confusing users
and hindering libification. This preserves the existing behavior while
tying the value to the repository from which it was read, avoiding
cross‑repository state leakage and continuing the effort to reduce
reliance on global configuration state.
Update all references to use `repo_config_values()`.
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `core.sparseCheckoutCone` configuration was previously stored in an
uninitialized global `int` variable, risking cross‑repository state
leakage.
Move it into `repo_config_values`, where eagerly‑parsed repository
configuration lives. `core.sparseCheckoutCone` is parsed eagerly
because it determines the fundamental sparse‑checkout mode and is
consulted very early during repository setup; a lazy parse could
leave the sparse‑checkout state undefined and complicate
libification. This preserves the existing behavior while tying the
value to the repository from which it was read, avoiding cross‑
repository state leakage and continuing the effort to reduce reliance
on global configuration state.
Update all references to use `repo_config_values()`.
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `pack_compression_level` configuration is currently stored in the
global variable `pack_compression_level`, which makes it shared across
repository instances within a single process.
Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. `pack_compression_level` is parsed
eagerly because it influences packfile compression, a core operation
where a lazy parse could cause inconsistent behavior and hamper
libification. This preserves the existing eager‑parsing behavior while
tying the value to the repository from which it was read, avoiding
cross‑repository state leakage and continuing the effort to reduce
reliance on global configuration state.
Update all references to use `repo_config_values()`.
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `zlib_compression_level` configuration is currently stored in the
global variable `zlib_compression_level`, which makes it shared across
repository instances within a single process.
Store it instead in `repo_config_values`, where eagerly‑parsed
repository configuration lives. `zlib_compression_level` is parsed
eagerly because it determines compression behaviour for objects and
packs – core operations where a lazy parse could lead to unpredictable
results and hinder libification. This preserves the existing
eager‑parsing behavior while tying the value to the repository it
was read from, avoiding cross‑repository state leakage and continuing
the effort to reduce reliance on global configuration state.
Update all references to use `repo_config_values()`.
Mentored-by: Christian Couder <christian.couder@gmail.com>
Mentored-by: Usman Akinyemi <usmanakinyemi202@gmail.com>
Signed-off-by: Olamide Caleb Bello <belkid98@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "git pack-objects --path-walk" traversal has been integrated
with several object filters, including blobless and sparse filters.
* ds/path-walk-filters:
path-walk: support `combine` filter
path-walk: support `object:type` filter
path-walk: support `tree:0` filter
t6601: tag otherwise-unreachable trees
pack-objects: support sparse:oid filter with path-walk
path-walk: add pl_sparse_trees to control tree pruning
path-walk: support blob size limit filter
backfill: die on incompatible filter options
path-walk: support blobless filter
path-walk: always emit directly-requested objects
t/perf: add pack-objects filter and path-walk benchmark
pack-objects: pass --objects with --path-walk
t5620: make test work with path-walk var
The "name" argument in git_connect() and related functions has been
converted to a "service" enum to improve type safety and clarify its
purpose.
* jk/connect-service-enum:
transport-helper: fix typo in BUG() message
connect: use "service" enum for "name" argument
git describe --contains acts as a wrapper around git name-rev. When
operating with --contains and --all, the --match and --exclude patterns
are not properly forwarded to name-rev as --exclude and --refs options.
This results in the command silently discarding match and exclude
requests from the user when operating in --all mode.
We could check and die() if the user provides --contains, --all, and
--match/--exclude. However, its also straight forward to just pass the
filters down to git name-rev.
Notice that the documentation for --match and --exclude mention the
--all mode. It explains that they operate on refs with the prefix
refs/tags, and additionally refs/heads and refs/remotes when using
--all.
Fix the describe logic to pass the patterns down with the appropriate
prefixes when --all is provided. This fixes the support to match the
documented behavior.
Add tests to check that this works as expected.
Reported-by: Tuomas Ahola <taahol@utu.fi>
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
When resolving a delta whose result has children of its own,
index-pack adds the result to work_head, accounts its data in
base_cache_used, and calls prune_base_data(). It then immediately frees
that same data.
This bypasses the existing delta base cache policy and can force later
descendants to reconstruct the queued base again. Let the existing
delta_base_cache_limit pruning policy decide whether to keep or evict
the data instead.
This does not add a new cache or increase the cache limit. The object
data is already accounted in base_cache_used before prune_base_data()
runs, and the existing pruning and base cleanup paths still release it.
On a quiet Ubuntu 24.04 VM with 16 vCPUs, 32 GiB RAM, and local SSD,
direct index-pack timings on single-pack Linux fixtures improved as
follows:
linux blobless: 69.17s -> 57.98s (16.2% faster), RSS flat
linux full: 280.72s -> 236.32s (15.8% faster), RSS +1.9%
Five-repeat medians on public repositories also improved:
git.git: 12.31s -> 10.70s (13.1% faster)
libgit2: 3.35s -> 2.88s (14.0% faster)
redis: 6.52s -> 5.64s (13.5% faster)
cpython: 33.02s -> 31.44s (4.8% faster)
The standard p5302 perf test on a smaller git.git fixture was neutral:
5302.9 index-pack default threads:
11.21(38.07+1.33) -> 11.16(37.90+1.31), -0.4%
t/t5302-pack-index.sh passed, and GitGitGadget's linux-leaks CI also
exercised that test under SANITIZE=leak.
Signed-off-by: Arijit Banerjee <arijit@effectiveailabs.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The function `odb_source_loose_has_object()` checks whether a specific
object exists as a loose object on disk by using lstat(3p). This
interface is somewhat redundant, as we typically check for object
existence in a generic way via `odb_source_read_object_info()`.
In fact, these two calls are redundant in case the latter is called in a
specific way: when called without an object info request and without the
`OBJECT_INFO_QUICK` flag, then we will end up doing the same call to
lstat(3p) in `read_object_info_from_path()`.
Drop the function and adapt callers to instead use the generic
interface so that its calling conventions align with that of other
sources.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move `odb_source_loose_count_objects()` and its associated helpers from
"object-file.c" into "odb/source-loose.c" and wire it up as the
`count_objects()` callback of the loose source.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Move `odb_source_loose_for_each_object()` and its associated helpers
from "object-file.c" into "odb/source-loose.c" and wire it up as the
`for_each_object()` callback of the loose source.
Again, as in the preceding commit, we are forced to expose a couple of
functions from "object-file.c" that are now used by both subsystems.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git cat-file --batch" learns an in-line command "mailmap"
that lets the user toggle use of mailmap.
* sa/cat-file-batch-mailmap-switch:
cat-file: add mailmap subcommand to --batch-command
The fsmonitor daemon has been implemented for Linux.
* pt/fsmonitor-linux:
fsmonitor: convert shown khash to strset in do_handle_client
fsmonitor: add tests for Linux
fsmonitor: add timeout to daemon stop command
fsmonitor: close inherited file descriptors and detach in daemon
run-command: add close_fd_above_stderr option
fsmonitor: implement filesystem change listener for Linux
fsmonitor: rename fsm-settings-darwin.c to fsm-settings-unix.c
fsmonitor: rename fsm-ipc-darwin.c to fsm-ipc-unix.c
fsmonitor: use pthread_cond_timedwait for cookie wait
compat/win32: add pthread_cond_timedwait
fsmonitor: fix hashmap memory leak in fsmonitor_run_daemon
fsmonitor: fix khash memory leak in do_handle_client
t9210, t9211: disable GIT_TEST_SPLIT_INDEX for scalar clone tests
"git bisect" now uses the selected terms (e.g., old/new) more
consistently in its output.
* jr/bisect-custom-terms-in-output:
rev-parse: use selected alternate terms to look up refs
bisect: print bisect terms in single quotes
bisect: use selected alternate terms in status output
Replace `free_commit_list` with `commit_list_free`. The former was
deprecated in 9f18d089 (commit: rename `free_commit_list()` to conform
to coding guidelines, 2026-01-15).
This allows us to remove all the deprecated functions in the
next commit:
• `copy_commit_list`
• `reverse_commit_list`
• `free_commit_list`
Acked-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Kristoffer Haugsbakk <code@khaugsbakk.name>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
enumerate_and_traverse_cruft_objects() initializes a rev_info on the
stack but never calls release_revisions() afterwards. This is not
visible on master but becomes a leak once the revision walking
machinery uses dynamically allocated structures.
Add the missing release_revisions() call.
Signed-off-by: Kristofer Karlsson <krka@spotify.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many uses of the_repository has been updated to use a more
appropriate struct repository instance in setup.c codepath.
* ps/setup-wo-the-repository:
setup: stop using `the_repository` in `init_db()`
setup: stop using `the_repository` in `create_reference_database()`
setup: stop using `the_repository` in `initialize_repository_version()`
setup: stop using `the_repository` in `check_repository_format()`
setup: stop using `the_repository` in `upgrade_repository_format()`
setup: stop using `the_repository` in `setup_git_directory()`
setup: stop using `the_repository` in `setup_git_directory_gently()`
setup: stop using `the_repository` in `setup_git_env()`
setup: stop using `the_repository` in `set_git_work_tree()`
setup: stop using `the_repository` in `setup_work_tree()`
setup: stop using `the_repository` in `enter_repo()`
setup: stop using `the_repository` in `verify_non_filename()`
setup: stop using `the_repository` in `verify_filename()`
setup: stop using `the_repository` in `path_inside_repo()`
setup: stop using `the_repository` in `prefix_path()`
setup: stop using `the_repository` in `is_inside_work_tree()`
setup: stop using `the_repository` in `is_inside_git_dir()`
setup: replace use of `the_repository` in static functions
The repacking code has been refactored and compaction of MIDX layers
have been implemented, and incremental strategy that does not require
all-into-one repacking has been introduced.
* tb/incremental-midx-part-3.3:
repack: allow `--write-midx=incremental` without `--geometric`
repack: introduce `--write-midx=incremental`
repack: implement incremental MIDX repacking
packfile: ensure `close_pack_revindex()` frees in-memory revindex
builtin/repack.c: convert `--write-midx` to an `OPT_CALLBACK`
repack-geometry: prepare for incremental MIDX repacking
repack-midx: extract `repack_fill_midx_stdin_packs()`
repack-midx: factor out `repack_prepare_midx_command()`
midx: expose `midx_layer_contains_pack()`
repack: track the ODB source via existing_packs
midx: support custom `--base` for incremental MIDX writes
midx: introduce `--no-write-chain-file` for incremental MIDX writes
midx: use `strvec` for `keep_hashes`
midx: build `keep_hashes` array in order
midx: use `strset` for retained MIDX files
midx-write: handle noop writes when converting incremental chains
The negotiation tip options in "git fetch" have been reworked to
allow requiring certain refs to be sent as "have" lines, and to
restrict negotiation to a specific set of refs.
* ds/fetch-negotiation-options:
send-pack: pass negotiation config in push
remote: add remote.*.negotiationInclude config
fetch: add --negotiation-include option for negotiation
negotiator: add have_sent() interface
remote: add remote.*.negotiationRestrict config
transport: rename negotiation_tips
fetch: add --negotiation-restrict option
t5516: fix test order flakiness
The logic to determine that branches in an octopus merge are
independent has been optimized.
* kk/merge-octopus-optim:
merge: use repo_in_merge_bases for octopus up-to-date check
In a lazy clone, "git cherry" and "git grep" often fetch necessary
blob objects one by one from promisor remotes. It has been corrected
to collect necessary object names and fetch them in bulk to gain
reasonable performance.
* en/batch-prefetch:
grep: prefetch necessary blobs
builtin/log: prefetch necessary blobs for `git cherry`
patch-ids.h: add missing trailing parenthesis in documentation comment
promisor-remote: document caller filtering contract
Teach update_some() to handle sparse directory entries at the tree
level rather than expanding the entire sparse index. When iterating a
source tree during checkout/restore operations:
- If a directory matches a sparse directory entry with the same OID,
skip it entirely (no change needed).
- If the OID differs and we are in non-overlay mode (e.g., restore
--staged), update the sparse directory entry's OID in place. This
is semantically correct because non-overlay mode removes paths not
in the source tree anyway.
- In overlay mode (e.g., checkout <tree> -- .), fall through to
recursive descent so individual file entries are preserved
correctly.
Also switch from index_name_pos() to index_name_pos_sparse() for
individual file lookups to avoid triggering ensure_full_index() when
the file is already individually tracked in the index.
Update the test expectation in t1092 to assert that 'restore --staged'
no longer expands the sparse index.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Before a8cc594333 (hooks: fix an obscure TOCTOU "did we just run a
hook?" race, 2022-03-07), when receive.denyCurrentBranch is set to
updateInstead, only one of push_to_checkout() or push_to_deploy()
was called. That commit changed to always call push_to_checkout(),
and then to call push_to_deploy() if push_to_checkout() didn't run
anything.
This change didn't take into account that push_to_checkout() had a
side effect of modifying env, and that modified env broke updating
the worktree in push_to_deploy() if core.worktree was configured.
To fix this, only mutate the environment used inside
push_to_commit(), rather than the environment that might later be
passed to push_to_deploy().
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
`git stash -p` prepares the interactive selection by creating a
temporary index at HEAD, switching `GIT_INDEX_FILE` to it, and then
running the `add -p` machinery.
That temporary index was created by running `git read-tree HEAD`. The
resulting index had no useful cached stat data or fsmonitor-valid bits
from the real index. When `run_add_p()` refreshed that temporary index
before showing the first prompt, it could end up lstat(2)-ing every
tracked file, even in a repository where `git diff` and `git restore -p`
can use fsmonitor to avoid that work.
Create the temporary index in-process instead. Use `unpack_trees()` to
reset the real index contents to HEAD while writing the result to the
temporary index path. For paths whose index entries already match HEAD,
`oneway_merge()` reuses the existing cache entries, preserving their
cached stat data and `CE_FSMONITOR_VALID` state.
This makes the refresh performed by `run_add_p()` behave like the one
used by `git restore -p`: unchanged paths can be skipped via fsmonitor
instead of being scanned again.
In a 206k file repository with `core.fsmonitor` enabled and a one-line
change in one file, time to first prompt dropped from 34.774 seconds to
0.659 seconds. The new perf test file demonstrates similar improvements,
with maen times for without- and with-fsmonitor cases dropping from 6.90
and 6.83 seconds to 0.55 and 0.28 seconds, respectively.
Signed-off-by: Adam Johnson <me@adamj.eu>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The --filter=sparse:<oid> option to 'git pack-objects' allows focusing
an object set to a sparse-checkout definition. This reduces the set of
matching blobs while retaining all reachable trees. No server currently
supports fetching with this filter because it is expensive to compute
and reachability bitmaps do not help without a significant effort to
extend the bitmap feature to store bitmaps for each supported sparse-
checkout definition.
Without focusing on serving fetches and clones with these filters, there
are still benefits that could be realized by making this faster. With
the sparse index, it's more realistic now than ever to be able to
operate a local clone that was bootstrapped by a packfile created with
a sparse filter, because the missing trees are not needed to move a
sparse-checkout from one commit to another or to view the history of any
path in scope. Such clones could perhaps be bootstrapped by partial
bundles.
Previously, constructing these sparse packs has been incredibly
computationally inefficient. The revision walk that explores which
objects are in scope spends a lot of time checking each object to see if
it matches the sparse-checkout patterns, causing quadratic behavior
(number of objects times number of sparse-checkout patterns). This
improves somewhat when using cone-mode sparse-checkout patterns that can
use hashtables and prefix matches to determine containment. However, the
check per object is still too expensive for most cases.
This is where the path-walk feature comes in. We can proceed as normal
by placing objects in bins by path and _then_ check a group of objects
all at once. Since sparse:<oid> only restricts blobs, the path-walk must
include all reachable trees while using the cone-mode patterns to skip
blobs at paths outside the sparse scope. This establishes a baseline for
a potential future "treesparse:<oid>" filter that would also restrict
trees, but introducing such a new filter is deferred to a later change.
The implementation here is focused around loading the sparse-checkout
patterns from the provided object ID and checking that the patterns are
indeed cone-mode patterns. We can then load the correct pattern list
into the path walk context and use the logic that already exists from
bff4555767 (backfill: add --sparse option, 2025-02-03), though that
feature loads sparse-checkout patterns from the worktree's local
settings and also restricts tree objects. We use a combination of errors
and warnings to signal problems during this load. The difference is that
errors are likely fatal for the non-path-walk version while the warnings
are probably just implementation details for the path-walk version and
the 'git pack-objects' command can fall back to the revision walk
version.
Now that the SEEN flag is deferred until after pattern checks (from the
previous commit), handle the case where a tree with a shared OID appears
at both an out-of-cone and in-cone path. When trees are not being pruned
(pl_sparse_trees == 0), the path-walk re-walks the tree at the in-cone
path so that in-cone blobs within it are discovered. The new tests in
t5317 and t6601 demonstrate this behavior and would fail without these
changes.
The performance test p5315 shows the impact of this change when using
sparse filters:
Test HEAD~1 HEAD
----------------------------------------------------------------------
5315.10: repack (sparse:oid) 77.98 77.47 -0.7%
5315.11: repack size (sparse:oid) 187.5M 187.4M -0.0%
5315.12: repack (sparse:oid, --path-walk) 77.91 31.41 -59.7%
5315.13: repack size (sparse:oid, --path-walk) 187.5M 161.1M -14.1%
These performance tests were run on the Git repository. The --path-walk
feature shows meaningful space savings (14% smaller for sparse packs)
and dramatic time savings (60% faster) by leveraging the path-walk's
ability to skip blobs outside the sparse scope.
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blaue <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The path-walk API prunes trees and blobs when a sparse-checkout pattern
list is provided, which is the correct behavior for 'git backfill
--sparse' since it only needs to fill in objects at paths within the
sparse cone.
However, a future change will use the path-walk API with a sparse:<oid>
filter that restricts only blobs while retaining all reachable trees.
To support both behaviors, add a 'pl_sparse_trees' flag to
path_walk_info. When set (as in 'git backfill --sparse' and the
--stdin-pl test helper mode), the sparse patterns prune both trees and
blobs. When unset, only blobs are filtered and all trees are walked and
reported.
Additionally, move the SEEN flag assignment in add_tree_entries() to
after the sparse pattern and pathspec checks. Previously, SEEN was set
immediately upon discovering an object, before checking whether its path
matched the sparse patterns. When the same object ID appeared at
multiple paths (e.g. sibling directories with identical contents), the
first path to be visited would mark the object as SEEN. If that path was
outside the sparse cone, the object would be skipped there but also
never discovered at its in-cone path.
By deferring the SEEN flag until after the checks pass, objects that are
skipped due to sparse filtering remain discoverable at other paths where
they may be in scope.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Extend the path-walk API to handle the 'blob:limit=<size>' object
filter natively. This filter omits blobs whose size is equal to or
greater than the given limit, matching the semantics used by the
list-objects-filter machinery.
When revs->filter.choice is LOFC_BLOB_LIMIT, the prepare_filters()
method stores the limit value in info->blob_limit and clears the filter
from revs. If the limit is zero, this degenerates to blob:none (all
blobs excluded), so info->blobs is set to 0 instead.
During walk_path(), blob batches are filtered before being delivered to
the callback: each blob's size is checked via odb_read_object_info(),
and only blobs strictly smaller than the limit are included. Blobs whose
size cannot be determined (e.g. missing in a partial clone) are
conservatively included, matching the existing filter behavior. Empty
batches after filtering are skipped entirely.
The check for inclusion in the path batch looks a little strange at
first glance. We use odb_read_object_info() to read the object's size.
Based on all of the assumptions to this point, this _should_ return
OBJ_BLOB. Since we are focused on the size filter, we use a
short-circuited OR (||) to skip the size check if that method returns a
different object type.
Notice that this inspection of object sizes requires the content to be
present in the repository. The odb_read_object_info() call will download
a missing blob on-demand. This means that the use of the path-walk API
within 'git backfill' would not operate nicely with this filter type.
The intention of that command is to download missing blobs in batches.
Downloading objects one-by-one would go against the point. Update the
validation in 'git backfill' to add its own compatibility check on top
of path_walk_filter_compatible().
Add tests for blob:limit=0 (equivalent to blob:none) and blob:limit=3
(which exercises partial filtering within a batch where some blobs are
kept and others are excluded).
Co-authored-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The 'git backfill' command uses the path-walk API in a critical way: it
uses the objects output from the command to find the batches of missing
objects that should be requested from the server. Unlike 'git
pack-objects', we cannot fall back to another mechanism.
The previous change added the path_walk_filter_compatible() method that
we can reuse here. Use it during argument validation in cmd_backfill().
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>