The wildmatch code had a corner case bug that mistakenly makes
"foo**/bar" match with "foobar", which has been corrected.
* jk/match-pathname-fix:
match_pathname(): give fnmatch one char of prefix context
match_pathname(): reorder prefix-match check
"git sparse-checkout" subcommand learned a new "clean" action to
prune otherwise unused working-tree files that are outside the
areas of interest.
* ds/sparse-checkout-clean:
sparse-index: improve advice message instructions
t: expand tests around sparse merges and clean
sparse-index: point users to new 'clean' action
sparse-checkout: add --verbose option to 'clean'
dir: add generic "walk all files" helper
sparse-checkout: match some 'clean' behavior
sparse-checkout: add basics of 'clean' command
sparse-checkout: remove use of the_repository
In match_pathname(), which we use for matching .gitignore and
.gitattribute patterns, we are comparing paths with fnmatch patterns
(actually our extended wildmatch, which will be important). There's an
extra optimization there: we pre-compute the number of non-wildcard
characters at the beginning of the pattern and do an fspathncmp() on
that prefix.
That lets us avoid fnmatch entirely on patterns without wildcards, and
shrinks the amount of work we hand off to fnmatch. For a pattern like
"foo*.txt" and a path "foobar.txt", we'd cut away the matching "foo"
prefix and just pass "*.txt" and "bar.txt" to fnmatch().
But this misses a subtle corner case. In fnmatch(), we'll think
"bar.txt" is the start of the path, but it's not. This doesn't matter
for the pattern above, but consider the wildmatch pattern "foo**/bar"
and the path "foobar". These two should not match, because there is no
file named "bar", and the "**" applies only to the containing directory
name. But after removing the "foo" prefix, fnmatch will get "**/bar" and
"bar", which it does consider a match, because "**/" can match zero
directories.
We can solve this by giving fnmatch a bit more context. As long as it
has one byte of the matched prefix, then it will know that "bar" is not
the start of the path. In this example it would get "o**/bar" and
"obar", and realize that they cannot match.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
As an optimization, we use fspathncmp() to match a prefix of the pattern
that does not contain any wildcards, and then pass the remainder to
fnmatch(). If it has matched the whole thing, we can return early.
Let's shift this early-return check to before we tweak the pattern and
name strings. That will gives us more flexibility with that tweaking.
It might also save a few instructions, but I couldn't measure any
improvement in doing so (and I wouldn't be surprised if an optimizing
compiler could figure that out itself).
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The varint subsystem currently uses implicit widths for integers. On the
one hand we use `uintmax_t` for the actual value. On the other hand, we
use `int` for the length of the encoded varint.
Both of these have known maximum values, as we only support at most 16
bytes when encoding varints. Thus, we know that we won't ever exceed
`uint64_t` for the actual value and `uint8_t` for the prefix length.
Refactor the code to use explicit widths. Besides making the logic
platform-independent, it also makes our life a bit easier in the next
commit, where we reimplement "varint.c" in Rust.
Suggested-by: Ezekiel Newren <ezekielnewren@gmail.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There is sometimes a need to visit every file within a directory,
recursively. The main example is remove_dir_recursively(), though it has
some extra flags that make it want to iterate over paths in a custom
way. There is also the fill_directory() approach but that involves an
index and a pathspec.
This change adds a new for_each_file_in_dir() method that will be
helpful in the next change.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The within_depth() function is used to check whether pathspecs limited
by a max-depth parameter are acceptable. It takes a path to check, a
maximum depth, and a "base" depth. It counts the components in the
path (by counting slashes), adds them to the base, and compares them to
the maximum.
However, if the base does not have any slashes at all, we always return
`true`. If the base depth is 0, then this is correct; no matter what the
maximum is, we are always within it. However, if the base depth is
greater than 0, then we might return an erroneous result.
This ends up not causing any user-visible bugs in the current code. The
call sites in dir.c always pass a base depth of 0, so are unaffected.
But tree_entry_interesting() uses this function differently: it will
pass the prefix of the current entry, along with a `1` if the entry is a
directory, in essence checking whether items inside the entry would be
of interest. It turns out not to make a difference in behavior, but the
reasoning is complex.
Given a tree like:
file
a/file
a/b/file
walking the tree and calling tree_entry_interesting() will yield the
following results:
(with max_depth=0):
file: yes
a: yes
a/file: no
a/b: no
(with max_depth=1):
file: yes
a: yes
a/file: yes
a/b: no
So we have inconsistent behavior in considering directories interesting.
If they are at the edge of our depth but at the root, we will recurse
into them, but then find all of their entries uninteresting (e.g., in
the first case, we will look at "a" but find "a/*" uninteresting). But
if they are at the edge of our depth and not at the root, then we will
not recurse (in the second example, we do not even bother entering
"a/b").
This turns out not to matter because the only caller which uses
max-depth pathspecs is cmd_grep(), which only cares about blob entries.
From its perspective, it is exactly the same to not recurse into a
subtree, or to recurse and find that it contains no matching entries.
Not recursing is merely an optimization.
It is debatable whether tree_entry_interesting() should consider such an
entry interesting. The only caller does not care if it sees the tree
itself, and can benefit from the optimization. But if we add a
"max-depth" limiter to regular diffs, then a diff with
DIFF_OPT_TREE_IN_RECURSIVE would probably want to show the tree itself,
but not what it contains.
This patch just fixes within_depth(), which means we consider such
entries uninteresting (and makes the current caller happy). If we want
to change that in the future, then this fix is still the correct first
step, as the current behavior is simply inconsistent.
This has the effect the function tree_entry_interesting() now behaves
like following on the first example:
(with max_depth=0):
file: yes
a: no
a/file: no
a/b: no
Meaning we won't step in "a/" no more to realize all "a/*" entries are
uninterested, but we stop at the tree entry itself.
Based-on-patch-by: Jeff King <peff@peff.net>
Signed-off-by: Toon Claes <toon@iotcl.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In 036876a106 (config: hide functions using `the_repository` by
default, 2024-08-13) we have moved around a bunch of functions in the
config subsystem that depend on `the_repository`. Those function have
been converted into mere wrappers around their equivalent function that
takes in a repository as parameter, and the intent was that we'll
eventually remove those wrappers to make the dependency on the global
repository variable explicit at the callsite.
Follow through with that intent and remove `git_config_set_in_file()`.
All callsites are adjusted so that they use
`repo_config_set_in_file(the_repository, ...)` instead. While some
callsites might already have a repository available, this mechanical
conversion is the exact same as the current situation and thus cannot
cause any regression. Those sites should eventually be cleaned up in a
later patch series.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Code clean-up around object access API.
* ps/object-store:
odb: rename `read_object_with_reference()`
odb: rename `pretend_object_file()`
odb: rename `has_object()`
odb: rename `repo_read_object_file()`
odb: rename `oid_object_info()`
odb: trivial refactorings to get rid of `the_repository`
odb: get rid of `the_repository` when handling submodule sources
odb: get rid of `the_repository` when handling the primary source
odb: get rid of `the_repository` in `for_each()` functions
odb: get rid of `the_repository` when handling alternates
odb: get rid of `the_repository` in `odb_mkstemp()`
odb: get rid of `the_repository` in `assert_oid_type()`
odb: get rid of `the_repository` in `find_odb()`
odb: introduce parent pointers
object-store: rename files to "odb.{c,h}"
object-store: rename `object_directory` to `odb_source`
object-store: rename `raw_object_store` to `object_database`
Rename `repo_read_object_file()` to `odb_read_object()` to match other
functions related to the object database and our modern coding
guidelines.
Introduce a compatibility wrapper so that any in-flight topics will
continue to compile.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
A following change will add support for pathspecs to the git diff
--no-index command. This mode of git diff does not load any repository.
Add a new PATHSPEC_NO_REPOSITORY flag indicating that we're parsing
pathspecs without a repository.
Both PATHSPEC_ATTR and PATHSPEC_FROMTOP require a repository to
function. Thus, verify that both of these are set in magic_mask to
ensure they won't be accepted when PATHSPEC_NO_REPOSITORY is set.
Check PATHSPEC_NO_REPOSITORY when warning about paths outside the
directory tree. When the flag is set, do not look for a git repository
when generating the warning message.
Finally, add a BUG in match_pathspec_item if the istate is NULL but the
pathspec has PATHSPEC_ATTR set. Callers which support PATHSPEC_ATTR
should always pass a valid istate, and callers which don't pass a valid
istate should have set PATHSPEC_ATTR in the magic_mask field to disable
support for attribute-based pathspecs.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The do_match_pathspec() function has the DO_MATCH_LEADING_PATHSPEC
option to allow pathspecs to match when matching "src" against a
pathspec like "src/path/...". This support is not exposed by
match_pathspec, and the internal flags to do_match_pathspec are not
exposed outside of dir.c
The upcoming support for pathspecs in git diff --no-index need the
LEADING matching behavior when iterating down through a directory with
readdir.
We could try to expose the match_pathspec_with_flags to the public API.
However, DO_MATCH_EXCLUDES really shouldn't be public, and its a bit
weird to only have a few of the flags become public.
Instead, add match_leading_pathspec() as a function which sets both
DO_MATCH_DIRECTORY and DO_MATCH_LEADING_PATHSPEC when is_dir is true.
This will be used in a following change to support pathspec matching in
git diff --no-index.
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
"git add 'f?o'" did not add 'foo' if 'f?o', an unusual pathname,
also existed on the working tree, which has been corrected.
* kj/glob-path-with-special-char:
dir.c: literal match with wildcard in pathspec should still glob
When a path with wildcard characters, e.g. 'f*o', exists in the
working tree, "git add -- 'f*o'" stops after happily finding
that there is 'f*o' and adding it to the index, without
realizing there may be other paths, e.g. 'foooo', that may match
the given pathspec.
This is because dir.c:do_match_pathspec() disables further
matches with pathspec when it finds an exact match.
Reported-by: piotrsiupa <piotrsiupa@gmail.com>
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: K Jayatheerth <jayatheerthkulkarni2005@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We carry declarations for a couple of functions in "object-store.h" that
are not defined in "object-store.c", but in a different subsystem. Move
these declarations to the respective headers whose matching code files
carry the corresponding definition.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The "object-store-ll.h" header has been introduced to keep transitive
header dependendcies and compile times at bay. Now that we have created
a new "object-store.c" file though we can easily move the last remaining
additional bit of "object-store.h", the `odb_path_map`, out of the
header.
Do so. As the "object-store.h" header is now equivalent to its low-level
alternative we drop the latter and inline it into the former.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `safe_create_leading_directories()` function and its relatives are
located in "object-file.c", which is not a good fit as they provide
generic functionality not related to objects at all. Move them into
"path.c", which already hosts `safe_create_dir()` and its relative
`safe_create_dir_in_gitdir()`.
"path.c" is free of `the_repository`, but the moved functions depend on
`the_repository` to read the "core.sharedRepository" config. Adapt the
function signature to accept a repository as argument to fix the issue
and adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `null_oid()` function returns the object ID that only consists of
zeroes. Naturally, this ID also depends on the hash algorithm used, as
the number of zeroes is different between SHA1 and SHA256. Consequently,
the function returns the hash-algorithm-specific null object ID.
This is currently done by depending on `the_hash_algo`, which implicitly
makes us depend on `the_repository`. Refactor the function to instead
pass in the hash algorithm for which we want to retrieve the null object
ID. Adapt callsites accordingly by passing in `the_repository`, thus
bubbling up the dependency on that global variable by one layer.
There are a couple of trivial exceptions for subsystems that already got
rid of `the_repository`. These subsystems instead use the repository
that is available via the calling context:
- "builtin/grep.c"
- "grep.c"
- "refs/debug.c"
There are also two non-trivial exceptions:
- "diff-no-index.c": Here we know that we may not have a repository
initialized at all, so we cannot rely on `the_repository`. Instead,
we adapt `diff_no_index()` to get a `struct git_hash_algo` as
parameter. The only caller is located in "builtin/diff.c", where we
know to call `repo_set_hash_algo()` in case we're running outside of
a Git repository. Consequently, it is fine to continue passing
`the_repository->hash_algo` even in this case.
- "builtin/ls-files.c": There is an in-flight patch series that drops
`USE_THE_REPOSITORY_VARIABLE` in this file, which causes a semantic
conflict because we use `null_oid()` in `show_submodule()`. The
value is passed to `repo_submodule_init()`, which may use the object
ID to resolve a tree-ish in the superproject from which we want to
read the submodule config. As such, the object ID should refer to an
object in the superproject, and consequently we need to use its hash
algorithm.
This means that we could in theory just not bother about this edge
case at all and just use `the_repository` in "diff-no-index.c". But
doing so would feel misdesigned.
Remove the `USE_THE_REPOSITORY_VARIABLE` preprocessor define in
"hash.c".
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The path.[ch] API takes an explicit repository parameter passed
throughout the callchain, instead of relying on the_repository
singleton instance.
* ps/path-sans-the-repository:
path: adjust last remaining users of `the_repository`
environment: move access to "core.sharedRepository" into repo settings
environment: move access to "core.hooksPath" into repo settings
repo-settings: introduce function to clear struct
path: drop `git_path()` in favor of `repo_git_path()`
rerere: let `rerere_path()` write paths into a caller-provided buffer
path: drop `git_common_path()` in favor of `repo_common_path()`
worktree: return allocated string from `get_worktree_git_dir()`
path: drop `git_path_buf()` in favor of `repo_git_path_replace()`
path: drop `git_pathdup()` in favor of `repo_git_path()`
path: drop unused `strbuf_git_path()` function
path: refactor `repo_submodule_path()` family of functions
submodule: refactor `submodule_to_gitdir()` to accept a repo
path: refactor `repo_worktree_path()` family of functions
path: refactor `repo_git_path()` family of functions
path: refactor `repo_common_path()` family of functions
Remove `git_pathdup()` in favor of `repo_git_path()`. The latter does
essentially the same, with the only exception that it does not rely on
`the_repository` but takes the repo as separate parameter.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
One way to significantly reduce the cost of a Git clone and later fetches is
to use a blobless partial clone and combine that with a sparse-checkout that
reduces the paths that need to be populated in the working directory. Not
only does this reduce the cost of clones and fetches, the sparse-checkout
reduces the number of objects needed to download from a promisor remote.
However, history investigations can be expensive as computing blob diffs
will trigger promisor remote requests for one object at a time. This can be
avoided by downloading the blobs needed for the given sparse-checkout using
'git backfill' and its new '--sparse' mode, at a time that the user is
willing to pay that extra cost.
Note that this is distinctly different from the '--filter=sparse:<oid>'
option, as this assumes that the partial clone has all reachable trees and
we are using client-side logic to avoid downloading blobs outside of the
sparse-checkout cone. This avoids the server-side cost of walking trees
while also achieving a similar goal. It also downloads in batches based on
similar path names, presenting a resumable download if things are
interrupted.
This augments the path-walk API to have a possibly-NULL 'pl' member that may
point to a 'struct pattern_list'. This could be more general than the
sparse-checkout definition at HEAD, but 'git backfill --sparse' is currently
the only consumer.
Be sure to test this in both cone mode and not cone mode. Cone mode has the
benefit that the path-walk can skip certain paths once they would expand
beyond the sparse-checkout. Non-cone mode can describe the included files
using both positive and negative patterns, which changes the possible return
values of path_matches_pattern_list(). Test both kinds of matches for
increased coverage.
To test this, we can create a blobless sparse clone, expand the
sparse-checkout slightly, and then run 'git backfill --sparse' to see
how much data is downloaded. The general steps are
1. git clone --filter=blob:none --sparse <url>
2. git sparse-checkout set <dir1> ... <dirN>
3. git backfill --sparse
For the Git repository with the 'builtin' directory in the
sparse-checkout, we get these results for various batch sizes:
| Batch Size | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|-------|
| (Initial clone) | 3 | 110 MB | |
| 10K | 12 | 192 MB | 17.2s |
| 15K | 9 | 192 MB | 15.5s |
| 20K | 8 | 192 MB | 15.5s |
| 25K | 7 | 192 MB | 14.7s |
This case matters less because a full clone of the Git repository from
GitHub is currently at 277 MB.
Using a copy of the Linux repository with the 'kernel/' directory in the
sparse-checkout, we get these results:
| Batch Size | Pack Count | Pack Size | Time |
|-----------------|------------|-----------|------|
| (Initial clone) | 2 | 1,876 MB | |
| 10K | 11 | 2,187 MB | 46s |
| 25K | 7 | 2,188 MB | 43s |
| 50K | 5 | 2,194 MB | 44s |
| 100K | 4 | 2,194 MB | 48s |
This case is more meaningful because a full clone of the Linux
repository is currently over 6 GB, so this is a valuable way to download
a fraction of the repository and no longer need network access for all
reachable objects within the sparse-checkout.
Choosing a batch size will depend on a lot of factors, including the
user's network speed or reliability, the repository's file structure,
and how many versions there are of the file within the sparse-checkout
scope. There will not be a one-size-fits-all solution.
Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Mark code units that generate warnings with `-Wsign-compare`. This
allows for a structured approach to get rid of all such warnings over
time in a way that can be easily measured.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We use `repo_config_get_string()` to read "status.showUntrackedFiles"
from the config subsystem. This function allocates the result, but we
never free the result after parsing it.
The value never leaves the scope of the calling function, so refactor it
to instead use `repo_config_get_string_tmp()`, which does not hand over
ownership to the caller.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
There are several cases where we invalidate untracked cache directory
entries where we do not free the underlying data, but reset the number
of entries. This causes us to leak memory because `free_untracked()`
will not iterate over any potential entries which we still had in the
array.
Fix this issue by freeing old entries. The leak is exposed by t7519, but
plugging it alone does not make the whole test suite pass.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In `treat_directory()` we perform some logic to handle ignored and
untracked entries. When populating a directory with entries we first
save the current number of ignored/untracked entries and then populate
new entries at the end of our arrays that keep track of those entries.
When we figure out that all entries have been ignored/are untracked we
then remove this tail of entries from those vectors again. But there is
an off by one error in both paths that causes us to not free the first
ignored and untracked entries, respectively.
Fix these off-by-one errors to plug the resulting leak. While at it,
massage the code a bit to match our modern code style.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The `get_git_work_tree()` function retrieves the path of the work tree
of `the_repository`. Make it accept a `struct repository` such that it
can work on arbitrary repositories and make it part of the repository
subsystem. This reduces our reliance on `the_repository` and clarifies
scope.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
On Windows, the backslash is the directory separator, even if the
forward slash can be used, too, at least since Windows NT.
This means that the paths `a/b` and `a\b` are equivalent, and
`fspathcmp()` needs to be made aware of that fact.
Note that we have to override both `fspathcmp()` and `fspathncmp()`, and
the former cannot be a mere pre-processor constant that transforms calls
to `fspathcmp(a, b)` into `fspathncmp(a, b, (size_t)-1)` because the
function `report_collided_checkout()` in `unpack-trees.c` wants to
assign `list.cmp = fspathcmp`.
Also note that `fspatheq()` does _not_ need to be overridden because it
calls `fspathcmp()` internally.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Use of the `the_repository` variable is deprecated nowadays, and we
slowly but steadily convert the codebase to not use it anymore. Instead,
callers should be passing down the repository to work on via parameters.
It is hard though to prove that a given code unit does not use this
variable anymore. The most trivial case, merely demonstrating that there
is no direct use of `the_repository`, is already a bit of a pain during
code reviews as the reviewer needs to manually verify claims made by the
patch author. The bigger problem though is that we have many interfaces
that implicitly rely on `the_repository`.
Introduce a new `USE_THE_REPOSITORY_VARIABLE` macro that allows code
units to opt into usage of `the_repository`. The intent of this macro is
to demonstrate that a certain code unit does not use this variable
anymore, and to keep it from new dependencies on it in future changes,
be it explicit or implicit
For now, the macro only guards `the_repository` itself as well as
`the_hash_algo`. There are many more known interfaces where we have an
implicit dependency on `the_repository`, but those are not guarded at
the current point in time. Over time though, we should start to add
guards as required (or even better, just remove them).
Define the macro as required in our code units. As expected, most of our
code still relies on the global variable. Nearly all of our builtins
rely on the variable as there is no way yet to pass `the_repository` to
their entry point. For now, declare the macro in "biultin.h" to keep the
required changes at least a little bit more contained.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Both `oidread()` and `oidclr()` use `the_repository` to derive the hash
function that shall be used. Require callers to pass in the hash
algorithm to get rid of this implicit dependency.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Many memory leaks in the sparse-checkout code paths have been
plugged.
* jk/sparse-leakfix:
sparse-checkout: free duplicate hashmap entries
sparse-checkout: free string list after displaying
sparse-checkout: free pattern list in sparse_checkout_list()
sparse-checkout: free sparse_filename after use
sparse-checkout: refactor temporary sparse_checkout_patterns
sparse-checkout: always free "line" strbuf after reading input
sparse-checkout: reuse --stdin buffer when reading patterns
dir.c: always copy input to add_pattern()
dir.c: free removed sparse-pattern hashmap entries
sparse-checkout: clear patterns when init() sees existing sparse file
dir.c: free strings in sparse cone pattern hashmaps
sparse-checkout: pass string literals directly to add_pattern()
sparse-checkout: free string list in write_cone_to_file()
An overly large ".gitignore" files are now rejected silently.
* jk/cap-exclude-file-size:
dir.c: reduce max pattern file size to 100MB
dir.c: skip .gitignore, etc larger than INT_MAX
The add_pattern() function has a subtle and undocumented gotcha: the
pattern string you pass in must remain valid as long as the pattern_list
is in use (and nor do we take ownership of it). This is easy to get
wrong, causing either subtle bugs (because you free or reuse the string
buffer) or leaks (because you copy the string, but don't track ownership
separately).
All of this "pattern" code was originally the "exclude" mechanism. So
this _usually_ works OK because you add entries in one of two ways:
1. From the command-line (e.g., "--exclude"), in which case we're
pointing to an argv entry which remains valid for the lifetime of
the program.
2. From a file (e.g., ".gitignore"), in which case we read the whole
file into a buffer, attach it to the pattern_list's "filebuf"
entry, then parse the buffer in-place (adding NULs). The strings
point into the filebuf, which is cleaned up when the whole
pattern_list goes away.
But other code, like sparse-checkout, reads individual lines from stdin
and passes them one by one to add_pattern(), leaking each. We could fix
this by refactoring it to take in the whole buffer at once, like (2)
above, and stuff it in "filebuf". But given how subtle the interface is,
let's just fix it to always copy the string.
That seems at first like we'd be wasting extra memory, but we can
mitigate that:
a. The path_pattern struct already uses a FLEXPTR, since we sometimes
make a copy (when we see "foo/", we strip off the trailing slash,
requiring a modifiable copy of the string).
Since we'll now always embed the string inside the struct, we can
switch to the regular FLEX_ARRAY pattern, saving us 8 bytes of
pointer. So patterns with a trailing slash and ones under 8 bytes
actually get smaller.
b. Now that we don't need the original string to hang around, we can
get rid of the "filebuf" mechanism entirely, and just free the file
contents after parsing. Since files are the sources we'd expect to
have the largest pattern sets, we should mostly break even on
stuffing the same data into the individual structs.
This patch just adjusts the add_pattern() interface; it doesn't fix any
leaky callers yet.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In a2bc523e1e (dir.c: skip .gitignore, etc larger than INT_MAX,
2024-05-31) we put capped the size of some files whose parsing code and
data structures used ints. Setting the limit to INT_MAX was a natural
spot, since we know the parsing code would misbehave above that.
But it also leaves the possibility of overflow errors when we multiply
that limit to allocate memory. For instance, a file consisting only of
"a\na\n..." could have INT_MAX/2 entries. Allocating an array of
pointers for each would need INT_MAX*4 bytes on a 64-bit system, enough
to overflow a 32-bit int.
So let's give ourselves a bit more safety margin by giving a much
smaller limit. The size 100MB is somewhat arbitrary, but is based on the
similar value for attribute files added by 3c50032ff5 (attr: ignore
overly large gitattributes files, 2022-12-01).
There's no particular reason these have to be the same, but the idea is
that they are in the ballpark of "so huge that nobody would care, but
small enough to avoid malicious overflow". So lacking a better guess, it
makes sense to use the same value. The implementation here doesn't share
the same constant, but we could change that later (or even give it a
runtime config knob, though nobody has complained yet about the
attribute limit).
And likewise, let's add a few tests that exercise the limits, based on
the attr ones. In this case, though, we never read .gitignore from the
index; the blob code is exercised only for sparse filters. So we'll
trigger it that way.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In add_pattern_to_hashsets(), we remove entries from the
recursive_hashmap when adding similar ones to the parent_hashmap. I
won't pretend to understand all of what's going on here, but there's an
obvious leak: whatever we removed from recursive_hashmap is not
referenced anywhere else, and is never free()d.
We can easily fix this by asking the hashmap to return a pointer to the
old entry. This makes t7002 now completely leak-free.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
The pattern_list structs used for cone-mode sparse lookups use a few
extra hashmaps. These store pattern_entry structs, each of which has its
own heap-allocated pattern string. When we clean up the hashmaps, we
free the individual pattern_entry structs, but forget to clean up the
embedded strings, causing memory leaks.
We can fix this by iterating over the hashmaps to free the extra
strings. This reduces the numbers of leaks in t7002 from 22 to 9.
One alternative here would be to make the string a FLEX_ARRAY member of
the pattern_entry. Then there's no extra free() required, and as a bonus
it would be a little more efficient. However, some of the refactoring
gets awkward, as we are often assigning strings allocated by helper
functions. So let's just fix the leak for now, and we can explore bigger
refactoring separately.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
We use add_patterns() to read .gitignore, .git/info/exclude, etc, as
well as other pattern-like files like sparse-checkout. The parser for
these uses an "int" as an index, meaning that files over 2GB will
generally cause signed integer overflow and out-of-bounds access.
This is unlikely to happen in any real files, but we do read .gitignore
files from the tree. A malicious tree could cause an out-of-bounds read
and segfault (we also write NULs over newlines, so in theory it could be
an out-of-bounds write, too, but as we go char-by-char, the first thing
that happens is trying to read a negative 2GB offset).
We could fix the most obvious issue by replacing one "int" with a
"size_t". But there are tons of "int" sprinkled throughout this code for
things like pattern lengths, number of patterns, and so on. Since nobody
would actually want a 2GB .gitignore file, an easy defensive measure is
to just refuse to parse them.
The "int" in question is in add_patterns_from_buffer(), so we could
catch it there. But by putting the checks in its two callers, we can
produce more useful error messages.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
In `resolve_gitlink_ref()` we implicitly rely on `the_repository` to
look up the submodule ref store. Now that we can look up submodule ref
stores for arbitrary repositories we can improve this function to
instead accept a repository as parameter for which we want to resolve
the gitlink.
Do so and adjust callers accordingly.
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
* maint-2.44: (41 commits)
Git 2.44.1
Git 2.43.4
Git 2.42.2
Git 2.41.1
Git 2.40.2
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
...
* maint-2.43: (40 commits)
Git 2.43.4
Git 2.42.2
Git 2.41.1
Git 2.40.2
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
t7423: add tests for symlinked submodule directories
...
* maint-2.42: (39 commits)
Git 2.42.2
Git 2.41.1
Git 2.40.2
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
t7423: add tests for symlinked submodule directories
has_dir_name(): do not get confused by characters < '/'
...
* maint-2.41: (38 commits)
Git 2.41.1
Git 2.40.2
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
t7423: add tests for symlinked submodule directories
has_dir_name(): do not get confused by characters < '/'
docs: document security issues around untrusted .git dirs
...
* maint-2.40: (39 commits)
Git 2.40.2
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
t7423: add tests for symlinked submodule directories
has_dir_name(): do not get confused by characters < '/'
docs: document security issues around untrusted .git dirs
upload-pack: disable lazy-fetching by default
...
* maint-2.39: (38 commits)
Git 2.39.4
fsck: warn about symlink pointing inside a gitdir
core.hooksPath: add some protection while cloning
init.templateDir: consider this config setting protected
clone: prevent hooks from running during a clone
Add a helper function to compare file contents
init: refactor the template directory discovery into its own function
find_hook(): refactor the `STRIP_EXTENSION` logic
clone: when symbolic links collide with directories, keep the latter
entry: report more colliding paths
t5510: verify that D/F confusion cannot lead to an RCE
submodule: require the submodule path to contain directories only
clone_submodule: avoid using `access()` on directories
submodules: submodule paths must not contain symlinks
clone: prevent clashing git dirs when cloning submodule in parallel
t7423: add tests for symlinked submodule directories
has_dir_name(): do not get confused by characters < '/'
docs: document security issues around untrusted .git dirs
upload-pack: disable lazy-fetching by default
fetch/clone: detect dubious ownership of local repositories
...
In b878579ae7 (clone: report duplicate entries on case-insensitive
filesystems, 2018-08-17) code was added to warn about index entries that
resolve to the same file system entity (usually the cause is a
case-insensitive filesystem).
In Git for Windows, where inodes are not trusted (because of a
performance trade-off, inodes are equal to 0 by default), that check
does not compare inode numbers but the verbatim path.
This logic works well when index entries' paths differ only in case.
However, for file/directory conflicts only the file's path was reported,
leaving the user puzzled with what that path collides.
Let's try ot catch colliding paths even if one path is the prefix of the
other. We do this also in setups where the file system is case-sensitive
because the inode check would not be able to catch those collisions.
While not a complete solution (for example, on macOS, Unicode
normalization could also lead to file/directory conflicts but be missed
by this logic), it is at least another defensive layer on top of what
the previous commits added.
Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
Create a wrapper function for untracked_cache_invalidate_path()
that silently trims a trailing slash, if present, before calling
the wrapped function.
The untracked cache expects to be called with a pathname that
does not contain a trailing slash. This can make it inconvenient
for callers that have a directory path. Lets hide this complexity.
This will be used by a later commit in the FSMonitor code which
may receive directory pathnames from an FSEvent.
Signed-off-by: Jeff Hostetler <jeffhostetler@github.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Remove unused header "#include".
* en/header-cleanup:
treewide: remove unnecessary includes in source files
treewide: add direct includes currently only pulled in transitively
trace2/tr2_tls.h: remove unnecessary include
submodule-config.h: remove unnecessary include
pkt-line.h: remove unnecessary include
line-log.h: remove unnecessary include
http.h: remove unnecessary include
fsmonitor--daemon.h: remove unnecessary includes
blame.h: remove unnecessary includes
archive.h: remove unnecessary include
treewide: remove unnecessary includes in source files
treewide: remove unnecessary includes from header files
Each of these were checked with
gcc -E -I. ${SOURCE_FILE} | grep ${HEADER_FILE}
to ensure that removing the direct inclusion of the header actually
resulted in that header no longer being included at all (i.e. that
no other header pulled it in transitively).
...except for a few cases where we verified that although the header
was brought in transitively, nothing from it was directly used in
that source file. These cases were:
* builtin/credential-cache.c
* builtin/pull.c
* builtin/send-pack.c
Signed-off-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Allow users to limit or exclude files based on file attributes
during git-add and git-stash.
For example, the chromium project would like to use
$ git add . ':(exclude,attr:submodule)'
as submodules are managed by an external tool, forbidding end users
to record changes with "git add". Allowing "git add" to often
records changes that users do not want in their commits.
This commit does not change any attr magic implementation. It is
only adding attr as an allowed pathspec in git-add and git-stash,
which was previously blocked by GUARD_PATHSPEC and a pathspec mask
in parse_pathspec()).
However, we fix a bug in prefix_magic() where attr values were
unintentionally removed. This was triggerable when parse_pathspec()
is called with PATHSPEC_PREFIX_ORIGIN as a flag, which was the case
for git-stash (Bug originally filed here [*])
Furthermore, while other commands hit this code path it did not
result in unexpected behavior because this bug only impacts the
pathspec->items->original field which is NOT used to filter
paths. However, git-stash does use pathspec->items->original when
building args used to call other git commands. (See add_pathspecs()
usage and implementation in stash.c)
It is possible that when the attr pathspec feature was first added
in b0db704652 (pathspec: allow querying for attributes, 2017-03-13),
"PATHSPEC_ATTR" was just unintentionally left out of a few
GUARD_PATHSPEC() invocations.
Later, to get a more user-friendly error message when attr was used
with git-add, PATHSPEC_ATTR was added as a mask to git-add's
invocation of parse_pathspec() 84d938b732 (add: do not accept
pathspec magic 'attr', 2018-09-18). However, this user-friendly
error message was never added for git-stash.
[Reference]
* https://lore.kernel.org/git/CAMmZTi-0QKtj7Q=sbC5qhipGsQxJFOY-Qkk1jfkRYwfF5FcUVg@mail.gmail.com/)
Signed-off-by: Joanna Wang <jojwang@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>