96 Commits

Author SHA1 Message Date
Junio C Hamano
6f23a2b952 Merge branch 'tb/incremental-midx-part-3.2' into seen
Further work on incremental repacking using MIDX/bitmap

* tb/incremental-midx-part-3.2:
  midx: enable reachability bitmaps during MIDX compaction
  midx: implement MIDX compaction
  t/helper/test-read-midx.c: plug memory leak when selecting layer
  midx-write.c: factor fanout layering from `compute_sorted_entries()`
  midx-write.c: enumerate `pack_int_id` values directly
  midx-write.c: extract `fill_pack_from_midx()`
  midx-write.c: introduce `midx_pack_perm()` helper
  git-compat-util.h: introduce `u32_add()`
  midx: do not require packs to be sorted in lexicographic order
  midx-write.c: introduce `struct write_midx_opts`
  midx-write.c: don't use `pack_perm` when assigning `bitmap_pos`
  t/t5319-multi-pack-index.sh: fix copy-and-paste error in t5319.39
  git-multi-pack-index(1): align SYNOPSIS with 'git multi-pack-index -h'
  git-multi-pack-index(1): remove non-existent incompatibility
  builtin/multi-pack-index.c: make '--progress' a common option
  midx: split `get_midx_checksum()` by adding `get_midx_hash()`
  midx: mark `get_midx_checksum()` arguments as const
2025-12-12 22:11:50 +09:00
Junio C Hamano
2f8df0d358 Merge branch 'ps/repack-avoid-noop-midx-rewrite' into jch
Even when there is no changes in the packfile and no need to
recompute bitmaps, "git repack" recomputed and updated the MIDX
file, which has been corrected.

Comments?

* ps/repack-avoid-noop-midx-rewrite:
  midx-write: skip rewriting MIDX with `--stdin-packs` unless needed
  midx-write: extract function to test whether MIDX needs updating
  midx: fix `BUG()` when getting preferred pack without a reverse index
2025-12-12 15:53:09 +09:00
Patrick Steinhardt
6ce9d558ce midx-write: skip rewriting MIDX with --stdin-packs unless needed
In `write_midx_internal()` we know to skip rewriting the multi-pack
index in case the existing one already covers all packs. This logic does
not know to handle `git multi-pack-index write --stdin-packs` though, so
we end up always rewriting the MIDX in this case even if the MIDX would
not change.

With our default maintenance strategy this isn't really much of a
problem, as git-gc(1) does not use the "--stdin-packs" option. But that
is changing with geometric repacking, where "--stdin-packs" is used to
explicitly select the packfiles part of the geometric sequence.

This issue can be demonstrated trivially with a benchmark in the Git
repository: executing `git repack --geometric=2 --write-midx -d` in the
Git repository takes more than 3 seconds only to end up with the same
multi-pack index as we already had before.

The logic that decides if we need to rewrite the MIDX only checks
whether the number of packfiles covered will change. That check is of
course too lenient for "--stdin-packs", as it could happen that we want
to cover a different-but-same-size set of packfiles. But there is no
inherent reason why we cannot handle "--stdin-packs".

Improve the logic to not only check for the number of packs, but to also
verify that we are asked to generate a MIDX for the _same_ packs. This
allows us to also skip no-op rewrites for "--stdin-packs".

Helped-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-11 12:09:59 +09:00
Patrick Steinhardt
b3bab9d272 midx-write: extract function to test whether MIDX needs updating
In `write_midx_internal()` we know to skip writing the new multi-pack
index in case it would be the same as the existing one. This logic does
not handle the `--stdin-packs` option yet though, so we end up always
rewriting the MIDX if that option is passed to us.

Extract the logic to decide whether or not to rewrite the MIDX into a
separate function. This will allow us to extend that feature in the next
commit to address the above issue.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-11 12:09:58 +09:00
Taylor Blau
1e06a72a0a midx: enable reachability bitmaps during MIDX compaction
Enable callers to generate reachability bitmaps when performing MIDX
layer compaction by combining all existing bitmaps from the compacted
layers.

Note that the because of the object/pack ordering described by the
previous commit, the pseudo-pack order for the compacted MIDX is the
same as concatenating the individual pseudo-pack orderings for each
layer in the compaction range.

As a result, the only non-test or documentation change necessary is to
treat all objects as non-preferred during compaction so as not to
disturb the object ordering.

In the future, we may want to adjust which commit(s) receive
reachability bitmaps when compacting multiple .bitmap files into one, or
even generate new bitmaps (e.g., if the references have moved
significantly since the .bitmap was generated). This commit only
implements combining all existing bitmaps in range together in order to
demonstrate and lay the groundwork for more exotic strategies.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:11 +09:00
Taylor Blau
98ce832ee6 midx: implement MIDX compaction
When managing a MIDX chain with many layers, it is convenient to combine
a sequence of adjacent layers into a single layer to prevent the chain
from growing too long.

While it is conceptually possible to "compact" a sequence of MIDX layers
together by running "git multi-pack-index write --stdin-packs", there
are a few drawbacks that make this less than desirable:

 - Preserving the MIDX chain is impossible, since there is no way to
   write a MIDX layer that contains objects or packs found in an earlier
   MIDX layer already part of the chain. So callers would have to write
   an entirely new (non-incremental) MIDX containing only the compacted
   layers, discarding all other objects/packs from the MIDX.

 - There is (currently) no way to write a MIDX layer outside of the MIDX
   chain to work around the above, such that the MIDX chain could be
   reassembled substituting the compacted layers with the MIDX that was
   written.

 - The `--stdin-packs` command-line option does not allow us to specify
   the order of packs as they appear in the MIDX. Therefore, even if
   there were workarounds for the previous two challenges, any bitmaps
   belonging to layers which come after the compacted layer(s) would no
   longer be valid.

This commit introduces a way to compact a sequence of adjacent MIDX
layers into a single layer while preserving the MIDX chain, as well as
any bitmap(s) in layers which are newer than the compacted ones.

Implementing MIDX compaction does not require a significant number of
changes to how MIDX layers are written. The main changes are as follows:

 - Instead of calling `fill_packs_from_midx()`, we call a new function
   `fill_packs_from_midx_range()`, which walks backwards along the
   portion of the MIDX chain which we are compacting, and adds packs one
   layer a time.

   In order to preserve the pseudo-pack order, the concatenated pack
   order is preserved, with the exception of preferred packs which are
   always added first.

 - After adding entries from the set of packs in the compaction range,
   `compute_sorted_entries()` must adjust the `pack_int_id`'s for all
   objects added in each fanout layer to match their original
   `pack_int_id`'s (as opposed to the index at which each pack appears
   in `ctx.info`).

 - When writing out the new 'multi-pack-index-chain' file, discard any
   layers in the compaction range, replacing them with the newly written
   layer, instead of keeping them and placing the new layer at the end
   of the chain.

This ends up being sufficient to implement MIDX compaction in such a way
that preserves bitmaps corresponding to more recent layers in the MIDX
chain.

The tests for MIDX compaction are so far fairly spartan, since the main
interesting behavior here is ensuring that the right packs/objects are
selected from each layer, and that the pack order is preserved despite
whether or not they are sorted in lexicographic order in the original
MIDX chain.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:11 +09:00
Taylor Blau
477f9c1b21 midx-write.c: factor fanout layering from compute_sorted_entries()
When computing the set of objects to appear in a MIDX, we use
compute_sorted_entries(), which handles objects from various existing
sources one fanout layer at a time.

The process for computing this set is slightly different during MIDX
compaction, so factor out the existing functionality into its own
routine to prevent `compute_sorted_entries()` from becoming too
difficult to read.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:10 +09:00
Taylor Blau
cad0632e13 midx-write.c: enumerate pack_int_id values directly
Our `midx-write.c::fill_packs_from_midx()` function currently enumerates
the range [0, m->num_packs), and then shifts its index variable up by
`m->num_packs_in_base` to produce a valid `pack_int_id`.

Instead, directly enumerate the range:

    [m->num_packs_in_base, m->num_packs_in_base + m->num_packs)

, which are the original pack_int_ids themselves as opposed to the
indexes of those packs relative to the MIDX layer they are contained
within.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:09 +09:00
Taylor Blau
9ced91c83b midx-write.c: extract fill_pack_from_midx()
When filling packs from an existing MIDX, `fill_packs_from_midx()`
handles preparing a MIDX'd pack, and reading out its pack name from the
existing MIDX.

MIDX compaction will want to perform an identical operation, though the
caller will look quite different than `fill_packs_from_midx()`. To
reduce any future code duplication, extract `fill_pack_from_midx()`
from `fill_packs_from_midx()` to prepare to call our new helper function
in a future change.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:09 +09:00
Taylor Blau
363485a311 midx-write.c: introduce midx_pack_perm() helper
The `ctx->pack_perm` array can be considered as a permutation between
the original `pack_int_id` of some given pack to its position in the
`ctx->info` array containing all packs.

Today we can always index into this array with any known `pack_int_id`,
since there is never a `pack_int_id` which is greater than or equal to
the value `ctx->nr`.

That is not necessarily the case with MIDX compaction. For example,
suppose we have a MIDX chain with three layers, each containing three
packs. The base of the MIDX chain will have packs with IDs 0, 1, and 2,
the next layer 3, 4, and 5, and so on. If we are compacting the topmost
two layers, we'll have input `pack_int_id` values between [3, 8], but
`ctx->nr` will only be 6.

In that example, if we want to know where the pack whose original
`pack_int_id` value was, say, 7, we would compute `ctx->pack_perm[7]`,
leading to an uninitialized read, since there are only 6 entries
allocated in that array.

To address this, there are a couple of options:

 - We could allocate enough entries in `ctx->pack_perm` to accommodate
   the largest `orig_pack_int_id` value.

 - Or, we could internally shift the input values by the number of packs
   in the base layer of the lower end of the MIDX compaction range.

This patch prepare us to take the latter approach, since it does not
allocate more memory than strictly necessary. (In our above example, the
base of the lower end of the compaction range is the first MIDX layer
(having three packs), so we would end up indexing `ctx->pack_perm[7-3]`,
which is a valid read.)

Note that this patch does not actually implement that approach yet, but
merely performs a behavior-preserving refactoring which will make the
change easier to carry out in the future.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:08 +09:00
Taylor Blau
b40e7cbca2 midx: do not require packs to be sorted in lexicographic order
The MIDX file format currently requires that pack files be identified by
the lexicographic ordering of their names (that is, a pack having a
checksum beginning with "abc" would have a numeric pack_int_id which is
smaller than the same value for a pack beginning with "bcd").

As a result, it is impossible to combine adjacent MIDX layers together
without permuting bits from bitmaps that are in more recent layer(s).

To see why, consider the following example:

          | packs       | preferred pack
  --------+-------------+---------------
  MIDX #0 | { X, Y, Z } | Y
  MIDX #1 | { A, B, C } | B
  MIDX #2 | { D, E, F } | D

, where MIDX #2's base MIDX is MIDX #1, and so on. Suppose that we want
to combine MIDX layers #0 and #1, to create a new layer #0' containing
the packs from both layers. With the original three MIDX layers, objects
are laid out in the bitmap in the order they appear in their source
pack, and the packs themselves are arranged according to the pseudo-pack
order. In this case, that ordering is Y, X, Z, B, A, C.

But recall that the pseudo-pack ordering is defined by the order that
packs appear in the MIDX, with the exception of the preferred pack,
which sorts ahead of all other packs regardless of its position within
the MIDX. In the above example, that means that pack 'Y' could be placed
anywhere (so long as it is designated as preferred), however, all other
packs must be placed in the location listed above.

Because that ordering isn't sorted lexicographically, it is impossible
to compact MIDX layers in the above configuration without permuting the
object-to-bit-position mapping. Changing this mapping would affect all
bitmaps belonging to newer layers, rendering the bitmaps associated with
MIDX #2 unreadable.

One of the goals of MIDX compaction is that we are able to shrink the
length of the MIDX chain *without* invalidating bitmaps that belong to
newer layers, and the lexicographic ordering constraint is at odds with
this goal.

However, packs do not *need* to be lexicographically ordered within the
MIDX. As far as I can gather, the only reason they are sorted lexically
is to make it possible to perform a binary search over the pack names in
a MIDX, necessary to make `midx_contains_pack()`'s performance
logarithmic in the number of packs rather than linear.

Relax this constraint by allowing MIDX writes to proceed with packs that
are not arranged in lexicographic order. `midx_contains_pack()` will
lazily instantiate a `pack_names_sorted` array on the MIDX, which will
be used to implement the binary search over pack names.

Note that this produces MIDXs which may be incompatible with earlier
versions of Git that have stricter requirements on the layout of packs
within a MIDX. This patch does *not* modify the version number of the
MIDX format, since existing versions of Git already know to gracefully
ignore a MIDX with packs that appear out-of-order.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:08 +09:00
Taylor Blau
a7c1d30f29 midx-write.c: introduce struct write_midx_opts
In the MIDX writing code, there are four functions which perform some
sort of MIDX write operation. They are:

 - write_midx_file()
 - write_midx_file_only()
 - expire_midx_packs()
 - midx_repack()

All of these functions are thin wrappers over `write_midx_internal()`,
which implements the bulk of these routines. As a result, the
`write_midx_internal()` function takes six arguments.

Future commits in this series will want to add additional arguments, and
in general this function's signature will be the union of parameters
among *all* possible ways to write a MIDX.

Instead of adding yet more arguments to this function to support MIDX
compaction, introduce a `struct write_midx_opts`, which has the same
struct members as `write_midx_internal()`'s arguments.

Adding additional fields to the `write_midx_opts` struct is preferable
to adding additional arguments to `write_midx_internal()`. This is
because the callers below all zero-initialize the struct, so each time
we add a new piece of information, we do not have to pass the zero value
for it in all other call-sites that do not care about it.

For now, no functional changes are included in this patch.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:07 +09:00
Taylor Blau
9df4325f15 midx-write.c: don't use pack_perm when assigning bitmap_pos
In midx_pack_order(), we compute for each bitampped pack the first bit
to correspond to an object in that pack, along with how many bits were
assigned to object(s) in that pack.

Initially, each bitmap_nr value is set to zero, and each bitmap_pos
value is set to the sentinel BITMAP_POS_UNKNOWN. This is done to ensure
that there are no packs who have an unknown bit position but a somehow
non-zero number of objects (cf. `write_midx_bitmapped_packs()` in
midx-write.c).

Once the pack order is fully determined, midx_pack_order() sets the
bitmap_pos field for any bitmapped packs to zero if they are still
listed as BITMAP_POS_UNKNOWN.

However, we enumerate the bitmapped packs in order of `ctx->pack_perm`.
This is fine for existing cases, since the only time the
`ctx->pack_perm` array holds a value outside of the addressable range of
`ctx->info` is when there are expired packs, which only occurs via 'git
multi-pack-index expire', which does not support writing MIDX bitmaps.
As a result, the range of ctx->pack_perm covers all values in [0,
`ctx->nr`), so enumerating in this order isn't an issue.

A future change necessary for compaction will complicate this further by
introducing a wrapper around the `ctx->pack_perm` array, which turns the
given `pack_int_id` into one that is relative to the lower end of the
compaction range. As a result, indexing into `ctx->pack_perm` through
this helper, say, with "0" will produce a crash when the lower end of
the compaction range has >0 pack(s) in its base layer, since the
subtraction will wrap around the 32-bit unsigned range, resulting in an
uninitialized read.

But the process is completely unnecessary in the first place: we are
enumerating all values of `ctx->info`, and there is no reason to process
them in a different order than they appear in memory. Index `ctx->info`
directly to reflect that.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:07 +09:00
Taylor Blau
fcf3d688d2 midx: split get_midx_checksum() by adding get_midx_hash()
When trying to print out, say, the hexadecimal representation of a
MIDX's hash, our code will do something like:

    hash_to_hex_algop(get_midx_checksum(m),
                      m->source->odb->repo->hash_algo);

, which is both cumbersome and repetitive. In fact, all but a handful of
callers to `get_midx_checksum()` do exactly the above. Reduce the
repetitive nature of calling `get_midx_checksum()` by having it return a
pointer into a static buffer containing the above result.

For the handful of callers that do need to compare the raw bytes and
don't want to deal with an encoded copy (e.g., because they are passing
it to hasheq() or similar), introduce `get_midx_hash()` which returns
the raw bytes.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-12-07 07:38:05 +09:00
Junio C Hamano
9d442ce2e2 Merge branch 'ps/object-source-management'
Code refactoring around object database sources.

* ps/object-source-management:
  odb: handle recreation of quarantine directories
  odb: handle changing a repository's commondir
  chdir-notify: add function to unregister listeners
  odb: handle initialization of sources in `odb_new()`
  http-push: stop setting up `the_repository` for each reference
  t/helper: stop setting up `the_repository` repeatedly
  builtin/index-pack: fix deferred fsck outside repos
  oidset: introduce `oidset_equal()`
  odb: move logic to disable ref updates into repo
  odb: refactor `odb_clear()` to `odb_free()`
  odb: adopt logic to close object databases
  setup: convert `set_git_dir()` to have file scope
  path: move `enter_repo()` into "setup.c"
2025-12-05 14:49:58 +09:00
Patrick Steinhardt
9aaba57993 odb: adopt logic to close object databases
The logic to close an object database is currently contained in the
packfile subsystem. That choice is somewhat relatable, as most of the
logic really is to close resources associated with the packfile store
itself. But we also end up handling object sources and commit graphs,
which certainly is not related to packfiles.

Move the function into the object database subsystem and rename it to
`odb_close()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-19 17:41:03 -08:00
Patrick Steinhardt
f898661637 refs: expose peeled object ID via the iterator
Both the "files" and "reftable" backend are able to store peeled values
for tags in the respective formats. This allows for a more efficient
lookup of the target object of such a tag without having to manually
peel via the object database.

The infrastructure to access these peeled object IDs is somewhat funky
though. When iterating through objects, we store a pointer reference to
the current iterator in a global variable. The callbacks invoked by that
iterator are then expected to call `peel_iterated_oid()`, which checks
whether the globally-stored iterator's current reference refers to the
one handed into that function. If so, we ask the iterator to peel the
object, otherwise we manually peel the object via the object database.
Depending on global state like this is somewhat weird and also quite
fragile.

Introduce a new `struct reference::peeled_oid` field that can be
populated by the reference backends. This field can be accessed via a
new function `reference_get_peeled_oid()` that either uses that value,
if set, or alternatively peels via the ODB. With this change we don't
have to rely on global state anymore, but make the peeled object ID
available to the callback functions directly.

Adjust trivial callers that already have a `struct reference` available.
Remaining callers will be adjusted in subsequent commits.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-04 07:32:25 -08:00
Patrick Steinhardt
bdbebe5714 refs: introduce wrapper struct for each_ref_fn
The `each_ref_fn` callback function type is used across our code base
for several different functions that iterate through reference. There's
a bunch of callbacks implementing this type, which makes any changes to
the callback signature extremely noisy. An example of the required churn
is e8207717f1 (refs: add referent to each_ref_fn, 2024-08-09): adding a
single argument required us to change 48 files.

It was already proposed back then [1] that we might want to introduce a
wrapper structure to alleviate the pain going forward. While this of
course requires the same kind of global refactoring as just introducing
a new parameter, it at least allows us to more change the callback type
afterwards by just extending the wrapper structure.

One counterargument to this refactoring is that it makes the structure
more opaque. While it is obvious which callsites need to be fixed up
when we change the function type, it's not obvious anymore once we use
a structure. That being said, we only have a handful of sites that
actually need to populate this wrapper structure: our ref backends,
"refs/iterator.c" as well as very few sites that invoke the iterator
callback functions directly.

Introduce this wrapper structure so that we can adapt the iterator
interfaces more readily.

[1]: <ZmarVcF5JjsZx0dl@tanuki>

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-11-04 07:32:24 -08:00
Junio C Hamano
e18e761bef Merge branch 'ds/midx-write-fixes'
Fixes multiple crashes around midx write-out codepaths.

* ds/midx-write-fixes:
  midx-write: simplify error cases
  midx-write: reenable signed comparison errors
  midx-write: use uint32_t for preferred_pack_idx
  midx-write: use cleanup when incremental midx fails
  midx-write: put failing response value back
  midx-write: only load initialized packs
2025-09-15 08:52:06 -07:00
Junio C Hamano
c31a276f12 Merge branch 'ps/object-store-midx-dedup-info'
Further code clean-up for multi-pack-index code paths.

* ps/object-store-midx-dedup-info:
  midx: compute paths via their source
  midx: stop duplicating info redundant with its owning source
  midx: write multi-pack indices via their source
  midx: load multi-pack indices via their source
  midx: drop redundant `struct repository` parameter
  odb: simplify calling `link_alt_odb_entry()`
  odb: return newly created in-memory sources
  odb: consistently use "dir" to refer to alternate's directory
  odb: allow `odb_find_source()` to fail
  odb: store locality in object database sources
2025-09-12 10:41:18 -07:00
Derrick Stolee
c25651aefd midx-write: simplify error cases
The write_midx_internal() method uses gotos to jump to a cleanup section to
clear memory before returning 'result'. Since these jumps are more common
for error conditions, initialize 'result' to -1 and then only set it to 0
before returning with success. There are a couple places where we return
with success via a jump.

This has the added benefit that the method now returns -1 on error instead
of an inconsistent 1 or -1.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:32:01 -07:00
Derrick Stolee
1f2bc6be1d midx-write: reenable signed comparison errors
Remove the remaining signed comparison warnings in midx-write.c so that
they can be enforced as errors in the future. After the previous change,
the remaining errors are due to iterator variables named 'i'.

The strategy here involves defining the variable within the for loop
syntax to make sure we use the appropriate bitness for the loop
sentinel. This matters in at least one method where the variable was
compared to uint32_t in some loops and size_t in others.

While adjusting these loops, there were some where the loop boundary was
checking against a uint32_t value _plus one_. These were replaced with
non-strict comparisons, but also the value is checked to not be
UINT32_MAX. Since the value is the number of incremental multi-pack-
indexes, this is not a meaningful restriction. The new die() is about
defensive programming more than it being realistically possible.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:32:01 -07:00
Derrick Stolee
68383ac9d4 midx-write: use uint32_t for preferred_pack_idx
midx-write.c has the DISABLE_SIGN_COMPARE_WARNINGS macro defined for a
few reasons, but the biggest one is the use of a signed
preferred_pack_idx member inside the write_midx_context struct. The code
currently uses -1 to indicate an unset preferred pack but pack int ids
are normally handled as uint32_t. There are also a few loops that search
for the preferred pack by name and those iterators will need updates to
uint32_t in the next change.

For now, replace the use of -1 with a 'NO_PREFERRED_PACK' macro and an
equality check. The macro stores the max value of a uint32_t, so we
cannot store a preferred pack that appears last in a list of 2^32 total
packs, but that's expected to be unreasonable already. Furthermore, with
this change we end up extending the range from 2^31 possible packs to
2^32-1.

There are some careful things to worry about with initializing the
preferred pack in the struct and using that value when searching for a
preferred pack that was already incorrect but accidentally working when
the index was initialized to zero.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:32:01 -07:00
Derrick Stolee
9c2262d65d midx-write: use cleanup when incremental midx fails
The incremental mode of writing a multi-pack-index has a few extra
conditions that could lead to failure, but these are currently
short-ciruiting with 'return -1' instead of setting the method's
'result' variable and going to the cleanup tag.

Replace these returns with gotos to avoid memory issues when exiting
early due to error conditions.

Unfortunately, these error conditions are difficult to reproduce with
test cases, which is perhaps one reason why the memory loss was not
caught by existing test cases in memory tracking modes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:32:00 -07:00
Derrick Stolee
3a45c7beb0 midx-write: put failing response value back
This instance of setting the result to 1 before going to cleanup was
accidentally removed in fcb2205b77 (midx: implement support for writing
incremental MIDX chains, 2024-08-06). Build upon a test that already deletes
a packfile to verify that this error propagates to full command failure.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:32:00 -07:00
Derrick Stolee
c9388d9012 midx-write: only load initialized packs
The fill_packs_from_midx() method was refactored in fcb2205b77 (midx:
implement support for writing incremental MIDX chains, 2024-08-06) to
allow for preferred packfiles and incremental multi-pack-indexes.
However, this led to some conditions that can cause improperly
initialized memory in the context's list of packfiles.

The conditions caring about the preferred pack name or the incremental
flag are currently necessary to load a packfile. But the context is
still being populated with pack_info structs based on the packfile array
for the existing multi-pack-index even if prepare_midx_pack() isn't
called.

Add a new test that breaks under --stress when compiled with
SANITIZE=address. The chosen number of 100 packfiles was selected to get
the --stress output to fail about 50% of the time, while 50 packfiles
could not get a failure in most --stress runs.

The test case is marked as EXPENSIVE not only because of the number of
packfiles it creates, but because some CI environments were reporting
errors during the test that I could not reproduce, specifically around
being unable to open the packfiles or their pack-indexes.

When it fails under SANITIZE=address, it provides the following error:

AddressSanitizer:DEADLYSIGNAL
=================================================================
==3263517==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000027
==3263517==The signal is caused by a READ memory access.
==3263517==Hint: address points to the zero page.
    #0 0x562d5d82d1fb in close_pack_windows packfile.c:299
    #1 0x562d5d82d3ab in close_pack packfile.c:354
    #2 0x562d5d7bfdb4 in write_midx_internal midx-write.c:1490
    #3 0x562d5d7c7aec in midx_repack midx-write.c:1795
    #4 0x562d5d46fff6 in cmd_multi_pack_index builtin/multi-pack-index.c:305
    ...

This failure stack trace is disconnected from the real fix because the bad
pointers are accessed later when closing the packfiles from the context.

There are a few different aspects to this fix that are worth noting:

 1. We return to the previous behavior of fill_packs_from_midx to not
    rely on the incremental flag or existence of a preferred pack.

 2. The behavior to scan all layers of an incremental midx is kept, so
    this is not a full revert of the change.

 3. We skip allocating more room in the pack_info array if the pack
    fails prepare_midx_pack().

 4. The method has always returned 0 for success and 1 for failure, but
    the condition checking for error added a check for a negative result
    for failure, so that is now updated.

 5. The call to open_pack_index() is removed, but this is needed later
    in the case of a preferred pack. That call is moved to immediately
    before its result is needed (checking for the object count).

Signed-off-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-09-05 12:31:59 -07:00
Patrick Steinhardt
13296ac909 midx: compute paths via their source
With the preceding commits we started to always have the object database
source available when we load, write or access multi-pack indices. With
this in place we can change how MIDX paths are computed so that we don't
have to pass in the combination of a hash algorithm and object directory
anymore, but only the object database source.

Refactor the code accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11 09:22:23 -07:00
Patrick Steinhardt
7744936f37 midx: stop duplicating info redundant with its owning source
Multi-pack indices store some information that is redundant with their
owning source:

  - The locality bit that tracks whether the source is the primary
    object source or an alternate.

  - The object directory path the multi-pack index is located in.

  - The pointer to the owning parent directory.

All of this information is already contained in `struct odb_source`. So
now that we always have that struct available when loading a multi-pack
index we have it readily accessible.

Drop the redundant information and instead store a pointer to the object
source.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11 09:22:23 -07:00
Patrick Steinhardt
c3f5d25146 midx: write multi-pack indices via their source
Similar to the preceding commit, refactor the writing side of multi-pack
indices so that we pass in the object database source where the index
should be written to.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11 09:22:22 -07:00
Patrick Steinhardt
9ff2129615 midx: drop redundant struct repository parameter
There are a couple of functions that take both a `struct repository` and
a `struct multi_pack_index`. This provides redundant information though
without much benefit given that the multi-pack index already has a
pointer to its owning repository.

Drop the `struct repository` parameter from such functions. While at it,
reorder the list of parameters of `fill_midx_entry()` so that the MIDX
comes first to better align with our coding guidelines.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11 09:22:22 -07:00
Patrick Steinhardt
0d61933b8f odb: allow odb_find_source() to fail
When trying to locate a source for an unknown object directory we will
die right away. In subsequent patches we will add new callsites though
that want to handle this situation gracefully instead.

Refactor the function to return a `NULL` pointer if the source could not
be found and adapt the callsites to die instead. Introduce a new wrapper
`odb_find_source_or_die()` that continues to die in case the source
could not be found.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-08-11 09:22:21 -07:00
Junio C Hamano
4ce0caa7cc Merge branch 'ps/object-file-wo-the-repository'
Reduce implicit assumption and dependence on the_repository in the
object-file subsystem.

* ps/object-file-wo-the-repository:
  object-file: get rid of `the_repository` in index-related functions
  object-file: get rid of `the_repository` in `force_object_loose()`
  object-file: get rid of `the_repository` in `read_loose_object()`
  object-file: get rid of `the_repository` in loose object iterators
  object-file: remove declaration for `for_each_file_in_obj_subdir()`
  object-file: inline `for_each_loose_file_in_objdir_buf()`
  object-file: get rid of `the_repository` when writing objects
  odb: introduce `odb_write_object()`
  loose: write loose objects map via their source
  object-file: get rid of `the_repository` in `finalize_object_file()`
  object-file: get rid of `the_repository` in `loose_object_info()`
  object-file: get rid of `the_repository` when freshening objects
  object-file: inline `check_and_freshen()` functions
  object-file: get rid of `the_repository` in `has_loose_object()`
  object-file: stop using `the_hash_algo`
  object-file: fix -Wsign-compare warnings
2025-08-05 11:53:55 -07:00
Patrick Steinhardt
cbb388f3e5 object-file: get rid of the_repository in finalize_object_file()
We implicitly depend on `the_repository` when moving an object file into
place in `finalize_object_file()`. Get rid of this global dependency by
passing in a repository.

Note that one might be pressed to inject an object database instead of a
repository. But the function doesn't really care about the ODB at all.
All it does is to move a file into place while checking whether there is
any collision. As such, the functionality it provides is independent of
the object database and only needs the repository as parameter so that
it can adjust permissions of the file we are about to finalize.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-16 22:16:14 -07:00
Patrick Steinhardt
736bb725eb packfile: refactor get_multi_pack_index() to work on sources
The function `get_multi_pack_index()` loads multi-pack indices via
`prepare_packed_git()` and then returns the linked list of multi-pack
indices that is stored in `struct object_database`. That list is in the
process of being removed though in favor of storing the MIDX as part of
the object database source it belongs to.

Refactor `get_multi_pack_index()` so that it returns the multi-pack
index for a single object source. Callers are now expected to call this
function for each source they are interested in. This requires them to
iterate through alternates, so we have to prepare alternate object
sources before doing so.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-15 12:07:29 -07:00
Patrick Steinhardt
bd52ea343d odb: get rid of the_repository in find_odb()
Get rid of our dependency on `the_repository` in `find_odb()` by passing
in the object database in which we want to search for the source and
adjusting all callers.

Rename the function to `odb_find_source()`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-07-01 14:46:35 -07:00
Phillip Wood
3aa98a61da midx: avoid negative array index
nth_midxed_pack_int_id() returns the index of the pack file in the multi
pack index's list of packfiles that the specified object. The index is
returned as a uint32_t. Storing this in an int will make the index
negative if the most significant bit is set. Fix this by using uint32_t
as the rest of the code does. This is unlikely to be a practical problem
as it requires the multipack index to reference 2^31 packfiles.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-22 14:48:37 -07:00
Phillip Wood
f874c0ed90 midx repack: avoid potential integer overflow on 64 bit systems
On a 64 bit system the calculation

    p->pack_size * pack_info[i].referenced_objects

could overflow. If a pack file contains 2^28 objects with an average
compressed size of 1KB then the pack size will be 2^38B. If all of the
objects are referenced by the multi-pack index the sum above will
overflow. Avoid this by using shifted integer arithmetic and changing
the order of the calculation so that the pack size is divided by the
total number of objects in the pack before multiplying by the number of
objects referenced by the multi-pack index. Using a shift of 14 bits
should give reasonable accuracy while avoiding overflow for pack sizes
less that 1PB.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-22 14:48:36 -07:00
Phillip Wood
b103881d4f midx repack: avoid integer overflow on 32 bit systems
On a 32 bit system "git multi-pack-index --repack --batch-size=120M"
failed with

    fatal: size_t overflow: 6038786 * 1289

The calculation to estimated size of the objects in the pack referenced
by the multi-pack-index uses st_mult() to multiply the pack size by the
number of referenced objects before dividing by the total number of
objects in the pack. As size_t is 32 bits on 32 bit systems this
calculation easily overflows. Fix this by using 64bit arithmetic instead.

Also fix a potential overflow when caluculating the total size of the
objects referenced by the multipack index with a batch size larger
than SIZE_MAX / 2. In that case

    total_size += estimated_size

can overflow as both total_size and estimated_size can be greater that
SIZE_MAX / 2. This is addressed by using saturating arithmetic for the
addition. Although estimated_size is of type uint64_t by the time we
reach this sum it is bounded by the batch size which is of type size_t
and so casting estimated_size to size_t does not truncate the value.

Signed-off-by: Phillip Wood <phillip.wood@dunelm.org.uk>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-05-22 14:48:36 -07:00
Junio C Hamano
36d8035d27 Merge branch 'ps/object-file-cleanup'
Code clean-up.

* ps/object-file-cleanup:
  object-store: merge "object-store-ll.h" and "object-store.h"
  object-store: remove global array of cached objects
  object: split out functions relating to object store subsystem
  object-file: drop `index_blob_stream()`
  object-file: split up concerns of `HASH_*` flags
  object-file: split out functions relating to object store subsystem
  object-file: move `xmmap()` into "wrapper.c"
  object-file: move `git_open_cloexec()` to "compat/open.c"
  object-file: move `safe_create_leading_directories()` into "path.c"
  object-file: move `mkdir_in_gitdir()` into "path.c"
2025-04-24 17:25:33 -07:00
Junio C Hamano
ee847e0034 Merge branch 'ps/object-wo-the-repository'
The object layer has been updated to take an explicit repository
instance as a parameter in more code paths.

* ps/object-wo-the-repository:
  hash: stop depending on `the_repository` in `null_oid()`
  hash: fix "-Wsign-compare" warnings
  object-file: split out logic regarding hash algorithms
  delta-islands: stop depending on `the_repository`
  object-file-convert: stop depending on `the_repository`
  pack-bitmap-write: stop depending on `the_repository`
  pack-revindex: stop depending on `the_repository`
  pack-check: stop depending on `the_repository`
  environment: move access to "core.bigFileThreshold" into repo settings
  pack-write: stop depending on `the_repository` and `the_hash_algo`
  object: stop depending on `the_repository`
  csum-file: stop depending on `the_repository`
2025-04-15 13:50:15 -07:00
Patrick Steinhardt
1a99fe8010 object-file: move safe_create_leading_directories() into "path.c"
The `safe_create_leading_directories()` function and its relatives are
located in "object-file.c", which is not a good fit as they provide
generic functionality not related to objects at all. Move them into
"path.c", which already hosts `safe_create_dir()` and its relative
`safe_create_dir_in_gitdir()`.

"path.c" is free of `the_repository`, but the moved functions depend on
`the_repository` to read the "core.sharedRepository" config. Adapt the
function signature to accept a repository as argument to fix the issue
and adjust callers accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-04-15 08:24:35 -07:00
Taylor Blau
27afc272c4 midx: implement writing incremental MIDX bitmaps
Now that the pack-bitmap machinery has learned how to read and interact
with an incremental MIDX bitmap, teach the pack-bitmap-write.c machinery
(and relevant callers from within the MIDX machinery) to write such
bitmaps.

The details for doing so are mostly straightforward. The main changes
are as follows:

  - find_object_pos() now makes use of an extra MIDX parameter which is
    used to locate the bit positions of objects which are from previous
    layers (and thus do not exist in the current layer's pack_order
    field).

    (Note also that the pack_order field is moved into struct
    write_midx_context to further simplify the callers for
    write_midx_bitmap()).

  - bitmap_writer_build_type_index() first determines how many objects
    precede the current bitmap layer and offsets the bits it sets in
    each respective type-level bitmap by that amount so they can be OR'd
    together.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
Acked-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-21 04:34:16 -07:00
Patrick Steinhardt
2582846f2f pack-write: stop depending on the_repository and the_hash_algo
There are a couple of functions in "pack-write.c" that implicitly depend
on `the_repository` or `the_hash_algo`. Remove this dependency by
injecting the repository via a parameter and adapt callers accordingly.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10 13:16:18 -07:00
Patrick Steinhardt
74d414c9f1 object: stop depending on the_repository
There are a couple of functions exposed by "object.c" that implicitly
depend on `the_repository`. Remove this dependency by injecting the
repository via a parameter. Adapt callers accordingly by simply using
`the_repository`, except in cases where the subsystem is already free of
the repository. In that case, we instead pass the repository provided by
the caller's context.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10 13:16:18 -07:00
Patrick Steinhardt
228457c9d9 csum-file: stop depending on the_repository
There are multiple sites in "csum-file.c" where we use the global
`the_repository` variable, either explicitly or implicitly by using
`the_hash_algo`.

Refactor the code to stop using `the_repository` by adapting functions
to receive required data as parameters. Adapt callsites accordingly by
either using `the_repository->hash_algo`, or by using a context-provided
hash algorithm in case the subsystem already got rid of its dependency
on `the_repository`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-03-10 13:16:18 -07:00
Patrick Steinhardt
028f618658 path: adjust last remaining users of the_repository
With the preceding refactorings we now only have a couple of implicit
users of `the_repository` left in the "path" subsystem, all of which
depend on global state via `calc_shared_perm()`. Make the dependency on
`the_repository` explicit by passing the repo as a parameter instead and
adjust callers accordingly.

Note that this change bubbles up into a couple of subsystems that were
previously declared as free from `the_repository`. Instead of marking
all of them as `the_repository`-dependent again, we instead use the
repository that is available in the calling context. There are three
exceptions though with "copy.c", "pack-write.c" and "tempfile.c".
Adjusting these would require us to adapt callsites all over the place,
so this is left for a future iteration.

Mark "path.c" as free from `the_repository`.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-02-28 13:54:11 -08:00
Junio C Hamano
b83a2f9006 Merge branch 'kn/pack-write-with-reduced-globals'
Code clean-up.

* kn/pack-write-with-reduced-globals:
  pack-write: pass hash_algo to internal functions
  pack-write: pass hash_algo to `write_rev_file()`
  pack-write: pass hash_algo to `write_idx_file()`
  pack-write: pass repository to `index_pack_lockfile()`
  pack-write: pass hash_algo to `fixup_pack_header_footer()`
2025-02-03 10:23:34 -08:00
Karthik Nayak
6b2aa7fd37 pack-write: pass hash_algo to write_rev_file()
The `write_rev_file()` function uses the global `the_hash_algo` variable
to access the repository's hash_algo. To avoid global variable usage,
pass a hash_algo from the layers above. Also modify children functions
`write_rev_file_order()` and `write_rev_header()` to accept
'the_hash_algo'.

Altough the layers above could have access to the hash_algo internally,
simply pass in `the_hash_algo`. This avoids any compatibility issues and
bubbles up global variable usage to upper layers which can be eventually
resolved.

However, in `midx-write.c`, since all usage of global variables is
removed, don't reintroduce them and instead use the `repo` available in
the context.

Signed-off-by: Karthik Nayak <karthik.188@gmail.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2025-01-21 12:36:34 -08:00
Junio C Hamano
4156b6a741 Merge branch 'ps/build-sign-compare'
Start working to make the codebase buildable with -Wsign-compare.

* ps/build-sign-compare:
  t/helper: don't depend on implicit wraparound
  scalar: address -Wsign-compare warnings
  builtin/patch-id: fix type of `get_one_patchid()`
  builtin/blame: fix type of `length` variable when emitting object ID
  gpg-interface: address -Wsign-comparison warnings
  daemon: fix type of `max_connections`
  daemon: fix loops that have mismatching integer types
  global: trivial conversions to fix `-Wsign-compare` warnings
  pkt-line: fix -Wsign-compare warning on 32 bit platform
  csum-file: fix -Wsign-compare warning on 32-bit platform
  diff.h: fix index used to loop through unsigned integer
  config.mak.dev: drop `-Wno-sign-compare`
  global: mark code units that generate warnings with `-Wsign-compare`
  compat/win32: fix -Wsign-compare warning in "wWinMain()"
  compat/regex: explicitly ignore "-Wsign-compare" warnings
  git-compat-util: introduce macros to disable "-Wsign-compare" warnings
2024-12-23 09:32:11 -08:00
Patrick Steinhardt
1f7e6478dc progress: stop using the_repository
Stop using `the_repository` in the "progress" subsystem by passing in a
repository when initializing `struct progress`. Furthermore, store a
pointer to the repository in that struct so that we can pass it to the
trace2 API when logging information.

Adjust callers accordingly by using `the_repository`. While there may be
some callers that have a repository available in their context, this
trivial conversion allows for easier verification and bubbles up the use
of `the_repository` by one level.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
2024-12-18 10:44:30 -08:00