Merge branch 'bc/sha1-256-interop-01'

The beginning of SHA1-SHA256 interoperability work. * bc/sha1-256-interop-01: t1010: use BROKEN_OBJECTS prerequisite t: allow specifying compatibility hash fsck: consider gpgsig headers expected in tags rev-parse: allow printing compatibility hash docs: add documentation for loose objects docs: improve ambiguous areas of pack format documentation docs: reflect actual double signature for tags docs: update offset order for pack index v3 docs: update pack index v3 format
2025-12-12 20:36:24 +01:00 · 2025-10-22 11:38:58 -07:00
parent c9ccf81948 db00605c13
commit 98401c10fc
15 changed files with 255 additions and 32 deletions
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -34,6 +34,7 @@ MAN5_TXT += gitformat-bundle.adoc
 MAN5_TXT += gitformat-chunk.adoc
 MAN5_TXT += gitformat-commit-graph.adoc
 MAN5_TXT += gitformat-index.adoc
+MAN5_TXT += gitformat-loose.adoc
 MAN5_TXT += gitformat-pack.adoc
 MAN5_TXT += gitformat-signature.adoc
 MAN5_TXT += githooks.adoc
--- a/Documentation/fsck-msgids.adoc
+++ b/Documentation/fsck-msgids.adoc
@@ -10,6 +10,12 @@
 `badFilemode`::
 	(INFO) A tree contains a bad filemode entry.

+`badGpgsig`::
+	(ERROR) A tag contains a bad (truncated) signature (e.g., `gpgsig`) header.
+
+`badHeaderContinuation`::
+	(ERROR) A continuation header (such as for `gpgsig`) is unexpectedly truncated.
+
 `badName`::
 	(ERROR) An author/committer name is empty.

--- a/Documentation/git-rev-parse.adoc
+++ b/Documentation/git-rev-parse.adoc
@@ -324,11 +324,12 @@ The following options are unaffected by `--path-format`:
 	path of the current directory relative to the top-level
 	directory.

--show-object-format[=(storage|input|output)]::
-	Show the object format (hash algorithm) used for the repository
-	for storage inside the `.git` directory, input, or output. For
-	input, multiple algorithms may be printed, space-separated.
-	If not specified, the default is "storage".
+--show-object-format[=(storage|input|output|compat)]::
+	Show the object format (hash algorithm) used for the repository for storage
+	inside the `.git` directory, input, output, or compatibility. For input,
+	multiple algorithms may be printed, space-separated. If `compat` is
+	requested and no compatibility algorithm is enabled, prints an empty line. If
+	not specified, the default is "storage".

 --show-ref-format::
 	Show the reference storage format used for the repository.
--- a/Documentation/gitformat-loose.adoc
+++ b/Documentation/gitformat-loose.adoc
@@ -0,0 +1,53 @@
+gitformat-loose(5)
+==================
+
+NAME
+----
+gitformat-loose - Git loose object format
+
+
+SYNOPSIS
+--------
+[verse]
+$GIT_DIR/objects/[0-9a-f][0-9a-f]/*
+
+DESCRIPTION
+-----------
+
+Loose objects are how Git stores individual objects, where every object is
+written as a separate file.
+
+Over the lifetime of a repository, objects are usually written as loose objects
+initially.  Eventually, these loose objects will be compacted into packfiles
+via repository maintenance to improve disk space usage and speed up the lookup
+of these objects.
+
+== Loose objects
+
+Each loose object contains a prefix, followed immediately by the data of the
+object.  The prefix contains `<type> <size>\0`.  `<type>` is one of `blob`,
+`tree`, `commit`, or `tag` and `size` is the size of the data (without the
+prefix) as a decimal integer expressed in ASCII.
+
+The entire contents, prefix and data concatenated, is then compressed with zlib
+and the compressed data is stored in the file.  The object ID of the object is
+the SHA-1 or SHA-256 (as appropriate) hash of the uncompressed data.
+
+The file for the loose object is stored under the `objects` directory, with the
+first two hex characters of the object ID being the directory and the remaining
+characters being the file name.  This is done to shard the data and avoid too
+many files being in one directory, since some file systems perform poorly with
+many items in a directory.
+
+As an example, the empty tree contains the data (when uncompressed) `tree 0\0`
+and, in a SHA-256 repository, would have the object ID
+`6ef19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321` and would be
+stored under
+`$GIT_DIR/objects/6e/f19b41225c5369f1c104d45d8d85efa9b057b53b14b4b9b939dd74decc5321`.
+
+Similarly, a blob containing the contents `abc` would have the uncompressed
+data of `blob 3\0abc`.
+
+GIT
+---
+Part of the linkgit:git[1] suite
--- a/Documentation/gitformat-pack.adoc
+++ b/Documentation/gitformat-pack.adoc
@@ -32,6 +32,10 @@ In a repository using the traditional SHA-1, pack checksums, index checksums,
 and object IDs (object names) mentioned below are all computed using SHA-1.
 Similarly, in SHA-256 repositories, these values are computed using SHA-256.

+CRC32 checksums are always computed over the entire packed object, including
+the header (n-byte type and length); the base object name or offset, if any;
+and the entire compressed object.  The CRC32 algorithm used is that of zlib.
+
 == pack-*.pack files have the following format:

   - A header appears at the beginning and consists of the following:
@@ -80,6 +84,16 @@ Valid object types are:

 Type 5 is reserved for future expansion. Type 0 is invalid.

+=== Object encoding
+
+Unlike loose objects, packed objects do not have a prefix containing the type,
+size, and a NUL byte. These are not necessary because they can be determined by
+the n-byte type and length that prefixes the data and so they are omitted from
+the compressed and deltified data.
+
+The computation of the object ID still uses this prefix by reconstructing it
+from the type and length as needed.
+
 === Size encoding

 This document uses the following "size encoding" of non-negative
@@ -92,6 +106,11 @@ values are more significant.
 This size encoding should not be confused with the "offset encoding",
 which is also used in this document.

+When encoding the size of an undeltified object in a pack, the size is that of
+the uncompressed raw object. For deltified objects, it is the size of the
+uncompressed delta.  The base object name or offset is not included in the size
+computation.
+
 === Deltified representation

 Conceptually there are only four object types: commit, tree, tag and
--- a/Documentation/meson.build
+++ b/Documentation/meson.build
@@ -173,6 +173,7 @@ manpages = {
  'gitformat-chunk.adoc' : 5,
  'gitformat-commit-graph.adoc' : 5,
  'gitformat-index.adoc' : 5,
+  'gitformat-loose.adoc' : 5,
  'gitformat-pack.adoc' : 5,
  'gitformat-signature.adoc' : 5,
  'githooks.adoc' : 5,
--- a/Documentation/technical/hash-function-transition.adoc
+++ b/Documentation/technical/hash-function-transition.adoc
@@ -227,9 +227,9 @@ network byte order):
    ** 4-byte length in bytes of shortened object names. This is the
      shortest possible length needed to make names in the shortened
      object name table unambiguous.
-    ** 4-byte integer, recording where tables relating to this format
+    ** 8-byte integer, recording where tables relating to this format
      are stored in this index file, as an offset from the beginning.
-  * 4-byte offset to the trailer from the beginning of this file.
+  * 8-byte offset to the trailer from the beginning of this file.
  * Zero or more additional key/value pairs (4-byte key, 4-byte
    value). Only one key is supported: 'PSRC'. See the "Loose objects
    and unreachable objects" section for supported values and how this
@@ -260,12 +260,10 @@ network byte order):
    compressed data to be copied directly from pack to pack during
    repacking without undetected data corruption.

-  * A table of 4-byte offset values. For an object in the table of
-    sorted shortened object names, the value at the corresponding
-    index in this table indicates where that object can be found in
-    the pack file. These are usually 31-bit pack file offsets, but
-    large offsets are encoded as an index into the next table with the
-    most significant bit set.
+  * A table of 4-byte offset values. The index of this table in pack order
+    indicates where that object can be found in the pack file. These are
+    usually 31-bit pack file offsets, but large offsets are encoded as
+    an index into the next table with the most significant bit set.

  * A table of 8-byte offset entries (empty for pack files less than
    2 GiB). Pack files are organized with heavily used objects toward
@@ -276,10 +274,14 @@ network byte order):
  up to and not including the table of CRC32 values.
 - Zero or more NUL bytes.
 - The trailer consists of the following:
-  * A copy of the 20-byte SHA-256 checksum at the end of the
+  * A copy of the full main hash checksum at the end of the
    corresponding packfile.

-  * 20-byte SHA-256 checksum of all of the above.
+  * Full main hash checksum of all of the above.
+
+The "full main hash" is a full-length hash of the main (not compatibility)
+algorithm in the repository.  Thus, if the main algorithm is SHA-256, this is
+a 32-byte SHA-256 hash and for SHA-1, it's a 20-byte SHA-1 hash.

 Loose object index
 ~~~~~~~~~~~~~~~~~~
@@ -427,17 +429,19 @@ ordinary unsigned commit.

 Signed Tags
 ~~~~~~~~~~~
-We add a new field "gpgsig-sha256" to the tag object format to allow
-signing tags without relying on SHA-1. Its signed payload is the
-SHA-256 content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
-SIGNATURE-----" delimited in-body signature removed.
+We add new fields "gpgsig" and "gpgsig-sha256" to the tag object format to
+allow signing tags in both formats.  The in-body signature is used for the
+signature in the current hash algorithm and the header is used for the
+signature in the other algorithm.  Thus, a dual-signature tag will contain both
+an in-body signature and a gpgsig-sha256 header for the SHA-1 format of an
+object or both an in-body signature and a gpgsig header for the SHA-256 format
+of and object.

-This means tags can be signed
+The signed payload of the tag is the content of the tag in the current
+algorithm with both its gpgsig and gpgsig-sha256 fields and
+"-----BEGIN PGP SIGNATURE-----" delimited in-body signature removed.

-1. using SHA-1 only, as in existing signed tag objects
-2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
-   signature.
-3. using only SHA-256, by only using the gpgsig-sha256 field.
+This means tags can be signed using one or both algorithms.

 Mergetag embedding
 ~~~~~~~~~~~~~~~~~~