swift-mirror/docs/CompactImageMapFormat.md

Compact ImageMap Format
=======================

A process' address space contains (among other things) the set of
dynamically loaded images that have been mapped into that address
space.  When generating crash logs or symbolicating backtraces, we
need to be able to capture and potentially store the list of images
that has been loaded, as well as some of the attributes of those
images, including each image's

- Path
- Build ID (aka UUID)
- Base address
- End-of-text address

Compact ImageMap Format (CIF) is a binary format for holding this
information.

### General Format

Compact ImageMap Format data is byte aligned and starts with an
information byte:

~~~
   7   6   5   4   3   2   1   0
 ┌───────────────────────┬───────┐
 │ version               │ size  │
 └───────────────────────┴───────┘
~~~

The `version` field identifies the version of CIF that is in use; this
document describes version `0`. The `size` field is encoded as
follows:

| `size` | Machine word size |
| :----: | :---------------- |
|   00   | 16-bit            |
|   01   | 32-bit            |
|   10   | 64-bit            |
|   11   | Reserved          |

This is followed immediately by a field containing the name of the platform
that generated this image map.  This field consists of a single byte length
followed by a UTF-8 string of that length.

After that is a field encoding the number of images in the image map;
this field is encoded as a sequence of bytes, each holding seven bits
of data, with the top bit clear for the final byte.  The most
significant byte is the first.  e.g.

| `count` | Encoding    |
| ------: | :---------- |
|       0 | 00          |
|       1 | 01          |
|     127 | 7f          |
|     128 | 81 00       |
|     129 | 81 01       |
|     700 | 85 3c       |
|    1234 | 89 52       |
|   16384 | 81 80 00    |
|   65535 | 83 ff 7f    |
| 2097152 | 81 80 80 00 |

This in turn is followed by the list of images, stored in order of
increasing base address.  For each image, we start with a header byte:

~~~
   7   6   5   4   3   2   1   0
 ┌───┬───┬───────────┬───────────┐
 │ r │ 0 │ acount    │ ecount    │
 └───┴───┴───────────┴───────────┘
~~~

If `r` is set, then the base address is understood to be relative to
the previously computed base address.

This byte is followed by `acount + 1` bytes of base address, then
`ecount + 1` bytes of offset to the end of text.

Following this is an encoded count of bytes in the build ID,
encoded using the 7-bit scheme we used to encode the image count, and
then after that come the build ID bytes themselves.

Finally, we encode the path string using the scheme below.

### String Encoding

Image paths contain a good deal of redundancy; paths are therefore
encoded using a prefix compression scheme.  The basic idea here is
that while generating or reading the data, we maintain a mapping from
small integers to path prefix segments.

The mapping is initialised with the following fixed list that never
need to be stored in CIF data:

| code | Path prefix                         |
| :--: | :---------------------------------- |
|   0  | `/lib`                              |
|   1  | `/usr/lib`                          |
|   2  | `/usr/local/lib`                    |
|   3  | `/opt/lib`                          |
|   4  | `/System/Library/Frameworks`        |
|   5  | `/System/Library/PrivateFrameworks` |
|   6  | `/System/iOSSupport`                |
|   7  | `/Library/Frameworks`               |
|   8  | `/System/Applications`              |
|   9  | `/Applications`                     |
|  10  | `C:\Windows\System32`               |
|  11  | `C:\Program Files`                  |

Codes below 32 are reserved for future expansion of the fixed list.

Strings are encoded as a sequence of bytes, as follows:

|  `opcode`  | Mnemonic  | Meaning                                   |
| :--------: | :-------- | :---------------------------------------- |
| `00000000` | `end`     | Marks the end of the string               |
| `00xxxxxx` | `str`     | Raw string data                           |
| `01xxxxxx` | `framewk` | Names a framework                         |
| `1exxxxxx` | `expand`  | Identifies a prefix in the table          |

#### `end`

##### Encoding

~~~
   7   6   5   4   3   2   1   0
 ┌───────────────────────────────┐
 │ 0   0   0   0   0   0   0   0 │  end
 └───────────────────────────────┘
~~~

#### Meaning

Marks the end of the string

#### `str`

##### Encoding

~~~
   7   6   5   4   3   2   1   0
 ┌───────┬───────────────────────┐
 │ 0   0 │ count                 │  str
 └───────┴───────────────────────┘
~~~

##### Meaning

The next `count` bytes are included in the string verbatim.
Additionally, all path prefixes of this string data will be added to
the current prefix table.  For instance, if the string data is
`/swift/linux/x86_64/libfoo.so`, then the prefix `/swift` will be
assigned the next available code, `/swift/linux` the code after that,
and `/swift/linux/x86_64` the code following that one.

#### `framewk`

##### Encoding

~~~
   7   6   5   4   3   2   1   0
 ┌───────┬───────────────────────┐
 │ 0   1 │ count                 │  framewk
 └───────┴───────────────────────┘
~~~

##### Meaning

The next byte is a version character (normally `A`, but some
frameworks use higher characters), after which there are `count + 1`
bytes of name.

This is expanded using the pattern
`/<name>.framework/Versions/<version>/<name>`.  This also marks the
end of the string.

#### `expand`

##### Encoding

~~~
   7   6   5   4   3   2   1   0
 ┌───┬───┬───────────────────────┐
 │ 1 │ e │ code                  │  expand
 └───┴───┴───────────────────────┘
~~~

##### Meaning

If `e` is `0`, `code` is the index into the prefix table for the
prefix that should be appended to the string at this point.

If `e` is `1`, this opcode is followed by `code + 1` bytes that give
a value `v` such that `v + 64` is the index into the prefix table for
the prefix that should be appended to the string at this point.

#### Example

Let's say we wish to encode the following strings:

    /System/Library/Frameworks/AppKit.framework/Versions/C/AppKit
    /System/Library/Frameworks/Photos.framework/Versions/A/Photos
    /usr/lib/libobjc.A.dylib
    /usr/lib/libz.1.dylib
    /usr/lib/swift/libswiftCore.dylib
    /usr/lib/libSystem.B.dylib
    /usr/lib/libc++.1.dylib

We would encode

    <84> <45> CAppKit <00>

We then follow with

    <84> <45> APhotos <00>

Next we have

    <81> <10> /libobjc.A.dylib <00>
    <81> <0d> /libz.1.dylib <00>
    <81> <19> /swift/libswiftCore.dylib <00>

assigning code 32 to `/swift`, then

    <81> <12> /libSystem.B.dylib <00>
    <81> <0f> /libc++.1.dylib <00>

In total the original data would have taken up 256 bytes.  Instead, we
have used 122 bytes, a saving of over 50%.