Now that the cached character stride in indices always mean the
stride in the full string, we can stop looking at whether a substring
has a character-aligned end index.
This fixes a compatibility issue with potential future UTF-8 encoded
foreign String forms, as well as simplifying the code a bit — we no
longer need to do an availability check on inlinable fast paths.
The isForeignUTF8 bit is never set by any past or current stdlib
version, but it allows us to introduce UTF-8 encoded foreign forms
without breaking inlinable index encoding validation introduced in
Swift 5.7.
- Split forward and backward direction into separate code paths.
This makes the code more readable and paves the way for future
improvements. (E.g., switching to a linear-time algorithm for
breaking backwards.)
- `Substring.index(after:)` now uses the same grapheme breaking paths
as `String.index(after:)`.
- The cached stride value in string indices is now well-defined even
on indices that aren’t character-aligned.
Also, in UTF-16 slices, forward collection methods to the base view
instead of `Slice`, to make behavior a bit easier to understand.
(There is no need to force readers to page in `Slice`
implementations _in addition to_ whatever the base view is doing.)
Also, in UTF-8 slices, forward collection methods to the base view
instead of `Slice`, to make behavior a bit easier to understand.
(There is no need to force readers to page in `Slice`
implementations _in addition to_ whatever the base view is doing.)
Prefer direct stored properties to computed ones — there is no reason
to risk inlining issues, esp. since things like `Slice.base` aren’t
even force-inlined.
Prefer using `_wholeGuts` to spelling out the full incantation.
To prevent unaligned indices from breaking well-defined index distance
and index offset calculations, round every index down to the nearest
whole Character.
For the horrific details, see the forum discussion below.
https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946
To avoid rounding from regressing String performance in the regular
case (when indices aren’t being passed across string views), introduce
a new String.Index flag bit that indicates that the index is already
Character aligned.
This used to forward to `Slice.replaceSubrange`, but that’s a generic algorithm that isn’t aware of the pecularities of Unicode extended grapheme clusters, and it can be mislead by unusual cases, like a substring or subrange whose bounds aren’t `Character`-aligned, or a replacement string that starts with a continuation scalar.
There are three flavors, corresponding to i < endIndex, i <= endIndex, and range containment checks.
Additionally, we have separate variants for index validation in substrings.
Assign some previously reserved bits in String.Index and _StringObject to keep track of their associated storage encoding (either UTF-8 or UTF-16).
None of these bits will be reliably set in processes that load binaries compiled with older stdlib releases, but when they do end up getting set, we can use them opportunistically to more reliably detect cases where an index is applied on a string with a mismatching encoding.
As more and more code gets recompiled with 5.7+, the stdlib will gradually become able to detect such issues with complete accuracy.
Code that misuses indices this way was always considered broken; however, String wasn’t able to reliably detect these runtime errors before. Therefore, I expect there is a large amount of broken code out there that keeps using bridged Cocoa String indices (UTF-16) after a mutation turns them into native UTF-8 strings. Therefore, instead of trapping, this commit silently corrects the issue, transcoding the offsets into the correct encoding.
It would probably be a good idea to also emit a runtime warning in addition to recovering from the error. This would generate some noise that would gently nudge folks to fix their code.
rdar://89369680
There is an unescaped reference to type `Substring`. I think a plain lowercased reference would read better. Alternatively we could escape it with backquotes.
Introduce checking of ConcurrentValue conformances:
- For structs, check that each stored property conforms to ConcurrentValue
- For enums, check that each associated value conforms to ConcurrentValue
- For classes, check that each stored property is immutable and conforms
to ConcurrentValue
Because all of the stored properties / associated values need to be
visible for this check to work, limit ConcurrentValue conformances to
be in the same source file as the type definition.
This checking can be disabled by conforming to a new marker protocol,
UnsafeConcurrentValue, that refines ConcurrentValue.
UnsafeConcurrentValue otherwise his no specific meaning. This allows
both "I know what I'm doing" for types that manage concurrent access
themselves as well as enabling retroactive conformance, both of which
are fundamentally unsafe but also quite necessary.
The bulk of this change ended up being to the standard library, because
all conformances of standard library types to the ConcurrentValue
protocol needed to be sunk down into the standard library so they
would benefit from the checking above. There were numerous little
mistakes in the initial pass through the stsandard library types that
have now been corrected.
Inlining those operations does not really give any benefit.
Those operations are complex and most likely, cannot be folded into simpler patterns after inlining, anyway.
Reduces some code size for cases where those functions were inlined (due to a not perfect inlining heuristics).
This is a second pass at the original patch, which broke an OS test.
Due to an oversight it seems that we never added a
withContigousStorageIfAvailable implementation to SubString.UTF8View,
which meant that if you sliced a String you lost the ability to get fast
access to the backing storage. There's no good reason for this
functionality to be missing, so this patch adds it in by delegating to
the Slice implementation.
Resolves SR-11999.
Due to an oversight it seems that we never added a
withContigousStorageIfAvailable implementation to SubString.UTF8View,
which meant that if you sliced a String you lost the ability to get fast
access to the backing storage. There's no good reason for this
functionality to be missing, so this patch adds it in by delegating to
the Slice implementation.
Resolves SR-11999.
Fixes a general category (pun intended) of scalar-alignment bugs
surrounding exchanging non-scalar-aligned indices between views and
for slicing.
SE-0180 unifies the Index type of String and all its views and allows
non-scalar-aligned indices to be used across views. In order to
guarantee behavior, we often have to check and perform scalar
alignment. To speed up these checks, we allocate a bit denoting
known-to-be-aligned, so that the alignment check can skip the
load. The below shows what views need to check for alignment before
they can operate, and whether the indices they produce are aligned.
┌───────────────╥────────────────────┬──────────────────────────┐
│ View ║ Requires Alignment │ Produces Aligned Indices │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Native UTF8 ║ no │ no │
├───────────────╫────────────────────┼──────────────────────────┤
│ Native UTF16 ║ yes │ no │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Foreign UTF8 ║ yes │ no │
├───────────────╫────────────────────┼──────────────────────────┤
│ Foreign UTF16 ║ no │ no │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ UnicodeScalar ║ yes │ yes │
├───────────────╫────────────────────┼──────────────────────────┤
│ Character ║ yes │ yes │
└───────────────╨────────────────────┴──────────────────────────┘
The "requires alignment" applies to any operation taking a
String.Index that's not defined entirely in terms of other operations
taking a String.Index. These include:
* index(after:)
* index(before:)
* subscript
* distance(from:to:) (since `to` is compared against directly)
* UTF16View._nativeGetOffset(for:)