In Swift 5.6 and below, (broken) code that acquired indices from a
UTF-16-encoded string bridged from Cocoa and kept using them after a
`makeContiguousUTF8` call (or other mutation) may have appeared to be
working correctly as long as the string was ASCII.
Since https://github.com/apple/swift/pull/41417, the
`String(_:within:)` initializers recognize miscoded indices and reject
them by returning nil. This is technically correct, but it
unfortunately may be a binary compatibility issue, as these used to
return non-nil in previous versions.
Mitigate this issue by accepting UTF-16 indices on a UTF-8 string,
transcoding their offset as needed. (Attempting to use an UTF-8 index
on a UTF-16 string is still rejected — we do not implicitly convert
strings in that direction.)
rdar://89369680
This fixes a compatibility issue with potential future UTF-8 encoded
foreign String forms, as well as simplifying the code a bit — we no
longer need to do an availability check on inlinable fast paths.
The isForeignUTF8 bit is never set by any past or current stdlib
version, but it allows us to introduce UTF-8 encoded foreign forms
without breaking inlinable index encoding validation introduced in
Swift 5.7.
There are three flavors, corresponding to i < endIndex, i <= endIndex, and range containment checks.
Additionally, we have separate variants for index validation in substrings.
`String.UnicodeScalarView` currently lacks proper index validation in its `index(after:)` and `index(before:)` methods, leading to out-of-bounds memory accesses when index navigation methods in this view are given invalid indices.
(Also see https://github.com/apple/swift/pull/41598 and https://github.com/apple/swift/pull/41417)
rdar://89498541
Introduce checking of ConcurrentValue conformances:
- For structs, check that each stored property conforms to ConcurrentValue
- For enums, check that each associated value conforms to ConcurrentValue
- For classes, check that each stored property is immutable and conforms
to ConcurrentValue
Because all of the stored properties / associated values need to be
visible for this check to work, limit ConcurrentValue conformances to
be in the same source file as the type definition.
This checking can be disabled by conforming to a new marker protocol,
UnsafeConcurrentValue, that refines ConcurrentValue.
UnsafeConcurrentValue otherwise his no specific meaning. This allows
both "I know what I'm doing" for types that manage concurrent access
themselves as well as enabling retroactive conformance, both of which
are fundamentally unsafe but also quite necessary.
The bulk of this change ended up being to the standard library, because
all conformances of standard library types to the ConcurrentValue
protocol needed to be sunk down into the standard library so they
would benefit from the checking above. There were numerous little
mistakes in the initial pass through the stsandard library types that
have now been corrected.
Since scalar-alignment is set in inlinable code, switch the alignment
bit to one of the previously-reserved bits rather than a grapheme
cache bit. Setting a grapheme cache bit in inlinable would break
backward deployment, as older versions would interpret it as a cached
value.
Also adjust the name to "scalar-aligned", which is clearer, and
removed assertion (which should be a real precondition).
Fixes a general category (pun intended) of scalar-alignment bugs
surrounding exchanging non-scalar-aligned indices between views and
for slicing.
SE-0180 unifies the Index type of String and all its views and allows
non-scalar-aligned indices to be used across views. In order to
guarantee behavior, we often have to check and perform scalar
alignment. To speed up these checks, we allocate a bit denoting
known-to-be-aligned, so that the alignment check can skip the
load. The below shows what views need to check for alignment before
they can operate, and whether the indices they produce are aligned.
┌───────────────╥────────────────────┬──────────────────────────┐
│ View ║ Requires Alignment │ Produces Aligned Indices │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Native UTF8 ║ no │ no │
├───────────────╫────────────────────┼──────────────────────────┤
│ Native UTF16 ║ yes │ no │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Foreign UTF8 ║ yes │ no │
├───────────────╫────────────────────┼──────────────────────────┤
│ Foreign UTF16 ║ no │ no │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ UnicodeScalar ║ yes │ yes │
├───────────────╫────────────────────┼──────────────────────────┤
│ Character ║ yes │ yes │
└───────────────╨────────────────────┴──────────────────────────┘
The "requires alignment" applies to any operation taking a
String.Index that's not defined entirely in terms of other operations
taking a String.Index. These include:
* index(after:)
* index(before:)
* subscript
* distance(from:to:) (since `to` is compared against directly)
* UTF16View._nativeGetOffset(for:)
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.
Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.
Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).
This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
Defining a custom iterator for the UTF16View avoid some redundant
computation over the indexing model. This speeds up iteration by
around 40% on non-ASCII strings.
Provide a custom iterator rather than relying a the IndexingIterator,
as an indexing model is less efficient for stateful processing of
strings. Provides around a 30% speedup.
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.
Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
Add inlinability annotations to restore performance parity with 4.2 String.
Take advantage of known NFC as a fast-path for comparison, and
overhaul comparison dispatch.
RRC improvements and optmizations.
* Refactor out RRC implementation into dedicated file.
* Change our `_invariantCheck` pattern to generate efficient code in
asserts builds and make the optimizer job's easier.
* Drop a few Bidi shims we no longer need.
* Restore View decls to String, workaround no longer needed
* Cleaner unicode helper facilities
This is a giant squashing of a lot of individual changes prototyping a
switch of String in Swift 5 to be natively encoded as UTF-8. It
includes what's necessary for a functional prototype, dropping some
history, but still leaves plenty of history available for future
commits.
My apologies to anyone trying to do code archeology between this
commit and the one prior. This was the lesser of evils.
_StringGuts shouldn't expose a subscript, implying efficient
access. Switch to the explicit code unit fetch method. Update tests
accordingly, and switch off of deprecated typealiases.
Move the shifts to index creation time rather than index comparison
time. This seems to benefit micro benchmarks and cover up
inefficiencies in our generic index distance calculations.
Drop append-related @inlinable annotations for String, StringGuts,
StringStorage, and the Views. Drop several for larger operations, such
as case conversion. Drop as many as we can from StringGuts for now.