Rule GB11 in Unicode Annex 29 is:
GB11: Extended_Pictographic Extend* ZWJ × Extended_Pictographic
However, our forward grapheme breaking state machine implements it as:
GB11: Extended_Pictographic Extend* ZWJ+ × Extended_Pictographic
We implement the correct rules when going backward, which can cause String values to have different counts whether we’re going forward or back.
The rule as implemented would be fine (Unicode doesn’t care much about the placement of grapheme breaks in invalid sequences), but the directional inconsistency messes with String’s Collection conformance.
rdar://104279671
Equatability allows faster implementations for updating cached grapheme boundary state after a text mutation, because it enables quick detection of before/after state equality, without having to feed the recognizers until they produce a synchronized grapheme break.
The CustomStringConvertible conformance makes it orders of magnitude more pleasant to debug code that uses this.
Sendable is a baseline requirement for value types these days.
`Unicode._CharacterRecognizer` is a newly exported opaque type that
exposes the stdlib’s extended grapheme cluster breaking facility,
independent of `String`.
This essentially makes the underlying simple state machine public,
without exposing any of the (unstable) Unicode details.
The ability to perform grapheme breaking over, say, the scalars stored
in multiple `String` values can be extremely useful while building
custom text processing algorithms and data structures.
Ideally this would eventually become API, but before proposing this
to Swift Evolution, I’d like to prove the shape of the type in actual
use (and we’ll also need to find better names for its operations).
This turns _GraphemeBreakingState into a more proper state
machine, although it is only able to recognize breaks in the
forward direction.
The backward direction requires arbitrarily long lookback,
and it currently remains in _StringGuts.
- Split forward and backward direction into separate code paths.
This makes the code more readable and paves the way for future
improvements. (E.g., switching to a linear-time algorithm for
breaking backwards.)
- `Substring.index(after:)` now uses the same grapheme breaking paths
as `String.index(after:)`.
- The cached stride value in string indices is now well-defined even
on indices that aren’t character-aligned.
To prevent unaligned indices from breaking well-defined index distance
and index offset calculations, round every index down to the nearest
whole Character.
For the horrific details, see the forum discussion below.
https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946
To avoid rounding from regressing String performance in the regular
case (when indices aren’t being passed across string views), introduce
a new String.Index flag bit that indicates that the index is already
Character aligned.
* Implement GraphemeWalker that does native grapheme breaking
* Bridged strings use native grapheme breaking for forward strides
* Implement bidirectional native grapheme breaking for native and foreign strings
* Remove ICU's grapheme breaking support
* Use UnicodeScalarView to implement GraphemeWalker
use an Iterator approach
remove Iterator conformance
* Incorporate Michael's feedback
more comments addressed
fix crlf bug
* Try bringing back some old fast paths
* Parameterize nextBoundary and previousBoundary
Parameterize nextBoundary and previousBoundary
* Implement Michael's suggestions
ICU will return different results if we call with an offset into a
code unit buffer vs if we slice the buffer first and provide an offset
of zero. Slicing more closely models the semantics of SE-0180, so use
that.
Test case coming in subsequent commit enforcing index
scalar-alignment.
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.
Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.
Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).
This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.
Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
* Refactor out RRC implementation into dedicated file.
* Change our `_invariantCheck` pattern to generate efficient code in
asserts builds and make the optimizer job's easier.
* Drop a few Bidi shims we no longer need.
* Restore View decls to String, workaround no longer needed
* Cleaner unicode helper facilities
This is a giant squashing of a lot of individual changes prototyping a
switch of String in Swift 5 to be natively encoded as UTF-8. It
includes what's necessary for a functional prototype, dropping some
history, but still leaves plenty of history available for future
commits.
My apologies to anyone trying to do code archeology between this
commit and the one prior. This was the lesser of evils.
Simplify String.Index by sinking transcoded offsets into the .utf8
variant. This is in preparation for a more resilient index type
capable of supporting existential string indices.
Include the initial implementation of _StringGuts, a 2-word
replacement for _LegacyStringCore. 64-bit Darwin supported, 32-bit and
Linux support in subsequent commits.