String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.
Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.
Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).
This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
Defining a custom iterator for the UTF16View avoid some redundant
computation over the indexing model. This speeds up iteration by
around 40% on non-ASCII strings.
These should be audited since some might not actually need to be
@inlinable, but for now:
- Anything public and @inline(__always) is now also @inlinable
- Anything @usableFromInline and @inline(__always) is now @inlinable
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.
Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
* Refactor out RRC implementation into dedicated file.
* Change our `_invariantCheck` pattern to generate efficient code in
asserts builds and make the optimizer job's easier.
* Drop a few Bidi shims we no longer need.
* Restore View decls to String, workaround no longer needed
* Cleaner unicode helper facilities
This is a giant squashing of a lot of individual changes prototyping a
switch of String in Swift 5 to be natively encoded as UTF-8. It
includes what's necessary for a functional prototype, dropping some
history, but still leaves plenty of history available for future
commits.
My apologies to anyone trying to do code archeology between this
commit and the one prior. This was the lesser of evils.
Move the shifts to index creation time rather than index comparison
time. This seems to benefit micro benchmarks and cover up
inefficiencies in our generic index distance calculations.
Simplify String.Index by sinking transcoded offsets into the .utf8
variant. This is in preparation for a more resilient index type
capable of supporting existential string indices.
Drop append-related @inlinable annotations for String, StringGuts,
StringStorage, and the Views. Drop several for larger operations, such
as case conversion. Drop as many as we can from StringGuts for now.
- Rename `Substring.CharacterView` to `Substring._CharacterView`, adding a deprecated typealias for the original name, like we do for `String.CharacterView`.
- Add a non-deprecated `Substring._characters` property, emulating `String.characters`.
- Explicitly deprecate the following members:
* String.withMutableCharacters<R>(_: (inout CharacterView) -> R) -> R
* String.subscript(Range<Index>) -> String.CharacterView
* Substring._CharacterView.subscript(Range<Index>) -> Substring.CharacterView
* Substring.init(_: CharacterView)
* String.init(_: Substring.CharacterView)
Include the initial implementation of _StringGuts, a 2-word
replacement for _LegacyStringCore. 64-bit Darwin supported, 32-bit and
Linux support in subsequent commits.
In grand LLVM tradition, the first step to redesigning _StringCore is
to first rename it to _LegacyStringCore. Subsequent commits will
introduce the replacement, and eventually all uses of the old one will
be moved to the new one.
NFC.
* Eradicate IndexDistance associated type, replacing with Int everywhere
* Consistently use Int for ExistentialCollection’s IndexDistance type.
* Fix test for IndexDistance removal
* Remove a handful of no-longer-needed explicit types
* Add compatibility shims for non-Int index distances
* Test compatibility shim
* Move IndexDistance typealias into the Collection protocol
- Update NSRange -> Range guidance
- Fix example in Optional
- Improve RangeExpression docs
- Fix issue in UnsafeRawBufferPointer.initializeMemory
- Code point -> scalar value most places
- Reposition the dot above the scripty `i'
- Fix ExpressibleByArrayLiteral code sample
CharacterView is now entirely redundant in Swift 4. Deprecate its
use. This also allows us to schedule the unbreaking of
String.CharacterView leakiness without a hard source break.
I failed to merge the upstream changes to swift-corelibs-foundation at the same
time as I merged that #9806, and it broke on linux. Going to get it right this
time.
Removes the legacy grapheme breaking code paths. Simplifies and
clarifies the non-contiguous grapheme breaking code through consistent
naming and handling of absolute positions vs relative offsets.
This adds Unicode 9 grapheme breaking support for non-contiguous
NSStrings. Non-contiguous NSStrings that don't hit our fast paths are
very rare, but should still behave identically to contiguous
strings.
We first copy a fixed number of code units into a fixed size buffer
(currently 16 in size) and try to grapheme break inside of that
buffer. This is sufficient storage for all known non-pathological
graphemes. Any graphemes larger than the buffer are handled by copying
larger portions of the string into an Array.
Test cases added, including pathological "zalgo" text that stresses
extremely long graphemes.
Many strings use non-sub-300 punctuation characters (e.g. unicode
hyphen, CJK quotes, etc). This can cause switching between fast and
slow paths for grapheme breaking. Add in fast-paths for general
punctuation characters and CJK punctuation and symbol characters.
This results in about a 5-8x speedup for heavily (unicode) punctuated
Latiny and CJKy workloads.
This takes care of the standard library portion, but we need a new
BuiltinUTF16ExtendedGraphemeClusterLiteralConvertible protocol in order to
fully recover the performance of character literals.
Note that part of the character_literals.swift test is currently disabled. That
will need to be fixed before we can merge this work.
Add in more grapheme break fast paths for scripts based on Cyrillic,
Arabic, or Hangul. Generates significant performance wins, similar to
those for the unihan fast paths.
While every extra check does slow down the runtime of
_internalExtraCheckGraphemeBreakBetween as currently implemented, I've
not found the performance cost to be relevant for workloads with
occasional mixed emoji contents, nor for workloads that his the
earlier checks. A pure Korean workload (currently the last check) does
pays a rather noticable price for the previous checks, but this is
only because the workload is now so greatly improved. Optimizing this
implementation is interesting future work, but not urgent.