Adds in a special case grapheme break detection between two values
within scalar ranges where we have special knowledge. Any sub-0x300
scalars, except CR-LF, are guaranteed to have grapheme breaks between
them. We're reasonably confident this will not change in future
versions of Unicode. We might add more ranges in the future, but
should do so conservatively, anticipating future Unicode changes.
In these cases we can very quickly break, even for strings that have
mixed latin and emoji characters. In a ASCII string with a single
emoji in it, we traverse the string 2x faster forwards and 3x faster
in reverse. (Reverse is 3x faster as it involves some forwards
traversal inside of the index). For a string that's half Latin half
non-Latin, we're about 1.5x faster forwards and backwards.
By removing the _fastPath inside of grapheme breaking, we avoid
cooling off all the other blocks in the function, restoring
performance parity with before the ASCII special case logic. This does
not meaningfully impact the performance of the special case, so we are
still about 2-3x faster iterating forwards and backwards over ASCII
characters than before.
By doing a fast check for CR-LF, we can speed up forwards and
backwards character iteration on ASCII strings by ~2-3x. This speedup
can be seen in the stringTests suite (previous PR).
It is done as a fast path inside of
_measureExtendedGraphemeClusterForward/Backward, which is never
inlined. Doing the fast path inline would yield even more dramatic
improvements, and might be an area for future exploration, but would
slightly pessimize non-ASCII strings due to code bloat. Additionally,
the current implementation has not been micro-optimized yet. It's
possible that more optimal code would be smaller and have less of an
impact on the general case.
* [stdlib] Fix String.UTF16View index sharing
* [stdlib] Fix String.UnicodeScalarView index sharing
* [stdlib] Fix String.CharacterView index sharing
* [stdlib] Test advancing string indices past their ends
* [stdlib] Simplify CharacterView ranged subscript
We should not make grapheme segmentation algorithm inlineable now (and
possibly ever). The immediate reason is that Unicode 8.0 grapheme
segmentation algorithm has been changed significantly in Unicode 9.0.
Unicode 9.0 requires a stateful algorithm. Current indices and
iterators just don't have the storage for that state, so we can't mark
them as fragile.
This removes the `_core` property from UnicodeScalarView.Index
and moves any remaining index-moving logic from the index to
the view in UnicodeScalarView and CharacterView.
This documentation revision covers a large number of types & protocols:
String, its views and their indices, the Unicode codec types and protocol,
as well as Character, UnicodeScalar, and StaticString, among others.
This also includes a few small changes across the standard library for
consistency.
1_stdlib/Collection.swift test runs without error.
Note: advance and distance implementations for which the default works
were removed from StringCharacterView.swift
-removed fatal stub Collection.next(Index)
-added default Collection.next(Index) where Index is Strideable
-added custom next(Index) on some collections
-added fatal stub next(Index) on some collections
I am using Erik's inliner analysis tool to identify areas where the inliner is
too aggressive. In this commit I disabled the inlining of the method
measureExtendedGraphemeCluster. This change reduced the size of the swift
standard library by about 45k (2% of the .text size). I did not see any
performance regressions on the performance test suite.
This commit is the result of auditing the following files:
* StringBuffer.swift
* StringCharacterView.swift
* StringCore.swift
* StringInterpolation.swift.gyb
* StringLegacy.swift