In grand LLVM tradition, the first step to redesigning _StringCore is
to first rename it to _LegacyStringCore. Subsequent commits will
introduce the replacement, and eventually all uses of the old one will
be moved to the new one.
NFC.
* Eradicate IndexDistance associated type, replacing with Int everywhere
* Consistently use Int for ExistentialCollection’s IndexDistance type.
* Fix test for IndexDistance removal
* Remove a handful of no-longer-needed explicit types
* Add compatibility shims for non-Int index distances
* Test compatibility shim
* Move IndexDistance typealias into the Collection protocol
- Update NSRange -> Range guidance
- Fix example in Optional
- Improve RangeExpression docs
- Fix issue in UnsafeRawBufferPointer.initializeMemory
- Code point -> scalar value most places
- Reposition the dot above the scripty `i'
- Fix ExpressibleByArrayLiteral code sample
CharacterView is now entirely redundant in Swift 4. Deprecate its
use. This also allows us to schedule the unbreaking of
String.CharacterView leakiness without a hard source break.
I failed to merge the upstream changes to swift-corelibs-foundation at the same
time as I merged that #9806, and it broke on linux. Going to get it right this
time.
Removes the legacy grapheme breaking code paths. Simplifies and
clarifies the non-contiguous grapheme breaking code through consistent
naming and handling of absolute positions vs relative offsets.
This adds Unicode 9 grapheme breaking support for non-contiguous
NSStrings. Non-contiguous NSStrings that don't hit our fast paths are
very rare, but should still behave identically to contiguous
strings.
We first copy a fixed number of code units into a fixed size buffer
(currently 16 in size) and try to grapheme break inside of that
buffer. This is sufficient storage for all known non-pathological
graphemes. Any graphemes larger than the buffer are handled by copying
larger portions of the string into an Array.
Test cases added, including pathological "zalgo" text that stresses
extremely long graphemes.
Many strings use non-sub-300 punctuation characters (e.g. unicode
hyphen, CJK quotes, etc). This can cause switching between fast and
slow paths for grapheme breaking. Add in fast-paths for general
punctuation characters and CJK punctuation and symbol characters.
This results in about a 5-8x speedup for heavily (unicode) punctuated
Latiny and CJKy workloads.
This takes care of the standard library portion, but we need a new
BuiltinUTF16ExtendedGraphemeClusterLiteralConvertible protocol in order to
fully recover the performance of character literals.
Note that part of the character_literals.swift test is currently disabled. That
will need to be fixed before we can merge this work.
Add in more grapheme break fast paths for scripts based on Cyrillic,
Arabic, or Hangul. Generates significant performance wins, similar to
those for the unihan fast paths.
While every extra check does slow down the runtime of
_internalExtraCheckGraphemeBreakBetween as currently implemented, I've
not found the performance cost to be relevant for workloads with
occasional mixed emoji contents, nor for workloads that his the
earlier checks. A pure Korean workload (currently the last check) does
pays a rather noticable price for the previous checks, but this is
only because the workload is now so greatly improved. Optimizing this
implementation is interesting future work, but not urgent.
Use UBreakIterators to perform grapheme breaking. This gives Unicode 9
grapheme breaking (e.g. family emoji) and provides a means to upgrade
to future versions. It also serves as a model for how to serve up
other advanced functionality in ICU to users.
This has tricky performance implications. Some things are faster and a
number of cases are slower. But, careful use of ICU can help mitigate
and amortize these costs. In conjunction with more early detection of
fast paths, overall grapheme breaking for the average user should be
much faster than in Swift 3.
NOTE: This is incomplete. It currently falls back on the legacy tries
for some bridged strings. There are many potential directions for a
general solution, but for now we'll be interatively adding support for
more and more special cases.
* Give Sequence a top-level Element, constrain Iterator to match
* Remove many instances of Iterator.
* Fixed various hard-coded tests
* XFAIL a few tests that need further investigation
* Change assoc type for arrayLiteralConvertible
* Mop up remaining "better expressed as a where clause" warnings
* Fix UnicodeDecoders prototype test
* Fix UIntBuffer
* Fix hard-coded Element identifier in CSDiag
* Fix up more tests
* Account for flatMap changes
This adds more fast path checks for grapheme breaks between BMP
scalars. Notably the rather vast range of 0x3400–0xA4CF which includes
unified common Han ideographs as well as the first extension to
unified Han ideographs. It also happens to pick up various Yijin and
Yi symbols/radicals. Additionally, the narrow hiragana/katakana ranges
0x3041-0x3096 and 0x30A1-0x30FA (including pre-composed semi-voiced
characters but excluding the combining semi-voice marks) have fast
paths.
The net effect is that the vast majority of modern Chinese and
Japanese text should be fast-pathed. This is especially important, as
adopting Unicode 9 might otherwise pessimize performance here relative
to the tries.
Use UBreakIterators to perform grapheme breaking. This gives Unicode 9
grapheme breaking (e.g. family emoji) and provides a means to upgrade
to future versions. It also serves as a model for how to serve up
other advanced functionality in ICU to users.
This has tricky performance implications. Some things are faster and a
number of cases are slower. But, careful use of ICU can help mitigate
and amortize these costs. In conjunction with more early detection of
fast paths, overall grapheme breaking for the average user should be
much faster than in Swift 3.
NOTE: This is incomplete. It currently falls back on the legacy tries
for some bridged strings. There are many potential directions for a
general solution, but for now we'll be interatively adding support for
more and more special cases.
UnsafeBufferPoiunter subscript used in the fast path only checks bounds
in Debug mode, therefore extra checks are needed.
Addresses: <rdar://problem/31992473>
This adds a fast path for single-code-unit Character
construction. Rather than use the general purpose String based
initializer (which then repeats grapheme breaking to ensure a trap,
amongst other inefficiencies), just make the Character from the single
unicode scalar value directly.
This also speeds up simple iteration of BMP strings when the optimizer
is unable to eliminate the subscript. Around 2x for ASCII, and around
20% for BMP UTF16.
Many of StringCore private APIs, when the StringCore is itself a
substring, expect relative offsets rather than absolute offsets. This
fixes a bug in the sub-0x300 fast path where we were using absolute
offsets. Test cases added.
Adds in a special case grapheme break detection between two values
within scalar ranges where we have special knowledge. Any sub-0x300
scalars, except CR-LF, are guaranteed to have grapheme breaks between
them. We're reasonably confident this will not change in future
versions of Unicode. We might add more ranges in the future, but
should do so conservatively, anticipating future Unicode changes.
In these cases we can very quickly break, even for strings that have
mixed latin and emoji characters. In a ASCII string with a single
emoji in it, we traverse the string 2x faster forwards and 3x faster
in reverse. (Reverse is 3x faster as it involves some forwards
traversal inside of the index). For a string that's half Latin half
non-Latin, we're about 1.5x faster forwards and backwards.
By removing the _fastPath inside of grapheme breaking, we avoid
cooling off all the other blocks in the function, restoring
performance parity with before the ASCII special case logic. This does
not meaningfully impact the performance of the special case, so we are
still about 2-3x faster iterating forwards and backwards over ASCII
characters than before.
By doing a fast check for CR-LF, we can speed up forwards and
backwards character iteration on ASCII strings by ~2-3x. This speedup
can be seen in the stringTests suite (previous PR).
It is done as a fast path inside of
_measureExtendedGraphemeClusterForward/Backward, which is never
inlined. Doing the fast path inline would yield even more dramatic
improvements, and might be an area for future exploration, but would
slightly pessimize non-ASCII strings due to code bloat. Additionally,
the current implementation has not been micro-optimized yet. It's
possible that more optimal code would be smaller and have less of an
impact on the general case.
* [stdlib] Fix String.UTF16View index sharing
* [stdlib] Fix String.UnicodeScalarView index sharing
* [stdlib] Fix String.CharacterView index sharing
* [stdlib] Test advancing string indices past their ends
* [stdlib] Simplify CharacterView ranged subscript