Commit Graph

41 Commits

Author SHA1 Message Date
Michael Ilseman
e6e4bd6056 UTF8Span (#78531)
Add support for UTF8Span

Also, refactor validation and grapheme breaking
2025-04-11 16:11:11 -06:00
Guillaume Lessard
dfb2e2f12e [stdlib] annotate uses of Range.init(_uncheckedBounds:) 2025-03-05 18:52:11 -08:00
Doug Gregor
22eecacc35 Adopt unsafe annotations throughout the standard library 2025-02-26 14:28:01 -08:00
Alejandro Alonso
1f74aa1634 Update grapheme breaking logic to support Unicode 16 2025-01-15 14:08:18 -08:00
Alexander Cyon
c21b1e68fd [stdlib] Fix typos 2024-07-06 13:09:57 +02:00
David Smith
753a9408c2 ASCII fast paths for grapheme stride (#72064) 2024-03-04 22:03:58 -08:00
David Smith
ea7d07714f Switch grapheme break property searching to Eytzinger binary search (#71668) 2024-02-16 16:06:20 -08:00
Karoy Lorentey
a1dae65528 [stdlib] Adjust availability of _CharacterRecognizer conformances
These never made it to 5.8, so their availability needs to be bumped to 5.9.
2023-04-11 16:43:06 -07:00
Karoy Lorentey
a3e517ed36 [stdlib] String: Fix forward implementation of grapheme breaking rule 11
Rule GB11 in Unicode Annex 29 is:

GB11: Extended_Pictographic Extend* ZWJ × Extended_Pictographic

However, our forward grapheme breaking state machine implements it as:

GB11: Extended_Pictographic Extend* ZWJ+ × Extended_Pictographic

We implement the correct rules when going backward, which can cause String values to have different counts whether we’re going forward or back.

The rule as implemented would be fine (Unicode doesn’t care much about the placement of grapheme breaks in invalid sequences), but the directional inconsistency messes with String’s Collection conformance.

rdar://104279671
2023-01-15 16:12:38 -08:00
Karoy Lorentey
2f1ed631e2 [stdlib] _CharacterRecognizer: Add Sendable, Equatable, CustomStringConvertible conformances
Equatability allows faster implementations for updating cached grapheme boundary state after a text mutation, because it enables quick detection of before/after state equality, without having to feed the recognizers until they produce a synchronized grapheme break.

The CustomStringConvertible conformance makes it orders of magnitude more pleasant to debug code that uses this.

Sendable is a baseline requirement for value types these days.
2023-01-06 14:51:37 -08:00
Karoy Lorentey
fa2f63cae0 [stdlib] _CharacterRecognizer._firstBreak(inUncheckedUnsafeUTF8Buffer:startingAt:) 2023-01-03 21:00:01 -08:00
Karoy Lorentey
87422e5dc4 [stdlib] _CharacterRecognizer: Remove initializer argument 2023-01-03 20:59:24 -08:00
Karoy Lorentey
55583ac13c [stdlib] Add new SPI for grapheme breaking (outside String)
`Unicode._CharacterRecognizer` is a newly exported opaque type that
exposes the stdlib’s extended grapheme cluster breaking facility,
independent of `String`.

This essentially makes the underlying simple state machine public,
without exposing any of the (unstable) Unicode details.

The ability to perform grapheme breaking over, say, the scalars stored
in multiple `String` values can be extremely useful while building
custom text processing algorithms and data structures.

Ideally this would eventually become API, but before proposing this
to Swift Evolution, I’d like to prove the shape of the type in actual
use (and we’ll also need to find better names for its operations).
2022-12-30 16:32:01 -08:00
Karoy Lorentey
ef0e79b70f [stdlib] String: Move shouldBreak into _GraphemeBreakingState
This turns _GraphemeBreakingState into a more proper state
machine, although it is only able to recognize breaks in the
forward direction.

The backward direction requires arbitrarily long lookback,
and it currently remains in _StringGuts.
2022-12-29 18:04:02 -08:00
Karoy Lorentey
73312fedd4 [stdlib] Grapheme breaking: Refactor to simplify logic
- Split forward and backward direction into separate code paths.
  This makes the code more readable and paves the way for future
  improvements. (E.g., switching to a linear-time algorithm for
  breaking backwards.)
- `Substring.index(after:)` now uses the same grapheme breaking paths
  as `String.index(after:)`.
- The cached stride value in string indices is now well-defined even
  on indices that aren’t character-aligned.
2022-04-05 20:47:42 -07:00
Karoy Lorentey
b29d8f4805 [stdlib] Substring: restrict grapheme breaking to the bounds of the substring
(Oops)
2022-03-29 20:00:08 -07:00
Karoy Lorentey
4eab8355ca [stdlib] String: prefer passing ranges to start+end argument pairs 2022-03-29 20:00:08 -07:00
Karoy Lorentey
298899264d [stdlib] String: Add some extra invariant checks 2022-03-24 21:00:00 -07:00
Karoy Lorentey
2464aa681e [stdlib] String: Ensure indices are marked scalar aligned before rounding down to Character 2022-03-24 21:00:00 -07:00
Karoy Lorentey
8ab2379946 [stdlib] Round indices down to nearest Character in String’s index algorithms
To prevent unaligned indices from breaking well-defined index distance
and index offset calculations, round every index down to the nearest
whole Character.

For the horrific details, see the forum discussion below.

https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946

To avoid rounding from regressing String performance in the regular
case (when indices aren’t being passed across string views), introduce
a new String.Index flag bit that indicates that the index is already
Character aligned.
2022-03-24 21:00:00 -07:00
Karoy Lorentey
683b9fa021 [stdlib] Adjust/fix String’s indexing operations to deal with the consequences of SE-0180 2022-03-24 20:59:59 -07:00
Alejandro Alonso
657c17fa39 Setup grapheme breaking tests 2022-02-15 17:16:36 -08:00
Alejandro Alonso
c0e1ef01f9 Fix backwards count of Indic graphemes 2022-02-15 15:28:37 -08:00
Alejandro Alonso
4a451829f8 Implement the Indic grapheme breaking rules 2022-01-05 16:18:54 -08:00
Alejandro Alonso
5a0bbb9f89 [stdlib] Implement native grapheme breaking for String (#37864)
* Implement GraphemeWalker that does native grapheme breaking

* Bridged strings use native grapheme breaking for forward strides

* Implement bidirectional native grapheme breaking for native and foreign strings

* Remove ICU's grapheme breaking support

* Use UnicodeScalarView to implement GraphemeWalker

use an Iterator approach

remove Iterator conformance

* Incorporate Michael's feedback

more comments addressed

fix crlf bug

* Try bringing back some old fast paths

* Parameterize nextBoundary and previousBoundary

Parameterize nextBoundary and previousBoundary

* Implement Michael's suggestions
2021-11-01 16:52:28 -07:00
Valeriy Van
78fb0f7774 Removes redundant buffer zeroing 2020-02-28 23:32:05 +01:00
Michael Ilseman
f52a865570 [String] Slice contents before asking ICU
ICU will return different results if we call with an offset into a
code unit buffer vs if we slice the buffer first and provide an offset
of zero. Slicing more closely models the semantics of SE-0180, so use
that.

Test case coming in subsequent commit enforcing index
scalar-alignment.
2019-06-26 09:22:17 -07:00
Michael Ilseman
415cc8fb0c [String.Index] Deprecate encodedOffset var/init
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.

Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.

Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).

This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
2019-02-13 18:42:40 -08:00
Michael Ilseman
18e415b4c0 [String] CJK Grapheme breaking fast-paths for fullwidth
Add in grapheme breaking fast-paths for fullwidth forms and
punctuation. Extend non-combining kana fast-paths to include vowel
extender.
2018-11-16 16:27:20 -08:00
Ben Cohen
1673c12d78 [stdlib] Replace "sanityCheck" with "internalInvariant" (#20616)
* Replace "sanityCheck" with "internalInvariant"
2018-11-15 20:50:22 -08:00
Michael Ilseman
948655e850 [String] Cleanups, comments, documentation
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
2018-11-04 10:42:42 -08:00
Michael Ilseman
7aea40680d [String] NFC iterator fast-paths
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.

Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
2018-11-04 10:42:41 -08:00
Michael Ilseman
a0e639eaf5 [String] Grapheme breaking fast-paths
Add in our scalar-based fast-paths for UTF-8 and foreign strings, and
update the grapheme cache.
2018-11-04 10:42:40 -08:00
Michael Ilseman
fe7c3ce2e4 [String] Refactorings and cleanup
* Refactor out RRC implementation into dedicated file.

* Change our `_invariantCheck` pattern to generate efficient code in
  asserts builds and make the optimizer job's easier.

* Drop a few Bidi shims we no longer need.

* Restore View decls to String, workaround no longer needed

* Cleaner unicode helper facilities
2018-11-04 10:42:40 -08:00
Michael Ilseman
9bf2c4d3d3 [String] Use small string at string creation 2018-11-04 10:42:40 -08:00
Michael Ilseman
4ab45dfe20 [String] Drop in initial UTF-8 String prototype
This is a giant squashing of a lot of individual changes prototyping a
switch of String in Swift 5 to be natively encoded as UTF-8. It
includes what's necessary for a functional prototype, dropping some
history, but still leaves plenty of history available for future
commits.

My apologies to anyone trying to do code archeology between this
commit and the one prior. This was the lesser of evils.
2018-11-04 10:42:40 -08:00
Ben Cohen
a4230ab2ad [stdlib] Update stdlib to 4.0 and reorganize compatibility shims (#17580)
* Update stdlib to 4.0 and move all compatibility shims into a dedicated source file
2018-06-29 06:26:52 -07:00
Michael Ilseman
614016fecd [String.Index] Simplify and prepare for more resilience.
Simplify String.Index by sinking transcoded offsets into the .utf8
variant. This is in preparation for a more resilient index type
capable of supporting existential string indices.
2018-05-24 14:47:04 -07:00
Slava Pestov
2e5aef9c8d stdlib: Remove redundant @usableFromInline attributes 2018-04-06 00:02:30 -07:00
Slava Pestov
e1f50b2d36 SE-0193: Rename @_inlineable to @inlinable, @_versioned to @usableFromInline 2018-03-30 21:55:30 -07:00
Michael Ilseman
3be2faf5d3 [String] Initial implementation of 64-bit StringGuts.
Include the initial implementation of _StringGuts, a 2-word
replacement for _LegacyStringCore. 64-bit Darwin supported, 32-bit and
Linux support in subsequent commits.
2018-01-21 12:32:26 -08:00