Commit Graph

9 Commits

Author SHA1 Message Date
Karoy Lorentey
3e18a07187 [stdlib] Fix implementation of Unicode text segmentation for word boundaries
Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts.

- Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point.
- The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont.
- Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`.
- Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680
2025-08-05 20:04:46 -07:00
Doug Gregor
22eecacc35 Adopt unsafe annotations throughout the standard library 2025-02-26 14:28:01 -08:00
Karl
7a57bd8ae4 [stdlib] Refactor Unicode normalization (#73590)
* [stdlib] Refactor Unicode normalization

* Tweak inlining
2024-05-31 08:22:37 -06:00
Alejandro Alonso
f9f640b141 Sendablize the standard library
oops dont add this flag

no more nonisolated
2024-03-05 15:02:09 -08:00
Alejandro Alonso
95da55b182 [stdlib] Implement String.WordView (#42414)
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
2022-06-22 09:10:09 -07:00
Alejandro Alonso
5fe6a7e247 Add caseFolded to scalar properties 2022-04-10 13:03:13 -07:00
Alejandro Alonso
1a3f791779 Underscore the script APIs
forgot underscore part 2

add build rules back in for right now

indents
2022-04-07 16:20:10 -07:00
Alejandro Alonso
79bf63505b Add Unicode script data to scalar properties 2022-04-06 13:37:04 -07:00
Alejandro Alonso
c9115e1cec Create wrapper types with availability NFD and NFC
Swap the bases to unicodeScalarView
2022-04-05 09:06:32 -07:00