Commit Graph

5 Commits

Author SHA1 Message Date
Karoy Lorentey
3e18a07187 [stdlib] Fix implementation of Unicode text segmentation for word boundaries
Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts.

- Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point.
- The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont.
- Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`.
- Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680
2025-08-05 20:04:46 -07:00
Doug Gregor
22eecacc35 Adopt unsafe annotations throughout the standard library 2025-02-26 14:28:01 -08:00
David Smith
f1feba0e42 Eytzingerize word break data (#71731)
Switch word break property searching to Eytzinger binary search
2024-02-20 22:49:34 -08:00
Alejandro Alonso
3deacf8944 Fix backwards iteration of word breaking 2023-11-08 13:26:20 -08:00
Alejandro Alonso
95da55b182 [stdlib] Implement String.WordView (#42414)
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
2022-06-22 09:10:09 -07:00