Commit Graph

3 Commits

Author SHA1 Message Date
Karoy Lorentey
3e18a07187 [stdlib] Fix implementation of Unicode text segmentation for word boundaries
Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts.

- Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point.
- The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont.
- Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`.
- Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680
2025-08-05 20:04:46 -07:00
David Smith
ea7d07714f Switch grapheme break property searching to Eytzinger binary search (#71668) 2024-02-16 16:06:20 -08:00
Alejandro Alonso
95da55b182 [stdlib] Implement String.WordView (#42414)
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
2022-06-22 09:10:09 -07:00