Commit Graph

6 Commits

Author SHA1 Message Date
Karoy Lorentey
3e18a07187 [stdlib] Fix implementation of Unicode text segmentation for word boundaries
Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts.

- Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point.
- The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont.
- Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`.
- Add tests to reliably validate that the two state machine flavors always produce consistent results.

rdar://155482680
2025-08-05 20:04:46 -07:00
Guillaume Lessard
dfb2e2f12e [stdlib] annotate uses of Range.init(_uncheckedBounds:) 2025-03-05 18:52:11 -08:00
Karoy Lorentey
e885037068 [stdlib] String: Expose _index(roundingDown:) functions in all String views
These simply expose the preexisting internal
`_StringGuts.validate*Index` functions that indexing operations
use to implicitly round indices down to the nearest valid index. (Or, in the case of the encoding views, the nearest scalar boundary.)

Being able to do this as a standalone, explicit, efficient operation
is crucial when implementing some `String` algorithms that need to
work with arbitrary indices.
2022-12-31 17:42:32 -08:00
Alejandro Alonso
95da55b182 [stdlib] Implement String.WordView (#42414)
* Implement String.WordView

* Add isWordAligned bit

* Hide WordView for now (also separate Index type)

add bidirectional conformance

Fix tests

* Address comments from Karoy and Michael

* Remove word view, use index methods

* Address Karoy's comments

aaa
2022-06-22 09:10:09 -07:00
Karoy Lorentey
50c2399a94 [stdlib] Work around binary compatibility issues with String index validation fixes in 5.7
Swift 5.7 added stronger index validation for `String`, so some illegal cases that previously triggered inconsistently diagnosed out of bounds accesses now result in reliable runtime errors. Similarly, attempts at applying an index originally vended by a UTF-8 string on a UTF-16 string now result in a reliable runtime error.

As is usually the case, new traps to the stdlib exposes code that contains previously undiagnosed / unreliably diagnosed coding issues.

Allow invalid code in binaries built with earlier versions of the stdlib to continue running with the 5.7 library by disabling some of the new traps based on the version of Swift the binary was built with.

In the case of an index encoding mismatch, allow transcoding of string storage regardless of the direction of the mismatch. (Previously we only allowed transcoding a UTF-8 string to UTF-16.)

rdar://93379333
2022-05-17 19:25:10 -07:00
Karoy Lorentey
3c9968945e [stdlib] String: Implement happy paths for index validation 2022-04-10 00:14:43 -07:00