[stdlib] Fix implementation of Unicode text segmentation for word boundaries

Carefully overhaul our word breaking implementation to follow the recommendations of Unicode Annex #29. Start exposing the core primitives (as well as `String`-level interfaces), so that folks can prototype proper API for these concepts. - Fix `_wordIndex(after:)` to always advance forward. It now requires its input index to be on a word boundary. Remove the `@_spi` attribute, exposing it as a (hidden, but) public entry point. - The old SPIs `_wordIndex(before:)` and `_nearestWordIndex(atOrBelow:)` were irredemably broken; follow the Unicode recommendation for implementing random-access text segmentation and replace them both with a new public `_wordIndex(somewhereAtOrBefore:)` entry pont. - Expose handcrafted low-level state machines for detecting word boundaries (_WordRecognizer`, `_RandomAccessWordRecognizer`), following the design of `_CharacterRecognizer`. - Add tests to reliably validate that the two state machine flavors always produce consistent results. rdar://155482680
2025-12-21 12:14:44 +01:00 · 2025-07-24 17:25:56 -07:00
parent 22b1205cf2
commit 3e18a07187
7 changed files with 1165 additions and 702 deletions
--- a/stdlib/public/core/StringIndexValidation.swift
+++ b/stdlib/public/core/StringIndexValidation.swift
@@ -400,20 +400,3 @@ extension _StringGuts {
      scalarAlign(validateInclusiveSubscalarIndex_5_7(i)))
  }
 }
-
-// Word index validation (String)
-extension _StringGuts {
-  internal func validateWordIndex(
-    _ i: String.Index
-  ) -> String.Index {
-    return roundDownToNearestWord(scalarAlign(validateSubscalarIndex(i)))
-  }
-
-  internal func validateInclusiveWordIndex(
-    _ i: String.Index
-  ) -> String.Index {
-    return roundDownToNearestWord(
-      scalarAlign(validateInclusiveSubscalarIndex(i))
-    )
-  }
-}