swift-mirror

mirror of https://github.com/apple/swift.git synced 2025-12-14 20:36:38 +01:00

Author	SHA1	Message	Date
Michael Ilseman	e6e4bd6056	UTF8Span (#78531 ) Add support for UTF8Span Also, refactor validation and grapheme breaking	2025-04-11 16:11:11 -06:00
Guillaume Lessard	dfb2e2f12e	[stdlib] annotate uses of `Range.init(_uncheckedBounds:)`	2025-03-05 18:52:11 -08:00
Doug Gregor	22eecacc35	Adopt unsafe annotations throughout the standard library	2025-02-26 14:28:01 -08:00
Alejandro Alonso	1f74aa1634	Update grapheme breaking logic to support Unicode 16	2025-01-15 14:08:18 -08:00
Alexander Cyon	c21b1e68fd	[stdlib] Fix typos	2024-07-06 13:09:57 +02:00
David Smith	753a9408c2	ASCII fast paths for grapheme stride (#72064 )	2024-03-04 22:03:58 -08:00
David Smith	ea7d07714f	Switch grapheme break property searching to Eytzinger binary search (#71668 )	2024-02-16 16:06:20 -08:00
Karoy Lorentey	a1dae65528	[stdlib] Adjust availability of _CharacterRecognizer conformances These never made it to 5.8, so their availability needs to be bumped to 5.9.	2023-04-11 16:43:06 -07:00
Karoy Lorentey	a3e517ed36	[stdlib] String: Fix forward implementation of grapheme breaking rule 11 Rule GB11 in Unicode Annex 29 is: GB11: Extended_Pictographic Extend* ZWJ × Extended_Pictographic However, our forward grapheme breaking state machine implements it as: GB11: Extended_Pictographic Extend* ZWJ+ × Extended_Pictographic We implement the correct rules when going backward, which can cause String values to have different counts whether we’re going forward or back. The rule as implemented would be fine (Unicode doesn’t care much about the placement of grapheme breaks in invalid sequences), but the directional inconsistency messes with String’s Collection conformance. rdar://104279671	2023-01-15 16:12:38 -08:00
Karoy Lorentey	2f1ed631e2	[stdlib] _CharacterRecognizer: Add Sendable, Equatable, CustomStringConvertible conformances Equatability allows faster implementations for updating cached grapheme boundary state after a text mutation, because it enables quick detection of before/after state equality, without having to feed the recognizers until they produce a synchronized grapheme break. The CustomStringConvertible conformance makes it orders of magnitude more pleasant to debug code that uses this. Sendable is a baseline requirement for value types these days.	2023-01-06 14:51:37 -08:00
Karoy Lorentey	fa2f63cae0	[stdlib] _CharacterRecognizer._firstBreak(inUncheckedUnsafeUTF8Buffer:startingAt:)	2023-01-03 21:00:01 -08:00
Karoy Lorentey	87422e5dc4	[stdlib] _CharacterRecognizer: Remove initializer argument	2023-01-03 20:59:24 -08:00
Karoy Lorentey	55583ac13c	[stdlib] Add new SPI for grapheme breaking (outside String) `Unicode._CharacterRecognizer` is a newly exported opaque type that exposes the stdlib’s extended grapheme cluster breaking facility, independent of `String`. This essentially makes the underlying simple state machine public, without exposing any of the (unstable) Unicode details. The ability to perform grapheme breaking over, say, the scalars stored in multiple `String` values can be extremely useful while building custom text processing algorithms and data structures. Ideally this would eventually become API, but before proposing this to Swift Evolution, I’d like to prove the shape of the type in actual use (and we’ll also need to find better names for its operations).	2022-12-30 16:32:01 -08:00
Karoy Lorentey	ef0e79b70f	[stdlib] String: Move shouldBreak into _GraphemeBreakingState This turns _GraphemeBreakingState into a more proper state machine, although it is only able to recognize breaks in the forward direction. The backward direction requires arbitrarily long lookback, and it currently remains in _StringGuts.	2022-12-29 18:04:02 -08:00
Karoy Lorentey	73312fedd4	[stdlib] Grapheme breaking: Refactor to simplify logic - Split forward and backward direction into separate code paths. This makes the code more readable and paves the way for future improvements. (E.g., switching to a linear-time algorithm for breaking backwards.) - `Substring.index(after:)` now uses the same grapheme breaking paths as `String.index(after:)`. - The cached stride value in string indices is now well-defined even on indices that aren’t character-aligned.	2022-04-05 20:47:42 -07:00
Karoy Lorentey	b29d8f4805	[stdlib] Substring: restrict grapheme breaking to the bounds of the substring (Oops)	2022-03-29 20:00:08 -07:00
Karoy Lorentey	4eab8355ca	[stdlib] String: prefer passing ranges to start+end argument pairs	2022-03-29 20:00:08 -07:00
Karoy Lorentey	298899264d	[stdlib] String: Add some extra invariant checks	2022-03-24 21:00:00 -07:00
Karoy Lorentey	2464aa681e	[stdlib] String: Ensure indices are marked scalar aligned before rounding down to Character	2022-03-24 21:00:00 -07:00
Karoy Lorentey	8ab2379946	[stdlib] Round indices down to nearest Character in String’s index algorithms To prevent unaligned indices from breaking well-defined index distance and index offset calculations, round every index down to the nearest whole Character. For the horrific details, see the forum discussion below. https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946 To avoid rounding from regressing String performance in the regular case (when indices aren’t being passed across string views), introduce a new String.Index flag bit that indicates that the index is already Character aligned.	2022-03-24 21:00:00 -07:00
Karoy Lorentey	683b9fa021	[stdlib] Adjust/fix String’s indexing operations to deal with the consequences of SE-0180	2022-03-24 20:59:59 -07:00
Alejandro Alonso	657c17fa39	Setup grapheme breaking tests	2022-02-15 17:16:36 -08:00
Alejandro Alonso	c0e1ef01f9	Fix backwards count of Indic graphemes	2022-02-15 15:28:37 -08:00
Alejandro Alonso	4a451829f8	Implement the Indic grapheme breaking rules	2022-01-05 16:18:54 -08:00
Alejandro Alonso	5a0bbb9f89	[stdlib] Implement native grapheme breaking for String (#37864 ) * Implement GraphemeWalker that does native grapheme breaking * Bridged strings use native grapheme breaking for forward strides * Implement bidirectional native grapheme breaking for native and foreign strings * Remove ICU's grapheme breaking support * Use UnicodeScalarView to implement GraphemeWalker use an Iterator approach remove Iterator conformance * Incorporate Michael's feedback more comments addressed fix crlf bug * Try bringing back some old fast paths * Parameterize nextBoundary and previousBoundary Parameterize nextBoundary and previousBoundary * Implement Michael's suggestions	2021-11-01 16:52:28 -07:00
Valeriy Van	78fb0f7774	Removes redundant buffer zeroing	2020-02-28 23:32:05 +01:00
Michael Ilseman	f52a865570	[String] Slice contents before asking ICU ICU will return different results if we call with an offset into a code unit buffer vs if we slice the buffer first and provide an offset of zero. Slicing more closely models the semantics of SE-0180, so use that. Test case coming in subsequent commit enforcing index scalar-alignment.	2019-06-26 09:22:17 -07:00
Michael Ilseman	415cc8fb0c	[String.Index] Deprecate encodedOffset var/init String.Index has an encodedOffset-based initializer and computed property that exists for serialization purposes. It was documented as UTF-16 in the SE proposal introducing it, which was String's underlying encoding at the time, but the dream of String even then was to abstract away whatever encoding happend to be used. Serialization needs an explicit encoding for serialized indices to make sense: the offsets need to align with the view. With String utilizing UTF-8 encoding for native contents in Swift 5, serialization isn't necessarily the most efficient in UTF-16. Furthermore, the majority of usage of encodedOffset in the wild is buggy and operates under the assumption that a UTF-16 code unit was a Swift Character, which isn't even valid if the String is known to be all-ASCII (because CR-LF). This change introduces a pair of semantics-preserving alternatives to encodedOffset that explicitly call out the UTF-16 assumption. These serve as a gentle off-ramp for current mis-uses of encodedOffset.	2019-02-13 18:42:40 -08:00
Michael Ilseman	18e415b4c0	[String] CJK Grapheme breaking fast-paths for fullwidth Add in grapheme breaking fast-paths for fullwidth forms and punctuation. Extend non-combining kana fast-paths to include vowel extender.	2018-11-16 16:27:20 -08:00
Ben Cohen	1673c12d78	[stdlib] Replace "sanityCheck" with "internalInvariant" (#20616 ) * Replace "sanityCheck" with "internalInvariant"	2018-11-15 20:50:22 -08:00
Michael Ilseman	948655e850	[String] Cleanups, comments, documentation After rebasing on master and incorporating more 32-bit support, perform a bunch of cleanup, documentation updates, comments, move code back to String declaration, etc.	2018-11-04 10:42:42 -08:00
Michael Ilseman	7aea40680d	[String] NFC iterator fast-paths Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a new buffered iterator. Also, fix bug in _typeName which used to assume ASCIIness and better SIL optimizations on StringObject.	2018-11-04 10:42:41 -08:00
Michael Ilseman	a0e639eaf5	[String] Grapheme breaking fast-paths Add in our scalar-based fast-paths for UTF-8 and foreign strings, and update the grapheme cache.	2018-11-04 10:42:40 -08:00
Michael Ilseman	fe7c3ce2e4	[String] Refactorings and cleanup * Refactor out RRC implementation into dedicated file. * Change our `_invariantCheck` pattern to generate efficient code in asserts builds and make the optimizer job's easier. * Drop a few Bidi shims we no longer need. * Restore View decls to String, workaround no longer needed * Cleaner unicode helper facilities	2018-11-04 10:42:40 -08:00
Michael Ilseman	9bf2c4d3d3	[String] Use small string at string creation	2018-11-04 10:42:40 -08:00
Michael Ilseman	4ab45dfe20	[String] Drop in initial UTF-8 String prototype This is a giant squashing of a lot of individual changes prototyping a switch of String in Swift 5 to be natively encoded as UTF-8. It includes what's necessary for a functional prototype, dropping some history, but still leaves plenty of history available for future commits. My apologies to anyone trying to do code archeology between this commit and the one prior. This was the lesser of evils.	2018-11-04 10:42:40 -08:00
Ben Cohen	a4230ab2ad	[stdlib] Update stdlib to 4.0 and reorganize compatibility shims (#17580 ) * Update stdlib to 4.0 and move all compatibility shims into a dedicated source file	2018-06-29 06:26:52 -07:00
Michael Ilseman	614016fecd	[String.Index] Simplify and prepare for more resilience. Simplify String.Index by sinking transcoded offsets into the .utf8 variant. This is in preparation for a more resilient index type capable of supporting existential string indices.	2018-05-24 14:47:04 -07:00
Slava Pestov	2e5aef9c8d	stdlib: Remove redundant @usableFromInline attributes	2018-04-06 00:02:30 -07:00
Slava Pestov	e1f50b2d36	SE-0193: Rename @_inlineable to @inlinable, @_versioned to @usableFromInline	2018-03-30 21:55:30 -07:00
Michael Ilseman	3be2faf5d3	[String] Initial implementation of 64-bit StringGuts. Include the initial implementation of _StringGuts, a 2-word replacement for _LegacyStringCore. 64-bit Darwin supported, 32-bit and Linux support in subsequent commits.	2018-01-21 12:32:26 -08:00

41 Commits