Commit Graph

35 Commits

Author SHA1 Message Date
David Smith
7b78a1d4b4 Avoid StringUTF16View dispatch overhead for some bridged String methods (#83529)
This removes a bunch of overhead on the UTF16 paths in String, as well
as consolidating the complicated bits of the logic in one file.
2025-09-22 17:03:24 -07:00
Doug Gregor
22eecacc35 Adopt unsafe annotations throughout the standard library 2025-02-26 14:28:01 -08:00
Karoy Lorentey
3f5dfea4b1 [stdlib] String: Avoid retain/release operations around use sites of sharedStorage and cocoaObject 2023-02-15 14:21:46 -08:00
DylanPerry5@gmail.com
512b499406 Issue #59903 - Removing _decodeScalar function that isn't being used 2022-07-29 19:51:52 -04:00
Karoy Lorentey
64a57e3ade [stdlib] Switch to using unchecked buffer subscript in low-level Unicode helpers
We expect indices to be already validated by the time these are
called, and UBP’s debug-mode checks are compiled into opaque parts
of the stdlib.

(The big exception is that Swift <5.7 used to not do proper index
validation and allowed out-of-bounds accesses. We emulate this
behavior for code that was compiled with such a release, and it turns
out that these UnsafeBufferPointer checks are interfering with the
intended (undefined) behavior.)

rdar://93707276
2022-07-05 18:33:51 -07:00
Karoy Lorentey
847337efd7 [stdlib][cosmetics] Clean up unused/underused interfaces, update naming
There is little point to having `isUTF16` properties when they simply
return `!isUTF8`; remove them.

Rename `String.Index._copyEncoding(from:)` to
`_copyingEncoding(from:)`.
2022-04-18 21:06:20 -07:00
Karoy Lorentey
bbb004854e [stdlib] Minor enhancements 2022-04-10 16:49:01 -07:00
Karoy Lorentey
dc6990370e [stdlib] StringGuts.scalarAlign: Preserve encoding flags in returned index 2022-03-29 20:00:08 -07:00
Karoy Lorentey
321284e9a9 [stdlib] Review & fix index validation during String index conversions
- Validate that the index has the same encoding as the string
- Validate that the index is within bounds
2022-03-24 21:00:00 -07:00
Karoy Lorentey
6e18955f90 [stdlib] Add bookkeeping to keep track of the encoding of strings and indices
Assign some previously reserved bits in String.Index and _StringObject to keep track of their associated storage encoding (either UTF-8 or UTF-16).

None of these bits will be reliably set in processes that load binaries compiled with older stdlib releases, but when they do end up getting set, we can use them opportunistically to more reliably detect cases where an index is applied on a string with a mismatching encoding.

As more and more code gets recompiled with 5.7+, the stdlib will gradually become able to detect such issues with complete accuracy.

Code that misuses indices this way was always considered broken; however, String wasn’t able to reliably detect these runtime errors before. Therefore, I expect there is a large amount of broken code out there that keeps using bridged Cocoa String indices (UTF-16) after a mutation turns them into native UTF-8 strings. Therefore, instead of trapping, this commit silently corrects the issue, transcoding the offsets into the correct encoding.

It would probably be a good idea to also emit a runtime warning in addition to recovering from the error. This would generate some noise that would gently nudge folks to fix their code.

rdar://89369680
2022-03-24 20:59:59 -07:00
David Smith
9ad3c9a5db Use withUnsafeTemporaryAllocation instead of a temporary Array 2022-03-17 17:08:05 -07:00
Alejandro Alonso
21ee3a5e0f Drop ICU
update freestanding deps
2021-11-30 13:53:08 -08:00
Valeriy Van
2dcbc53949 Removes redundant buffer zeroing in foreignErrorCorrectedGrapheme func 2020-04-17 22:51:28 +02:00
David Smith
a1c1779c9f Guard against passing endIndex to foreignScalarAlign when back-deploying to 5.0 stdlibs 2019-10-18 14:24:23 -07:00
Michael Ilseman
e01a294da6 [stdlib] Introduce _invariantCheck_5_1 for 5.1 and later assertions.
Inlinable and non-inlinable code can cause 5.1 code to intermix with
5.0 code on older OSes. Some (weak) invariants for 5.1 should only be
checked when the OS's code is 5.1 or later, which is the purpose of
_invariantCheck_5_1.

Applied to String.Index._isScalarAligned, which is a new bit
introduced in 5.1 from one of the reserved bits from 5.0. The bit is
set when the index is proven to be scalar aligned, and we want to
assert on this liberally in contexts where we expect it to be
so. However, older OSes might not set this bit when doing scalar
aligning, depending on exactly what got inlined where/when.
2019-07-12 15:58:27 -07:00
Michael Ilseman
63a6794cf9 [String] Switch scalar-aligned bit to a reserved bit.
Since scalar-alignment is set in inlinable code, switch the alignment
bit to one of the previously-reserved bits rather than a grapheme
cache bit. Setting a grapheme cache bit in inlinable would break
backward deployment, as older versions would interpret it as a cached
value.

Also adjust the name to "scalar-aligned", which is clearer, and
removed assertion (which should be a real precondition).
2019-07-02 16:25:04 -07:00
Michael Ilseman
bd5a40ff1b [gardening] Add underscore to internal member 2019-06-27 11:11:44 -07:00
Michael Ilseman
4cd1e812b7 [String] Scalar-alignment bug fixes.
Fixes a general category (pun intended) of scalar-alignment bugs
surrounding exchanging non-scalar-aligned indices between views and
for slicing.

SE-0180 unifies the Index type of String and all its views and allows
non-scalar-aligned indices to be used across views. In order to
guarantee behavior, we often have to check and perform scalar
alignment. To speed up these checks, we allocate a bit denoting
known-to-be-aligned, so that the alignment check can skip the
load. The below shows what views need to check for alignment before
they can operate, and whether the indices they produce are aligned.

┌───────────────╥────────────────────┬──────────────────────────┐
│ View          ║ Requires Alignment │ Produces Aligned Indices │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Native UTF8   ║ no                 │ no                       │
├───────────────╫────────────────────┼──────────────────────────┤
│ Native UTF16  ║ yes                │ no                       │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ Foreign UTF8  ║ yes                │ no                       │
├───────────────╫────────────────────┼──────────────────────────┤
│ Foreign UTF16 ║ no                 │ no                       │
╞═══════════════╬════════════════════╪══════════════════════════╡
│ UnicodeScalar ║ yes                │ yes                      │
├───────────────╫────────────────────┼──────────────────────────┤
│ Character     ║ yes                │ yes                      │
└───────────────╨────────────────────┴──────────────────────────┘

The "requires alignment" applies to any operation taking a
String.Index that's not defined entirely in terms of other operations
taking a String.Index. These include:

* index(after:)
* index(before:)
* subscript
* distance(from:to:) (since `to` is compared against directly)
* UTF16View._nativeGetOffset(for:)
2019-06-26 16:42:58 -07:00
Michael Ilseman
4967fc08eb [Unicode] Add convenience APIs to Unicode encodings
Add convenience APIs to the stdlib's Unicode encodings:

* Unicode.UTF16
  * isASCII
  * isSurrogate
* Unicode.UTF8
  * isASCII
  * width
* Unicode.UTF32
  * isASCII
* Unicode.ASCII
  * isASCII

Tests added
2019-03-29 15:43:00 -07:00
Michael Ilseman
415cc8fb0c [String.Index] Deprecate encodedOffset var/init
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.

Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.

Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).

This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
2019-02-13 18:42:40 -08:00
Lance Parker
15aaa1e777 [stdlib]String normalization functions (#21026)
* fast/foreignNormalize functions
2019-01-08 13:55:29 -08:00
Michael Ilseman
c0c530aef8 [String] Speed up constant factors on comparison.
Include some tuning and tweaking to reduce the constant factors
involved in string comparison. This yields considerable improvement on
our micro-benchmarks, and allows us to make less inlinable code and
have a smaller ABI surface area.

Adds more extensive testing of corner cases in our existing
fast-paths.
2018-12-03 15:49:38 -08:00
Michael Ilseman
3a0ac0270d [stdlib] Unchecked subscript on UnsafeBufferPointer
Add a use an unchecked subscript on UnsafeBufferPointer, which skips
debugPrecondition checks (in case we're not inlined) as well as a
force-unwrap check.
2018-11-16 11:12:29 -08:00
Ben Cohen
1673c12d78 [stdlib] Replace "sanityCheck" with "internalInvariant" (#20616)
* Replace "sanityCheck" with "internalInvariant"
2018-11-15 20:50:22 -08:00
Maxim Moiseev
cbf83ac04f [NFC][stdlib] Add FIXME markers to simplify audit 2018-11-14 11:58:42 -08:00
Slava Pestov
f6c2caf64b stdlib: Add @inlinable to @inline(__always) declarations
These should be audited since some might not actually need to be
@inlinable, but for now:

- Anything public and @inline(__always) is now also @inlinable
- Anything @usableFromInline and @inline(__always) is now @inlinable
2018-11-13 15:15:07 -05:00
Michael Ilseman
75943350d2 [String] Give String a custom iterator
Gives us modest wins on complex grapheme strings, but up to 40% on
heavy-ASCII strings.
2018-11-08 18:25:01 -08:00
Michael Ilseman
abe101c5b9 [String] Custom iterator for UnicodeScalarView
Provide a custom iterator rather than relying a the IndexingIterator,
as an indexing model is less efficient for stateful processing of
strings. Provides around a 30% speedup.
2018-11-08 18:00:39 -08:00
Michael Ilseman
948655e850 [String] Cleanups, comments, documentation
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
2018-11-04 10:42:42 -08:00
Michael Ilseman
7aea40680d [String] NFC iterator fast-paths
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.

Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
2018-11-04 10:42:41 -08:00
Michael Ilseman
c51aa5988f [String] Cleanup normalization code.
Clean up some of the code surrounding the normalized code unit
iterator.
2018-11-04 10:42:41 -08:00
Lance Parker
f1a35bd1c9 String comparison iterator for UTF8 strings 2018-11-04 10:42:41 -08:00
Michael Ilseman
a0e639eaf5 [String] Grapheme breaking fast-paths
Add in our scalar-based fast-paths for UTF-8 and foreign strings, and
update the grapheme cache.
2018-11-04 10:42:40 -08:00
Michael Ilseman
fe7c3ce2e4 [String] Refactorings and cleanup
* Refactor out RRC implementation into dedicated file.

* Change our `_invariantCheck` pattern to generate efficient code in
  asserts builds and make the optimizer job's easier.

* Drop a few Bidi shims we no longer need.

* Restore View decls to String, workaround no longer needed

* Cleaner unicode helper facilities
2018-11-04 10:42:40 -08:00
Michael Ilseman
89d18e1a3a [String] Refactor helper code into UnicodeHelpers.swift.
Clean up some of the index assumptions, stick index-aware methods on
_StringGuts, and otherwise migrate code over to UnicodeHelpers.swift.
2018-11-04 10:42:40 -08:00