When called on a string that is not uniquely referenced,
`String.reserveCapacity(_:)` ignores the current capacity, using
the passed-in capacity for the size of its new storage. This can
result in an underallocation and write past the end of the new
buffer.
This fix changes the new size calculation to use the current UTF-8
count as the minimum. Non-native or non-unique strings
now allocate the requested capacity (or space enough for the
current contents, if that's larger than what's requested).
rdar://109275875
Fixes#53483
There is little point to having `isUTF16` properties when they simply
return `!isUTF8`; remove them.
Rename `String.Index._copyEncoding(from:)` to
`_copyingEncoding(from:)`.
This fixes a compatibility issue with potential future UTF-8 encoded
foreign String forms, as well as simplifying the code a bit — we no
longer need to do an availability check on inlinable fast paths.
The isForeignUTF8 bit is never set by any past or current stdlib
version, but it allows us to introduce UTF-8 encoded foreign forms
without breaking inlinable index encoding validation introduced in
Swift 5.7.
- Split forward and backward direction into separate code paths.
This makes the code more readable and paves the way for future
improvements. (E.g., switching to a linear-time algorithm for
breaking backwards.)
- `Substring.index(after:)` now uses the same grapheme breaking paths
as `String.index(after:)`.
- The cached stride value in string indices is now well-defined even
on indices that aren’t character-aligned.
If the replacement collection is a fast UTF-8 substring, we can simply
access its backing store directly — we don’t need to use a circuituous
lazy algorithm.
This used to forward to `Slice.replaceSubrange`, but that’s a generic algorithm that isn’t aware of the pecularities of Unicode extended grapheme clusters, and it can be mislead by unusual cases, like a substring or subrange whose bounds aren’t `Character`-aligned, or a replacement string that starts with a continuation scalar.
* Don't allocate breadrumbs pointer if under threshold
* Increase breadrumbs threshold
* Linear 16-byte bucketing until 128 bytes, malloc_size after
* Allow cap less than _SmallString.capacity (bridging non-ASCII)
This change decreases the amount of heap usage for moderate-length
strings (< 64 UTF-8 code units in length) and increases the amount of
spare code unit capacity available (less growth needed).
Average improvements for moderate-length strings:
* 64-bit: on average, 8 bytes saved and 4 bytes of extra capacity
* 32-bit: on average, 4 bytes saved and 6 bytes of extra capacity
Additionally, on 32-bit, large-length strings also gain an average of
6 bytes of extra spare capacity.
Details:
On 64-bit, half of moderate-length allocations will save 16 bytes
while the other half get an extra 8 bytes of spare capacity.
On 32-bit, a quarter of moderate-length allocations will save 16
bytes, and the rest get an extra 4 bytes of spare
capacity. Additionally, 32-bit string's storage class now claims its
full allocation, which is its birthright. Prior to this change, we'd
have on average 1.5 bytes of spare capacity, and now we have 7.5 bytes
of spare capacity.
Breadcrumbs threshold is increased from the super-conservative 32 to
the pretty-conservative 64. Some speed improvements are incorporated
in this change, but more are in flight. Even without those eventual
improvements, this is a worthwhile change (ASCII is still fast-pathed
and irrelevant to breadcrumbing).
For a complex real-world workload, this amounts to around a 5%
improvement to transient heap usage due to all strings and a 4%
improvement to peak heap usage due to all strings. For moderate-length
strings specifically, this gives around 11% improvement to both.
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.
Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.
Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).
This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
Old Swift and new Swift runtimes and overlays need to coexist in the same process. This means there must not be any classes which have the same ObjC runtime name in old and new, because the ObjC runtime doesn't like name collisions.
When possible without breaking source compatibility, classes were renamed in Swift, which results in a different ObjC name.
Public classes were renamed only on the ObjC side using the @_objcRuntimeName attribute.
This is similar to the work done in pull request #19295. That only renamed @objc classes. This renames all of the others, since even pure Swift classes still get an ObjC name.
rdar://problem/46646438
These should be audited since some might not actually need to be
@inlinable, but for now:
- Anything public and @inline(__always) is now also @inlinable
- Anything @usableFromInline and @inline(__always) is now @inlinable
After rebasing on master and incorporating more 32-bit support,
perform a bunch of cleanup, documentation updates, comments, move code
back to String declaration, etc.
Refactor and rename _StringGutsSlice, apply NFC-aware fast paths to a
new buffered iterator.
Also, fix bug in _typeName which used to assume ASCIIness and better
SIL optimizations on StringObject.
Add inlinability annotations to restore performance parity with 4.2 String.
Take advantage of known NFC as a fast-path for comparison, and
overhaul comparison dispatch.
RRC improvements and optmizations.