Files
swift-mirror/stdlib/public/Darwin/NaturalLanguage/NLTokenizer.swift
Michael Ilseman 415cc8fb0c [String.Index] Deprecate encodedOffset var/init
String.Index has an encodedOffset-based initializer and computed
property that exists for serialization purposes. It was documented as
UTF-16 in the SE proposal introducing it, which was String's
underlying encoding at the time, but the dream of String even then was
to abstract away whatever encoding happend to be used.

Serialization needs an explicit encoding for serialized indices to
make sense: the offsets need to align with the view. With String
utilizing UTF-8 encoding for native contents in Swift 5, serialization
isn't necessarily the most efficient in UTF-16.

Furthermore, the majority of usage of encodedOffset in the wild is
buggy and operates under the assumption that a UTF-16 code unit was a
Swift Character, which isn't even valid if the String is known to be
all-ASCII (because CR-LF).

This change introduces a pair of semantics-preserving alternatives to
encodedOffset that explicitly call out the UTF-16 assumption. These
serve as a gentle off-ramp for current mis-uses of encodedOffset.
2019-02-13 18:42:40 -08:00

50 lines
1.7 KiB
Swift

//===----------------------------------------------------------------------===//
//
// This source file is part of the Swift.org open source project
//
// Copyright (c) 2014 - 2018 Apple Inc. and the Swift project authors
// Licensed under Apache License v2.0 with Runtime Library Exception
//
// See https://swift.org/LICENSE.txt for license information
// See https://swift.org/CONTRIBUTORS.txt for the list of Swift project authors
//
//===----------------------------------------------------------------------===//
@_exported import NaturalLanguage
import Foundation
@available(macOS 10.14, iOS 12.0, watchOS 5.0, tvOS 12.0, *)
extension NLTokenizer {
@nonobjc
public func tokenRange(at index: String.Index) -> Range<String.Index> {
let str = self.string ?? ""
let characterIndex = index.utf16Offset(in: str)
let nsrange = self.__tokenRange(at:characterIndex)
return Range(nsrange, in: str)!
}
@nonobjc
public func enumerateTokens(in range: Range<String.Index>, using block: (Range<String.Index>, NLTokenizer.Attributes) -> Bool) {
guard let str = self.string else { return }
let nsrange = NSRange(range, in: str)
self.__enumerateTokens(in: nsrange) { (tokenNSRange, attrs, stop) in
if let tokenRange = Range(tokenNSRange, in:str) {
let keepGoing = block(tokenRange, attrs)
if (!keepGoing) {
stop.pointee = true
}
}
}
}
@nonobjc
public func tokens(for range: Range<String.Index>) -> [Range<String.Index>] {
var array:[Range<String.Index>] = []
self.enumerateTokens(in: range) { (tokenRange, attrs) -> Bool in
array.append(tokenRange)
return true
}
return array
}
}