From b169cb876790c324728af660a539b2d6830dc2f6 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Tue, 6 Feb 2024 17:19:39 -0700 Subject: [PATCH 01/16] Add UTF-8 over unsafe contiguous storage proposal --- proposals/nnnn-utf-8-unsafe-contiguous.md | 559 ++++++++++++++++++++++ 1 file changed, 559 insertions(+) create mode 100644 proposals/nnnn-utf-8-unsafe-contiguous.md diff --git a/proposals/nnnn-utf-8-unsafe-contiguous.md b/proposals/nnnn-utf-8-unsafe-contiguous.md new file mode 100644 index 0000000000..79f1d10935 --- /dev/null +++ b/proposals/nnnn-utf-8-unsafe-contiguous.md @@ -0,0 +1,559 @@ + + +# UTF-8 Processing Over Unsafe Contiguous Bytes + +## Introduction and Motivation + +Native `String`s are stored as validly-encoded UTF-8 bytes in a contiguous memory buffer. The standard library implements `String` functionality on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose exposing this functionality as API for more advanced libraries and developers. + +This pitch focuses on a portion of the broader API and functionality discussed in [Pitch: Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294). That broader pitch can be divided into 3 kinds of API additions: + +1. Unicode processing API for working with contiguously-stored valid UTF-8 bytes +2. `Element`-based stream processing functionality. E.g., a stream of `UInt8` can be turned into a stream of `Unicode.Scalar` or `Character`s. +3. Stream-of-buffers processing functionality, which provides a lower-level / more efficient implementation for the second area. + +This pitch focuses on the first. + +## Proposed Solution + +We propose `UnsafeValidUTF8BufferPointer` which exposes a similar API surface as `String` for validly-encoded UTF-8 code units in contiguous memory. + + +## Detailed Design + +`UnsafeValidUTF8BufferPointer` consists of a (non-optional) raw pointer and a length, with some flags bit-packed in. + +```swift +/// An unsafe buffer pointer to validly-encoded UTF-8 code units stored in +/// contiguous memory. +/// +/// UTF-8 validity is checked upon creation. +/// +/// `UnsafeValidUTF8BufferPointer` does not manage the memory or guarantee +/// memory safety. Any overlapping writes into the memory can lead to undefined +/// behavior. +/// +@frozen +public struct UnsafeValidUTF8BufferPointer { + @usableFromInline + internal var _baseAddress: UnsafeRawPointer + + // A bit-packed count and flags (such as isASCII) + @usableFromInline + internal var _countAndFlags: UInt64 +} +``` + +It differs from `UnsafeRawBufferPointer` in that its contents, upon construction, are guaranteed to be validly-encoded UTF-8. This guarantee speeds up processing significantly relative to performing validation on every read. It is unsafe because it is an API surface on top of `UnsafeRawPointer`, inheriting all the unsafety therein and developers must manually guarantee invariants such as lifetimes and exclusivity. It is further based on `UnsafeRawPointer` instead of `UnsafePointer` so as not to [bind memory to a type](https://developer.apple.com/documentation/swift/unsaferawpointer#Typed-Memory). + + +### Validation and creation + +`UnsafeValidUTF8BufferPointer` is validated at initialization time, and encoding errors are thrown. + +```swift +extension Unicode.UTF8 { + @frozen + public enum EncodingErrorKind: Error { + case unexpectedContinuationByte + case expectedContinuationByte + case overlongEncoding + case invalidCodePoint + + case invalidStarterByte + + case unexpectedEndOfInput + } +} +``` + +```swift +// All the initializers below are `throw`ing, as they validate the contents +// upon construction. +extension UnsafeValidUTF8BufferPointer { + @frozen + public struct DecodingError: Error, Sendable, Hashable, Codable { + public var kind: UTF8.EncodingErrorKind + public var offsets: Range + } + + // ABI traffics in `Result` + @usableFromInline + internal static func _validate( + baseAddress: UnsafeRawPointer, length: Int + ) -> Result + + @_alwaysEmitIntoClient + public init(baseAddress: UnsafeRawPointer, length: Int) throws(DecodingError) + + @_alwaysEmitIntoClient + public init(nulTerminatedCString: UnsafeRawPointer) throws(DecodingError) + + @_alwaysEmitIntoClient + public init(nulTerminatedCString: UnsafePointer) throws(DecodingError) + + @_alwaysEmitIntoClient + public init(_: UnsafeRawBufferPointer) throws(DecodingError) + + @_alwaysEmitIntoClient + public init(_: UnsafeBufferPointer) throws(DecodingError) +} +``` + +#### Unsafety and encoding validity + +Every way to construct a `UnsafeValidUTF8BufferPointer` ensures that its contents are validly-encoded UTF-8. Thus, it has no new source of unsafety beyond the unsafety inherent in unsafe pointer's requirement that lifetime and exclusive access be manually enforced by the programmer. A write into this memory which violates encoding validity would also violate exclusivity. + +If we did not guarantee UTF-8 encoding validity, we'd be open to new security and safety concerns beyond unsafe pointers. + +With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents. + +Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an [overlong encoding](https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings), which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison). + +`UnsafeValidUTF8BufferPointer` is unsafe in the all ways that unsafe pointers are unsafe, but not in more ways. + + +### Accessing contents + +Flags and raw contents can be accessed: + +```swift +extension UnsafeValidUTF8BufferPointer { + /// Returns whether the validated contents were all-ASCII. This is checked at + /// initialization time and remembered. + @inlinable + public var isASCII: Bool + + /// Access the underlying raw bytes + @inlinable + public var rawBytes: UnsafeRawBufferPointer +} +``` + +Like `String`, `UnsafeValidUTF8BufferPointer` provides views for accessing `Unicode.Scalar`s, `UTF16.CodeUnit`s, and `Character`s. + +```swift +extension UnsafeValidUTF8BufferPointer { + /// A view of the buffer's contents as a bidirectional collection of `Unicode.Scalar`s. + @frozen + public struct UnicodeScalarView { + public var buffer: UnsafeValidUTF8BufferPointer + + @inlinable + public init(_ buffer: UnsafeValidUTF8BufferPointer) + } + + @inlinable + public var unicodeScalars: UnicodeScalarView + + /// A view of the buffer's contents as a bidirectional collection of `Character`s. + @frozen + public struct CharacterView { + public var buffer: UnsafeValidUTF8BufferPointer + + @inlinable + public init(_ buffer: UnsafeValidUTF8BufferPointer) + } + + @inlinable + public var characters: CharacterView + + /// A view off the buffer's contents as a bidirectional collection of transcoded + /// `UTF16.CodeUnit`s. + @frozen + public struct UTF16View { + public var buffer: UnsafeValidUTF8BufferPointer + + @inlinable + public init(_ buffer: UnsafeValidUTF8BufferPointer) + } + + @inlinable + public var utf16: UTF16View +} +``` + +These are bidirectional collections, as in `String`. Their indices, however, are distinct from each other because they mean different things. For example, a scalar-view index is scalar aligned but not necessarily `Character` aligned, and a transcoded index which points mid-scalar doesn't have a corresponding position in the raw bytes. + +```swift +extension UnsafeValidUTF8BufferPointer.UnicodeScalarView: BidirectionalCollection { + public typealias Element = Unicode.Scalar + + @frozen + public struct Index: Comparable, Hashable { + @usableFromInline + internal var _byteOffset: Int + + @inlinable + public var byteOffset: Int { get } + + @inlinable + public static func < (lhs: Self, rhs: Self) -> Bool + + @inlinable + internal init(_uncheckedByteOffset offset: Int) + } + + @inlinable + public subscript(position: Index) -> Element { _read } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public var startIndex: Index + + @inlinable + public var endIndex: Index +} + + +extension UnsafeValidUTF8BufferPointer.CharacterView: BidirectionalCollection { + public typealias Element = Character + + @frozen + public struct Index: Comparable, Hashable { + @usableFromInline + internal var _byteOffset: Int + + @inlinable + public var byteOffset: Int { get } + + @inlinable + public static func < (lhs: Self, rhs: Self) -> Bool + + @inlinable + internal init(_uncheckedByteOffset offset: Int) + } + + // Custom-defined for performance to avoid double-measuring + // grapheme cluster length + @frozen + public struct Iterator: IteratorProtocol { + @usableFromInline + internal var _buffer: UnsafeValidUTF8BufferPointer + + @usableFromInline + internal var _position: Index + + @inlinable + public var buffer: UnsafeValidUTF8BufferPointer { get } + + @inlinable + public var position: Index { get } + + public typealias Element = Character + + public mutating func next() -> Character? + + @inlinable + internal init( + _buffer: UnsafeValidUTF8BufferPointer, _position: Index + ) + } + + @inlinable + public func makeIterator() -> Iterator + + @inlinable + public subscript(position: Index) -> Element { _read } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public var startIndex: Index + + @inlinable + public var endIndex: Index +} + +extension UnsafeValidUTF8BufferPointer.UTF16View: BidirectionalCollection { + public typealias Element = Unicode.Scalar + + @frozen + public struct Index: Comparable, Hashable { + // Bitpacked byte offset and transcoded offset + @usableFromInline + internal var _byteOffsetAndTranscodedOffset: UInt64 + + /// Offset of the first byte of the currently-indexed scalar + @inlinable + public var byteOffset: Int { get } + + /// Offset of the transcoded code unit within the currently-indexed scalar + @inlinable + public var transcodedOffset: Int { get } + + @inlinable + public static func < (lhs: Self, rhs: Self) -> Bool + + @inlinable + internal init( + _uncheckedByteOffset offset: Int, _transcodedOffset: Int + ) + } + + @inlinable + public subscript(position: Index) -> Element { _read } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public var startIndex: Index + + @inlinable + public var endIndex: Index +} +``` + +### Canonical equivalence + +```swift +// Canonical equivalence +extension UnsafeValidUTF8BufferPointer { + /// Whether `self` is equivalent to `other` under Unicode Canonical Equivalance + public func isCanonicallyEquivalent( + to other: UnsafeValidUTF8BufferPointer + ) -> Bool + + /// Whether `self` orders less than `other` (under Unicode Canonical Equivalance + /// using normalized code-unit order) + public func isCanonicallyLessThan( + _ other: UnsafeValidUTF8BufferPointer + ) -> Bool +} +``` + + + +## Alternatives Considered + +### Other names + +We're not particularly attached to the name `UnsafeValidUTF8BufferPointer`. Other names could include: + +- `UnsafeValidUTF8CodeUnitBufferPointer` +- `UTF8.UnsafeValidBufferPointer` +- `UTF8.UnsafeValidCodeUnitBufferPointer` +- `UTF8.ValidlyEncodedCodeUnitUnsafeBufferPointer` +- `UnsafeContiguouslyStoredValidUTF8CodeUnitsBuffer` + +etc. + +For `isCanonicallyLessThan`, another name could be `canonicallyPrecedes`, `lexicographicallyPrecedesUnderNFC`, etc. + +### Static methods instead of initializers + +`UnsafeValidUTF8BufferPointer`s could instead be created by static methods on `UTF8`: + +```swift +extension Unicode.UTF8 { + static func validate( + ... + ) throws -> UnsafeValidUTF8BufferPointer +} +``` + +### Hashable and other conformances + +`UnsafeValidUTF8BufferPointer` follows `UnsafeRawBufferPointer` and `UnsafeBufferPointer` in not conforming to `Sendable`, `Hashable`, `Equatable`, `Comparable`, `Codable`, etc. + +### `UTF8.EncodingErrorKind` as a `struct` + +We may want to use the [raw-representable struct pattern](https://github.com/apple/swift-system/blob/9a812b5fef1e7f27f8594fee5463bd88c5b691ec/Sources/System/Errno.swift#L14) for `UTF8.EncodingErrorKind` instead of an exhaustive enum. That is, we may want to define it as: + +```swift +extension Unicode.UTF8 { + @frozen + public struct EncodingErrorKind: Error, Sendable, Hashable, Codable { + public var rawValue: UInt8 + + @inlinable + public init(rawValue: UInt8) { + self.rawValue = rawValue + } + + @inlinable + public static var unexpectedContinuationByte: Self { + .init(rawValue: 0x01) + } + + @inlinable + public static var overlongEncoding: Self { + .init(rawValue: 0x02) + } + + // ... + } +} +``` + +This would allow us to grow the kinds or errors or else add some error-nuance to the future, at the loss of exhaustive switches inside `catch`es. + +For example, an unexpected-end-of-input error, which happens when a scalar is in the process of being decoded but not enough bytes have been read, could be reported in different ways. It could be reported as a distinct kind of error (particularly useful for stream processing which may want to resume with more content) or it could be a `expectedContinuationByte` covering the end-of-input position. As a value, it could have a distinct value or be an alias to the same value. + + + + +## Future Directions + +### A non-escapable `ValidUTF8BufferView` + +Future improvements to Swift enable a non-escapable type (["BufferView"](https://github.com/atrick/swift-evolution/blob/fd63292839808423a5062499f588f557000c5d15/visions/language-support-for-BufferView.md)) to provide safely-unmanaged buffers via dependent lifetimes for use within a limited scope. We should add a corresponding type for validly-encoded UTF-8 contents, following the same API shape. + + +### Shared-ownership buffer + +We could propose a managed or shared-ownership validly-encoded UTF-8 buffer. E.g.: + +```swift +struct ValidlyEncodedUTF8SharedBuffer { + var contents: UnsafeValidlyEncodedUTF8BufferPointer + var owner: AnyObject? +} +``` + +where "shared" denotes that ownership is shared with the `owner` field, as opposed to an allocation exclusively managed by this type (the way `Array` or `String` would). Thus, it could be backed by a native `String`, an instance of `Data` or `Array` (if ensured to be validly encoded), etc., which participate fully in their COW semantics by retaining their storage. + +This would enable us to create shared strings, e.g. + +```swift +extension String { + /// Does not copy the given storage, rather shares it + init(sharing: ValidlyEncodedUTF8SharedBuffer) +} +``` + +Also, this could allow us to present API which repairs invalid contents, since a repair operation would need to create and manage its own allocation. + + +#### Alternative: More general formulation (💥🐮) + +We could add the more general ["deconstructed COW"](https://forums.swift.org/t/idea-bytes-literal/44124/50) + +```swift +/// A buffer of `T`s in contiguous memory +struct SharedContiguousStorage { + var rawContents: UnsafeRawBufferPointer + var owner: AnyObject? +} +``` + +where the choice of `Raw` pointers is necessary to avoid type-binding the memory, but other designs are possible too. + +However, this type alone loses static knowledge of the UTF-8 validity, so we'd still need a separate type for validly encoded UTF-8. + +Instead, we could parameterize over a unsafe-buffer-pointer-like protocol: + +```swift +struct SharedContiguousStorage { + var contents: UnsafeBuffer + var owner: AnyObject? +} + +extension String { + /// Does not copy the given storage, rather shares it + init(sharing: SharedContiguousStorage) +} +``` + +Accessing the stored pointer would still need to be done carefully, as it would have lifetime dependent on `owner`. In current Swift, that would likely need to be done via a closure-taking API. + + +### `protocol ContiguouslyStoredValidUTF8` + +We could define a protocol for validly-encoded UTF-8 bytes in contiguous memory, somewhat analogous to a low-level `StringProtocol`. Both an unsafe and a shared-ownership type could conform to provide the same API. + +However, we'd want to be careful to future-proof such a protocol so that a `ValidUTF8BufferView` could conform as well. In the mean-time, even if we go with adding a shared-ownership type, Unicode processing operations can be performed by accessing the unsafe buffer pointer. + +### Extend to `Element`-based or buffer-based streams + +We could define a segment of validly encoded UTF-8, which is not necessarily aligned across any particular boundary. This would be a significantly different API shape than `String`'s views. Accessing the start of content would require passing in initial state and reaching the end would produce a state to be fed into the next segment. + +It would make an awkward fit directly on top of `Collection`, so this would be a new API shape. For example, it could be akin to a `StatefulCollection` that in addition to having `startIndex/endIndex` would have `startState/endState`. Concerns such as bidirectionality, where exactly `endIndex` points to (the start or end of the partial value at the tail), etc, requires further thought. + +### Regex or regex-like support + +Future API additions would be to support `Regex`es on such buffers. + +Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as: + +```swift +extension UnsafeValidUTF8BufferPointer.CharacterView { + func matchCharacterClass( + _: CharacterClass, + startingAt: Index, + limitedBy: Index + ) throws -> Index? + + func matchQuantifiedCharacterClass( + _: CharacterClass, + _: QuantificationDescription, + startingAt: Index, + limitedBy: Index + ) throws -> Index? +} +``` + +which would be useful for parser-combinator libraries who wish to expose `String`'s model of Unicode by using the stdlib's accelerated implementation. + +### Transcoded views, normalized views, case-folded views, etc + +We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for `UnsafeValidUTF8BufferPointer`, we should consider adding equivalents on `String`, `Substring`, etc. If we were to make any new protocols or changes to protocols, we'd want to also future-proof for a `ValidUTF8BufferView`. + +For example, transcoded views can be generalized: + +```swift +extension UnsafeValidUTF8BufferPointer { + /// A view off the buffer's contents as a bidirectional collection of transcoded + /// `Encoding.CodeUnit`s. + @frozen + public struct TranscodedView { + public var buffer: UnsafeValidUTF8BufferPointer + + @inlinable + public init(_ buffer: UnsafeValidUTF8BufferPointer) + } +} +``` + +Note that since UTF-16 has such historical significance that even with a fully-generic transcoded view, we'd likely want a dedicated, specialized type for UTF-16. + +We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms. + +Finally, case-folded functionality can be accessed in today's Swift via [scalar properties](https://developer.apple.com/documentation/swift/unicode/scalar/properties-swift.struct), but we could provide convenience collections ourselves as well. + + +### UTF-8 to/from UTF-16 breadcrumbs API + +String's implementation caches distances between UTF-8 and UTF-16 views, as some imported Cocoa APIs use random access to the UTF-16 view. We could formalize and expose API for this. + + +### `NUL`-termination concerns and C bridging + +`UnsafeValidUTF8BufferPointer` is capable of housing interior `NUL` characters, just like `String`. We could add additional flags and initialization options to detect a trailing `NUL` byte beyond the count and treat it as a terminator. In those cases, we could provide a `withCStringIfAvailable` style API. + +### Index rounding operations + +Unlike String, `UnsafeValidUTF8BufferPointer`'s view's `Index` types are distinct, which avoids a [mess of problems](https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946). Interesting additions to both `UnsafeValidUTF8BufferPointer` and `String` would be explicit index-rounding for a desired behavior. + + +### Canonical Spaceships + +Should a `ComparisonResult` (or [spaceship](https://forums.swift.org/t/pitch-comparison-reform/5662)) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to `isCanonicallyEquivalent(to:)` and `isCanonicallyLessThan(_:)`. + + +### Other Unicode functionality + +For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of `String`'s API. Other functionality can be considered future work. From b9f727aa9fe57959f1b588d447c55370010ed68a Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Tue, 6 Feb 2024 17:21:43 -0700 Subject: [PATCH 02/16] Header --- proposals/nnnn-utf-8-unsafe-contiguous.md | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/proposals/nnnn-utf-8-unsafe-contiguous.md b/proposals/nnnn-utf-8-unsafe-contiguous.md index 79f1d10935..92bf3eafd4 100644 --- a/proposals/nnnn-utf-8-unsafe-contiguous.md +++ b/proposals/nnnn-utf-8-unsafe-contiguous.md @@ -1,7 +1,13 @@ - - # UTF-8 Processing Over Unsafe Contiguous Bytes +* Proposal: [SE-NNNN](nnnn-utf-8-unsafe-contiguous.md) +* Authors: [Michael Ilseman](https://github.com/milseman) +* Review Manager: TBD +* Status: **Awaiting implementation** +* Implementation: (pending) +* Upcoming Feature Flag: (pending) +* Review: ([pitch](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) + ## Introduction and Motivation Native `String`s are stored as validly-encoded UTF-8 bytes in a contiguous memory buffer. The standard library implements `String` functionality on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose exposing this functionality as API for more advanced libraries and developers. From 7542cc5a289d9d134c819316e1702cdbe34b2f31 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 6 May 2024 11:44:36 -0600 Subject: [PATCH 03/16] Update to be a span --- proposals/nnnn-utf-8-unsafe-contiguous.md | 565 ---------------- proposals/nnnn-utf8-span.md | 756 ++++++++++++++++++++++ 2 files changed, 756 insertions(+), 565 deletions(-) delete mode 100644 proposals/nnnn-utf-8-unsafe-contiguous.md create mode 100644 proposals/nnnn-utf8-span.md diff --git a/proposals/nnnn-utf-8-unsafe-contiguous.md b/proposals/nnnn-utf-8-unsafe-contiguous.md deleted file mode 100644 index 92bf3eafd4..0000000000 --- a/proposals/nnnn-utf-8-unsafe-contiguous.md +++ /dev/null @@ -1,565 +0,0 @@ -# UTF-8 Processing Over Unsafe Contiguous Bytes - -* Proposal: [SE-NNNN](nnnn-utf-8-unsafe-contiguous.md) -* Authors: [Michael Ilseman](https://github.com/milseman) -* Review Manager: TBD -* Status: **Awaiting implementation** -* Implementation: (pending) -* Upcoming Feature Flag: (pending) -* Review: ([pitch](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) - -## Introduction and Motivation - -Native `String`s are stored as validly-encoded UTF-8 bytes in a contiguous memory buffer. The standard library implements `String` functionality on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose exposing this functionality as API for more advanced libraries and developers. - -This pitch focuses on a portion of the broader API and functionality discussed in [Pitch: Unicode Processing APIs](https://forums.swift.org/t/pitch-unicode-processing-apis/69294). That broader pitch can be divided into 3 kinds of API additions: - -1. Unicode processing API for working with contiguously-stored valid UTF-8 bytes -2. `Element`-based stream processing functionality. E.g., a stream of `UInt8` can be turned into a stream of `Unicode.Scalar` or `Character`s. -3. Stream-of-buffers processing functionality, which provides a lower-level / more efficient implementation for the second area. - -This pitch focuses on the first. - -## Proposed Solution - -We propose `UnsafeValidUTF8BufferPointer` which exposes a similar API surface as `String` for validly-encoded UTF-8 code units in contiguous memory. - - -## Detailed Design - -`UnsafeValidUTF8BufferPointer` consists of a (non-optional) raw pointer and a length, with some flags bit-packed in. - -```swift -/// An unsafe buffer pointer to validly-encoded UTF-8 code units stored in -/// contiguous memory. -/// -/// UTF-8 validity is checked upon creation. -/// -/// `UnsafeValidUTF8BufferPointer` does not manage the memory or guarantee -/// memory safety. Any overlapping writes into the memory can lead to undefined -/// behavior. -/// -@frozen -public struct UnsafeValidUTF8BufferPointer { - @usableFromInline - internal var _baseAddress: UnsafeRawPointer - - // A bit-packed count and flags (such as isASCII) - @usableFromInline - internal var _countAndFlags: UInt64 -} -``` - -It differs from `UnsafeRawBufferPointer` in that its contents, upon construction, are guaranteed to be validly-encoded UTF-8. This guarantee speeds up processing significantly relative to performing validation on every read. It is unsafe because it is an API surface on top of `UnsafeRawPointer`, inheriting all the unsafety therein and developers must manually guarantee invariants such as lifetimes and exclusivity. It is further based on `UnsafeRawPointer` instead of `UnsafePointer` so as not to [bind memory to a type](https://developer.apple.com/documentation/swift/unsaferawpointer#Typed-Memory). - - -### Validation and creation - -`UnsafeValidUTF8BufferPointer` is validated at initialization time, and encoding errors are thrown. - -```swift -extension Unicode.UTF8 { - @frozen - public enum EncodingErrorKind: Error { - case unexpectedContinuationByte - case expectedContinuationByte - case overlongEncoding - case invalidCodePoint - - case invalidStarterByte - - case unexpectedEndOfInput - } -} -``` - -```swift -// All the initializers below are `throw`ing, as they validate the contents -// upon construction. -extension UnsafeValidUTF8BufferPointer { - @frozen - public struct DecodingError: Error, Sendable, Hashable, Codable { - public var kind: UTF8.EncodingErrorKind - public var offsets: Range - } - - // ABI traffics in `Result` - @usableFromInline - internal static func _validate( - baseAddress: UnsafeRawPointer, length: Int - ) -> Result - - @_alwaysEmitIntoClient - public init(baseAddress: UnsafeRawPointer, length: Int) throws(DecodingError) - - @_alwaysEmitIntoClient - public init(nulTerminatedCString: UnsafeRawPointer) throws(DecodingError) - - @_alwaysEmitIntoClient - public init(nulTerminatedCString: UnsafePointer) throws(DecodingError) - - @_alwaysEmitIntoClient - public init(_: UnsafeRawBufferPointer) throws(DecodingError) - - @_alwaysEmitIntoClient - public init(_: UnsafeBufferPointer) throws(DecodingError) -} -``` - -#### Unsafety and encoding validity - -Every way to construct a `UnsafeValidUTF8BufferPointer` ensures that its contents are validly-encoded UTF-8. Thus, it has no new source of unsafety beyond the unsafety inherent in unsafe pointer's requirement that lifetime and exclusive access be manually enforced by the programmer. A write into this memory which violates encoding validity would also violate exclusivity. - -If we did not guarantee UTF-8 encoding validity, we'd be open to new security and safety concerns beyond unsafe pointers. - -With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents. - -Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an [overlong encoding](https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings), which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison). - -`UnsafeValidUTF8BufferPointer` is unsafe in the all ways that unsafe pointers are unsafe, but not in more ways. - - -### Accessing contents - -Flags and raw contents can be accessed: - -```swift -extension UnsafeValidUTF8BufferPointer { - /// Returns whether the validated contents were all-ASCII. This is checked at - /// initialization time and remembered. - @inlinable - public var isASCII: Bool - - /// Access the underlying raw bytes - @inlinable - public var rawBytes: UnsafeRawBufferPointer -} -``` - -Like `String`, `UnsafeValidUTF8BufferPointer` provides views for accessing `Unicode.Scalar`s, `UTF16.CodeUnit`s, and `Character`s. - -```swift -extension UnsafeValidUTF8BufferPointer { - /// A view of the buffer's contents as a bidirectional collection of `Unicode.Scalar`s. - @frozen - public struct UnicodeScalarView { - public var buffer: UnsafeValidUTF8BufferPointer - - @inlinable - public init(_ buffer: UnsafeValidUTF8BufferPointer) - } - - @inlinable - public var unicodeScalars: UnicodeScalarView - - /// A view of the buffer's contents as a bidirectional collection of `Character`s. - @frozen - public struct CharacterView { - public var buffer: UnsafeValidUTF8BufferPointer - - @inlinable - public init(_ buffer: UnsafeValidUTF8BufferPointer) - } - - @inlinable - public var characters: CharacterView - - /// A view off the buffer's contents as a bidirectional collection of transcoded - /// `UTF16.CodeUnit`s. - @frozen - public struct UTF16View { - public var buffer: UnsafeValidUTF8BufferPointer - - @inlinable - public init(_ buffer: UnsafeValidUTF8BufferPointer) - } - - @inlinable - public var utf16: UTF16View -} -``` - -These are bidirectional collections, as in `String`. Their indices, however, are distinct from each other because they mean different things. For example, a scalar-view index is scalar aligned but not necessarily `Character` aligned, and a transcoded index which points mid-scalar doesn't have a corresponding position in the raw bytes. - -```swift -extension UnsafeValidUTF8BufferPointer.UnicodeScalarView: BidirectionalCollection { - public typealias Element = Unicode.Scalar - - @frozen - public struct Index: Comparable, Hashable { - @usableFromInline - internal var _byteOffset: Int - - @inlinable - public var byteOffset: Int { get } - - @inlinable - public static func < (lhs: Self, rhs: Self) -> Bool - - @inlinable - internal init(_uncheckedByteOffset offset: Int) - } - - @inlinable - public subscript(position: Index) -> Element { _read } - - @inlinable - public func index(after i: Index) -> Index - - @inlinable - public func index(before i: Index) -> Index - - @inlinable - public var startIndex: Index - - @inlinable - public var endIndex: Index -} - - -extension UnsafeValidUTF8BufferPointer.CharacterView: BidirectionalCollection { - public typealias Element = Character - - @frozen - public struct Index: Comparable, Hashable { - @usableFromInline - internal var _byteOffset: Int - - @inlinable - public var byteOffset: Int { get } - - @inlinable - public static func < (lhs: Self, rhs: Self) -> Bool - - @inlinable - internal init(_uncheckedByteOffset offset: Int) - } - - // Custom-defined for performance to avoid double-measuring - // grapheme cluster length - @frozen - public struct Iterator: IteratorProtocol { - @usableFromInline - internal var _buffer: UnsafeValidUTF8BufferPointer - - @usableFromInline - internal var _position: Index - - @inlinable - public var buffer: UnsafeValidUTF8BufferPointer { get } - - @inlinable - public var position: Index { get } - - public typealias Element = Character - - public mutating func next() -> Character? - - @inlinable - internal init( - _buffer: UnsafeValidUTF8BufferPointer, _position: Index - ) - } - - @inlinable - public func makeIterator() -> Iterator - - @inlinable - public subscript(position: Index) -> Element { _read } - - @inlinable - public func index(after i: Index) -> Index - - @inlinable - public func index(before i: Index) -> Index - - @inlinable - public var startIndex: Index - - @inlinable - public var endIndex: Index -} - -extension UnsafeValidUTF8BufferPointer.UTF16View: BidirectionalCollection { - public typealias Element = Unicode.Scalar - - @frozen - public struct Index: Comparable, Hashable { - // Bitpacked byte offset and transcoded offset - @usableFromInline - internal var _byteOffsetAndTranscodedOffset: UInt64 - - /// Offset of the first byte of the currently-indexed scalar - @inlinable - public var byteOffset: Int { get } - - /// Offset of the transcoded code unit within the currently-indexed scalar - @inlinable - public var transcodedOffset: Int { get } - - @inlinable - public static func < (lhs: Self, rhs: Self) -> Bool - - @inlinable - internal init( - _uncheckedByteOffset offset: Int, _transcodedOffset: Int - ) - } - - @inlinable - public subscript(position: Index) -> Element { _read } - - @inlinable - public func index(after i: Index) -> Index - - @inlinable - public func index(before i: Index) -> Index - - @inlinable - public var startIndex: Index - - @inlinable - public var endIndex: Index -} -``` - -### Canonical equivalence - -```swift -// Canonical equivalence -extension UnsafeValidUTF8BufferPointer { - /// Whether `self` is equivalent to `other` under Unicode Canonical Equivalance - public func isCanonicallyEquivalent( - to other: UnsafeValidUTF8BufferPointer - ) -> Bool - - /// Whether `self` orders less than `other` (under Unicode Canonical Equivalance - /// using normalized code-unit order) - public func isCanonicallyLessThan( - _ other: UnsafeValidUTF8BufferPointer - ) -> Bool -} -``` - - - -## Alternatives Considered - -### Other names - -We're not particularly attached to the name `UnsafeValidUTF8BufferPointer`. Other names could include: - -- `UnsafeValidUTF8CodeUnitBufferPointer` -- `UTF8.UnsafeValidBufferPointer` -- `UTF8.UnsafeValidCodeUnitBufferPointer` -- `UTF8.ValidlyEncodedCodeUnitUnsafeBufferPointer` -- `UnsafeContiguouslyStoredValidUTF8CodeUnitsBuffer` - -etc. - -For `isCanonicallyLessThan`, another name could be `canonicallyPrecedes`, `lexicographicallyPrecedesUnderNFC`, etc. - -### Static methods instead of initializers - -`UnsafeValidUTF8BufferPointer`s could instead be created by static methods on `UTF8`: - -```swift -extension Unicode.UTF8 { - static func validate( - ... - ) throws -> UnsafeValidUTF8BufferPointer -} -``` - -### Hashable and other conformances - -`UnsafeValidUTF8BufferPointer` follows `UnsafeRawBufferPointer` and `UnsafeBufferPointer` in not conforming to `Sendable`, `Hashable`, `Equatable`, `Comparable`, `Codable`, etc. - -### `UTF8.EncodingErrorKind` as a `struct` - -We may want to use the [raw-representable struct pattern](https://github.com/apple/swift-system/blob/9a812b5fef1e7f27f8594fee5463bd88c5b691ec/Sources/System/Errno.swift#L14) for `UTF8.EncodingErrorKind` instead of an exhaustive enum. That is, we may want to define it as: - -```swift -extension Unicode.UTF8 { - @frozen - public struct EncodingErrorKind: Error, Sendable, Hashable, Codable { - public var rawValue: UInt8 - - @inlinable - public init(rawValue: UInt8) { - self.rawValue = rawValue - } - - @inlinable - public static var unexpectedContinuationByte: Self { - .init(rawValue: 0x01) - } - - @inlinable - public static var overlongEncoding: Self { - .init(rawValue: 0x02) - } - - // ... - } -} -``` - -This would allow us to grow the kinds or errors or else add some error-nuance to the future, at the loss of exhaustive switches inside `catch`es. - -For example, an unexpected-end-of-input error, which happens when a scalar is in the process of being decoded but not enough bytes have been read, could be reported in different ways. It could be reported as a distinct kind of error (particularly useful for stream processing which may want to resume with more content) or it could be a `expectedContinuationByte` covering the end-of-input position. As a value, it could have a distinct value or be an alias to the same value. - - - - -## Future Directions - -### A non-escapable `ValidUTF8BufferView` - -Future improvements to Swift enable a non-escapable type (["BufferView"](https://github.com/atrick/swift-evolution/blob/fd63292839808423a5062499f588f557000c5d15/visions/language-support-for-BufferView.md)) to provide safely-unmanaged buffers via dependent lifetimes for use within a limited scope. We should add a corresponding type for validly-encoded UTF-8 contents, following the same API shape. - - -### Shared-ownership buffer - -We could propose a managed or shared-ownership validly-encoded UTF-8 buffer. E.g.: - -```swift -struct ValidlyEncodedUTF8SharedBuffer { - var contents: UnsafeValidlyEncodedUTF8BufferPointer - var owner: AnyObject? -} -``` - -where "shared" denotes that ownership is shared with the `owner` field, as opposed to an allocation exclusively managed by this type (the way `Array` or `String` would). Thus, it could be backed by a native `String`, an instance of `Data` or `Array` (if ensured to be validly encoded), etc., which participate fully in their COW semantics by retaining their storage. - -This would enable us to create shared strings, e.g. - -```swift -extension String { - /// Does not copy the given storage, rather shares it - init(sharing: ValidlyEncodedUTF8SharedBuffer) -} -``` - -Also, this could allow us to present API which repairs invalid contents, since a repair operation would need to create and manage its own allocation. - - -#### Alternative: More general formulation (💥🐮) - -We could add the more general ["deconstructed COW"](https://forums.swift.org/t/idea-bytes-literal/44124/50) - -```swift -/// A buffer of `T`s in contiguous memory -struct SharedContiguousStorage { - var rawContents: UnsafeRawBufferPointer - var owner: AnyObject? -} -``` - -where the choice of `Raw` pointers is necessary to avoid type-binding the memory, but other designs are possible too. - -However, this type alone loses static knowledge of the UTF-8 validity, so we'd still need a separate type for validly encoded UTF-8. - -Instead, we could parameterize over a unsafe-buffer-pointer-like protocol: - -```swift -struct SharedContiguousStorage { - var contents: UnsafeBuffer - var owner: AnyObject? -} - -extension String { - /// Does not copy the given storage, rather shares it - init(sharing: SharedContiguousStorage) -} -``` - -Accessing the stored pointer would still need to be done carefully, as it would have lifetime dependent on `owner`. In current Swift, that would likely need to be done via a closure-taking API. - - -### `protocol ContiguouslyStoredValidUTF8` - -We could define a protocol for validly-encoded UTF-8 bytes in contiguous memory, somewhat analogous to a low-level `StringProtocol`. Both an unsafe and a shared-ownership type could conform to provide the same API. - -However, we'd want to be careful to future-proof such a protocol so that a `ValidUTF8BufferView` could conform as well. In the mean-time, even if we go with adding a shared-ownership type, Unicode processing operations can be performed by accessing the unsafe buffer pointer. - -### Extend to `Element`-based or buffer-based streams - -We could define a segment of validly encoded UTF-8, which is not necessarily aligned across any particular boundary. This would be a significantly different API shape than `String`'s views. Accessing the start of content would require passing in initial state and reaching the end would produce a state to be fed into the next segment. - -It would make an awkward fit directly on top of `Collection`, so this would be a new API shape. For example, it could be akin to a `StatefulCollection` that in addition to having `startIndex/endIndex` would have `startState/endState`. Concerns such as bidirectionality, where exactly `endIndex` points to (the start or end of the partial value at the tail), etc, requires further thought. - -### Regex or regex-like support - -Future API additions would be to support `Regex`es on such buffers. - -Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as: - -```swift -extension UnsafeValidUTF8BufferPointer.CharacterView { - func matchCharacterClass( - _: CharacterClass, - startingAt: Index, - limitedBy: Index - ) throws -> Index? - - func matchQuantifiedCharacterClass( - _: CharacterClass, - _: QuantificationDescription, - startingAt: Index, - limitedBy: Index - ) throws -> Index? -} -``` - -which would be useful for parser-combinator libraries who wish to expose `String`'s model of Unicode by using the stdlib's accelerated implementation. - -### Transcoded views, normalized views, case-folded views, etc - -We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for `UnsafeValidUTF8BufferPointer`, we should consider adding equivalents on `String`, `Substring`, etc. If we were to make any new protocols or changes to protocols, we'd want to also future-proof for a `ValidUTF8BufferView`. - -For example, transcoded views can be generalized: - -```swift -extension UnsafeValidUTF8BufferPointer { - /// A view off the buffer's contents as a bidirectional collection of transcoded - /// `Encoding.CodeUnit`s. - @frozen - public struct TranscodedView { - public var buffer: UnsafeValidUTF8BufferPointer - - @inlinable - public init(_ buffer: UnsafeValidUTF8BufferPointer) - } -} -``` - -Note that since UTF-16 has such historical significance that even with a fully-generic transcoded view, we'd likely want a dedicated, specialized type for UTF-16. - -We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms. - -Finally, case-folded functionality can be accessed in today's Swift via [scalar properties](https://developer.apple.com/documentation/swift/unicode/scalar/properties-swift.struct), but we could provide convenience collections ourselves as well. - - -### UTF-8 to/from UTF-16 breadcrumbs API - -String's implementation caches distances between UTF-8 and UTF-16 views, as some imported Cocoa APIs use random access to the UTF-16 view. We could formalize and expose API for this. - - -### `NUL`-termination concerns and C bridging - -`UnsafeValidUTF8BufferPointer` is capable of housing interior `NUL` characters, just like `String`. We could add additional flags and initialization options to detect a trailing `NUL` byte beyond the count and treat it as a terminator. In those cases, we could provide a `withCStringIfAvailable` style API. - -### Index rounding operations - -Unlike String, `UnsafeValidUTF8BufferPointer`'s view's `Index` types are distinct, which avoids a [mess of problems](https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946). Interesting additions to both `UnsafeValidUTF8BufferPointer` and `String` would be explicit index-rounding for a desired behavior. - - -### Canonical Spaceships - -Should a `ComparisonResult` (or [spaceship](https://forums.swift.org/t/pitch-comparison-reform/5662)) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to `isCanonicallyEquivalent(to:)` and `isCanonicallyLessThan(_:)`. - - -### Other Unicode functionality - -For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of `String`'s API. Other functionality can be considered future work. diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md new file mode 100644 index 0000000000..84e8c5d0fd --- /dev/null +++ b/proposals/nnnn-utf8-span.md @@ -0,0 +1,756 @@ +# Safe Access to Contiguous UTF-8 Storage + +* Proposal: [SE-NNNN](nnnn-utf8-span.md) +* Authors: [Michael Ilseman](https://github.com/milseman), [Guillaume Lessard](https://github.com/glessard) +* Review Manager: TBD +* Status: **Awaiting implementation** +* Bug: rdar://48132971, rdar://96837923 +* Implementation: (pending) +* Upcoming Feature Flag: (pending) +* Review: ([pitch 1](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) + +## Introduction + +We introduce `UTF8Span` for efficient and safe Unicode processing over contiguous storage. + +Native `String`s are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements `String`'s API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers. + +## Motivation + +Currently, if a developer wants to do `String`-like processing over UTF-8 bytes, they have to make an instance of `String`, which allocates a native storage class and copies all the bytes. The developer would then need to operate within the new `String`'s views and map between `String.Index` and byte offsets in the original buffer. + +For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new `String` instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation. + +Furthermore, `String` may not be available on all embedded platforms due to the fact that it's conformance to `Comparable` and `Collection` depend on data tables bundled with the stdlib. `UTF8Span` is a more appropriate type for these platforms, and only some explicit API make use of data tables. + + + +### UTF-8 validity and efficiency + +UTF-8 validation is particularly common concern and the subject of a fair amount of [research](https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/). Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's `String` type's native storage is guaranteed-valid-UTF8 for this reason. + +Failure to guarantee UTF-8 encoding validity creates security and safety concerns. With invalidly-encoded contents, memory safety would become more nuanced. An ill-formed leading byte can dictate a scalar length that is longer than the memory buffer. The buffer may have bounds associated with it, which differs from the bounds dictated by its contents. + +Additionally, a particular scalar value in valid UTF-8 has only one encoding, but invalid UTF-8 could have the same value encoded as an [overlong encoding](https://en.wikipedia.org/wiki/UTF-8#Overlong_encodings), which would compromise code that checks for the presence of a scalar value by looking at the encoded bytes (or that does a byte-wise comparison). + + +## Proposed solution + +We propose a non-escapable `UTF8Span` which exposes a similar API surface as `String` for validly-encoded UTF-8 code units in contiguous memory. + + +## Detailed design + +`UTF8Span` is a borrowed view into contiguous memory containing validly-encoded UTF-8 code units. + +```swift +@frozen +public struct UTF8Span: Copyable, ~Escapable { + @usableFromInline + internal var _start: Index + + /* + A bit-packed count and flags (such as isASCII) + + ┌───────┬──────────┬───────┐ + │ b63 │ b62:56 │ b56:0 │ + ├───────┼──────────┼───────┤ + │ ASCII │ reserved │ count │ + └───────┴──────────┴───────┘ + + Future bits could be used for all <0x300 scalar (aka <0xC0 byte) + flag which denotes the quickest NFC check, a quickCheck NFC + flag (using Unicode data tables), a full-check NFC flag, + single-scalar-grapheme-clusters flag, etc. + + */ + @usableFromInline + internal var _countAndFlags: UInt64 +} +``` + +### Creation and validation + +`UTF8Span` is validated at initialization time, and encoding errors are diagnosed and thrown. + +```swift +extension Unicode.UTF8 { + /// The kind of encoding error encountered during validation + @frozen + public struct EncodingErrorKind: Error, Sendable, Hashable, Codable { + public var rawValue: UInt8 + + @inlinable + public init(rawValue: UInt8) + + @_alwaysEmitIntoClient + public static var unexpectedContinuationByte: Self { get } + + @_alwaysEmitIntoClient + public static var overlongEncoding: Self { get } + + @_alwaysEmitIntoClient + public static var invalidCodePoint: Self { get } + } +} +``` + +**TODO**: Check all the kinds of errors we'd like to diagnose. Since this is a `RawRepresentable` struct, we can still extend it with a (finite) number of error kinds in the future. + +```swift +extension UTF8Span { + /// The kind and location of invalidly-encoded UTF-8 bytes + @frozen + public struct EncodingError: Error, Sendable, Hashable, Codable { + /// The kind of encoding error + public var kind: Unicode.UTF8.EncodingErrorKind + + /// The range of offsets into our input containing the error + public var range: Range + } + + public init( + validating codeUnits: Span + ) throws(EncodingError) -> dependsOn(codeUnits) Self + + public init( + nulTerminatedCString: UnsafeRawPointer, + owner: borrowing Owner + ) throws(EncodingError) -> dependsOn(owner) Self + + public init( + nulTerminatedCString: UnsafePointer, + owner: borrowing Owner + ) throws(EncodingError) -> dependsOn(owner) Self +} +``` + +### Views + +Similarly to `String`, `UTF8Span` exposes different ways to view the UTF-8 contents. + +`UTF8Span.UnicodeScalarView` corresponds to `String.UnicodeScalarView` for read-only purposes, however it is not `RangeReplaceable` as `UTF8Span` provides read-only access. Similarly, `UTF8Span.CharacterView` corresponds to `String`'s character view (i.e. its default view), `UTF8Span.UTF16View` to `String.UTF16View`, and `UTF8Span.CodeUnits` to `String.UTF8View`. + +```swift +extension UTF8Span { + public typealias CodeUnits = Span + + @inlinable + public var codeUnits: CodeUnits { get } + + @frozen + public struct UnicodeScalarView: ~Escapable { + public let span: UTF8Span + + @inlinable + public init(_ span: UTF8Span) + } + + @inlinable + public var unicodeScalars: UnicodeScalarView { _read } + + @frozen + public struct CharacterView: ~Escapable { + public let span: UTF8Span + + @inlinable + public init(_ span: UTF8Span) + } + + @inlinable + public var characters: CharacterView { _read } + + @frozen + public struct UTF16View: ~Escapable { + public let span: UTF8Span + + @inlinable + public init(_ span: UTF8Span) + } + + @inlinable + public var utf16: UTF16View { _read } +} +``` + +**TOOD**: `_read` vs `get`? `@inlinable` vs `@_alwaysEmitIntoClient`? + +##### `Collection`-like API: + +Like `Span`, `UTF8Span` provides index and `Collection`-like API: + + +```swift +extension UTF8Span { + public typealias Index = RawSpan.Index +} + +extension UTF8Span.UnicodeScalarView { + @frozen + public struct Index: Comparable, Hashable { + public var position: UTF8Span.Index + + @inlinable + public init(_ position: UTF8Span.Index) + + @inlinable + public static func < ( + lhs: UTF8Span.UnicodeScalarView.Index, + rhs: UTF8Span.UnicodeScalarView.Index + ) -> Bool + } + + public typealias Element = Unicode.Scalar + + @frozen + public struct Iterator: ~Escapable { + public typealias Element = Unicode.Scalar + + public let span: UTF8Span + + public var position: UTF8Span.Index + + @inlinable + init(_ span: UTF8Span) + + @inlinable + public mutating func next() -> Unicode.Scalar? + } + + @inlinable + public borrowing func makeIterator() -> Iterator + + @inlinable + public var startIndex: Index { get } + + @inlinable + public var endIndex: Index { get } + + @inlinable + public var count: Int { get } + + @inlinable + public var isEmpty: Bool { get } + + @inlinable + public var indices: Range { get } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public func index( + _ i: Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Index? + + @inlinable + public func formIndex(after i: inout Index) + + @inlinable + public func formIndex(before i: inout Index) + + @inlinable + public func index(_ i: Index, offsetBy distance: Int) -> Index + + @inlinable + public func formIndex(_ i: inout Index, offsetBy distance: Int) + + @inlinable + public func formIndex( + _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Bool + + @inlinable + public subscript(position: Index) -> Element { borrowing _read } + + @inlinable + public subscript(unchecked position: Index) -> Element { + borrowing _read + } + + @inlinable + public subscript(bounds: Range) -> Self { get } + + @inlinable + public subscript(unchecked bounds: Range) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(bounds: some RangeExpression) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript( + unchecked bounds: some RangeExpression + ) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(x: UnboundedRange) -> Self { + borrowing get + } + + @inlinable + public func distance(from start: Index, to end: Index) -> Int + + @inlinable + public func elementsEqual(_ other: Self) -> Bool + + @inlinable + public func elementsEqual(_ other: some Sequence) -> Bool +} + +extension UTF8Span.CharacterView { + @frozen + public struct Index: Comparable, Hashable { + public var position: UTF8Span.Index + + @inlinable + public init(_ position: UTF8Span.Index) + + @inlinable + public static func < ( + lhs: UTF8Span.CharacterView.Index, + rhs: UTF8Span.CharacterView.Index + ) -> Bool + } + + public typealias Element = Character + + @frozen + public struct Iterator: ~Escapable { + public typealias Element = Character + + public let span: UTF8Span + + public var position: UTF8Span.Index + + @inlinable + init(_ span: UTF8Span) + + @inlinable + public mutating func next() -> Character? + } + + @inlinable + public borrowing func makeIterator() -> Iterator + + @inlinable + public var startIndex: Index { get } + + @inlinable + public var endIndex: Index { get } + + @inlinable + public var count: Int { get } + + @inlinable + public var isEmpty: Bool { get } + + @inlinable + public var indices: Range { get } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public func index( + _ i: Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Index? + + @inlinable + public func formIndex(after i: inout Index) + + @inlinable + public func formIndex(before i: inout Index) + + @inlinable + public func index(_ i: Index, offsetBy distance: Int) -> Index + + @inlinable + public func formIndex(_ i: inout Index, offsetBy distance: Int) + + @inlinable + public func formIndex( + _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Bool + + @inlinable + public subscript(position: Index) -> Element { borrowing _read } + + @inlinable + public subscript(unchecked position: Index) -> Element { + borrowing _read + } + + @inlinable + public subscript(bounds: Range) -> Self { get } + + @inlinable + public subscript(unchecked bounds: Range) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(bounds: some RangeExpression) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript( + unchecked bounds: some RangeExpression + ) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(x: UnboundedRange) -> Self { + borrowing get + } + + @inlinable + public func distance(from start: Index, to end: Index) -> Int + + @inlinable + public func elementsEqual(_ other: Self) -> Bool + + @inlinable + public func elementsEqual(_ other: some Sequence) -> Bool +} + +extension UTF8Span.UTF16View { + @frozen + public struct Index: Comparable, Hashable { + @usableFromInline + internal var _rawValue: UInt64 + + @inlinable + public var position: UTF8Span.Index { get } + + /// Whether this index is referring to the second code unit of a non-BMP + /// Unicode Scalar value. + @inlinable + public var secondCodeUnit: Bool { get } + + @inlinable + public init(_ position: UTF8Span.Index, secondCodeUnit: Bool) + + @inlinable + public static func < ( + lhs: UTF8Span.UTF16View.Index, + rhs: UTF8Span.UTF16View.Index + ) -> Bool + } + + public typealias Element = UInt16 + + @frozen + public struct Iterator: ~Escapable { + public typealias Element = UInt16 + + public let span: UTF8Span + + public var index: UTF8Span.UTF16View.Index + + @inlinable + init(_ span: UTF8Span) + + @inlinable + public mutating func next() -> UInt16? + } + + @inlinable + public borrowing func makeIterator() -> Iterator + + @inlinable + public var startIndex: Index { get } + + @inlinable + public var endIndex: Index { get } + + @inlinable + public var count: Int { get } + + @inlinable + public var isEmpty: Bool { get } + + @inlinable + public var indices: Range { get } + + @inlinable + public func index(after i: Index) -> Index + + @inlinable + public func index(before i: Index) -> Index + + @inlinable + public func index( + _ i: Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Index? + + @inlinable + public func formIndex(after i: inout Index) + + @inlinable + public func formIndex(before i: inout Index) + + @inlinable + public func index(_ i: Index, offsetBy distance: Int) -> Index + + @inlinable + public func formIndex(_ i: inout Index, offsetBy distance: Int) + + @inlinable + public func formIndex( + _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index + ) -> Bool + + @inlinable + public subscript(position: Index) -> Element { borrowing _read } + + @inlinable + public subscript(unchecked position: Index) -> Element { + borrowing _read + } + + @inlinable + public subscript(bounds: Range) -> Self { get } + + @inlinable + public subscript(unchecked bounds: Range) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(bounds: some RangeExpression) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript( + unchecked bounds: some RangeExpression + ) -> Self { + borrowing get + } + + @_alwaysEmitIntoClient + public subscript(x: UnboundedRange) -> Self { + borrowing get + } + + @inlinable + public func distance(from start: Index, to end: Index) -> Int + + @inlinable + public func elementsEqual(_ other: Self) -> Bool + + @inlinable + public func elementsEqual(_ other: some Sequence) -> Bool +} +``` + +### Queries + +```swift +extension UTF8Span { + /// Returns whether the validated contents were all-ASCII. This is checked at + /// initialization time and remembered. + @inlinable + public var isASCII: Bool { get } + + /// Whether `i` is on a boundary between Unicode scalar values + @inlinable + public func isScalarAligned(_ i: UTF8Span.Index) -> Bool + + /// Whether `i` is on a boundary between `Character`s, i.e. extended grapheme clusters. + @inlinable + public func isCharacterAligned(_ i: UTF8Span.Index) -> Bool + + /// Whether `self` is equivalent to `other` under Unicode Canonical Equivalance + public func isCanonicallyEquivalent(to other: UTF8Span) -> Bool + + /// Whether `self` orders less than `other` under Unicode Canonical Equivalance + /// using normalized code-unit order + public func isCanonicallyLessThan(_ other: UTF8Span) -> Bool +} +``` + +### Additions to `String` and `RawSpan` + +We extend `String` with the ability to access its backing `UTF8Span`: + +```swift +extension String { + // TODO: note that a copy may happen if `String` is not native... + public var utf8Span: UTF8Span { + // TODO: how to do this well, considering we also have small + // strings + } +} +extension Substring { + // TODO: needs scalar alignment (check Substring's invariants) + // TODO: note that a copy may happen if `String` is not native... + public var utf8Span: UTF8Span { + // TODO: how to do this well, considering we also have small + // strings + } +} +``` + +Additionally, we extend `RawSpan`'s byte parsing support with helpers for parsing validly-encoded UTF-8. + +```swift +extension RawSpan { + public func parseUTF8( + _ position: inout Index, length: Int + ) throws -> UTF8Span + + public func parseNullTermiantedUTF8( + _ position: inout Index + ) throws -> UTF8Span +} + +extension RawSpan.Cursor { + public mutating func parseUTF8(length: Int) throws -> UTF8Span + + public mutating func parseNullTermiantedUTF8() throws -> UTF8Span +} +``` + +## Source compatibility + +This proposal is additive and source-compatible with existing code. + +## ABI compatibility + +This proposal is additive and ABI-compatible with existing code. + +## Implications on adoption + +The additions described in this proposal require a new version of the standard library and runtime. + +## Future directions + + +### More alignments + +Future API could include whether an index is "word aligned" (either [simple](https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries) or [default](https://www.unicode.org/reports/tr18/#Default_Word_Boundaries)), "line aligned", etc. + +### Normalization + +Future API could include checks for whether the content is in a normal form. These could take the form of thorough checks, quick checks, and even mutating check-and-update-flag checks. + +### Transcoded views, normalized views, case-folded views, etc + +We could provide lazily transcoded, normalized, case-folded, etc., views. If we do any of these for `UTF8Span`, we should consider adding equivalents on `String`, `Substring`, etc. + +For example, transcoded views can be generalized: + +```swift +extension UTF8Span { + /// A view off the span's contents as a bidirectional collection of + /// transcoded `Encoding.CodeUnit`s. + @frozen + public struct TranscodedView { + public var span: UTF8Span + + @inlinable + public init(_ span: UTF8Span) + + ... + } +} +``` + +Note: UTF-16 has such historical significance that, even with a fully-generic transcoded view, we'd still want a dedicated, specialized type for UTF-16. + +We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms. + +Finally, case-folded functionality can be accessed in today's Swift via [scalar properties](https://developer.apple.com/documentation/swift/unicode/scalar/properties-swift.struct), but we could provide convenience collections ourselves as well. + + +### Regex or regex-like support + +Future API additions would be to support `Regex`es on such spans. + +Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as: + +```swift +extension UTF8Span.CharacterView { + func matchCharacterClass( + _: CharacterClass, + startingAt: Index, + limitedBy: Index + ) throws -> Index? + + func matchQuantifiedCharacterClass( + _: CharacterClass, + _: QuantificationDescription, + startingAt: Index, + limitedBy: Index + ) throws -> Index? +} +``` + +which would be useful for parser-combinator libraries who wish to expose `String`'s model of Unicode by using the stdlib's accelerated implementation. + + +### Index rounding operations + +Unlike String, `UTF8Span`'s view's `Index` types are distinct, which avoids a [mess of problems](https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946). Interesting additions to both `UTF8Span` and `String` would be explicit index-rounding for a desired behavior. + +### Canonical Spaceships + +Should a `ComparisonResult` (or [spaceship](https://forums.swift.org/t/pitch-comparison-reform/5662)) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to `isCanonicallyEquivalent(to:)` and `isCanonicallyLessThan(_:)`. + + +### Other Unicode functionality + +For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of `String`'s API. Other functionality can be considered future work. + + +## Alternatives considered + + + +### Use the same Index type across views + + + + +### Deprecate `String.withUTF8` + +... mutating... + +### Alternate places or representations for UTF-8 `EncodingError`s + +**TODO**: Should `EncodingError.range` be a range of span indices instead, and we only have a span-based init? Should it be generic over the index type? Should it be inside of `Unicode.UTF8` instead? + + + +- put it on `UTF8.EncodingError` +- make it generic over index type + - (but doesn't necessarily make more sense for null-terminated UTF-8 pointer) + + + + +### An unsafe UTF8 Buffer Pointer type + +An [earlier pitch](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715) proposed an unsafe version of `UTF8Span`. + +... + +## Acknowledgments + +Karoy Lorentey, Karl, Geordie_J, and fclout, contributed to this proposal with their clarifying questions and discussions. + From 41cec56cae9cae1c8a2263bd1e46e4ed19d4c366 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 24 Jun 2024 17:28:42 -0600 Subject: [PATCH 04/16] Update to be a span --- proposals/nnnn-utf8-span.md | 1236 +++++++++++++++++++++-------------- 1 file changed, 759 insertions(+), 477 deletions(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 84e8c5d0fd..bdfad02e39 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -9,20 +9,22 @@ * Upcoming Feature Flag: (pending) * Review: ([pitch 1](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) + ## Introduction -We introduce `UTF8Span` for efficient and safe Unicode processing over contiguous storage. +We introduce `UTF8Span` for efficient and safe Unicode processing over contiguous storage. `UTF8Span` is a memory safe non-escapable type similar to `Span` (**TODO**: link span proposal). Native `String`s are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements `String`'s API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers. ## Motivation -Currently, if a developer wants to do `String`-like processing over UTF-8 bytes, they have to make an instance of `String`, which allocates a native storage class and copies all the bytes. The developer would then need to operate within the new `String`'s views and map between `String.Index` and byte offsets in the original buffer. +Currently, if a developer wants to do `String`-like processing over UTF-8 bytes, they have to make an instance of `String`, which allocates a native storage class, copies all the bytes, and is reference counted. The developer would then need to operate within the new `String`'s views and map between `String.Index` and byte offsets in the original buffer. -For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new `String` instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation. +For example, if these bytes were part of a data structure, the developer would need to decide to either cache such a new `String` instance or recreate it on the fly. Caching more than doubles the size and adds caching complexity. Recreating it on the fly adds a linear time factor and class instance allocation/deallocation and potentially reference counting. Furthermore, `String` may not be available on all embedded platforms due to the fact that it's conformance to `Comparable` and `Collection` depend on data tables bundled with the stdlib. `UTF8Span` is a more appropriate type for these platforms, and only some explicit API make use of data tables. +**TODO** annotate those API as unavailable on embedded ### UTF-8 validity and efficiency @@ -36,8 +38,7 @@ Additionally, a particular scalar value in valid UTF-8 has only one encoding, bu ## Proposed solution -We propose a non-escapable `UTF8Span` which exposes a similar API surface as `String` for validly-encoded UTF-8 code units in contiguous memory. - +We propose a non-escapable `UTF8Span` which exposes a similar API surface as `String` for validly-encoded UTF-8 code units in contiguous memory. We also propose rich API describing the kind and location of encoding errors. ## Detailed design @@ -46,585 +47,830 @@ We propose a non-escapable `UTF8Span` which exposes a similar API surface as `St ```swift @frozen public struct UTF8Span: Copyable, ~Escapable { - @usableFromInline - internal var _start: Index + public var unsafeBaseAddress: UnsafeRawPointer /* A bit-packed count and flags (such as isASCII) - ┌───────┬──────────┬───────┐ - │ b63 │ b62:56 │ b56:0 │ - ├───────┼──────────┼───────┤ - │ ASCII │ reserved │ count │ - └───────┴──────────┴───────┘ - - Future bits could be used for all <0x300 scalar (aka <0xC0 byte) - flag which denotes the quickest NFC check, a quickCheck NFC - flag (using Unicode data tables), a full-check NFC flag, - single-scalar-grapheme-clusters flag, etc. + ╔═══════╦═════╦═════╦══════════╦═══════╗ + ║ b63 ║ b62 ║ b61 ║ b60:56 ║ b56:0 ║ + ╠═══════╬═════╬═════╬══════════╬═══════╣ + ║ ASCII ║ NFC ║ SSC ║ reserved ║ count ║ + ╚═══════╩═════╩═════╩══════════╩═══════╝ + ASCII means the contents are all-ASCII (<0x7F). + NFC means contents are in normal form C for fast comparisons. + SSC means single-scalar Characters (i.e. grapheme clusters): every + `Character` holds only a single `Unicode.Scalar`. */ @usableFromInline internal var _countAndFlags: UInt64 + + @inlinable @inline(__always) + init( + _unsafeAssumingValidUTF8 start: UnsafeRawPointer, + _countAndFlags: UInt64, + owner: borrowing Owner + ) -> dependsOn(owner) Self { } } + ``` +**TODO**: dependsOn(owner) or omit? + +**TODO**: Should we have null-termination support? A null-terminated UTF8Span has a NUL byte after its contents and contains no interior NULs. How would we ensure the NUL byte is exclusively borrowed by us? + +**TODO**: Should we track contains-newlines or only-newline-terminated? That would speed up Regex `.*` matching considerably. + ### Creation and validation `UTF8Span` is validated at initialization time, and encoding errors are diagnosed and thrown. ```swift extension Unicode.UTF8 { + /** + + The kind and location of a UTF-8 encoding error. + + Valid UTF-8 is represented by this table: + + ╔════════════════════╦════════╦════════╦════════╦════════╗ + ║ Scalar value ║ Byte 0 ║ Byte 1 ║ Byte 2 ║ Byte 3 ║ + ╠════════════════════╬════════╬════════╬════════╬════════╣ + ║ U+0000..U+007F ║ 00..7F ║ ║ ║ ║ + ║ U+0080..U+07FF ║ C2..DF ║ 80..BF ║ ║ ║ + ║ U+0800..U+0FFF ║ E0 ║ A0..BF ║ 80..BF ║ ║ + ║ U+1000..U+CFFF ║ E1..EC ║ 80..BF ║ 80..BF ║ ║ + ║ U+D000..U+D7FF ║ ED ║ 80..9F ║ 80..BF ║ ║ + ║ U+E000..U+FFFF ║ EE..EF ║ 80..BF ║ 80..BF ║ ║ + ║ U+10000..U+3FFFF ║ F0 ║ 90..BF ║ 80..BF ║ 80..BF ║ + ║ U+40000..U+FFFFF ║ F1..F3 ║ 80..BF ║ 80..BF ║ 80..BF ║ + ║ U+100000..U+10FFFF ║ F4 ║ 80..8F ║ 80..BF ║ 80..BF ║ + ╚════════════════════╩════════╩════════╩════════╩════════╝ + + ### Classifying errors + + An *unexpected continuation* is when a continuation byte (`10xxxxxx`) occurs + in a position that should be the start of a new scalar value. Unexpected + continuations can often occur when the input contains arbitrary data + instead of textual content. An unexpected continuation at the start of + input might mean that the input was not correctly sliced along scalar + boundaries or that it does not contain UTF-8. + + A *truncated scalar* is a multi-byte sequence that is the start of a valid + multi-byte scalar but is cut off before ending correctly. A truncated + scalar at the end of the input might mean that only part of the entire + input was received. + + A *surrogate code point* (`U+D800..U+DFFF`) is invalid UTF-8. Surrogate + code points are used by UTF-16 to encode scalars in the supplementary + planes. Their presence may mean the input was encoded in a different 8-bit + encoding, such as CESU-8, WTF-8, or Java's Modified UTF-8. + + An *invalid non-surrogate code point* is any code point higher than + `U+10FFFF`. This can often occur when the input is arbitrary data instead + of textual content. + + An *overlong encoding* occurs when a scalar value that could have been + encoded using fewer bytes is encoded in a longer byte sequence. Overlong + encodings are invalid UTF-8 and can lead to security issues if not + correctly detected: + + - https://nvd.nist.gov/vuln/detail/CVE-2008-2938 + - https://nvd.nist.gov/vuln/detail/CVE-2000-0884 + + An overlong encoding of `NUL`, `0xC0 0x80`, is used in Java's Modified + UTF-8 but is invalid UTF-8. Overlong encoding errors often catch attempts + to bypass security measures. + + ### Reporting the range of the error + + The range of the error reported follows the *Maximal subpart of an + ill-formed subsequence* algorithm in which each error is either one byte + long or ends before the first byte that is disallowed. See "U+FFFD + Substitution of Maximal Subparts" in the Unicode Standard. Unicode started + recommending this algorithm in version 6 and is adopted by the W3C. + + The maximal subpart algorithm will produce a single multi-byte range for a + truncated scalar (a multi-byte sequence that is the start of a valid + multi-byte scalar but is cut off before ending correctly). For all other + errors (including overlong encodings, surrogates, and invalid code + points), it will produce an error per byte. + + Since overlong encodings, surrogates, and invalid code points are erroneous + by the second byte (at the latest), the above definition produces the same + ranges as defining such a sequence as a truncated scalar error followed by + unexpected continuation byte errors. The more semantically-rich + classification is reported. + + For example, a surrogate count point sequence `ED A0 80` will be reported + as three `.surrogateCodePointByte` errors rather than a `.truncatedScalar` + followed by two `.unexpectedContinuationByte` errors. + + Other commonly reported error ranges can be constructed from this result. + For example, PEP 383's error-per-byte can be constructed by mapping over + the reported range. Similarly, constructing a single error for the longest + invalid byte range can be constructed by joining adjacent error ranges. + + ╔═════════════════╦══════╦═════╦═════╦═════╦═════╦═════╦═════╦══════╗ + ║ ║ 61 ║ F1 ║ 80 ║ 80 ║ E1 ║ 80 ║ C2 ║ 62 ║ + ╠═════════════════╬══════╬═════╬═════╬═════╬═════╬═════╬═════╬══════╣ + ║ Longest range ║ U+61 ║ err ║ ║ ║ ║ ║ ║ U+62 ║ + ║ Maximal subpart ║ U+61 ║ err ║ ║ ║ err ║ ║ err ║ U+62 ║ + ║ Error per byte ║ U+61 ║ err ║ err ║ err ║ err ║ err ║ err ║ U+62 ║ + ╚═════════════════╩══════╩═════╩═════╩═════╩═════╩═════╩═════╩══════╝ + + */ + @frozen + public struct EncodingError: Error, Sendable, Hashable, Codable { + /// The kind of encoding error + public var kind: Unicode.UTF8.EncodingError.Kind + + /// The range of offsets into our input containing the error + public var range: Range + + @_alwaysEmitIntoClient + public init( + _ kind: Unicode.UTF8.EncodingError.Kind, + _ range: some RangeExpression + ) + + @_alwaysEmitIntoClient + public init(_ kind: Unicode.UTF8.EncodingError.Kind, at: Int) + } +} + +extension UTF8.EncodingError { /// The kind of encoding error encountered during validation @frozen - public struct EncodingErrorKind: Error, Sendable, Hashable, Codable { + public struct Kind: Error, Sendable, Hashable, Codable, RawRepresentable { public var rawValue: UInt8 @inlinable public init(rawValue: UInt8) + /// A continuation byte (`10xxxxxx`) outside of a multi-byte sequence + @_alwaysEmitIntoClient + public static var unexpectedContinuationByte: Self + + /// A byte in a surrogate code point (`U+D800..U+DFFF`) sequence + @_alwaysEmitIntoClient + public static var surrogateCodePointByte: Self + + /// A byte in an invalid, non-surrogate code point (`>U+10FFFF`) sequence @_alwaysEmitIntoClient - public static var unexpectedContinuationByte: Self { get } + public static var invalidNonSurrogateCodePointByte: Self + /// A byte in an overlong encoding sequence @_alwaysEmitIntoClient - public static var overlongEncoding: Self { get } + public static var overlongEncodingByte: Self + /// A multi-byte sequence that is the start of a valid multi-byte scalar + /// but is cut off before ending correctly @_alwaysEmitIntoClient - public static var invalidCodePoint: Self { get } + public static var truncatedScalar: Self } } -``` - -**TODO**: Check all the kinds of errors we'd like to diagnose. Since this is a `RawRepresentable` struct, we can still extend it with a (finite) number of error kinds in the future. -```swift -extension UTF8Span { - /// The kind and location of invalidly-encoded UTF-8 bytes - @frozen - public struct EncodingError: Error, Sendable, Hashable, Codable { - /// The kind of encoding error - public var kind: Unicode.UTF8.EncodingErrorKind +extension UTF8.EncodingError.Kind: CustomStringConvertible { + public var description: String { get } +} - /// The range of offsets into our input containing the error - public var range: Range - } +extension UTF8.EncodingError: CustomStringConvertible { + public var description: String { get } +} +extension UTF8Span { public init( validating codeUnits: Span ) throws(EncodingError) -> dependsOn(codeUnits) Self - - public init( - nulTerminatedCString: UnsafeRawPointer, - owner: borrowing Owner - ) throws(EncodingError) -> dependsOn(owner) Self - - public init( - nulTerminatedCString: UnsafePointer, - owner: borrowing Owner - ) throws(EncodingError) -> dependsOn(owner) Self } ``` -### Views +**TODO**: null-terminated strings where we borrow and remember the terminator (and ensure there's no interior nulls)? -Similarly to `String`, `UTF8Span` exposes different ways to view the UTF-8 contents. +### Basic operations -`UTF8Span.UnicodeScalarView` corresponds to `String.UnicodeScalarView` for read-only purposes, however it is not `RangeReplaceable` as `UTF8Span` provides read-only access. Similarly, `UTF8Span.CharacterView` corresponds to `String`'s character view (i.e. its default view), `UTF8Span.UTF16View` to `String.UTF16View`, and `UTF8Span.CodeUnits` to `String.UTF8View`. +#### Core Scalar API ```swift extension UTF8Span { - public typealias CodeUnits = Span - - @inlinable - public var codeUnits: CodeUnits { get } - - @frozen - public struct UnicodeScalarView: ~Escapable { - public let span: UTF8Span - - @inlinable - public init(_ span: UTF8Span) - } - - @inlinable - public var unicodeScalars: UnicodeScalarView { _read } - - @frozen - public struct CharacterView: ~Escapable { - public let span: UTF8Span - - @inlinable - public init(_ span: UTF8Span) - } + /// Whether `i` is on a boundary between Unicode scalar values. + @_alwaysEmitIntoClient + public func isScalarAligned(_ i: Int) -> Bool - @inlinable - public var characters: CharacterView { _read } + /// Whether `i` is on a boundary between Unicode scalar values. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func isScalarAligned(unchecked i: Int) -> Bool - @frozen - public struct UTF16View: ~Escapable { - public let span: UTF8Span + /// Whether `range`'s bounds are aligned to `Unicode.Scalar` boundaries. + @_alwaysEmitIntoClient + public func isScalarAligned(_ range: Range) -> Bool - @inlinable - public init(_ span: UTF8Span) - } + /// Whether `range`'s bounds are aligned to `Unicode.Scalar` boundaries. + /// + /// This function does not validate that `range` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func isScalarAligned(unchecked range: Range) -> Bool - @inlinable - public var utf16: UTF16View { _read } + /// Returns the start of the next `Unicode.Scalar` after the one starting at + /// `i`, or the end of the span if `i` denotes the final scalar. + /// + /// `i` must be scalar-aligned. + @_alwaysEmitIntoClient + public func nextScalarStart(_ i: Int) -> Int + + /// Returns the start of the next `Unicode.Scalar` after the one starting at + /// `i`, or the end of the span if `i` denotes the final scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func nextScalarStart(unchecked i: Int) -> Int + + /// Returns the start of the next `Unicode.Scalar` after the one starting at + /// `i`, or the end of the span if `i` denotes the final scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// This function does not validate that `i` is scalar-aligned; this is an + /// unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func nextScalarStart( + uncheckedAssumingAligned i: Int + ) -> Int + + /// Returns the start of the `Unicode.Scalar` ending at `i`, i.e. the scalar + /// before the one starting at `i` or the last scalar if `i` is the end of + /// the span. + /// + /// `i` must be scalar-aligned. + @_alwaysEmitIntoClient + public func previousScalarStart(_ i: Int) -> Int + + /// Returns the start of the `Unicode.Scalar` ending at `i`, i.e. the scalar + /// before the one starting at `i` or the last scalar if `i` is the end of + /// the span. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func previousScalarStart(unchecked i: Int) -> Int + + /// Returns the start of the `Unicode.Scalar` ending at `i`, i.e. the scalar + /// before the one starting at `i` or the last scalar if `i` is the end of + /// the span. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// + /// This function does not validate that `i` is scalar-aligned; this is an + /// unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func previousScalarStart( + uncheckedAssumingAligned i: Int + ) -> Int + + /// Decode the `Unicode.Scalar` starting at `i`. Return it and the start of + /// the next scalar. + /// + /// `i` must be scalar-aligned. + @_alwaysEmitIntoClient + public func decodeNextScalar( + _ i: Int + ) -> (Unicode.Scalar, nextScalarStart: Int) + + /// Decode the `Unicode.Scalar` starting at `i`. Return it and the start of + /// the next scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func decodeNextScalar( + unchecked i: Int + ) -> (Unicode.Scalar, nextScalarStart: Int) + + /// Decode the `Unicode.Scalar` starting at `i`. Return it and the start of + /// the next scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// + /// This function does not validate that `i` is scalar-aligned; this is an + /// unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func decodeNextScalar( + uncheckedAssumingAligned i: Int + ) -> (Unicode.Scalar, nextScalarStart: Int) + + /// Decode the `Unicode.Scalar` ending at `i`, i.e. the previous scalar. + /// Return it and the start of that scalar. + /// + /// `i` must be scalar-aligned. + @_alwaysEmitIntoClient + public func decodePreviousScalar( + _ i: Int + ) -> (Unicode.Scalar, previousScalarStart: Int) + + /// Decode the `Unicode.Scalar` ending at `i`, i.e. the previous scalar. + /// Return it and the start of that scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func decodePreviousScalar( + unchecked i: Int + ) -> (Unicode.Scalar, previousScalarStart: Int) + + /// Decode the `Unicode.Scalar` ending at `i`, i.e. the previous scalar. + /// Return it and the start of that scalar. + /// + /// `i` must be scalar-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// + /// This function does not validate that `i` is scalar-aligned; this is an + /// unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func decodePreviousScalar( + uncheckedAssumingAligned i: Int + ) -> (Unicode.Scalar, previousScalarStart: Int) } -``` - -**TOOD**: `_read` vs `get`? `@inlinable` vs `@_alwaysEmitIntoClient`? -##### `Collection`-like API: - -Like `Span`, `UTF8Span` provides index and `Collection`-like API: +``` +#### Core Character API ```swift extension UTF8Span { - public typealias Index = RawSpan.Index -} - -extension UTF8Span.UnicodeScalarView { - @frozen - public struct Index: Comparable, Hashable { - public var position: UTF8Span.Index - - @inlinable - public init(_ position: UTF8Span.Index) - - @inlinable - public static func < ( - lhs: UTF8Span.UnicodeScalarView.Index, - rhs: UTF8Span.UnicodeScalarView.Index - ) -> Bool - } - - public typealias Element = Unicode.Scalar - - @frozen - public struct Iterator: ~Escapable { - public typealias Element = Unicode.Scalar - - public let span: UTF8Span - - public var position: UTF8Span.Index - - @inlinable - init(_ span: UTF8Span) - - @inlinable - public mutating func next() -> Unicode.Scalar? - } - - @inlinable - public borrowing func makeIterator() -> Iterator - - @inlinable - public var startIndex: Index { get } - - @inlinable - public var endIndex: Index { get } - - @inlinable - public var count: Int { get } - - @inlinable - public var isEmpty: Bool { get } - - @inlinable - public var indices: Range { get } - - @inlinable - public func index(after i: Index) -> Index - - @inlinable - public func index(before i: Index) -> Index - - @inlinable - public func index( - _ i: Index, offsetBy distance: Int, limitedBy limit: Index - ) -> Index? - - @inlinable - public func formIndex(after i: inout Index) - - @inlinable - public func formIndex(before i: inout Index) - - @inlinable - public func index(_ i: Index, offsetBy distance: Int) -> Index - - @inlinable - public func formIndex(_ i: inout Index, offsetBy distance: Int) - - @inlinable - public func formIndex( - _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index - ) -> Bool - - @inlinable - public subscript(position: Index) -> Element { borrowing _read } - - @inlinable - public subscript(unchecked position: Index) -> Element { - borrowing _read - } - - @inlinable - public subscript(bounds: Range) -> Self { get } - - @inlinable - public subscript(unchecked bounds: Range) -> Self { - borrowing get - } - + /// Whether `i` is on a boundary between `Character`s (i.e. grapheme + /// clusters). @_alwaysEmitIntoClient - public subscript(bounds: some RangeExpression) -> Self { - borrowing get - } + public func isCharacterAligned(_ i: Int) -> Bool + /// Whether `i` is on a boundary between `Character`s (i.e. grapheme + /// clusters). + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. @_alwaysEmitIntoClient - public subscript( - unchecked bounds: some RangeExpression - ) -> Self { - borrowing get - } + public func isCharacterAligned(unchecked i: Int) -> Bool + /// Returns the start of the next `Character` (i.e. grapheme cluster) after + /// the one starting at `i`, or the end of the span if `i` denotes the final + /// `Character`. + /// + /// `i` must be `Character`-aligned. @_alwaysEmitIntoClient - public subscript(x: UnboundedRange) -> Self { - borrowing get - } - - @inlinable - public func distance(from start: Index, to end: Index) -> Int - - @inlinable - public func elementsEqual(_ other: Self) -> Bool + public func nextCharacterStart(_ i: Int) -> Int + + /// Returns the start of the next `Character` (i.e. grapheme cluster) after + /// the one starting at `i`, or the end of the span if `i` denotes the final + /// `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func nextCharacterStart(unchecked i: Int) -> Int + + /// Returns the start of the next `Character` (i.e. grapheme cluster) after + /// the one starting at `i`, or the end of the span if `i` denotes the final + /// `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// This function does not validate that `i` is `Character`-aligned; this is + /// an unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func nextCharacterStart( + uncheckedAssumingAligned i: Int + ) -> Int + + /// Returns the start of the `Character` (i.e. grapheme cluster) ending at + /// `i`, i.e. the `Character` before the one starting at `i` or the last + /// `Character` if `i` is the end of the span. + /// + /// `i` must be `Character`-aligned. + @_alwaysEmitIntoClient + public func previousCharacterStart(_ i: Int) -> Int + + /// Returns the start of the `Character` (i.e. grapheme cluster) ending at + /// `i`, i.e. the `Character` before the one starting at `i` or the last + /// `Character` if `i` is the end of the span. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func previousCharacterStart(unchecked i: Int) -> Int + + /// Returns the start of the `Character` (i.e. grapheme cluster) ending at + /// `i`, i.e. the `Character` before the one starting at `i` or the last + /// `Character` if `i` is the end of the span. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// This function does not validate that `i` is `Character`-aligned; this is + /// an unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func previousCharacterStart( + uncheckedAssumingAligned i: Int + ) -> Int + + /// Decode the `Character` starting at `i` Return it and the start of the + /// next `Character`. + /// + /// `i` must be `Character`-aligned. + @_alwaysEmitIntoClient + public func decodeNextCharacter( + _ i: Int + ) -> (Character, nextCharacterStart: Int) + + /// Decode the `Character` starting at `i` Return it and the start of the + /// next `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func decodeNextCharacter( + unchecked i: Int + ) -> (Character, nextCharacterStart: Int) + + /// Decode the `Character` starting at `i` Return it and the start of the + /// next `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// This function does not validate that `i` is `Character`-aligned; this is + /// an unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func decodeNextCharacter( + uncheckedAssumingAligned i: Int + ) -> (Character, nextCharacterStart: Int) + + /// Decode the `Character` (i.e. grapheme cluster) ending at `i`, i.e. the + /// previous `Character`. Return it and the start of that `Character`. + /// + /// `i` must be `Character`-aligned. + @_alwaysEmitIntoClient + public func decodePreviousCharacter(_ i: Int) -> (Character, Int) + + /// Decode the `Character` (i.e. grapheme cluster) ending at `i`, i.e. the + /// previous `Character`. Return it and the start of that `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func decodePreviousCharacter( + unchecked i: Int + ) -> (Character, Int) + + /// Decode the `Character` (i.e. grapheme cluster) ending at `i`, i.e. the + /// previous `Character`. Return it and the start of that `Character`. + /// + /// `i` must be `Character`-aligned. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + /// + /// This function does not validate that `i` is `Character`-aligned; this is + /// an unsafe operation if `i` isn't. + @_alwaysEmitIntoClient + public func decodePreviousCharacter( + uncheckedAssumingAligned i: Int + ) -> (Character, Int) - @inlinable - public func elementsEqual(_ other: some Sequence) -> Bool } -extension UTF8Span.CharacterView { - @frozen - public struct Index: Comparable, Hashable { - public var position: UTF8Span.Index - - @inlinable - public init(_ position: UTF8Span.Index) +``` - @inlinable - public static func < ( - lhs: UTF8Span.CharacterView.Index, - rhs: UTF8Span.CharacterView.Index - ) -> Bool - } +#### Derived Scalar operations - public typealias Element = Character +```swift +extension UTF8Span { + /// Find the nearest scalar-aligned position `<= i`. + @_alwaysEmitIntoClient + public func scalarAlignBackwards(_ i: Int) -> Int - @frozen - public struct Iterator: ~Escapable { - public typealias Element = Character + /// Find the nearest scalar-aligned position `<= i`. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func scalarAlignBackwards(unchecked i: Int) -> Int - public let span: UTF8Span + /// Find the nearest scalar-aligned position `>= i`. + @_alwaysEmitIntoClient + public func scalarAlignForwards(_ i: Int) -> Int - public var position: UTF8Span.Index + /// Find the nearest scalar-aligned position `>= i`. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func scalarAlignForwards(unchecked i: Int) -> Int +} +``` - @inlinable - init(_ span: UTF8Span) +#### Derived Character operations - @inlinable - public mutating func next() -> Character? - } +```swift +extension UTF8Span { + /// Find the nearest `Character` (i.e. grapheme cluster)-aligned position + /// that is `<= i`. + @_alwaysEmitIntoClient + public func characterAlignBackwards(_ i: Int) -> Int - @inlinable - public borrowing func makeIterator() -> Iterator + /// Find the nearest `Character` (i.e. grapheme cluster)-aligned position + /// that is `<= i`. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func characterAlignBackwards(unchecked i: Int) -> Int - @inlinable - public var startIndex: Index { get } + /// Find the nearest `Character` (i.e. grapheme cluster)-aligned position + /// that is `>= i`. + @_alwaysEmitIntoClient + public func characterAlignForwards(_ i: Int) -> Int - @inlinable - public var endIndex: Index { get } + /// Find the nearest `Character` (i.e. grapheme cluster)-aligned position + /// that is `>= i`. + /// + /// This function does not validate that `i` is within the span's bounds; + /// this is an unsafe operation. + @_alwaysEmitIntoClient + public func characterAlignForwards(unchecked i: Int) -> Int +} +``` - @inlinable - public var count: Int { get } +### Collection-like API - @inlinable - public var isEmpty: Bool { get } +#### Comparisons - @inlinable - public var indices: Range { get } +```swift +extension UTF8Span { + /// Whether this span has the same bytes as `other`. + @_alwaysEmitIntoClient + public func bytesEqual(to other: UTF8Span) -> Bool - @inlinable - public func index(after i: Index) -> Index + /// Whether this span has the same bytes as `other`. + @_alwaysEmitIntoClient + public func bytesEqual(to other: some Sequence) -> Bool - @inlinable - public func index(before i: Index) -> Index + /// Whether this span has the same `Unicode.Scalar`s as `other`. + @_alwaysEmitIntoClient + public func scalarsEqual( + to other: some Sequence + ) -> Bool - @inlinable - public func index( - _ i: Index, offsetBy distance: Int, limitedBy limit: Index - ) -> Index? + /// Whether this span has the same `Character`s as `other`. + @_alwaysEmitIntoClient + public func charactersEqual( + to other: some Sequence + ) -> Bool - @inlinable - public func formIndex(after i: inout Index) +} +``` - @inlinable - public func formIndex(before i: inout Index) +**TODO**: lexicographically less than? `std::mismatch`? others? - @inlinable - public func index(_ i: Index, offsetBy distance: Int) -> Index +#### Canonical equivalence and ordering - @inlinable - public func formIndex(_ i: inout Index, offsetBy distance: Int) +`UTF8Span` can perform Unicode canonical equivalence checks (i.e. the semantics of `String.==` and `Character.==`). - @inlinable - public func formIndex( - _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index +```swift +extension UTF8Span { + /// Whether `self` is equivalent to `other` under Unicode Canonical + /// Equivalance. + public func isCanonicallyEquivalent( + to other: UTF8Span ) -> Bool - @inlinable - public subscript(position: Index) -> Element { borrowing _read } - - @inlinable - public subscript(unchecked position: Index) -> Element { - borrowing _read - } + /// Whether `self` orders less than `other` under Unicode Canonical + /// Equivalance using normalized code-unit order (in NFC). + public func isCanonicallyLessThan( + _ other: UTF8Span + ) -> Bool +} +``` - @inlinable - public subscript(bounds: Range) -> Self { get } +#### Extracting sub-spans - @inlinable - public subscript(unchecked bounds: Range) -> Self { - borrowing get - } +Similarly to `Span`, we support subscripting and extracting sub-spans. Since a `UTF8Span` is always validly-encoded UTF-8, extracting must happen along Unicode scalar boundaries. +```swift +extension UTF8Span { + /// Constructs a new `UTF8Span` span over the bytes within the supplied + /// range of positions within this span. + /// + /// `bounds` must be scalar aligned. + /// + /// The returned span's first item is always at offset 0; unlike buffer + /// slices, extracted spans do not generally share their indices with the + /// span from which they are extracted. + /// + /// - Parameter bounds: A valid range of positions. Every position in + /// this range must be within the bounds of this `Span`. + /// + /// - Returns: A `UTF8Span` over the bytes within `bounds`. @_alwaysEmitIntoClient - public subscript(bounds: some RangeExpression) -> Self { - borrowing get - } - + public func extracting(_ bounds: some RangeExpression) -> Self + + /// Constructs a new `UTF8Span` span over the bytes within the supplied + /// range of positions within this span. + /// + /// `bounds` must be scalar aligned. + /// + /// This function does not validate that `bounds` is within the span's + /// bounds; this is an unsafe operation. + /// + /// The returned span's first item is always at offset 0; unlike buffer + /// slices, extracted spans do not generally share their indices with the + /// span from which they are extracted. + /// + /// - Parameter bounds: A valid range of positions. Every position in + /// this range must be within the bounds of this `Span`. + /// + /// - Returns: A `UTF8Span` over the bytes within `bounds`. @_alwaysEmitIntoClient - public subscript( - unchecked bounds: some RangeExpression - ) -> Self { - borrowing get - } - + public func extracting( + unchecked bounds: some RangeExpression + ) -> Self + + /// Constructs a new `UTF8Span` span over the bytes within the supplied + /// range of positions within this span. + /// + /// This function does not validate that `bounds` is within the span's + /// bounds; this is an unsafe operation. + /// + /// This function does not validate that `bounds` is within the span's + /// bounds; this is an unsafe operation. + /// + /// The returned span's first item is always at offset 0; unlike buffer + /// slices, extracted spans do not generally share their indices with the + /// span from which they are extracted. + /// + /// - Parameter bounds: A valid range of positions. Every position in + /// this range must be within the bounds of this `Span`. + /// + /// - Returns: A `UTF8Span` over the bytes within `bounds`. @_alwaysEmitIntoClient - public subscript(x: UnboundedRange) -> Self { - borrowing get - } - - @inlinable - public func distance(from start: Index, to end: Index) -> Int - - @inlinable - public func elementsEqual(_ other: Self) -> Bool - - @inlinable - public func elementsEqual(_ other: some Sequence) -> Bool + public func extracting( + uncheckedAssumingAligned bounds: some RangeExpression + ) -> Self } -extension UTF8Span.UTF16View { - @frozen - public struct Index: Comparable, Hashable { - @usableFromInline - internal var _rawValue: UInt64 - - @inlinable - public var position: UTF8Span.Index { get } - - /// Whether this index is referring to the second code unit of a non-BMP - /// Unicode Scalar value. - @inlinable - public var secondCodeUnit: Bool { get } - - @inlinable - public init(_ position: UTF8Span.Index, secondCodeUnit: Bool) - - @inlinable - public static func < ( - lhs: UTF8Span.UTF16View.Index, - rhs: UTF8Span.UTF16View.Index - ) -> Bool - } - - public typealias Element = UInt16 - - @frozen - public struct Iterator: ~Escapable { - public typealias Element = UInt16 - - public let span: UTF8Span - - public var index: UTF8Span.UTF16View.Index - - @inlinable - init(_ span: UTF8Span) - - @inlinable - public mutating func next() -> UInt16? - } - - @inlinable - public borrowing func makeIterator() -> Iterator - - @inlinable - public var startIndex: Index { get } - - @inlinable - public var endIndex: Index { get } +``` - @inlinable - public var count: Int { get } +#### Misc. - @inlinable +```swift +extension UTF8Span { + @_alwaysEmitIntoClient public var isEmpty: Bool { get } - @inlinable - public var indices: Range { get } - - @inlinable - public func index(after i: Index) -> Index - - @inlinable - public func index(before i: Index) -> Index - - @inlinable - public func index( - _ i: Index, offsetBy distance: Int, limitedBy limit: Index - ) -> Index? - - @inlinable - public func formIndex(after i: inout Index) - - @inlinable - public func formIndex(before i: inout Index) - - @inlinable - public func index(_ i: Index, offsetBy distance: Int) -> Index - - @inlinable - public func formIndex(_ i: inout Index, offsetBy distance: Int) - - @inlinable - public func formIndex( - _ i: inout Index, offsetBy distance: Int, limitedBy limit: Index - ) -> Bool - - @inlinable - public subscript(position: Index) -> Element { borrowing _read } - - @inlinable - public subscript(unchecked position: Index) -> Element { - borrowing _read - } - - @inlinable - public subscript(bounds: Range) -> Self { get } - - @inlinable - public subscript(unchecked bounds: Range) -> Self { - borrowing get - } - @_alwaysEmitIntoClient - public subscript(bounds: some RangeExpression) -> Self { - borrowing get - } + public var storage: Span { get } + /// Whether `i` is in bounds @_alwaysEmitIntoClient - public subscript( - unchecked bounds: some RangeExpression - ) -> Self { - borrowing get + public func boundsCheck(_ i: Int) -> Bool { + i >= 0 && i < count } + /// Whether `bounds` is in bounds @_alwaysEmitIntoClient - public subscript(x: UnboundedRange) -> Self { - borrowing get + public func boundsCheck(_ bounds: Range) -> Bool + + /// Calls a closure with a pointer to the viewed contiguous storage. + /// + /// The buffer pointer passed as an argument to `body` is valid only + /// during the execution of `withUnsafeBufferPointer(_:)`. + /// Do not store or return the pointer for later use. + /// + /// - Parameter body: A closure with an `UnsafeBufferPointer` parameter + /// that points to the viewed contiguous storage. If `body` has + /// a return value, that value is also used as the return value + /// for the `withUnsafeBufferPointer(_:)` method. The closure's + /// parameter is valid only for the duration of its execution. + /// - Returns: The return value of the `body` closure parameter. + @_alwaysEmitIntoClient + borrowing public func withUnsafeBufferPointer< + E: Error, Result: ~Copyable & ~Escapable + >( + _ body: (_ buffer: borrowing UnsafeBufferPointer) throws(E) -> Result + ) throws(E) -> dependsOn(self) Result { + try body(unsafeBaseAddress._ubp(0.. Int - - @inlinable - public func elementsEqual(_ other: Self) -> Bool - - @inlinable - public func elementsEqual(_ other: some Sequence) -> Bool } ``` ### Queries +`UTF8Span` checks at construction time and remembers whether its contents are all ASCII. Additional checks can be requested and remembered. + ```swift extension UTF8Span { /// Returns whether the validated contents were all-ASCII. This is checked at /// initialization time and remembered. - @inlinable + @inlinable @inline(__always) public var isASCII: Bool { get } - /// Whether `i` is on a boundary between Unicode scalar values - @inlinable - public func isScalarAligned(_ i: UTF8Span.Index) -> Bool - - /// Whether `i` is on a boundary between `Character`s, i.e. extended grapheme clusters. - @inlinable - public func isCharacterAligned(_ i: UTF8Span.Index) -> Bool - - /// Whether `self` is equivalent to `other` under Unicode Canonical Equivalance - public func isCanonicallyEquivalent(to other: UTF8Span) -> Bool + /// Returns whether the contents are known to be NFC. This is not + /// always checked at initialization time and is set by `checkForNFC`. + @inlinable @inline(__always) + public var isKnownNFC: Bool { get } + + /// Do a scan checking for whether the contents are in Normal Form C. + /// When the contents are in NFC, canonical equivalence checks are much + /// faster. + /// + /// `quickCheck` will check for a subset of NFC contents using the + /// NFCQuickCheck algorithm, which is faster than the full normalization + /// algorithm. However, it cannot detect all NFC contents. + /// + /// Updates the `isKnownNFC` bit. + public mutating func checkForNFC( + quickCheck: Bool + ) -> Bool - /// Whether `self` orders less than `other` under Unicode Canonical Equivalance - /// using normalized code-unit order - public func isCanonicallyLessThan(_ other: UTF8Span) -> Bool + /// Returns whether every `Character` (i.e. grapheme cluster) + /// is known to be comprised of a single `Unicode.Scalar`. + /// + /// This is not always checked at initialization time. It is set by + /// `checkForSingleScalarCharacters`. + @inlinable @inline(__always) + public var isKnownSingleScalarCharacters: Bool { get } + + /// Do a scan, checking whether every `Character` (i.e. grapheme cluster) + /// is comprised of only a single `Unicode.Scalar`. When a span contains + /// only single-scalar characters, character operations are much faster. + /// + /// `quickCheck` will check for a subset of single-scalar character contents + /// using a faster algorithm than the full grapheme breaking algorithm. + /// However, it cannot detect all single-scalar `Character` contents. + /// + /// Updates the `isKnownSingleScalarCharacters` bit. + public mutating func checkForSingleScalarCharacters( + quickCheck: Bool + ) -> Bool } ``` -### Additions to `String` and `RawSpan` - -We extend `String` with the ability to access its backing `UTF8Span`: +### Spans from strings ```swift extension String { - // TODO: note that a copy may happen if `String` is not native... - public var utf8Span: UTF8Span { - // TODO: how to do this well, considering we also have small - // strings - } + /// ... note that a copy may happen if `String` is not native... + public var utf8Span: UTF8Span { _read } } extension Substring { - // TODO: needs scalar alignment (check Substring's invariants) - // TODO: note that a copy may happen if `String` is not native... - public var utf8Span: UTF8Span { - // TODO: how to do this well, considering we also have small - // strings - } + // ... note that a copy may happen if `Substring` is not native... + public var utf8Span: UTF8Span { _read } } ``` -Additionally, we extend `RawSpan`'s byte parsing support with helpers for parsing validly-encoded UTF-8. -```swift -extension RawSpan { - public func parseUTF8( - _ position: inout Index, length: Int - ) throws -> UTF8Span - - public func parseNullTermiantedUTF8( - _ position: inout Index - ) throws -> UTF8Span -} - -extension RawSpan.Cursor { - public mutating func parseUTF8(length: Int) throws -> UTF8Span - - public mutating func parseNullTermiantedUTF8() throws -> UTF8Span -} -``` ## Source compatibility @@ -640,14 +886,32 @@ The additions described in this proposal require a new version of the standard l ## Future directions - ### More alignments Future API could include whether an index is "word aligned" (either [simple](https://www.unicode.org/reports/tr18/#Simple_Word_Boundaries) or [default](https://www.unicode.org/reports/tr18/#Default_Word_Boundaries)), "line aligned", etc. ### Normalization -Future API could include checks for whether the content is in a normal form. These could take the form of thorough checks, quick checks, and even mutating check-and-update-flag checks. +Future API could include checks for whether the content is in a particular normal form (not just NFC). + +### UnicodeScalarView and CharacterView + +Like `Span`, we are deferring adding any collection-like types to non-escapable `UTF8Span`. Future work includes adding view types and corresponding iterators. + +For an example implementation of those see **TODO**: link to test in repo + +### Returning all the encoding errors + +Future work includes returning all the encoding errors found in a given input. + +```swift +extension UTF8 { + public static func checkAllErrors( + _ s: some Sequence + ) -> some Sequence +``` + +See **TODO**: link to example implementation ### Transcoded views, normalized views, case-folded views, etc @@ -657,7 +921,7 @@ For example, transcoded views can be generalized: ```swift extension UTF8Span { - /// A view off the span's contents as a bidirectional collection of + /// A view of the span's contents as a bidirectional collection of /// transcoded `Encoding.CodeUnit`s. @frozen public struct TranscodedView { @@ -671,8 +935,6 @@ extension UTF8Span { } ``` -Note: UTF-16 has such historical significance that, even with a fully-generic transcoded view, we'd still want a dedicated, specialized type for UTF-16. - We could similarly provide lazily-normalized views of code units or scalars under NFC or NFD (which the stdlib already distributes data tables for), possibly generic via a protocol for 3rd party normal forms. Finally, case-folded functionality can be accessed in today's Swift via [scalar properties](https://developer.apple.com/documentation/swift/unicode/scalar/properties-swift.struct), but we could provide convenience collections ourselves as well. @@ -680,7 +942,7 @@ Finally, case-folded functionality can be accessed in today's Swift via [scalar ### Regex or regex-like support -Future API additions would be to support `Regex`es on such spans. +Future API additions would be to support `Regex`es on `UTF8Span`. We'd expose grapheme-level semantics, scalar-level semantics, and introduce byte-level semantics. Another future direction could be to add many routines corresponding to the underlying operations performed by the regex engine, such as: @@ -704,10 +966,6 @@ extension UTF8Span.CharacterView { which would be useful for parser-combinator libraries who wish to expose `String`'s model of Unicode by using the stdlib's accelerated implementation. -### Index rounding operations - -Unlike String, `UTF8Span`'s view's `Index` types are distinct, which avoids a [mess of problems](https://forums.swift.org/t/string-index-unification-vs-bidirectionalcollection-requirements/55946). Interesting additions to both `UTF8Span` and `String` would be explicit index-rounding for a desired behavior. - ### Canonical Spaceships Should a `ComparisonResult` (or [spaceship](https://forums.swift.org/t/pitch-comparison-reform/5662)) be added to Swift, we could support that operation under canonical equivalence in a single pass rather than subsequent calls to `isCanonicallyEquivalent(to:)` and `isCanonicallyLessThan(_:)`. @@ -718,39 +976,63 @@ Should a `ComparisonResult` (or [spaceship](https://forums.swift.org/t/pitch-com For the purposes of this pitch, we're not looking to expand the scope of functionality beyond what the stdlib already does in support of `String`'s API. Other functionality can be considered future work. -## Alternatives considered +### Exposing `String`'s storage class + +String's internal storage class is null-terminated valid UTF-8 (by substituting replacement characters) and implements range-replaceable operations along scalar boundaries. We could consider exposing the storage class itself, which might be useful for embedded platforms that don't have `String`. +### Yield UTF8Spans in byte parsers +Span's proposal mentions a future direction of byte parsing helpers on a `Cursor` or `Iterator` type (**TODO**: link to span proposal section). We could extend these types (or analogous types on `Span`) with UTF-8 parsing code: -### Use the same Index type across views +```swift +extension RawSpan.Cursor { + public mutating func parseUTF8(length: Int) throws -> UTF8Span + public mutating func parseNullTermiantedUTF8() throws -> UTF8Span +} +``` -### Deprecate `String.withUTF8` +## Alternatives considered -... mutating... +### Invalid start / end of input UTF-8 encoding errors -### Alternate places or representations for UTF-8 `EncodingError`s +Earlier prototypes had `.invalidStartOfInput` and `.invalidEndOfInput` UTF8 validation errors to communicate that the input was perhaps incomplete or not slices along scalar boundaries. In this scenario, `.invalidStartOfInput` is equivalent to `.unexpectedContinuation` with the range's lower bound equal to 0 and `.invalidEndOfInput` is equivalent to `.truncatedScalar` with the range's upper bound equal to `count`. -**TODO**: Should `EncodingError.range` be a range of span indices instead, and we only have a span-based init? Should it be generic over the index type? Should it be inside of `Unicode.UTF8` instead? +This was rejected so as to not have two ways to encode the same error. There is no loss of information and `.unexpectedContinuation`/`.truncatedScalar` with ranges are more semantically precise. +### An unsafe UTF8 Buffer Pointer type +An [earlier pitch](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715) proposed an unsafe version of `UTF8Span`. Now that we have `~Escapable`, a memory-safe `UTF8Span` is better. -- put it on `UTF8.EncodingError` -- make it generic over index type - - (but doesn't necessarily make more sense for null-terminated UTF-8 pointer) +### Other names for basic operations +An alternative name for `nextScalarStart(_:)` and `previousScalarStart(_:)` could be something like `scalarEnd(startingAt:)` and `scalarStart(endingAt: i)`. Similarly, `decodeNextScalar(_:)` and `decodePreviousScalar(_:)` could be `decodeScalar(startingAt:)` and `decodeScalar(endingAt:)`. These names are similar to `index(after:)` and `index(before:)`. +However, in practice this buries the direction deeper into the argument label and is more confusing than the `index(before/after:)` analogues. This is especially true when the argument label contains `unchecked` or `uncheckedAssumingAligned`. +That being said, these names are definitely bikesheddable and we'd like suggestions from the community. -### An unsafe UTF8 Buffer Pointer type -An [earlier pitch](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715) proposed an unsafe version of `UTF8Span`. +### Other bounds or alignment checked formulations + +For many operations that take an index that needs to be appropriately aligned, we propose `foo(_:)`, `foo(unchecked:)`, and `foo(uncheckedAssumingAligned:)`. + +`foo(_:)` and `foo(unchecked:)` have analogues in `Span` and `foo(uncheckedAssumingAligned:)` is the lowest level interface that a type such as `Iterator` would call (since it maintains index validity and alignment as an invariant). + +We could additionally have a `foo(assumingAligned:)` overload that does bounds checking, but it's unclear what the use case would be. + +Another alternative is to only have a variant that skips both bounds and alignment checks and call it `foo(unchecked:)`. However, this use of `unchecked:` is far more nuanced than `Span`'s and it's not the case that any `i` in `0.. Date: Mon, 24 Jun 2024 18:21:31 -0600 Subject: [PATCH 05/16] Link to impl --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index bdfad02e39..df897c9b0b 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -5,7 +5,7 @@ * Review Manager: TBD * Status: **Awaiting implementation** * Bug: rdar://48132971, rdar://96837923 -* Implementation: (pending) +* Implementation: [Prototype](https://github.com/apple/swift-collections/pull/394) * Upcoming Feature Flag: (pending) * Review: ([pitch 1](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) From 7e2657ab85d46e9ea6dd81cbdf5bd97f6407f375 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Tue, 25 Jun 2024 09:05:30 -0600 Subject: [PATCH 06/16] Clean up todos --- proposals/nnnn-utf8-span.md | 67 ++++++++++++++++++++++--------------- 1 file changed, 40 insertions(+), 27 deletions(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index df897c9b0b..e0f4f46d0b 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -12,7 +12,7 @@ ## Introduction -We introduce `UTF8Span` for efficient and safe Unicode processing over contiguous storage. `UTF8Span` is a memory safe non-escapable type similar to `Span` (**TODO**: link span proposal). +We introduce `UTF8Span` for efficient and safe Unicode processing over contiguous storage. `UTF8Span` is a memory safe non-escapable type [similar to `Span`](https://github.com/swiftlang/swift-evolution/pull/2307). Native `String`s are stored as validly-encoded UTF-8 bytes in an internal contiguous memory buffer. The standard library implements `String`'s API as internal methods which operate on top of this buffer, taking advantage of the validly-encoded invariant and specialized Unicode knowledge. We propose making this UTF-8 buffer and its methods public as API for more advanced libraries and developers. @@ -24,9 +24,6 @@ For example, if these bytes were part of a data structure, the developer would n Furthermore, `String` may not be available on all embedded platforms due to the fact that it's conformance to `Comparable` and `Collection` depend on data tables bundled with the stdlib. `UTF8Span` is a more appropriate type for these platforms, and only some explicit API make use of data tables. -**TODO** annotate those API as unavailable on embedded - - ### UTF-8 validity and efficiency UTF-8 validation is particularly common concern and the subject of a fair amount of [research](https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/). Once an input is known to be validly encoded UTF-8, subsequent operations such as decoding, grapheme breaking, comparison, etc., can be implemented much more efficiently under this assumption of validity. Swift's `String` type's native storage is guaranteed-valid-UTF8 for this reason. @@ -58,7 +55,7 @@ public struct UTF8Span: Copyable, ~Escapable { ║ ASCII ║ NFC ║ SSC ║ reserved ║ count ║ ╚═══════╩═════╩═════╩══════════╩═══════╝ - ASCII means the contents are all-ASCII (<0x7F). + ASCII means the contents are all-ASCII (<0x7F). NFC means contents are in normal form C for fast comparisons. SSC means single-scalar Characters (i.e. grapheme clusters): every `Character` holds only a single `Unicode.Scalar`. @@ -76,12 +73,6 @@ public struct UTF8Span: Copyable, ~Escapable { ``` -**TODO**: dependsOn(owner) or omit? - -**TODO**: Should we have null-termination support? A null-terminated UTF8Span has a NUL byte after its contents and contains no interior NULs. How would we ensure the NUL byte is exclusively borrowed by us? - -**TODO**: Should we track contains-newlines or only-newline-terminated? That would speed up Regex `.*` matching considerably. - ### Creation and validation `UTF8Span` is validated at initialization time, and encoding errors are diagnosed and thrown. @@ -232,10 +223,12 @@ extension UTF8.EncodingError { } } +@_unavailableInEmbedded extension UTF8.EncodingError.Kind: CustomStringConvertible { public var description: String { get } } +@_unavailableInEmbedded extension UTF8.EncodingError: CustomStringConvertible { public var description: String { get } } @@ -247,8 +240,6 @@ extension UTF8Span { } ``` -**TODO**: null-terminated strings where we borrow and remember the terminator (and ensure there's no interior nulls)? - ### Basic operations #### Core Scalar API @@ -354,7 +345,7 @@ extension UTF8Span { _ i: Int ) -> (Unicode.Scalar, nextScalarStart: Int) - /// Decode the `Unicode.Scalar` starting at `i`. Return it and the start of + /// Decode the `Unicode.Scalar` starting at `i`. Return it and the start of /// the next scalar. /// /// `i` must be scalar-aligned. @@ -425,6 +416,7 @@ extension UTF8Span { #### Core Character API ```swift +@_unavailableInEmbedded extension UTF8Span { /// Whether `i` is on a boundary between `Character`s (i.e. grapheme /// clusters). @@ -614,6 +606,7 @@ extension UTF8Span { #### Derived Character operations ```swift +@_unavailableInEmbedded extension UTF8Span { /// Find the nearest `Character` (i.e. grapheme cluster)-aligned position /// that is `<= i`. @@ -664,6 +657,7 @@ extension UTF8Span { ) -> Bool /// Whether this span has the same `Character`s as `other`. + @_unavailableInEmbedded @_alwaysEmitIntoClient public func charactersEqual( to other: some Sequence @@ -672,8 +666,6 @@ extension UTF8Span { } ``` -**TODO**: lexicographically less than? `std::mismatch`? others? - #### Canonical equivalence and ordering `UTF8Span` can perform Unicode canonical equivalence checks (i.e. the semantics of `String.==` and `Character.==`). @@ -682,12 +674,14 @@ extension UTF8Span { extension UTF8Span { /// Whether `self` is equivalent to `other` under Unicode Canonical /// Equivalance. + @_unavailableInEmbedded public func isCanonicallyEquivalent( to other: UTF8Span ) -> Bool - /// Whether `self` orders less than `other` under Unicode Canonical + /// Whether `self` orders less than `other` under Unicode Canonical /// Equivalance using normalized code-unit order (in NFC). + @_unavailableInEmbedded public func isCanonicallyLessThan( _ other: UTF8Span ) -> Bool @@ -819,17 +813,19 @@ extension UTF8Span { /// Returns whether the contents are known to be NFC. This is not /// always checked at initialization time and is set by `checkForNFC`. @inlinable @inline(__always) + @_unavailableInEmbedded public var isKnownNFC: Bool { get } /// Do a scan checking for whether the contents are in Normal Form C. /// When the contents are in NFC, canonical equivalence checks are much /// faster. /// - /// `quickCheck` will check for a subset of NFC contents using the + /// `quickCheck` will check for a subset of NFC contents using the /// NFCQuickCheck algorithm, which is faster than the full normalization /// algorithm. However, it cannot detect all NFC contents. /// /// Updates the `isKnownNFC` bit. + @_unavailableInEmbedded public mutating func checkForNFC( quickCheck: Bool ) -> Bool @@ -839,6 +835,7 @@ extension UTF8Span { /// /// This is not always checked at initialization time. It is set by /// `checkForSingleScalarCharacters`. + @_unavailableInEmbedded @inlinable @inline(__always) public var isKnownSingleScalarCharacters: Bool { get } @@ -851,6 +848,7 @@ extension UTF8Span { /// However, it cannot detect all single-scalar `Character` contents. /// /// Updates the `isKnownSingleScalarCharacters` bit. + @_unavailableInEmbedded public mutating func checkForSingleScalarCharacters( quickCheck: Bool ) -> Bool @@ -860,10 +858,13 @@ extension UTF8Span { ### Spans from strings ```swift +@_unavailableInEmbedded extension String { /// ... note that a copy may happen if `String` is not native... public var utf8Span: UTF8Span { _read } } + +@_unavailableInEmbedded extension Substring { // ... note that a copy may happen if `Substring` is not native... public var utf8Span: UTF8Span { _read } @@ -896,11 +897,19 @@ Future API could include checks for whether the content is in a particular norma ### UnicodeScalarView and CharacterView -Like `Span`, we are deferring adding any collection-like types to non-escapable `UTF8Span`. Future work includes adding view types and corresponding iterators. +Like `Span`, we are deferring adding any collection-like types to non-escapable `UTF8Span`. Future work includes adding view types and corresponding iterators. + +For an example implementation of those see [the `UTFSpanViews.swift` test file](https://github.com/apple/swift-collections/pull/394). + +### More Collectiony algorithms + +We propose equality checks (e.g. `scalarsEqual`), as those are incredibly common and useful operations. We have (tentatively) deferred other algorithms until non-escapable collections are figured out. -For an example implementation of those see **TODO**: link to test in repo +However, we can add select high-value algorithms if motivated by the community. We'd want to -### Returning all the encoding errors + + +### More validation API Future work includes returning all the encoding errors found in a given input. @@ -911,7 +920,7 @@ extension UTF8 { ) -> some Sequence ``` -See **TODO**: link to example implementation +See [`_checkAllErrors` in `UTF8EncodingError.swift`](https://github.com/apple/swift-collections/pull/394). ### Transcoded views, normalized views, case-folded views, etc @@ -921,7 +930,7 @@ For example, transcoded views can be generalized: ```swift extension UTF8Span { - /// A view of the span's contents as a bidirectional collection of + /// A view of the span's contents as a bidirectional collection of /// transcoded `Encoding.CodeUnit`s. @frozen public struct TranscodedView { @@ -951,14 +960,14 @@ extension UTF8Span.CharacterView { func matchCharacterClass( _: CharacterClass, startingAt: Index, - limitedBy: Index + limitedBy: Index ) throws -> Index? func matchQuantifiedCharacterClass( _: CharacterClass, _: QuantificationDescription, startingAt: Index, - limitedBy: Index + limitedBy: Index ) throws -> Index? } ``` @@ -982,7 +991,7 @@ String's internal storage class is null-terminated valid UTF-8 (by substituting ### Yield UTF8Spans in byte parsers -Span's proposal mentions a future direction of byte parsing helpers on a `Cursor` or `Iterator` type (**TODO**: link to span proposal section). We could extend these types (or analogous types on `Span`) with UTF-8 parsing code: +Span's proposal mentions a future direction of byte parsing helpers on a `Cursor` or `Iterator` type on `RawSpan`. We could extend these types (or analogous types on `Span`) with UTF-8 parsing code: ```swift extension RawSpan.Cursor { @@ -992,6 +1001,9 @@ extension RawSpan.Cursor { } ``` +### Track other bits + +Future work include tracking whether the contents are NULL-terminated (useful for C bridging), whether the contents contain any newlines or only a single newline at the end (useful for accelerating Regex `.`), etc. ## Alternatives considered @@ -1017,7 +1029,7 @@ That being said, these names are definitely bikesheddable and we'd like suggesti ### Other bounds or alignment checked formulations -For many operations that take an index that needs to be appropriately aligned, we propose `foo(_:)`, `foo(unchecked:)`, and `foo(uncheckedAssumingAligned:)`. +For many operations that take an index that needs to be appropriately aligned, we propose `foo(_:)`, `foo(unchecked:)`, and `foo(uncheckedAssumingAligned:)`. `foo(_:)` and `foo(unchecked:)` have analogues in `Span` and `foo(uncheckedAssumingAligned:)` is the lowest level interface that a type such as `Iterator` would call (since it maintains index validity and alignment as an invariant). @@ -1029,6 +1041,7 @@ We could also only offer `foo(_:)` and `foo(uncheckedAssumingAligned:)`. Unalign + ## Acknowledgments Karoy Lorentey, Karl, Geordie_J, and fclout, contributed to this proposal with their clarifying questions and discussions. From 229732f2d044ba3c59bc31f2059983bbf1195248 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Tue, 25 Jun 2024 11:12:47 -0600 Subject: [PATCH 07/16] title --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index e0f4f46d0b..573acd915f 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -1,4 +1,4 @@ -# Safe Access to Contiguous UTF-8 Storage +# Safe UTF-8 Processing Over Contiguous Bytes * Proposal: [SE-NNNN](nnnn-utf8-span.md) * Authors: [Michael Ilseman](https://github.com/milseman), [Guillaume Lessard](https://github.com/glessard) From 3c3a4b6d41ce52c9f07bac98260af5a83cb1f929 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 26 Jun 2024 14:32:06 -0600 Subject: [PATCH 08/16] Update future directions --- proposals/nnnn-utf8-span.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 573acd915f..fd1fe1bad0 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -905,9 +905,7 @@ For an example implementation of those see [the `UTFSpanViews.swift` test file]( We propose equality checks (e.g. `scalarsEqual`), as those are incredibly common and useful operations. We have (tentatively) deferred other algorithms until non-escapable collections are figured out. -However, we can add select high-value algorithms if motivated by the community. We'd want to - - +However, we can add select high-value algorithms if motivated by the community. ### More validation API @@ -1005,6 +1003,9 @@ extension RawSpan.Cursor { Future work include tracking whether the contents are NULL-terminated (useful for C bridging), whether the contents contain any newlines or only a single newline at the end (useful for accelerating Regex `.`), etc. +### Putting more API on String + +`String` would also benefit from the query API, such as `isKnownNFC` and corresponding scan methods. Because a string may be a lazily-bridged instance of `NSString`, we don't always have the bits available to query or set, but this may become via pending future improvements in bridging. ## Alternatives considered From d66a1ec3af75460ed8f09695e403dcb32c294748 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 26 Jun 2024 14:33:41 -0600 Subject: [PATCH 09/16] typo --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index fd1fe1bad0..373cd323d6 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -1005,7 +1005,7 @@ Future work include tracking whether the contents are NULL-terminated (useful fo ### Putting more API on String -`String` would also benefit from the query API, such as `isKnownNFC` and corresponding scan methods. Because a string may be a lazily-bridged instance of `NSString`, we don't always have the bits available to query or set, but this may become via pending future improvements in bridging. +`String` would also benefit from the query API, such as `isKnownNFC` and corresponding scan methods. Because a string may be a lazily-bridged instance of `NSString`, we don't always have the bits available to query or set, but this may become viable pending future improvements in bridging. ## Alternatives considered From 3cd5b2850e9a87853222e33a72f6635d27188bfa Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Wed, 26 Jun 2024 14:37:56 -0600 Subject: [PATCH 10/16] Future direction of printing and logging facilities --- proposals/nnnn-utf8-span.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 373cd323d6..7e519d102d 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -1007,6 +1007,10 @@ Future work include tracking whether the contents are NULL-terminated (useful fo `String` would also benefit from the query API, such as `isKnownNFC` and corresponding scan methods. Because a string may be a lazily-bridged instance of `NSString`, we don't always have the bits available to query or set, but this may become viable pending future improvements in bridging. +### Generalize printing and logging facilities + +Many printing and logging protocols and facilities operate in terms of `String`. They could be generalized to work in terms of UTF-8 bytes instead, which is important for embedded. + ## Alternatives considered ### Invalid start / end of input UTF-8 encoding errors From 2101841b2bf7f5116dc60ea401c6dda8301f782d Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:04:26 -0600 Subject: [PATCH 11/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 7e519d102d..a6ca412f63 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -7,7 +7,7 @@ * Bug: rdar://48132971, rdar://96837923 * Implementation: [Prototype](https://github.com/apple/swift-collections/pull/394) * Upcoming Feature Flag: (pending) -* Review: ([pitch 1](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) +* Review: ([pitch 1](https://forums.swift.org/t/pitch-utf-8-processing-over-unsafe-contiguous-bytes/69715)) ([pitch 2](https://forums.swift.org/t/pitch-safe-utf-8-processing-over-contiguous-bytes/72742)) ## Introduction From 6d4516e5dea7d23c3f5a1de6e9b91f9f572ac4f7 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:04:34 -0600 Subject: [PATCH 12/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index a6ca412f63..2a0084f5f7 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -768,9 +768,7 @@ extension UTF8Span { /// Whether `i` is in bounds @_alwaysEmitIntoClient - public func boundsCheck(_ i: Int) -> Bool { - i >= 0 && i < count - } + public func boundsCheck(_ i: Int) -> Bool /// Whether `bounds` is in bounds @_alwaysEmitIntoClient From c6f01f3554572d0e6152814cd255daca743c06e2 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:04:46 -0600 Subject: [PATCH 13/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 2a0084f5f7..2554a6b4f9 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -791,9 +791,7 @@ extension UTF8Span { E: Error, Result: ~Copyable & ~Escapable >( _ body: (_ buffer: borrowing UnsafeBufferPointer) throws(E) -> Result - ) throws(E) -> dependsOn(self) Result { - try body(unsafeBaseAddress._ubp(0.. dependsOn(self) Result } ``` From 5d9543926d1a3b857a57c5c60db6c297cd60eebb Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:04:52 -0600 Subject: [PATCH 14/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 2554a6b4f9..81cfb9c3bc 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -673,7 +673,7 @@ extension UTF8Span { ```swift extension UTF8Span { /// Whether `self` is equivalent to `other` under Unicode Canonical - /// Equivalance. + /// Equivalence. @_unavailableInEmbedded public func isCanonicallyEquivalent( to other: UTF8Span From 3c01eb63113c22db9205a3a7d6f40298591c97eb Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:05:06 -0600 Subject: [PATCH 15/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index 81cfb9c3bc..ea81787393 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -680,7 +680,7 @@ extension UTF8Span { ) -> Bool /// Whether `self` orders less than `other` under Unicode Canonical - /// Equivalance using normalized code-unit order (in NFC). + /// Equivalence using normalized code-unit order (in NFC). @_unavailableInEmbedded public func isCanonicallyLessThan( _ other: UTF8Span From ff456c6a7822ef7240c242451fdb5d6dba00e942 Mon Sep 17 00:00:00 2001 From: Michael Ilseman Date: Mon, 1 Jul 2024 16:05:25 -0600 Subject: [PATCH 16/16] Update proposals/nnnn-utf8-span.md Co-authored-by: Ben Rimmington --- proposals/nnnn-utf8-span.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/nnnn-utf8-span.md b/proposals/nnnn-utf8-span.md index ea81787393..1c31567f7c 100644 --- a/proposals/nnnn-utf8-span.md +++ b/proposals/nnnn-utf8-span.md @@ -991,7 +991,7 @@ Span's proposal mentions a future direction of byte parsing helpers on a `Cursor extension RawSpan.Cursor { public mutating func parseUTF8(length: Int) throws -> UTF8Span - public mutating func parseNullTermiantedUTF8() throws -> UTF8Span + public mutating func parseNullTerminatedUTF8() throws -> UTF8Span } ```