Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement std::convert traits for char #35755

Merged
merged 2 commits into from
Sep 1, 2016
Merged

Conversation

SimonSapin
Copy link
Contributor

This is motivated by avoiding the as operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that From<u8> for char should be implemented: while it matches an existing behavior of as, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@sfackler sfackler added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Aug 17, 2016
///
/// Surrogates are used in the UTF-16 encoding, and therefore are not characters.
SurrogateCodePoint,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're never going to need to add any extra cases to this enum, right? Should we stick a __ForExtensibility variant just in case?

@sfackler
Copy link
Member

I personally feel okay about the u8 -> char impl as it seems like what everyone would expect to happen, but am interested to see what other people think.

cc @rust-lang/libs

@alexcrichton
Copy link
Member

Seems reasonable to me, but I'd prefer to use an opaque struct with optional method accessors rather than an enum for the error type in TryFrom

@SimonSapin
Copy link
Contributor Author

@sfackler Re extensibility, I don’t expect this to ever be needed. The range of Unicode Scalar Values changed exactly once in the history of Unicode. At first it was “16 bits ought to be enough for everybody” 0x0000...0xFFFF. When that turned out not to be enough and a lot of systems were already using u16 as a code unit, UTF-16 was introduced. The current range, 0x0000...0xD7FF | 0xE000...0x10FFFF, is exactly what can be encoded by UTF-16. (UTF-8’s original design can support up to 0x7FFF_FFFF with 6-bytes sequences. It is now artificially restricted to match.)

In Unicode 9.0, 76% of the million and some code points are unassigned, so they’re not expected to run out. And breaking compatibility with UTF-16 is such a breaking change that I imagine it’s not even considered.

And this concern disappears with…

@alexcrichton Yeah, I also considered an opaque struct. Even without accessor method since I can’t think of a use case for it. (Code like a WTF-8 implementation that wants to deal with surrogate code points will likely do its own code point arithmetic anyway.) And a method cal always be added later. I’ve updated the PR.

@alexcrichton
Copy link
Member

Discussed during @rust-lang/libs triage today, conclusion was to merge. Thanks for the update @SimonSapin!

@bors: r+

@bors
Copy link
Contributor

bors commented Aug 23, 2016

📌 Commit 82678c5 has been approved by alexcrichton

eddyb added a commit to eddyb/rust that referenced this pull request Aug 23, 2016
…hton

Implement std::convert traits for char

This is motivated by avoiding the `as` operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that `From<u8> for char` should be implemented: while it matches an existing behavior of `as`, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.
@bors
Copy link
Contributor

bors commented Aug 23, 2016

☔ The latest upstream changes (presumably #35656) made this pull request unmergeable. Please resolve the merge conflicts.

@@ -176,6 +172,41 @@ pub unsafe fn from_u32_unchecked(i: u32) -> char {
transmute(i)
}

#[stable(feature = "char_convert", since = "1.12.0")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these not be "1.13.0"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the time I first opened this PR, but now yes. Fixed.

@bors
Copy link
Contributor

bors commented Aug 29, 2016

🔒 Merge conflict

These fit with other From implementations between integer types.

This helps the coding style of avoiding the 'as' operator that sometimes
silently truncates, and signals that these specific conversions are
lossless and infaillible.
For symmetry with From<char> for u32.
@alexcrichton
Copy link
Member

@bors: r+ f040208

@bors
Copy link
Contributor

bors commented Sep 1, 2016

⌛ Testing commit f040208 with merge b2799a5...

bors added a commit that referenced this pull request Sep 1, 2016
Implement std::convert traits for char

This is motivated by avoiding the `as` operator, which sometimes silently truncates, and instead use conversions that are explicitly lossless and infallible.

I’m less certain that `From<u8> for char` should be implemented: while it matches an existing behavior of `as`, it’s not necessarily the right thing to use for non-ASCII bytes. It effectively decodes bytes as ISO/IEC 8859-1 (since Unicode designed its first 256 code points to be compatible with that encoding), but that is not apparent in the API name.
@bors bors merged commit f040208 into rust-lang:master Sep 1, 2016
@bluss bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Sep 8, 2016
@bluss
Copy link
Member

bluss commented Sep 8, 2016

Mini-reminder: let's tag more user-visible stuff with relnotes

@SimonSapin SimonSapin deleted the char_convert branch September 15, 2016 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants