Escape fewer Unicode codepoints in `Debug` impl of `str` #34485

tbu- · 2016-06-26T13:40:52Z

Use the same procedure as Python to determine whether a character is
printable, described in PEP 3138. In particular, this means that the
following character classes are escaped:

Cc (Other, Control)
Cf (Other, Format)
Cs (Other, Surrogate), even though they can't appear in Rust strings
Co (Other, Private Use)
Cn (Other, Not Assigned)
Zl (Separator, Line)
Zp (Separator, Paragraph)
Zs (Separator, Space), except for the ASCII space ' ' 0x20

This allows for user-friendly inspection of strings that are not
English (e.g. compare "\u{e9}\u{e8}\u{ea}" to "éèê").

Fixes #34318.
CC #34422.

rust-highfive · 2016-06-26T13:41:04Z

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

sfackler · 2016-06-26T17:46:14Z

Looks like run-pass/ifmt.rs is failing on travis.

ollie27 · 2016-06-26T21:46:40Z

Is this changing char::escape_default? That would be a breaking change.

tbu- · 2016-06-26T21:50:06Z

It is changing that function. Why is it a breaking change?

ollie27 · 2016-06-26T22:04:47Z

It's a stable function and this will break people's code which relies on the current behaviour.

steveklabnik · 2016-06-26T22:11:03Z

/cc @rust-lang/libs and @rust-lang/lang

On Jun 26, 2016, 18:04 -0400, Oliver Middletonnotifications@github.com, wrote:

It's a stable function and this will break people's code which relies on the current behaviour.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly,view it on GitHub(#34485 (comment)), ormute the thread(https://github.com/notifications/unsubscribe/AABsijjdWpFL-HoADL8BmwCH9hBOw61Qks5qPveAgaJpZM4I-joR).

ranma42 · 2016-06-27T06:58:53Z

The documentation explicitly states that any character that is not in the printable ASCII range 0x20 .. 0x7e will be escaped (some characters with ad-hoc sequences, the other ones with hexadecimal Unicode escapes), so this change would at the very least need to change the documentation of the function to match the new behaviour.
This is probably undesirable as the API is stable and the current specification guarantees that the output is only composed of ASCII characters, while the new implementation has much looser and complex guarantees (for example, it would be hard to provide an accurate specification without listing the Unicode character classes that are not escaped).

BurntSushi · 2016-06-27T11:07:49Z

@ranma42 Agreed. I don't think we can change this. It's very clearly changing the contract of the function.

tbu- · 2016-06-27T11:18:34Z

@ranma42 OK, assume for now that we don't change that function, but only the Debug implementation of str.

eddyb · 2016-06-27T11:56:12Z

Didn't these exact conversations happen before? Was a previous attempt abandoned?

tbu- · 2016-06-27T11:58:36Z

I don't know, I don't remember any.

eddyb · 2016-06-27T12:02:57Z

I can't find it. Might have been char instead. Don't mind me.

ranma42 · 2016-06-27T12:09:41Z

@tbu- , yes, I think that should work. We might also want to expose it as a function on char (escape_nonprintable?), but I am unsure if this should be done directly or if we should go for an RFC to discuss the details (which Debug impls should escape non printable chars and which ones should escape everything? should we expose the API on char? under what name? are there other conventions besides the one established by Python that might be worth evaluating?).

brson · 2016-06-27T16:46:44Z

Not sure what I think of this.

The medium that Debug writes to is a byte sequence with no guarantees that it supports Unicode. Certainly most places the output would end up would speak UTF-8 but definitely not all. I don't recall the motivation for escaping Debug so aggressively, but it seems like the possibility of terminals not understanding it must be one reason.

tbu- · 2016-06-28T09:25:38Z

Actually, the output of Debug is statically guaranteed to be UTF-8.

I would guess that the reason for this is that the people implementing it didn't need non-ASCII characters, and I mean if you don't need them they're just a nuisance. But if you're implementing a non-English program, then it basically makes the Debug implementation useless (see the linked issue in which @liigo shows the unreadable "file not found" message in Chinese.

If you write to a device that doesn't support UTF-8, you should just escape these characters later, when writing to said device -- like the ascii function in Python. The other way around doesn't work.

ranma42 · 2016-06-28T10:06:02Z

But if you're implementing a non-English program, then it basically makes the Debug implementation useless

A possible objection is that

Debug should format the output in a programmer-facing, debugging context.

which seems to imply that it should not be exposed to the users, but rather to tools or developers.
For example, one could say that formatting errors using Debug is not correct if they are going to be shown to the user.

tbu- · 2016-06-28T18:05:10Z

@ranma42 The linked example for a programmer, @liigo.

ranma42 · 2016-06-28T20:40:30Z

@tbu- not really, in that case he is the user of the rustc program.

tbu- · 2016-06-29T00:03:27Z

@ranma42 It's a runtime error provided by the operating system, encountered while programming.

EDIT: Also, you could probably look into the PEP, they also give a longer motivation in there.

liigo · 2016-06-29T07:07:14Z

Debug should format the output in a programmer-facing, debugging context.

I'd rather they don't do these (mostly useless) format/escape for me (a programmer). They ~~(self-righteous people)~~ thought these format may help programmers in debugging context, but it's not true. In some times, non-English programmers got unreadable formatted output, worse than Display output. I don't use very-old output devices, which maybe have problems to display Unicode characters, which really need escape.

@tbu- Maybe we can't change escape_default for compatible reasons, but changing impl Debug for str is just implementation details IMHO.

ranma42 · 2016-06-29T07:21:55Z

@tbu- It is a runtime error provided by the operating system, encountered by rustc and dumped on the console in a bad way. What I am saying is that it is not obvious to me that the correct way to display it is using Debug instead of Display.

liigo · 2016-06-29T07:22:33Z

@brson: it seems like the possibility of terminals not understanding it must be one reason.

If this was the reason, we should also implement println! to escape Unicode characters.

liigo · 2016-06-29T07:25:40Z

@ranma42: to display it is using Debug instead of Display

That'd be a big breaking change. Not all Debug types implement Display.

ranma42 · 2016-06-29T07:38:25Z

@liigo Yes, that would be a major breaking change (it would change the constraints on the Result type so that E: fmt:Display).

ranma42 · 2016-06-29T08:05:13Z

To me, the major advantage of the current implementation of Debug for str (always escaping non-ASCII) is that it allows to distinguish between different strings that have the same visual appearance. This applies in the same way to whitespace ('\t' vs spaces), Unicode combining characters and control characters... and maybe more?

Of course this does not mean that it should be used for everything. Specifically, I would only use Debug when the representation of the data is relevant, not when it just needs to be shown to the user. Basically this reasoning is my understanding of the documentation of Debug and Display (programmer-facing vs user-facing). This can of course be incorrect, but in that case I would also ask for an improvement in the documentation ;)

#34318 shows an example where using Debug looks inappropriate (an OS-provided localised error). A partial solution that would have limited impact on other things would be to avoid escaping only the OS localised errors here. I would not even call this a breaking change, given that the output is already not explicitly stabilised (it obviously depends on the current locale and most likely on the operating system + version).

Even though Rust does not (yet) have its own localised error messages, it would not be hard to imagine the same issue affecting other types of output, so it might be a good idea to think of a more general solution to ensure a way forward in this direction.

tbu- · 2016-06-29T09:22:57Z

@ranma42 If you want to see the exact code points, why only make an exception for English? That's very English-centric. :) EDIT: Imagine the Debug implementation showed an escape sequence for every latin character instead of the actual character. That would be quite a pain to program with, right? That's what it's like if the program deals with, say Chinese strings.

We should probably provide a function that does the same as Debug today, like Python has: ascii. This basically solves the problem. If you need a Debug string like today, you can just put the Debug output into said function and you receive what you wanted.

liigo · 2016-06-29T09:31:37Z

To me, the major advantage of the current implementation of Debug for str (always escaping non-ASCII) is that it allows to distinguish between different strings that have the same visual appearance.

This is not advantage for non-ASCII text. It just makes unreadable noise (\u{xxxx}...). If someone do think this is a major advantage, please let println! have this advantage too.

tbu- · 2016-07-27T19:22:36Z

@brson Could you @bors r-, there's another test failure in the WTF-8 code.

alexcrichton · 2016-07-27T20:21:23Z

@bors: r-

tbu- · 2016-07-28T00:50:07Z

@alexcrichton It should be fixed now.

alexcrichton · 2016-07-28T16:05:21Z

@bors: r+ 3d09b4a

bors · 2016-07-28T18:20:33Z

⌛ Testing commit 3d09b4a with merge d1df3fe...

Escape fewer Unicode codepoints in `Debug` impl of `str` Use the same procedure as Python to determine whether a character is printable, described in [PEP 3138]. In particular, this means that the following character classes are escaped: - Cc (Other, Control) - Cf (Other, Format) - Cs (Other, Surrogate), even though they can't appear in Rust strings - Co (Other, Private Use) - Cn (Other, Not Assigned) - Zl (Separator, Line) - Zp (Separator, Paragraph) - Zs (Separator, Space), except for the ASCII space `' '` `0x20` This allows for user-friendly inspection of strings that are not English (e.g. compare `"\u{e9}\u{e8}\u{ea}"` to `"éèê"`). Fixes #34318. CC #34422. [PEP 3138]: https://www.python.org/dev/peps/pep-3138/

bors · 2016-07-28T21:17:52Z

SimonSapin · 2017-02-03T00:57:57Z

I’m very late to say this, but this adds 2102 bytes of static data to libcore, whereas previously all large Unicode tables were in the std_unicode crate that #![no_std] programs could opt not to use. 2 KB may not seem like much, but it’s significant when programming a micro-controller that has 16 KB of flash memory.

aturon · 2017-02-03T17:09:21Z

@SimonSapin I opened #39492 for this issue, and to propose a general policy.

ariasuni · 2017-06-23T02:05:03Z

I’d like to know what the code do precisely (what are SINGLETONS0U, SINGLETONS0L, NORMAL0 and check()) so I could try to optimize it, at least in size. Also, this code should to belong to libstd_unicode, right?

SimonSapin · 2017-06-23T06:11:09Z

@ariasuni The commit message and PR message give a list of Unicode categories of characters that are escape, but yes if it’s not already this list should also be in some doc-comment in the code.

I agree that I’d prefer to have these tables in libstd_unicode, but libcore still needs some impl of Debug for str. There can only be one such impl in a program, I don’t know if it’s possible to make it do something different based on whether libstd_unicode is available.

tbu- · 2017-06-23T09:34:57Z

@ariasuni The high-level view is that you have to store 0x110000 booleans, one for each Unicode code point, which specify whether the given code point should be printed or not (according to the description in the first post of this PR).

The low-level view of this particular implementation seems to have changed since I have implemented it, you can find some notes on the new one in 44bcd26.

ariasuni · 2017-06-25T00:46:11Z

@SimonSapin We can split libstd_unicode in unicode-casing and unicode-categories, and use only the latter to implement is_printable(). What do you think about it?

SimonSapin · 2017-06-25T06:58:04Z

@ariasuni even if we did that, that doesn’t solve the situation that a #[no_std] program needs a Debug for str impl and ideally should also be able to opt out of embedding large Unicode tables. If we move is_printable to a separate crate but make libcore depend on that crate, we’ve moved things around without actually changing anything for Rust users. If we add a Cargo feature to disable these including tables (#39492), it doesn’t change much that it’s a separate crate or not.

ariasuni · 2017-06-25T16:45:50Z

Sorry, I probably overthought it. The idea is to put the code for Unicode categories elsewhere so that we can use it in is_printable() even when using #[no_std] (right now without waiting for the Cargo feature), and this would reduce the code size for those who don’t. Not sure about the stability implications of moving code around thought.

SimonSapin · 2017-06-25T17:33:11Z

I don’t understand how this would reduce code size, I may be missing something.

ariasuni · 2017-06-25T17:46:43Z

If I understand correctly, is_printable() is using Unicode categories data in char_private.rs, instead of the data in libstd_unicode used by is_control() is_alphabetic(), is_numeric(), etc.

rust-highfive assigned brson Jun 26, 2016

brson added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. I-nominated labels Jun 27, 2016

Rename char::escape to char::escape_debug and add tracking issue

3d09b4a

tbu- force-pushed the pr_unicode_debug_str branch from f9bf85d to 3d09b4a Compare July 28, 2016 00:21

alexcrichton added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 28, 2016

bors merged commit 3d09b4a into rust-lang:master Jul 28, 2016

This was referenced Jul 28, 2016

Escape the unmatched surrogates with lower-case hexadecimal numbers #35084

Merged

Methods Fn(Mut,Once)::call(mut,once) are gated with two feature gates, remove one of them #34802

Merged

SimonSapin added a commit to SimonSapin/rust-std-candidates that referenced this pull request Aug 2, 2016

Update tests for rust-lang/rust#34485

e10bfeb

bluss mentioned this pull request Nov 15, 2016

fmt::Debug should not escape printable characters #24588

Closed

liigo mentioned this pull request Nov 17, 2016

Unreadable io::Error debug strings #34318

Closed

radix pushed a commit to radix/string-wrapper that referenced this pull request Jan 13, 2017

Update tests for rust-lang/rust#34485

dcc28f0

aturon mentioned this pull request Feb 3, 2017

Escaping char in libcore adds 2k of static data for no_std cases #39492

Open

tbu- mentioned this pull request Jun 23, 2017

Tracking issue for the functions for debug escaping char_escape_debug #35068

Closed

SimonSapin mentioned this pull request Mar 23, 2018

Escape combining characters in char::Debug #49283

Merged

Escape fewer Unicode codepoints in Debug impl of str #34485

Escape fewer Unicode codepoints in Debug impl of str #34485

Conversation

tbu- commented Jun 26, 2016

rust-highfive commented Jun 26, 2016

sfackler commented Jun 26, 2016

ollie27 commented Jun 26, 2016

tbu- commented Jun 26, 2016

ollie27 commented Jun 26, 2016

steveklabnik commented Jun 26, 2016

ranma42 commented Jun 27, 2016

BurntSushi commented Jun 27, 2016

tbu- commented Jun 27, 2016

eddyb commented Jun 27, 2016

tbu- commented Jun 27, 2016

eddyb commented Jun 27, 2016

ranma42 commented Jun 27, 2016

brson commented Jun 27, 2016 • edited Loading

tbu- commented Jun 28, 2016

ranma42 commented Jun 28, 2016

tbu- commented Jun 28, 2016

ranma42 commented Jun 28, 2016

tbu- commented Jun 29, 2016 • edited Loading

liigo commented Jun 29, 2016

ranma42 commented Jun 29, 2016

liigo commented Jun 29, 2016

liigo commented Jun 29, 2016 • edited Loading

ranma42 commented Jun 29, 2016

ranma42 commented Jun 29, 2016

tbu- commented Jun 29, 2016 • edited Loading

liigo commented Jun 29, 2016

tbu- commented Jul 27, 2016

alexcrichton commented Jul 27, 2016

tbu- commented Jul 28, 2016

alexcrichton commented Jul 28, 2016

bors commented Jul 28, 2016

bors commented Jul 28, 2016

SimonSapin commented Feb 3, 2017

aturon commented Feb 3, 2017

ariasuni commented Jun 23, 2017

SimonSapin commented Jun 23, 2017

tbu- commented Jun 23, 2017

ariasuni commented Jun 25, 2017 • edited Loading

SimonSapin commented Jun 25, 2017

ariasuni commented Jun 25, 2017 • edited Loading

SimonSapin commented Jun 25, 2017

ariasuni commented Jun 25, 2017 • edited Loading

Escape fewer Unicode codepoints in `Debug` impl of `str` #34485

Escape fewer Unicode codepoints in `Debug` impl of `str` #34485

brson commented Jun 27, 2016 •

edited

Loading

tbu- commented Jun 29, 2016 •

edited

Loading

liigo commented Jun 29, 2016 •

edited

Loading

tbu- commented Jun 29, 2016 •

edited

Loading

ariasuni commented Jun 25, 2017 •

edited

Loading

ariasuni commented Jun 25, 2017 •

edited

Loading

ariasuni commented Jun 25, 2017 •

edited

Loading