Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: escape invalid UTF-8 bytes in debug output for Match #1203

Merged
merged 3 commits into from
Jun 9, 2024

Conversation

notJoon
Copy link
Contributor

@notJoon notJoon commented Jun 8, 2024

Description

  1. The Debug implementation for Match has been updated to use DebugHaystack. This provides a way to handle the formatting of &[u8] for debug output.
  • Valid UTF-8 characters are output as is.
  • Invalid UTF-8 bytes are output as hex escape sequences (\xHH).
  • ASCII escape characters (e.g., \t, \n) are properly escaped.
  1. Additional test cases have been added

fmt.field("bytes", &s);

let bytes = self.as_bytes();
let formatted = bytes_to_string_with_invalid_utf8_escaped(bytes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use regex_automata::util::escape::DebugHaystack instead? It will basically do what you have here, but will only escape invalid UTF-8. What you've implemented here will escape not only invalid UTF-8, but all UTF-8 that isn't ASCII. (I think that would be a cure worse than the disease.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modified to use DebugHaystack. I thought there would be such a feature but couldn't find it. Thanks for your suggestion. 88112b3

debug_str,
r#"Match { start: 7, end: 13, bytes: "\\xFFworld" }"#
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some tests with non-ASCII UTF-8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added along with other tests.
d18841e

fn bytes_to_string_with_invalid_utf8_escaped(bytes: &[u8]) -> String {
let mut result = String::new();
for &byte in bytes {
if byte.is_ascii() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs valid UTF-8 characters as is

This is why what you said isn't accurate here. This only outputs ASCII characters as-is. Everything else, including valid UTF-8 that isn't ASCII, is emitted as escape byte sequences.

@notJoon notJoon requested a review from BurntSushi June 9, 2024 01:11
Copy link
Member

@BurntSushi BurntSushi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@BurntSushi BurntSushi merged commit 1f9f9cc into rust-lang:master Jun 9, 2024
16 checks passed
@BurntSushi
Copy link
Member

This PR is on crates.io in regex 1.10.5.

@notJoon notJoon deleted the debug-output-invalid-utf8 branch June 10, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants