Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libsyntax: be more accepting of whitespace in lexer #29734

Merged
merged 3 commits into from
Mar 8, 2016

Conversation

Ryman
Copy link
Contributor

@Ryman Ryman commented Nov 10, 2015

Fixes #29590.

Perhaps this may need more thorough testing?

r? @Aatch

/// this function is Unicode-ignorant; fortunately, the careful design of
/// UTF-8 mitigates this ignorance. In particular, this function only collapses
/// sequences of \n, \r, ' ', and \t, but it should otherwise tolerate Unicode
/// chars. Unsurprisingly, it doesn't do NKF-normalization(?).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should leave the rest of this comment in. The function is more unicode-aware now, but presumably still doesn't do any normalisation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// This function is relatively Unicode-ignorant; fortunately, the careful design
/// of UTF-8 mitigates this ignorance. It doesn't do NKF-normalization(?).

Would be ok for you? Or just the normalization sentence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks fine to me.

@Aatch
Copy link
Contributor

Aatch commented Nov 10, 2015

Test looks fine to me. r=me with the nits addressed.

@Aatch
Copy link
Contributor

Aatch commented Nov 10, 2015

@bors r+ de89895

@eefriedman
Copy link
Contributor

This is technically a new feature... probably doesn't need a full RFC, but shouldn't it at least be discussed a bit?

@Ryman
Copy link
Contributor Author

Ryman commented Nov 11, 2015

Should we r- bors if there are concerns?

cc @nikomatsakis (who suggested accepting all whitespace in #29590)

@eefriedman
Copy link
Contributor

I guess I don't really have any specific concerns; it's not like we could interpret a character of class whitespace as anything other than whitespace or an error. We could end up with a sort of confusing parse with ZWNJs and other characters like that, but I don't see any reason not to allow users to inflict that on themselves.

@eefriedman
Copy link
Contributor

Err, actually, correction, I do have a specific objection: UAX #31 specifically recommends not doing this. We should be using the Pattern_White_Space property, not White_Space.

@Aatch
Copy link
Contributor

Aatch commented Nov 11, 2015

@bors r-

@Aatch
Copy link
Contributor

Aatch commented Nov 11, 2015

Sorry, I guess I got a little too zealous there.

@nikomatsakis
Copy link
Contributor

My bad. I should have known better than to think anything about unicode was a black-and-white decision!

@Ryman
Copy link
Contributor Author

Ryman commented Nov 12, 2015

I've updated the branch to accept only PATTERN_WHITE_SPACE, I've assumed exposing more from librustc_unicode is fine due to it being unstable.

The github diff doesn't seem to display the testfile too well, it shows as the below in my local git which shows the test more clearly:

-    assert_eq!(4 +  7 * 2
-
+assert_eq!(4^L+

-/ 3 * 2 , 4 + 7 * 2 / 3 * 2);
+7   * 2<U+0085>/<U+200E>3<U+200F>*<U+2028>2<U+2029>, 4 + 7 * 2 / 3 * 2);
 }```

@bors
Copy link
Contributor

bors commented Dec 6, 2015

☔ The latest upstream changes (presumably #30187) made this pull request unmergeable. Please resolve the merge conflicts.

@steveklabnik
Copy link
Member

What's the status of this PR?

@Ryman
Copy link
Contributor Author

Ryman commented Jan 1, 2016

@steveklabnik Needs some feedback on if the changes are now ok, perhaps this change still needs more discussion though? Can rebase after that!

}
}

// check if a has *only* trailing whitespace
a_iter.all(|c| is_pattern_whitespace(c))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the closure needed? Wouldn't a_iter.all(is_pattern_whitespace) work just fine?

ranma42 added a commit to ranma42/rust that referenced this pull request Jan 4, 2016
As mentioned in rust-lang#29734, the range comparison closure can be improved.

The LLVM IR and the assembly from the new version are much simpler and
unfortunately we cannot rely on the compiler to optimise this much, as
it would need to know that `lo <= hi`.

Besides from simpler code, there might also be a performance
advantage, although it is unlikely to appear on benchmarks, as we are
doing a binary search, which should always involve few comparisons.

The code is available on the playpen for ease of comparison:
http://is.gd/4raMmH
@nikomatsakis
Copy link
Contributor

@Ryman so it now conforms with UAX #31 (as cited by @eefriedman)?

@Ryman
Copy link
Contributor Author

Ryman commented Jan 7, 2016

@nikomatsakis I think it partially does, from the guideline:

Pattern_White_Space and Pattern_Syntax Characters: To meet this requirement, an implementation shall use Pattern_White_Space characters as all and only those characters interpreted as whitespace in parsing, and shall use Pattern_Syntax characters as all and only those characters with syntactic use. ...

Note: When meeting this requirement, all characters except those that have the Pattern_White_Space or Pattern_Syntax properties are available for use as identifiers or literals.

This PR doesn't change anything about the acceptance of Pattern_Syntax, so I don't think the final note is true, but it does bring us into alignment with the whitespace requirements of the guideline. (unless I've mis-implemented!)

@nikomatsakis
Copy link
Contributor

@Ryman great!

bors added a commit that referenced this pull request Jan 12, 2016
and the associated update of tables.rs

The last commit is related to my comment to #29734.
This aligns with unicode recommendations and should be stable for all future
unicode releases. See http://unicode.org/reports/tr31/#R3.

This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace`
so potentially breaks users of libsyntax.
@Ryman
Copy link
Contributor Author

Ryman commented Jan 16, 2016

Rebased, I think I got all of your suggestions accounted for @ranma42 :)

@ranma42
Copy link
Contributor

ranma42 commented Jan 16, 2016

Yes, thank you! :)

@steveklabnik
Copy link
Member

Triage: it seems this PR has been updated with review and is still siting here almost a month later, anyone willing to r+?

@alexcrichton
Copy link
Member

r? @nikomatsakis

was this intended to be r+'d? just confirming...

@alexcrichton alexcrichton assigned nikomatsakis and unassigned Aatch Mar 8, 2016
@Aatch
Copy link
Contributor

Aatch commented Mar 8, 2016

@bors r+ 24578e0

@Aatch
Copy link
Contributor

Aatch commented Mar 8, 2016

Woops, looks like this managed to slip by.

@bors
Copy link
Contributor

bors commented Mar 8, 2016

⌛ Testing commit 24578e0 with merge 8b7c3f2...

bors added a commit that referenced this pull request Mar 8, 2016
libsyntax: be more accepting of whitespace in lexer

Fixes #29590.

Perhaps this may need more thorough testing?

r? @Aatch
@bors bors merged commit 24578e0 into rust-lang:master Mar 8, 2016
durka added a commit to durka/rust that referenced this pull request Jun 15, 2016
steveklabnik added a commit to steveklabnik/rust that referenced this pull request Jun 27, 2016
steveklabnik added a commit to steveklabnik/rust that referenced this pull request Jun 27, 2016
GuillaumeGomez added a commit to GuillaumeGomez/rust that referenced this pull request Jun 28, 2016
dlrobertson pushed a commit to dlrobertson/rust that referenced this pull request Nov 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants