Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested character classes and intersection with && #346

Merged

Conversation

robinst
Copy link
Contributor

@robinst robinst commented Feb 22, 2017

This implements parts of UTS#18 RL1.3, namely:

  • Nested character classes, e.g.: [a[b-c]]
  • Intersections in classes, e.g.: [\w&&\p{Greek}]

They can be combined to do things like [\w&&[^a]] to get all word
characters except a.

Fixes #341

Copy link
Contributor Author

@robinst robinst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments in the code with remarks/questions. Please don't hold back with comments of any kind, I'm still quite new to Rust :).

I can add the user documentation once we have a better understanding of all the rules.

/// Calculate the intersection of two canonical character classes.
///
/// The returned intersection is canonical.
fn intersection(&self, other: &CharClass) -> CharClass {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should be pub or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say hold off for now. Thanks.

} else {
// No more ranges to check, done.
break;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is the most idiomatic way to do this :). Maybe using old-fashioned indexing would work better in this case.

Copy link

@ncm ncm Feb 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

    match iter.next() {
          Some(v) => *item = v,
          _ => break  // no more ranges to check, done
    }

?

}

Expr::ClassBytes(byte_class)
}))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as before, just moved.

fn class_nested_class_brackets_hyphen() {
// This is really confusing, but `]` is allowed if first character within a class
// It parses as a nested class with the `]` and `-` characters
assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')])));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the decision to allow ] unescaped as the first character in a class final? Thinking about this and how it interacts with && was one of the most confusing parts about this change.

I'd wish the rule for ] was that it needs to be escaped inside a character class, no matter where it is. The same for -. Is it too late to change this? It would simplify the parsing a bit, and would make it easier to explain how things work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if it is too late to break compatibility with existing regexes we can at least ban ] and possibly - as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. The downside of this is that it would break the useful rule of "everything that can be a top-level character class can also be a nested character class".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit that this is exasperated by the presence of nested character classes, but AFAIK, this is pretty standard. Changing this would, no doubt in my mind, require a semver version bump. I really really don't like breaking changes in the regex syntax, so I would rather live with the implementation complexity over the breaking changes and inconsistencies with other regex engines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Just tested with Java's Pattern, even its implementation allows it. So yeah, makes sense to keep it. (It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)

I have no doubt that that is indeed the case. :-)

In fact, it's probably true for a lot of things in the regex syntax. :-)

Copy link

@trishume trishume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks excellent to me. I reviewed everything and left a couple comments suggesting minor changes and asking questions. Looks well tested, did you check the branch coverage like @BurntSushi mentioned should stay perfect?

'&' => {
// intersection with `&&`
self.bump();
self.bump();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this throw a syntax error (or just treat it as a literal & in the character class, not sure what spec says) if the second character you bump over isn't also an &?

Edit: noticed later that you only exit parse_class_set if you look ahead and see &&. Maybe add a comment on the second bump saying this was verified to be & in parse_class_set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, needs a comment, adding it.


Expr::ClassBytes(byte_class)
}))
Build::Expr(Expr::ClassBytes(class2)) => {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are all these extra cases for? I don't see how they relate to the rest of the changes in this PR. Do they add support for more types of escapes in character classes? Which tests test these new branches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to see in the diff, but this is unchanged from before, just extracted into its own parse_class_escape method.

fn class_nested_class_brackets_hyphen() {
// This is really confusing, but `]` is allowed if first character within a class
// It parses as a nested class with the `]` and `-` characters
assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')])));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe if it is too late to break compatibility with existing regexes we can at least ban ] and possibly - as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.

'\\' => match try!(self.parse_escape()) {
Build::Expr(Expr::Class(class2)) => {
// Nested set, e.g. `[c-d]` in `[a-b[c-d]]`
let class2 = try!(self.parse_class_as_chars());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had chance to do a thorough review yet (although, from what I see, this looks really really good), but this is a potential problem. In particular, this turns a parser with predictable stack growth into a parser with unpredictable stack growth. This is bad because if a program accepts regexes as user input and they do [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[... or some such, then this recursion will cause a stack overflow.

With that said, there are various parts of this crate that use recursion over the abstract syntax, but not before checking the nesting limit. It's not quite a guarantee (since who knows how big the stack really is), but it does afford the caller some control over what happens if a very bad regex is provided.

This basically means you would need to maintain an explicit stack in the parser state of nested character classes, and this is what the Build enum is for. Right now, the Build enum only cares about groups and alternates, but maybe it's easy to add a variant for nested classes?

I'm sorry I didn't bring this point up earlier. I should have seen it coming, but I completely forgot about it until the code was in front of me.

How willing are you to make this change? If not, I might be able to finish it (I just don't know when I will).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. I remember wondering about that problem while implementing it, but then forgot later.

I'll look into using Build, yeah. Maybe it could also be a separate stack. Have to wrap my head around it first though :).

@robinst robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from c436bfd to 93f1b69 Compare February 25, 2017 03:00
@robinst
Copy link
Contributor Author

robinst commented Feb 25, 2017

Rewrote it to use a stack instead of recursion. Also added a test for deeply nested character classes that didn't pass before but now it does :).

I decided to not use Build but instead have a separate stack as the two things are completely separate.

@trishume
Copy link

@robinst Awesome. I looked over it briefly but I'm too tired to fully understand it at the moment, so I haven't done anything close to a careful review.

One thing I'm unsure about is if the important thing is just to not stack overflow, or to have limited memory usage. At the moment this should use heap proportional to at most the length of the regex as far as I can tell. But @BurntSushi said something about the nesting limit, and I'm not sure if checking it is warranted in this case, or only when using recursion.

@BurntSushi
Copy link
Member

But @BurntSushi said something about the nesting limit, and I'm not sure if checking it is warranted in this case, or only when using recursion.

Hmm, yes, the primary purpose of the nesting limit is to mitigate the risk of stack overflow. However, in the nested character case, all nestings are resolved (I think) to a single flattened set of Unicode codepoints.

Heap space may indeed be unbounded, but:

  1. This is already true today (I think) with regexes like (((((((((((((((((((((((((((((...
  2. The heap space is roughly proportional to the actual size of the regular expression pattern string, so some simple mitigation measures are possible by users.

@robinst
Copy link
Contributor Author

robinst commented Mar 2, 2017

all nestings are resolved (I think) to a single flattened set of Unicode codepoints.

Yes, they are.

This is already true today (I think) with regexes like (((((((((((((((((((((((((((((...

Yes, the stack just keeps growing and there's no check before pushing expressions onto it.

Did you have a chance to look at the change yet? I was looking into a quickcheck test for intersection as well, I can add it if you want.

@robinst robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from 93f1b69 to 2f5c967 Compare March 13, 2017 01:13
@robinst
Copy link
Contributor Author

robinst commented Mar 13, 2017

I was looking into a quickcheck test for intersection as well, I can add it if you want.

Added the quickcheck test as well now.

@BurntSushi
Copy link
Member

OK, I've finally had a chance to review this in more depth and I'm pretty much speechless. This is phenomenal work. I looked hard, but I don't see anything to complain about!

I'm not sure if #354 will cause conflicts, so you might wind up needing to rebase, but let's see what happens. (Bors seems to be stuck at the moment...)

@bors r+

@bors
Copy link
Contributor

bors commented May 20, 2017

📌 Commit 2f872b4 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented May 20, 2017

🔒 Merge conflict

@bors
Copy link
Contributor

bors commented May 20, 2017

☔ The latest upstream changes (presumably #354) made this pull request unmergeable. Please resolve the merge conflicts.

@BurntSushi
Copy link
Member

@robinst Yup, this will need a rebase!

@robinst robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from 2f872b4 to d3b5d32 Compare May 21, 2017 11:29
This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes rust-lang#341
@robinst robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from d3b5d32 to bb233ec Compare May 21, 2017 11:31
@robinst
Copy link
Contributor Author

robinst commented May 21, 2017

OK, I've finally had a chance to review this in more depth and I'm pretty much speechless. This is phenomenal work. I looked hard, but I don't see anything to complain about!

Wow, thank you :)! It was surprisingly simple to do this change, the codebase is very readable!

Rebased it now!

@BurntSushi
Copy link
Member

@bors retry

@BurntSushi
Copy link
Member

@bors r+

@bors
Copy link
Contributor

bors commented May 21, 2017

📌 Commit bb233ec has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented May 21, 2017

⌛ Testing commit bb233ec with merge 548cb19...

bors added a commit that referenced this pull request May 21, 2017
…ction, r=BurntSushi

Support nested character classes and intersection with `&&`

This implements parts of UTS#18 RL1.3, namely:

* Nested character classes, e.g.: `[a[b-c]]`
* Intersections in classes, e.g.: `[\w&&\p{Greek}]`

They can be combined to do things like `[\w&&[^a]]` to get all word
characters except `a`.

Fixes #341
@bors
Copy link
Contributor

bors commented May 21, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: BurntSushi
Pushing 548cb19 to master...

@bors bors merged commit bb233ec into rust-lang:master May 21, 2017
@robinst robinst deleted the issue-341-char-class-nesting-and-intersection branch May 22, 2017 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants