Support nested character classes and intersection with `&&` #346

robinst · 2017-02-22T10:20:06Z

This implements parts of UTS#18 RL1.3, namely:

Nested character classes, e.g.: [a[b-c]]
Intersections in classes, e.g.: [\w&&\p{Greek}]

They can be combined to do things like [\w&&[^a]] to get all word
characters except a.

Fixes #341

robinst

Added some comments in the code with remarks/questions. Please don't hold back with comments of any kind, I'm still quite new to Rust :).

I can add the user documentation once we have a better understanding of all the rules.

robinst · 2017-02-22T10:20:44Z

regex-syntax/src/lib.rs

+    /// Calculate the intersection of two canonical character classes.
+    ///
+    /// The returned intersection is canonical.
+    fn intersection(&self, other: &CharClass) -> CharClass {


Not sure if this should be pub or not.

I'd say hold off for now. Thanks.

robinst · 2017-02-22T10:22:24Z

regex-syntax/src/lib.rs

+            } else {
+                // No more ranges to check, done.
+                break;
+            }


Not sure if this is the most idiomatic way to do this :). Maybe using old-fashioned indexing would work better in this case.

Maybe

match iter.next() { Some(v) => *item = v, _ => break // no more ranges to check, done }

?

robinst · 2017-02-22T10:25:06Z

regex-syntax/src/parser.rs

+            }
+
+            Expr::ClassBytes(byte_class)
+        }))


This is the same as before, just moved.

robinst · 2017-02-22T10:32:17Z

regex-syntax/src/parser.rs

+    fn class_nested_class_brackets_hyphen() {
+        // This is really confusing, but `]` is allowed if first character within a class
+        // It parses as a nested class with the `]` and `-` characters
+        assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')])));


Is the decision to allow ] unescaped as the first character in a class final? Thinking about this and how it interacts with && was one of the most confusing parts about this change.

I'd wish the rule for ] was that it needs to be escaped inside a character class, no matter where it is. The same for -. Is it too late to change this? It would simplify the parsing a bit, and would make it easier to explain how things work.

Maybe if it is too late to break compatibility with existing regexes we can at least ban ] and possibly - as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.

Maybe. The downside of this is that it would break the useful rule of "everything that can be a top-level character class can also be a nested character class".

I admit that this is exasperated by the presence of nested character classes, but AFAIK, this is pretty standard. Changing this would, no doubt in my mind, require a semver version bump. I really really don't like breaking changes in the regex syntax, so I would rather live with the implementation complexity over the breaking changes and inconsistencies with other regex engines.

Ok. Just tested with Java's Pattern, even its implementation allows it. So yeah, makes sense to keep it. (It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)

(It's just one of those things that seem like they were originally done "not because it's a good idea, but because we can", and now everyone has to support it.)

I have no doubt that that is indeed the case. :-)

In fact, it's probably true for a lot of things in the regex syntax. :-)

trishume

Mostly looks excellent to me. I reviewed everything and left a couple comments suggesting minor changes and asking questions. Looks well tested, did you check the branch coverage like @BurntSushi mentioned should stay perfect?

trishume · 2017-02-22T15:29:56Z

regex-syntax/src/parser.rs

+                '&' => {
+                    // intersection with `&&`
+                    self.bump();
+                    self.bump();


Shouldn't this throw a syntax error (or just treat it as a literal & in the character class, not sure what spec says) if the second character you bump over isn't also an &?

Edit: noticed later that you only exit parse_class_set if you look ahead and see &&. Maybe add a comment on the second bump saying this was verified to be & in parse_class_set.

Yeah, needs a comment, adding it.

trishume · 2017-02-22T15:45:42Z

regex-syntax/src/parser.rs

-
-            Expr::ClassBytes(byte_class)
-        }))
+            Build::Expr(Expr::ClassBytes(class2)) => {


What are all these extra cases for? I don't see how they relate to the rest of the changes in this PR. Do they add support for more types of escapes in character classes? Which tests test these new branches?

It's hard to see in the diff, but this is unchanged from before, just extracted into its own parse_class_escape method.

trishume · 2017-02-22T15:51:44Z

regex-syntax/src/parser.rs

+    fn class_nested_class_brackets_hyphen() {
+        // This is really confusing, but `]` is allowed if first character within a class
+        // It parses as a nested class with the `]` and `-` characters
+        assert_eq!(p(r"[[]-]]"), Expr::Class(class(&[('-', '-'), (']', ']')])));


Maybe if it is too late to break compatibility with existing regexes we can at least ban ] and possibly - as first characters of a nested character class. As that test demonstrates, it can get really confusing when nested character classes are involved.

BurntSushi · 2017-02-23T00:14:00Z

regex-syntax/src/parser.rs

-                '\\' => match try!(self.parse_escape()) {
-                    Build::Expr(Expr::Class(class2)) => {
+                        // Nested set, e.g. `[c-d]` in `[a-b[c-d]]`
+                        let class2 = try!(self.parse_class_as_chars());


I haven't had chance to do a thorough review yet (although, from what I see, this looks really really good), but this is a potential problem. In particular, this turns a parser with predictable stack growth into a parser with unpredictable stack growth. This is bad because if a program accepts regexes as user input and they do [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[... or some such, then this recursion will cause a stack overflow.

With that said, there are various parts of this crate that use recursion over the abstract syntax, but not before checking the nesting limit. It's not quite a guarantee (since who knows how big the stack really is), but it does afford the caller some control over what happens if a very bad regex is provided.

This basically means you would need to maintain an explicit stack in the parser state of nested character classes, and this is what the Build enum is for. Right now, the Build enum only cares about groups and alternates, but maybe it's easy to add a variant for nested classes?

I'm sorry I didn't bring this point up earlier. I should have seen it coming, but I completely forgot about it until the code was in front of me.

How willing are you to make this change? If not, I might be able to finish it (I just don't know when I will).

Very good point. I remember wondering about that problem while implementing it, but then forgot later.

I'll look into using Build, yeah. Maybe it could also be a separate stack. Have to wrap my head around it first though :).

robinst · 2017-02-25T03:07:43Z

Rewrote it to use a stack instead of recursion. Also added a test for deeply nested character classes that didn't pass before but now it does :).

I decided to not use Build but instead have a separate stack as the two things are completely separate.

trishume · 2017-02-25T03:21:52Z

@robinst Awesome. I looked over it briefly but I'm too tired to fully understand it at the moment, so I haven't done anything close to a careful review.

One thing I'm unsure about is if the important thing is just to not stack overflow, or to have limited memory usage. At the moment this should use heap proportional to at most the length of the regex as far as I can tell. But @BurntSushi said something about the nesting limit, and I'm not sure if checking it is warranted in this case, or only when using recursion.

BurntSushi · 2017-02-25T03:41:10Z

But @BurntSushi said something about the nesting limit, and I'm not sure if checking it is warranted in this case, or only when using recursion.

Hmm, yes, the primary purpose of the nesting limit is to mitigate the risk of stack overflow. However, in the nested character case, all nestings are resolved (I think) to a single flattened set of Unicode codepoints.

Heap space may indeed be unbounded, but:

This is already true today (I think) with regexes like (((((((((((((((((((((((((((((...
The heap space is roughly proportional to the actual size of the regular expression pattern string, so some simple mitigation measures are possible by users.

robinst · 2017-03-02T05:47:41Z

all nestings are resolved (I think) to a single flattened set of Unicode codepoints.

Yes, they are.

This is already true today (I think) with regexes like (((((((((((((((((((((((((((((...

Yes, the stack just keeps growing and there's no check before pushing expressions onto it.

Did you have a chance to look at the change yet? I was looking into a quickcheck test for intersection as well, I can add it if you want.

robinst · 2017-03-13T01:14:36Z

I was looking into a quickcheck test for intersection as well, I can add it if you want.

Added the quickcheck test as well now.

BurntSushi · 2017-05-20T14:37:01Z

OK, I've finally had a chance to review this in more depth and I'm pretty much speechless. This is phenomenal work. I looked hard, but I don't see anything to complain about!

I'm not sure if #354 will cause conflicts, so you might wind up needing to rebase, but let's see what happens. (Bors seems to be stuck at the moment...)

@bors r+

bors · 2017-05-20T14:37:02Z

📌 Commit 2f872b4 has been approved by BurntSushi

bors · 2017-05-20T15:18:54Z

🔒 Merge conflict

bors · 2017-05-20T15:18:59Z

☔ The latest upstream changes (presumably #354) made this pull request unmergeable. Please resolve the merge conflicts.

BurntSushi · 2017-05-20T15:27:05Z

@robinst Yup, this will need a rebase!

This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes rust-lang#341

robinst · 2017-05-21T12:40:37Z

OK, I've finally had a chance to review this in more depth and I'm pretty much speechless. This is phenomenal work. I looked hard, but I don't see anything to complain about!

Wow, thank you :)! It was surprisingly simple to do this change, the codebase is very readable!

Rebased it now!

BurntSushi · 2017-05-21T13:03:23Z

@bors retry

BurntSushi · 2017-05-21T13:03:34Z

@bors r+

bors · 2017-05-21T13:03:34Z

📌 Commit bb233ec has been approved by BurntSushi

bors · 2017-05-21T13:03:41Z

⌛ Testing commit bb233ec with merge 548cb19...

…ction, r=BurntSushi Support nested character classes and intersection with `&&` This implements parts of UTS#18 RL1.3, namely: * Nested character classes, e.g.: `[a[b-c]]` * Intersections in classes, e.g.: `[\w&&\p{Greek}]` They can be combined to do things like `[\w&&[^a]]` to get all word characters except `a`. Fixes #341

bors · 2017-05-21T13:21:39Z

☀️ Test successful - status-appveyor, status-travis
Approved by: BurntSushi
Pushing 548cb19 to master...

robinst commented Feb 22, 2017

View reviewed changes

robinst mentioned this pull request Feb 22, 2017

Implement (at least part of) UTS#18 RL1.3 - Operators in character sets #341

Closed

trishume approved these changes Feb 22, 2017

View reviewed changes

BurntSushi reviewed Feb 23, 2017

View reviewed changes

robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from c436bfd to 93f1b69 Compare February 25, 2017 03:00

robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from 93f1b69 to 2f5c967 Compare March 13, 2017 01:13

robinst mentioned this pull request May 2, 2017

Handle fancy escapes in character classes google/fancy-regex#12

Merged

robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from 2f5c967 to 2f872b4 Compare May 5, 2017 03:25

robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from 2f872b4 to d3b5d32 Compare May 21, 2017 11:29

robinst force-pushed the issue-341-char-class-nesting-and-intersection branch from d3b5d32 to bb233ec Compare May 21, 2017 11:31

bors merged commit bb233ec into rust-lang:master May 21, 2017

robinst deleted the issue-341-char-class-nesting-and-intersection branch May 22, 2017 08:16

Support nested character classes and intersection with && #346

Support nested character classes and intersection with && #346

Conversation

robinst commented Feb 22, 2017

robinst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncm Feb 24, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trishume left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robinst commented Feb 25, 2017

trishume commented Feb 25, 2017

BurntSushi commented Feb 25, 2017

robinst commented Mar 2, 2017

robinst commented Mar 13, 2017

BurntSushi commented May 20, 2017

bors commented May 20, 2017

bors commented May 20, 2017

bors commented May 20, 2017

BurntSushi commented May 20, 2017

robinst commented May 21, 2017

BurntSushi commented May 21, 2017

BurntSushi commented May 21, 2017

bors commented May 21, 2017

bors commented May 21, 2017

bors commented May 21, 2017

Support nested character classes and intersection with `&&` #346

Support nested character classes and intersection with `&&` #346

ncm Feb 24, 2017 •

edited

Loading