compiler: fix lazy DFA false quits on ASCII text #768

BurntSushi · 2021-05-01T11:27:44Z

One of the things the lazy DFA can't handle is Unicode word boundaries,
since it requires multi-byte look-around. However, it turns out that on
pure ASCII text, Unicode word boundaries are equivalent to ASCII word
boundaries. So the DFA has a heuristic: it treats Unicode word
boundaries as ASCII boundaries until it sees a non-ASCII byte. When it
does, it quits, and some other (slower) regex engine needs to take over.

In a bug report against ripgrep[1], it was discovered that the lazy DFA
was quitting and falling back to a slower engine even though the
haystack was pure ASCII.

It turned out that our equivalence byte class optimization was at fault.
Namely, a '{' (which appears very frequently in the input) was being
grouped in with other non-ASCII bytes. So whenever the DFA saw it, it
treated it as a non-ASCII byte and thus stopped.

The fix for this is simple: when we see a Unicode word boundary in the
compiler, we set a boundary on our byte classes such that ASCII bytes
are guaranteed to be in a different class from non-ASCII bytes. And
indeed, this fixes the performance problem reported in [1].

[1] - BurntSushi/ripgrep#1860

One of the things the lazy DFA can't handle is Unicode word boundaries, since it requires multi-byte look-around. However, it turns out that on pure ASCII text, Unicode word boundaries are equivalent to ASCII word boundaries. So the DFA has a heuristic: it treats Unicode word boundaries as ASCII boundaries until it sees a non-ASCII byte. When it does, it quits, and some other (slower) regex engine needs to take over. In a bug report against ripgrep[1], it was discovered that the lazy DFA was quitting and falling back to a slower engine even though the haystack was pure ASCII. It turned out that our equivalence byte class optimization was at fault. Namely, a '{' (which appears very frequently in the input) was being grouped in with other non-ASCII bytes. So whenever the DFA saw it, it treated it as a non-ASCII byte and thus stopped. The fix for this is simple: when we see a Unicode word boundary in the compiler, we set a boundary on our byte classes such that ASCII bytes are guaranteed to be in a different class from non-ASCII bytes. And indeed, this fixes the performance problem reported in [1]. [1] - BurntSushi/ripgrep#1860

This brings in a performance bug fix, merged in rust-lang/regex#768. Fixes #1860.

BurntSushi merged commit 036ce80 into master May 1, 2021

BurntSushi added a commit to BurntSushi/ripgrep that referenced this pull request May 1, 2021

deps: update to regex 1.5.2

3f4c418

This brings in a performance bug fix, merged in rust-lang/regex#768. Fixes #1860.

BurntSushi deleted the ag/fix-dfa-word-boundary-false-starts branch May 1, 2021 11:46

BurntSushi mentioned this pull request May 1, 2021

Slower search with combination word boundary and multiple regex on certain files BurntSushi/ripgrep#1860

Closed

BurntSushi mentioned this pull request Apr 8, 2022

partition ASCII and non-ASCII byte classes when a Unicode word boundary is used #652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compiler: fix lazy DFA false quits on ASCII text #768

compiler: fix lazy DFA false quits on ASCII text #768

BurntSushi commented May 1, 2021

compiler: fix lazy DFA false quits on ASCII text #768

compiler: fix lazy DFA false quits on ASCII text #768

Conversation

BurntSushi commented May 1, 2021