Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove regex plugin + rollup + chores #436

Merged
merged 15 commits into from
Dec 30, 2017
Merged

remove regex plugin + rollup + chores #436

merged 15 commits into from
Dec 30, 2017

Conversation

BurntSushi
Copy link
Member

This PR:

These are dev dependencies, so we don't need to worry about the minimum
Rust version supported.
The latest update to rand requires a newer version of Rust. Since it's
a dev dependency, we shouldn't need to do a semver bump when updating
rand. However, CI needs to be told not to run tests. Instead, we merely
check that we can build the crate and produce documentation.
The regex_macros crate hasn't been maintained in quite some time, and has
been broken. Nobody has complained. Given the fact that there are no
immediate plans to improve the situation, and the fact that it is slower
than the runtime engine, we simply remove it.
The 0.2.1 release of simd includes a fix so that it can compile on the
latest nightly.

We needn't worry about semver here because simd is a nightly-only
dependency.
BurntSushi and others added 10 commits December 30, 2017 14:50
There are a few sub-crates in this repository, so sharing a target
directory makes sense.
This updates dependencies and makes sure everything compiles and runs.
This also simplifies the build script.
Principally, this updates docopt to 0.8, which replaces rustc-serialize
with serde.
This commit tweaks the heuristic employed to determine whether to use TBM
or not. For the most part, the heuristic was tweaked by combining the
actual benchmark results with a bit of hand waving. In particular, the
primary change here is that the frequency rank cutoff is no longer a
constant, but rather, a function of the pattern length. That is, we guess
that TBM will do well with longer patterns, even if it contains somewhat
infrequent bytes. We do put a constant cap on this heuristic. That is,
regardless of the length of the pattern, if a "very rare" byte is found
in the pattern, then we won't use TBM.
As far as I can tell, nobody has actually described a substring search
algorithm that used both frequency analysis and vector instructions.
So I'm naming it.
4fab6c added the current bench runner script as `benches/run`, and
removed the old `run-bench` script. It was later renamed to `bench/run`
when `benches` was renamed to `bench` in b217bf. This patch fixes a few
references to the old benchmark runner in the hacking guide as well
as a few references to the old directory structure. The cargo plugin
syntax in the example is also updated.
The DFA can't produce captures, but is still faster than the Pike VM
NFA, so the normal approach to finding capture groups is to look for
the entire match with the DFA and then run the NFA on the substring
of the input that matched. In cases where the regex in anchored, the
match always starts at the beginning of the input, so there is never
any point to trying the DFA first.

The DFA can still be useful for rejecting inputs which are not in the
language of the regular expression, but anchored regex with capture
groups are most commonly used in a parsing context, so it seems like a
fair trade-off.

Fixes #348
@BurntSushi
Copy link
Member Author

cc @ethanpailes Note that in commit 392b3d6 I tweaked the TBM heuristic a little bit.

@BurntSushi
Copy link
Member Author

@bors r+

@bors
Copy link
Contributor

bors commented Dec 30, 2017

📌 Commit 4152e18 has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented Dec 30, 2017

⌛ Testing commit 4152e18 with merge f3425da...

bors added a commit that referenced this pull request Dec 30, 2017
remove regex plugin + rollup + chores

This PR:

* Removes the regex compiler plugin. It's been broken for quite some time and nobody has seemed to notice. It's time for it to go. See commit cc7b00c for details.
* Setup a Cargo workspace for this repo.
* Update deps in various places. This includes updating simd to `0.2.1`, which fixes a build failure on Rust nightly.
* Name the frequency analysis based memchr search "freqy packed."
* Rolls up the other open PRs #401, #410 and #433.
@BurntSushi
Copy link
Member Author

@bors r+

@bors
Copy link
Contributor

bors commented Dec 30, 2017

📌 Commit 5ea594e has been approved by BurntSushi

@bors
Copy link
Contributor

bors commented Dec 30, 2017

⌛ Testing commit 5ea594e with merge 5ca056d...

bors added a commit that referenced this pull request Dec 30, 2017
remove regex plugin + rollup + chores

This PR:

* Removes the regex compiler plugin. It's been broken for quite some time and nobody has seemed to notice. It's time for it to go. See commit cc7b00c for details.
* Setup a Cargo workspace for this repo.
* Update deps in various places. This includes updating simd to `0.2.1`, which fixes a build failure on Rust nightly.
* Name the frequency analysis based memchr search "freqy packed."
* Rolls up the other open PRs #401, #410 and #433.
@BurntSushi
Copy link
Member Author

@bors r-

@bors
Copy link
Contributor

bors commented Dec 30, 2017

💔 Test failed - status-travis

@BurntSushi BurntSushi merged commit 2b4ac35 into master Dec 30, 2017
@BurntSushi BurntSushi deleted the ag/misc-fixes branch December 30, 2017 20:37
@ethanpailes
Copy link
Contributor

@BurntSushi, I know I'm a bit late to the party, but that new heuristic looks great. Thanks for doing the LEN_CUTOFF_PROPORTION stuff! I had been planning on trying to find a good value for something like that, but now it is already done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants