Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Support is addef for lazy builtin matchers (with a separately compiled file), as well as loading json or yaml files using lazy matchers.
Lazy matchers are very much a tradeoff: they improve import speed, but slow down run speed, possibly dramatically.
Use them by default for the re2 parser, but not the basic parser: experimentally, on Python 3.11
the eager matchers have a significant overhead, however running the bench on the sample file, they cause a runtime increase of 700~800ms on the basic parser bench, as that ends up instantiating every regex (likely due to match failures). Relatively this is not huge (~2.5%), but the tradeoff doesn't seem great, especially since the parser itself is initialized lazily.
The re2 parser does much better, only losing 20~30ms (~1%), this is likely because it only needs to compile a fraction of the regexes (156 out of 1162 as of regexes.yaml version 0.18), and possibly because it gets to avoid some of the most expensive to compile ones.
TODO:
turns out the eagerly compiled regex likely consume a bunch of memory,
the literal strings are likely shared but the compiled regex definitely are not, could have a shared cache but the use case of loading multiple builtin sets in actual production seems unlikely