-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(stdlib): add parse_etld
function
#669
Conversation
This adds `parse_etld` function to stdlib, based on `psl` crate Fixes: vectordotdev#651
This is currently using psl crate for performance, but there is also publicsuffix crate, recommended by the same author, for cases when dynamic list is needed. I could switch it to that crate if it would make more sense. Also, I was also considering adding this to |
Very happy to see this feature. I suspect it will have a better chance of approval if it also had documentation in the main docs repository as part of the PR. |
Thank you @esensar, this looks good!
Let's switch to
👍 |
Alright, makes sense, although that might complicate things a bit. With that crate we need to download the list in runtime, so I need to figure out when to do it and how and when to cache it. I will see if I can update the PR tomorrow, but let me know if you have ideas on how I should approach that. |
Actually, this is a good point. This introduces complexity. Let me take a closer look soon. |
@esensar on a second thought, this is preferable to the alternative 👍 |
}, | ||
Example { | ||
title: "parse etld with plus parts", | ||
source: r#"parse_etld!("vector.dev", plus_parts: 1)"#, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if by mistake we had plus_parts: 10
here? Can you add a test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually a test for that, named short_plus
, but it uses plus_parts: 3
when only 2 are available. I will change it to 10 for better visibility and clarity, but this makes me think if behavior in that case is right. It will just silently return whatever is available (so "vector.dev" in the above case). Would it be better to somehow let it be known in the results that it was not able to get all of these parts, since this way it would require some additional checks on the callers end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but this makes me think if behavior in that case is right.
Some options to consider:
- return the max available parts
- return an error
I am leaning towards 1 which I believe is the current behavior but let me know if you have further thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with case 1. That should cover the general case and keep things simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @esensar!
Could we add these new functions to the benchmarks in |
Missed in #669 Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Missed in #669 Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
There is a PR open for that update and if I understood correctly, it will be merged after new VRL version is released, to be up to date with the latest available version. |
I didn't understand the method by which those pages are created, and it seems that you've already done this work - sorry for the confusion. |
Missed in vectordotdev#669 This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original PR.
Missed in vectordotdev#669 This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original PR.
Missed in #669 This also adds `known_suffix` to type definition of `parse_etld` result, that was missed in original PR.
* docs(vrl): add documentation for `parse_etld` function Related: vectordotdev/vrl#669 * Allow etld in spellchecker * Fix formatting of `parse_etld.cue` --------- Co-authored-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Sorry for bringing this topic back, but turns out custom PSL is needed. I was thinking about how to implement it and since these are not really big, it would probably be okay to keep them in memory. I thought of maybe implementing it in a similar way to how We can also have a separate function for that "dynamic" approach, so we don't have to remove this one from |
@pront not sure if you saw this. What do you think about this? |
Some context here: the PSL data set doesn't quite summarize things as desired, and is not helping in our goal of reducing/aggregating data to the degree that we want to be fed out of edge nodes. I suspect almost all users of the PSL list have different interest levels of depth into the DNS tree that they wish to aggregate, based on different zone cuts. Many of the domains (voluntarily submitted to the PSL) are much "deeper" in the domain set than normally one would want to go for summarization. For example, if I'm trying to find out counters for every domain that is a registry-listed domain (or at least a close approximation.) So in the simplest case, the desire is to count how many events are inside of "amazonaws.com" as an example aggregation where we look at ETLD+1 (.com plus one zone cut to the left.) However, the PSL list shows that things like "execute-api.ap-east-1.amazonaws.com" are listed, which means that we see every zone cut underneath that zone. While they are not "ETLD+1", they act as if they are if the list is consumed without editing. Needless to say, what we wanted to have as a fairly compact set of namespace has turned out to be an exploding cardinality issue, with probably 50% of the elements in the namespace seen as un-necessarily collected based on our experience so far. To solve this problem, we'd like to edit the PSL on our own, so that Vector uses a customized version. This doesn't seem too difficult - we're just hoping to have a URL or filepath that allows the PSL data to specified on launch. This changes very infrequently, so it does not need to be dynamic, but it does need to allow a custom file to be identified. @pront I know this is an edge case, but it is fairly close to the heart of the concept of having the ETLD function and making it useful. Customization of data files like this I hope is consistent with Vector's focus on giving operators highly-editable functionality at the edge of the network. In our case, this turns into billions (hundreds of billions?) of events per day for us that we could condense, so it does have a significant result if implemented. |
We're kind of blocked here, and we hope that the path we're going to take isn't going to create problems with getting the patches accepted. |
👋 This thread fell through the cracks, can you summarize what is the proposed change here? Do you want add custom PSL support using enrichment tables? |
Thanks for the bump @johnhtodd . I agree with @pront that the proposal isn't entirely clear in that I'm not sure what UX you are expecting. I could imagine something like the feature to load aliases from a file that was added to
Is that something like what you were expecting? It seems like a reasonable enhancement to me, if so. I'd suggest we limit it to local files for now rather than opening the door to loading from, for example, URLs as suggested in #669 (comment) since Vector/VRL doesn't have a precedent for that. |
I think we should still keep the default psl if a custom one is not provided. |
Yeah, I guess file is okay for most cases, the URL option doesn't really offer much more, since the file can be downloaded before configuring anyways. I agree with the proposed API, but I think it would be good to move the actual file to configuration and refer to PSL by name, because that way we can parse it once on startup and then just use it, but that means that the function would have to move to I am not sure if that is alright with you, since we would have to remove it from
Agreed on that one. We can just use the library with pre-generated PSL for that I guess. |
You can actually do this while leaving it as an argument by requiring the argument to be static and so it can be loaded only once at start-time. You can see an example of this with #194. |
Oh, that makes sense, it is probably the best solution to do that then. I will then make a PR to |
This adds an optional argument `psl` to `parse_etld`function, which should be a file path to a custom PSL list (default one can be found on https://publicsuffix.org/list/public_suffix_list.dat). If none is provided, the default is used. Related: vectordotdev#669
* feat(stdlib): add optional `psl` argument to `parse_etld` This adds an optional argument `psl` to `parse_etld`function, which should be a file path to a custom PSL list (default one can be found on https://publicsuffix.org/list/public_suffix_list.dat). If none is provided, the default is used. Related: #669 * Add changelog entry * Remove needless variables from `parse_etld` VRL tests * Fix vrl tests
This adds
parse_etld
function to stdlib, based onpsl
crateFixes: #651