Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/ottl] Support for extracting UserAgent string #32434

Closed
michalpristas opened this issue Apr 16, 2024 · 13 comments
Closed

[pkg/ottl] Support for extracting UserAgent string #32434

michalpristas opened this issue Apr 16, 2024 · 13 comments
Assignees
Labels
enhancement New feature or request pkg/ottl

Comments

@michalpristas
Copy link
Contributor

michalpristas commented Apr 16, 2024

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

The intended converter extracts details from the user agent string a browser sends with its web requests into User Agent SemConv attributes.

Describe the solution you'd like

Example:

Input:

{
  "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}

Result

{
  "user_agent": {
      "name": "Chrome",
      "original": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
      "version": "51.0.2704.103",
  }
}

Describe alternatives you've considered

No response

Additional context

No response

@michalpristas michalpristas added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@TylerHelmuth
Copy link
Member

@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns?

@TylerHelmuth TylerHelmuth added discussion needed Community discussion needed and removed enhancement New feature or request needs triage New item requiring triage labels Apr 17, 2024
@michalpristas
Copy link
Contributor Author

it can, the reason i'm bringing this separately is that in our pipelines this is a processor very commonly used and having to play with regular expressions may be overwhelming a bit.
also we extract a bit more informations out of user agent string which do not have SemConv counterparts just yet. but i believe it will be on par over time. information like user_agent.device.name or user_agent.os.name

@felixbarny
Copy link
Contributor

I don't think it would be feasible to do user agent string extraction just with the ExtractPatterns converter. Elasticsearch's user agent processor maintains thousands of lines of regexes that sometimes (but rarely) need updates. The upstream of that file is in https://github.com/ua-parser/uap-core. Also, caching is essential to ensure decent performance. It avoids having to parse the same user agent string over and over again for different events by caching 1000 user agent strings in an LRU cache, by default.

There's a go library that we could potentially re-use for this: https://github.com/ua-parser/uap-go.

As user agent parsing yields multiple values, I'm not sure whether OTTL or a separate processor is the right place for user agent string parsing. I think it would be neat if it's possible to build a log parsing pipeline purely in OTTL, including UA parsing and other things that may yield multiple values but I'm not sure what the guideline and the scope of OTTL is.

To me, user agent string parsing feels an essential building block that should be available to users out of the box, one way or another.

@pchila
Copy link
Contributor

pchila commented Jul 3, 2024

Hello, I am interested in working on this, if you are looking for a volunteer (it would be my first contribution to OTel 😄 )

@rogercoll
Copy link
Contributor

+1 to adding this function to OTTL.
My use case: I was building an Otel collector configuration to parse the logs from an NGINX Ingress Controller (log format) using the filelog receiver + the transform processor. As the log has a standard format, the transform processor uses the ExtractPatterns function to get them (http.request.status_code, remote.address, etc.). The issue is that the UserAgent field comes as unstructured plain text, and due to its vast amount of formats (e.g. operating system names) it is very complex to correctly extract every possible subfield using regular expressions.

@andrzej-stencel
Copy link
Member

Assigned the issue to you @pchila 👍

@evan-bradley
Copy link
Contributor

Thanks for volunteering to work on this @pchila. I'd suggest waiting until the discussion needed label is removed before continuing work on this to prevent unnecessary work from any design changes that come out of additional discussion.

I'm okay with adding a function like this given we have an implementation that parses a standard format user agent strings (I think RFC 9910 is the official source right now) into a standard map-like structure, such as the attributes provided by semconv. I think following something close to what we have for the URL Converter that was recently added makes sense.

@felixbarny That's a good note that caching would be helpful here to improve performance. I think we should save that in a follow-up after this function is implemented to keep the each PR small and to ensure we get caching right.

@pchila
Copy link
Contributor

pchila commented Jul 9, 2024

Thank you @evan-bradley , I will have a look at what is done for URL Converter as for the output structure...
As a first PR I would target having a simple implementation without caching and maybe using https://github.com/ua-parser/uap-go as mentioned by @felixbarny (we can always switch the implementation in the future if we want) along with unit tests and some simple go benchmarks (so we can evaluate better performance impacts of caching and future changes in the implementation).

I will comment here sketching out the input/output of the function before starting implementation.

@pchila
Copy link
Contributor

pchila commented Jul 10, 2024

So, I had a look at what's been implemented for the URL Converter and a User-Agent parser is very similar, so I would implement the first iteration using https://github.com/ua-parser/uap-go (to keep the PR small while still providing some functionality) and then map the output to the relevant semconv attributes .
I think this should be enough as a first PR introducing the function, we can iterate over that with caching/swapping out internals/mapping additional attributes in follow-up PRs...

@TylerHelmuth would such first implementation would be ok in your opinion or we still need more details/clarification?

@TylerHelmuth
Copy link
Member

When we first started discussing this function we didn't have https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/ottlfuncs/README.md#adding-new-editorsconverters in place. I think this function does meet the acceptance guidelines since it is a significant user experience improvement and potentially a performance improvement.

I am worried about the resulting semantic convention attributes we'd be producing not being stable. When they stabilize the function could break. We'll want to clearly document the current semantic convention version it is following. Maybe the semantic convention version to generate should be an optional param.

@TylerHelmuth TylerHelmuth added enhancement New feature or request and removed discussion needed Community discussion needed labels Jul 18, 2024
andrzej-stencel pushed a commit that referenced this issue Aug 21, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Added a new ottl converter `UserAgent`: it parses an input string and
matches against a [set of known UA
regexes](https://github.com/ua-parser/uap-core/blob/master/regexes.yaml)
to correctly identify user agent and its version
**Link to tracking Issue:** #32434 

**Testing:** Unit tests, E2E tests

**Documentation:** <Describe the documentation added.> Added UserAgent
description in `pkg/ottl/ottlfuncs/README.md`

---------

Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
@ycombinator
Copy link
Contributor

Hi @pchila @TylerHelmuth, are we good to close this issue now that #34172 has been merged or is there some more work to be done?

@felixbarny
Copy link
Contributor

Are there any follow up enhancements that we're planning? For example, adding caching or supporting more attributes, like the ones the Elasticsearch user_agent supports?

f7o pushed a commit to f7o/opentelemetry-collector-contrib that referenced this issue Sep 12, 2024
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->
Added a new ottl converter `UserAgent`: it parses an input string and
matches against a [set of known UA
regexes](https://github.com/ua-parser/uap-core/blob/master/regexes.yaml)
to correctly identify user agent and its version
**Link to tracking Issue:** open-telemetry#32434 

**Testing:** Unit tests, E2E tests

**Documentation:** <Describe the documentation added.> Added UserAgent
description in `pkg/ottl/ottlfuncs/README.md`

---------

Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pkg/ottl
Projects
None yet
Development

No branches or pull requests

8 participants