[pkg/ottl] Support for extracting UserAgent string #32434

michalpristas · 2024-04-16T12:01:25Z

Component(s)

pkg/ottl

Is your feature request related to a problem? Please describe.

The intended converter extracts details from the user agent string a browser sends with its web requests into User Agent SemConv attributes.

Describe the solution you'd like

Example:

Input:

{
  "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}

Result

{
  "user_agent": {
      "name": "Chrome",
      "original": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
      "version": "51.0.2704.103",
  }
}

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-16T12:01:42Z

Pinging code owners:

pkg/ottl: @TylerHelmuth @kentquirk @bogdandrutu @evan-bradley

See Adding Labels via Comments if you do not have permissions to add labels yourself.

TylerHelmuth · 2024-04-17T03:48:14Z

@michalpristas I suspect this is doable via regular expression, and any converter would be doing regex internally. Can this be accomplished with ExtractPatterns?

michalpristas · 2024-04-17T06:23:52Z

it can, the reason i'm bringing this separately is that in our pipelines this is a processor very commonly used and having to play with regular expressions may be overwhelming a bit.
also we extract a bit more informations out of user agent string which do not have SemConv counterparts just yet. but i believe it will be on par over time. information like user_agent.device.name or user_agent.os.name

felixbarny · 2024-04-22T14:18:37Z

I don't think it would be feasible to do user agent string extraction just with the ExtractPatterns converter. Elasticsearch's user agent processor maintains thousands of lines of regexes that sometimes (but rarely) need updates. The upstream of that file is in https://github.com/ua-parser/uap-core. Also, caching is essential to ensure decent performance. It avoids having to parse the same user agent string over and over again for different events by caching 1000 user agent strings in an LRU cache, by default.

There's a go library that we could potentially re-use for this: https://github.com/ua-parser/uap-go.

As user agent parsing yields multiple values, I'm not sure whether OTTL or a separate processor is the right place for user agent string parsing. I think it would be neat if it's possible to build a log parsing pipeline purely in OTTL, including UA parsing and other things that may yield multiple values but I'm not sure what the guideline and the scope of OTTL is.

To me, user agent string parsing feels an essential building block that should be available to users out of the box, one way or another.

pchila · 2024-07-03T13:01:27Z

Hello, I am interested in working on this, if you are looking for a volunteer (it would be my first contribution to OTel 😄 )

rogercoll · 2024-07-08T07:17:40Z

+1 to adding this function to OTTL.
My use case: I was building an Otel collector configuration to parse the logs from an NGINX Ingress Controller (log format) using the filelog receiver + the transform processor. As the log has a standard format, the transform processor uses the ExtractPatterns function to get them (http.request.status_code, remote.address, etc.). The issue is that the UserAgent field comes as unstructured plain text, and due to its vast amount of formats (e.g. operating system names) it is very complex to correctly extract every possible subfield using regular expressions.

andrzej-stencel · 2024-07-08T13:48:18Z

Assigned the issue to you @pchila 👍

evan-bradley · 2024-07-08T14:21:01Z

Thanks for volunteering to work on this @pchila. I'd suggest waiting until the discussion needed label is removed before continuing work on this to prevent unnecessary work from any design changes that come out of additional discussion.

I'm okay with adding a function like this given we have an implementation that parses a standard format user agent strings (I think RFC 9910 is the official source right now) into a standard map-like structure, such as the attributes provided by semconv. I think following something close to what we have for the URL Converter that was recently added makes sense.

@felixbarny That's a good note that caching would be helpful here to improve performance. I think we should save that in a follow-up after this function is implemented to keep the each PR small and to ensure we get caching right.

pchila · 2024-07-09T05:12:02Z

Thank you @evan-bradley , I will have a look at what is done for URL Converter as for the output structure...
As a first PR I would target having a simple implementation without caching and maybe using https://github.com/ua-parser/uap-go as mentioned by @felixbarny (we can always switch the implementation in the future if we want) along with unit tests and some simple go benchmarks (so we can evaluate better performance impacts of caching and future changes in the implementation).

I will comment here sketching out the input/output of the function before starting implementation.

pchila · 2024-07-10T12:29:40Z

So, I had a look at what's been implemented for the URL Converter and a User-Agent parser is very similar, so I would implement the first iteration using https://github.com/ua-parser/uap-go (to keep the PR small while still providing some functionality) and then map the output to the relevant semconv attributes .
I think this should be enough as a first PR introducing the function, we can iterate over that with caching/swapping out internals/mapping additional attributes in follow-up PRs...

@TylerHelmuth would such first implementation would be ok in your opinion or we still need more details/clarification?

TylerHelmuth · 2024-07-18T22:53:21Z

When we first started discussing this function we didn't have https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/pkg/ottl/ottlfuncs/README.md#adding-new-editorsconverters in place. I think this function does meet the acceptance guidelines since it is a significant user experience improvement and potentially a performance improvement.

I am worried about the resulting semantic convention attributes we'd be producing not being stable. When they stabilize the function could break. We'll want to clearly document the current semantic convention version it is following. Maybe the semantic convention version to generate should be an optional param.

**Description:** <Describe what has changed.>  Added a new ottl converter `UserAgent`: it parses an input string and matches against a [set of known UA regexes](https://github.com/ua-parser/uap-core/blob/master/regexes.yaml) to correctly identify user agent and its version **Link to tracking Issue:** #32434 **Testing:** Unit tests, E2E tests **Documentation:** <Describe the documentation added.> Added UserAgent description in `pkg/ottl/ottlfuncs/README.md` --------- Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>

ycombinator · 2024-08-28T00:08:01Z

Hi @pchila @TylerHelmuth, are we good to close this issue now that #34172 has been merged or is there some more work to be done?

felixbarny · 2024-08-28T06:50:43Z

Are there any follow up enhancements that we're planning? For example, adding caching or supporting more attributes, like the ones the Elasticsearch user_agent supports?

**Description:** <Describe what has changed.>  Added a new ottl converter `UserAgent`: it parses an input string and matches against a [set of known UA regexes](https://github.com/ua-parser/uap-core/blob/master/regexes.yaml) to correctly identify user agent and its version **Link to tracking Issue:** open-telemetry#32434 **Testing:** Unit tests, E2E tests **Documentation:** <Describe the documentation added.> Added UserAgent description in `pkg/ottl/ottlfuncs/README.md` --------- Co-authored-by: Tyler Helmuth <12352919+TylerHelmuth@users.noreply.github.com>

michalpristas added enhancement New feature or request needs triage New item requiring triage labels Apr 16, 2024

github-actions bot added the pkg/ottl label Apr 16, 2024

TylerHelmuth added discussion needed Community discussion needed and removed enhancement New feature or request needs triage New item requiring triage labels Apr 17, 2024

michalpristas mentioned this issue Apr 22, 2024

OTTL additional converters #31930

Open

26 tasks

ycombinator assigned andrzej-stencel and unassigned andrzej-stencel Jun 21, 2024

andrzej-stencel assigned pchila Jul 8, 2024

TylerHelmuth added enhancement New feature or request and removed discussion needed Community discussion needed labels Jul 18, 2024

pchila mentioned this issue Jul 19, 2024

[pkg/ottl] add User Agent parsing #34172

Merged

TylerHelmuth closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pkg/ottl] Support for extracting UserAgent string #32434

[pkg/ottl] Support for extracting UserAgent string #32434

michalpristas commented Apr 16, 2024 •

edited

Loading

github-actions bot commented Apr 16, 2024

TylerHelmuth commented Apr 17, 2024

michalpristas commented Apr 17, 2024

felixbarny commented Apr 22, 2024

pchila commented Jul 3, 2024

rogercoll commented Jul 8, 2024

andrzej-stencel commented Jul 8, 2024

evan-bradley commented Jul 8, 2024

pchila commented Jul 9, 2024

pchila commented Jul 10, 2024

TylerHelmuth commented Jul 18, 2024

ycombinator commented Aug 28, 2024

felixbarny commented Aug 28, 2024

[pkg/ottl] Support for extracting UserAgent string #32434

[pkg/ottl] Support for extracting UserAgent string #32434

Comments

michalpristas commented Apr 16, 2024 • edited Loading

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Apr 16, 2024

TylerHelmuth commented Apr 17, 2024

michalpristas commented Apr 17, 2024

felixbarny commented Apr 22, 2024

pchila commented Jul 3, 2024

rogercoll commented Jul 8, 2024

andrzej-stencel commented Jul 8, 2024

evan-bradley commented Jul 8, 2024

pchila commented Jul 9, 2024

pchila commented Jul 10, 2024

TylerHelmuth commented Jul 18, 2024

ycombinator commented Aug 28, 2024

felixbarny commented Aug 28, 2024

michalpristas commented Apr 16, 2024 •

edited

Loading