-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix default hostname/port not being used when mixing http and uds configuration + improve warning message #3037
Fix default hostname/port not being used when mixing http and uds configuration + improve warning message #3037
Conversation
…figuration **What does this PR do?**: This PR fixes a corner case bug introduced in #2806 when we added support for configuration unix domain socket (UDS) via the `DD_TRACE_AGENT_URL` environment variable. Specifically, when a mix of http and uds configuration was specified, we printed a warning and used the http configuration instead. But, if only the hostname or only the port + uds configuration was provided, we did not correctly use the default hostname and default port as a replacement. That is: * `DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket` resulted in `hostname="10.128.186.61", port=nil` * `DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket` resulted in `hostname=nil, port=1234` This led to a failure to communicate with the Datadog agent in these situations. **Motivation**: Fix issue when communicating with the agent. **Additional Notes**: We caught this issue while testing internal apps. On the plus side, we did not get any customer reports for this issue, so hopefully nobody was bitten by it. **How to test the change?**: Change includes test coverage. You can also compare the output of the `AgentSettingsResolver` before/after the issue was fixed: ``` # Before $ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))" #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings adapter=:net_http, ssl=false, hostname="10.128.186.61", port=nil, uds_path=nil, timeout_seconds=nil, deprecated_for_removal_transport_configuration_proc=nil> $ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))" #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings adapter=:net_http, ssl=false, hostname=nil, port=1234, uds_path=nil, timeout_seconds=nil, deprecated_for_removal_transport_configuration_proc=nil> # After $ DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))" #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings adapter=:net_http, ssl=false, hostname="10.128.186.61", port=8126, uds_path=nil, timeout_seconds=nil, deprecated_for_removal_transport_configuration_proc=nil> $ DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket bundle exec ruby -e "require 'ddtrace'; pp(Datadog::Core::Configuration::AgentSettingsResolver.call(Datadog.configuration))" #<struct Datadog::Core::Configuration::AgentSettingsResolver::AgentSettings adapter=:net_http, ssl=false, hostname="127.0.0.1", port=1234, uds_path=nil, timeout_seconds=nil, deprecated_for_removal_transport_configuration_proc=nil> ```
The `configured_hostname` and `configured_port` don't include the defaults, so the resulting message was a bit confusing when defaults were picked for them. By using `hostname` and `port` instead, we make sure the warning message shows the values that will be picked. (I did notice a bit of a weird recursion where calling `hostname` and `port` does call back into `mixed_http_and_uds?` via `should_use_uds?`, but it's fine because we already set the `@mixed_http_and_uds` before doing so. It is a bit awkward but... I feel in a way this is reflecting the actual complexity of all these priorities and overrides and whatnot, so I'll leave it to someone else to try to untangle a bit better in the future.)
`uds_path` vs `parsed_url` is similar to `hostname` vs `configured_hostname`: The `uds_path` is the actual final value that's going to be included in the `AgentSettingsResolver` output, so it will be `nil` when `mixed_http_and_uds?` is true. In contrast, `parsed_url` contains the configured value which caused the configuration mismatch, and so that's the correct one to print.
**What does this PR do?**: This PR tweaks the `HttpTransport` error logging to include the configuration being used. For instance, it will now look like: ``` ERROR (lib/datadog/profiling/http_transport.rb:58:in `export') Failed to report profiling data ({:agent=>"unix:///fake-apm-socket.socket"}): failed ddog_prof_Exporter_send: error trying to connect: No such file or directory (os error 2): No such file or directory (os error 2) ``` ...I did notice the libdatadog error message seems to be repeated and isn't particularly great; that's a separate issue for a future fix (on the libdatadog side). **Motivation**: While investigating #3037 I noticed this information was not being logged when the profiler failed to report, and I thought it would be come in handy when debugging future issues. **Additional Notes**: N/A **How to test the change?**: The change includes test coverage. You can also manually trigger an issue by running dd-trace-rb with an invalid configuration, e.g. the above error message was triggered by running with ```bash $ DD_PROFILING_ENABLED=true DD_TRACE_AGENT_URL=unix:///fake-apm-socket.socket \ bundle exec ddtracerb exec ruby -e "sleep 1" ```
From technical point of view, this PR improves greatly the internal consistent of the agent resolver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this!
Replying to the Specifically, it's only in this bizarre corner case where conflicting configuration gets added (supplying both http and uds configuration) that there's a divergence. The divergence right now is a bit more subtle, because it's not about Thus, setting Usually, so far, we've been giving priority to any settings via code. So for this case, we'd either deprioritize settings-via-code vs a uds Since we've talked about this a bit via slack, and Tony approved this, and this PR is preventing us from seeing data on some of the test apps, I'm going to go ahead and press the merge button. If we do decide to go in a separate direction, I'm perfectly happy with a revert -- my intention is getting it to work, so any way we can do it is fine by me :) |
What does this PR do?:
This PR fixes a corner case bug introduced in #2806 when we added support for configuration unix domain socket (UDS) via the
DD_TRACE_AGENT_URL
environment variable.Specifically, when a mix of http and uds configuration was specified, we printed a warning and used the http configuration instead.
But, if only the hostname or only the port + uds configuration was provided, we did not correctly use the default hostname and default port as a replacement.
That is:
DD_AGENT_HOST=10.128.186.61 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket
resulted inhostname="10.128.186.61", port=nil
DD_TRACE_AGENT_PORT=1234 DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket
resulted inhostname=nil, port=1234
This led to a failure to communicate with the Datadog agent in these situations.
Motivation:
Fix issue when communicating with the agent.
Additional Notes:
We caught this issue while testing internal apps. On the plus side, we did not get any customer reports for this issue, so hopefully nobody was bitten by it.
Also included are a few tweaks to the warning message generated + tests for it. See individual commits for detail.
I've tested the fix in the KTG and can confirm data is showing up again for "candidate" releases.
How to test the change?:
Change includes test coverage. You can also compare the output of the
AgentSettingsResolver
before/after the issue was fixed: