-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed tracing for Sidekiq #2513
Conversation
spec/datadog/tracing/contrib/sidekiq/distributed_tracing_spec.rb
Outdated
Show resolved
Hide resolved
Is there any way I could help getting this PR merged? @TonyCTHsu |
👋 @sled , thanks for offering to help. I hesitated to continue this due to the fact that it is difficult to make sense of distributed tracing for asynchronous process in UI. The mechanism for distributed tracing is by context propagation, however, asynchronous process leaves a huge gap between the time when a job being pushed into a queue and before picked up by workers (a typical http request/response cycle would be milliseconds, while this gap could take seconds or longer, depends on how long it has been sitting and waiting inside of the queue). This gap takes too much space for the entire trace graph and makes the spans (actual code execution) tiny and hard to read. Furthermore, how does the entire trace expected to look like when a job fail and push back into the queue, then retry several times before it is considered dead? The duration of the trace could easily extended beyond days and weeks for asynchronous process. If you are interested in this feature, perhaps we could release it in a opt-in configuration, how does that sound? |
@TonyCTHsu if this is just a display issue, maybe the Datadog UI team could have a look at it? The big gaps between enqueueing and execution of the job could be squished for example. I think distributed tracing is an excellent fit for asynchronous systems like background jobs or event processing because it allows you to stitch together the whole picture starting from the initiator. Retries should also continue the original trace, a common scenario is a faulty application (v1) which enqueued jobs with wrong parameters. This gets fixed in v2, but you might still see job retries. With distributed tracing, you can easily trace those retried jobs to the faulty version (v1) which originally enqueued the job. How is this solved in other languages or frameworks, i.e. Java/Spring, Kafka etc.? |
👋 @sled , thanks for sharing! Other languages tracers have synthetic spans to fill the gap which did not fully address the UI issue, but Datadog is working on alternative solution for asynchronous tracing instead of implementing distributed tracing. Since I don't know when the alternative solution would be available and your graph looks fine, if this make sense to you, I believe we should move forward! |
Hi, I wanted to ask if there's a timeline for this PR getting merged. Is there something missing? I tried it and at least for me it's looking good! |
4f6d082
to
aace12e
Compare
2b52318
to
c2e3a13
Compare
What does this PR do?
#2295
Implement distributed tracing for sidekiq