-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045
Comments
Thanks for the report. Is there anything you can share to help us replicate this? (Config, Gemfile etc?) |
@delner 👍 I’ll prepare something I can share here. |
I found something interesting when trying to come up with a minimal setup to share: what causes the leak is to enable profiling. So I can reproduce with this setup: Datadog.configure do |config|
config.service = "bc3"
config.env = Rails.env
config.tracing.instrument :rails
config.tracing.instrument :resque
config.profiling.enabled = true
config.tracing.enabled = true
end But the problem gets solved if I remove this line: config.profiling.enabled = true As mentioned, we don't have the same problem with I can't share our full |
Hey 👋 . I work on the profiler bits so it seemed relevant to step in. Between 0.54.2 and 1.0.0 there were very little changes to the profiler, so perhaps that will make it easier to track it down. 😅
That will disable a new feature we added in 1.0.0 (#1813) and check if that's the culprit. |
Hey @ivoanjo, it totally looks like that. After setting |
Confirmed it's that feature @ivoanjo: |
Cool, glad we could narrow this down so quickly. This will help a lot in getting out a fix. Thanks again for the report and your help @jorgemanrubia! |
+1 thanks @jorgemanrubia for the fast feedback on the experiment! It's good that we seem to have found the culprit and that hopefully this can unblock you to run dd-trace-rb 1.0.0 (and 1.1.0, just released!). I'd like to investigate this further. Could I ask you to open a ticket with support via https://docs.datadoghq.com/help/ linking to this issue and include the following (if possible):
|
Yep - production affecting incident for us too. |
@ivoanjo I just opened it (#821036) |
@rgrey thanks for the report. Are you seeing this issue in your resque workers as well, or is it unrelated to resque?
|
Currently in the process of checking to see if this workaround works. For us, we don't use resque. Our rake tasks are failing, but those with more experience of our codebase are looking further to try understand/isolate. |
Thanks for the extra info @rgrey. I'm investigating and attempting to reproduce the issue, but have not had luck so far. Any information you can share is very helpful. I would also like to know if for you the issue also showed up on a 0.54 to 1.0/1.1 migration. If possible, I'll also request that you open a ticket with support and mention this github issue, so we can look into your profiles in more detail. |
We’ve reproduced the memory issue on a staging console running an internal rake task with ddtrace 1.0.0. Memory usage slowly increased and it died after half an hour or so. It still failed after adding Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence. A new rabbit hole on 1.1.0:
I can’t see any documentation about this. It just tells you to contact support. Memory usage is growing again as well, so looks like the issue isn’t fixed. Removing the profiler doesn't help. Still leaking. However, having removed the rake integration from
This does seems to fix the memory issue, but of course, isn't the solution! |
Great, having a way to reproduce the issue really helps. To clarify:
Ok, this is quite interesting! This makes me think that your issue @rgrey may not be the same one that @jorgemanrubia, since disabling profiling seemed to help his setup.
Interesting! This will not contribute to the memory leak, but is annoying since it blocks you from using the profiler. I've created a separate issue to track and investigate this: #2068
I believe still having libddwaf marked as unstable is an oversight. I'll discuss this with my colleagues and I'll get back to you on this. Note that this dependency is lazy loaded and thus not even required unless the security-related features of ddtrace are turned on. |
Thanks @ivoanjo - will confirm details with an engineer tomorrow re tasks, but no We will be looking for the ASM component, as we are heavily dependent on Sqreen and need to protect continuity of protective capabilities by migrating to ASM and keeping Sqreen in play until 31 Mar 2023. |
@rgrey Looking at the above, I first seriously suspect the Rake instrumentation. If a long running Rake task is instrumented (especially one that starts a worker process) then it can incorrectly create one trace for the entire application. This is because traces are thread-bound: there's up to one per thread, and they release memory after the original span finishes. If a Rake span opens on a Rake task that runs indefinitely, and the main thread of that Rake task continuously enters/exits instrumented code, there's a high likelihood a trace will grow indefinitely until the max span limit is reached (100K spans), thereby leaking memory. Best workaround is to disable Rake instrumentation on any long running Rake tasks (I see you had something like this in your configuration). We do have some mechanisms that could be employed to manage trace context or trace individual tasks, but I wouldn't recommend them until I understand if they would address your issue. @rgrey Is there a simple scenario I can run on my side to reproduce this? Will help give me an idea on what to recommend or how we could change behavior. Thanks! |
libddwaf's "unstable" marker here means the API may change and have breaking changes on minor versions, while its codebase and resulting binaries are definitely production-ready. To fully clarify, libddwaf is a heir to libsqreen, the latter which was semantically versioned using 0.x. To mark the break between libsqreen and libddwaf (which involved a lot of renaming and changes), we decided to bump the major version, but we needed some time nonetheless to stabilise the public API. So In essence libddwaf's Since libddwaf should not be used directly and is wrapped by bindings such as libddwaf-rb, any such low-level C API change is handled by Datadog internally and isolated by the higher level binding code, which aims to provide a much stabler high level Ruby-oriented API. In any case, libddwaf-rb dependency is directly consumed by the ddtrace gem, and should there be a breaking change in libddwaf-rb's API we would handle it as gracefully as technically possible at the ddtrace level, and properly handled using ddtrace's gemspec dependency verssion constraints so that it picks only compatible versions of libddwaf-rb. For additional clarity, any pre-release grade version of libddwaf-rb would be marked using pre-release gem version components such as I will create a PR to propose some changes to libddwaf's README to clarify the versioning scheme and API stability expectations. I hope that makes things clearer! |
I've been running some more tests and found some interesting things:
Sorry I initially associated the problem with upgrading to version 1.0 of the gem. I have updated the title to remove that. I think I observed inconsistent results because the problem is associated to exercising jobs in the queue. This time I used a script to keep enqueuing jobs in all my tests.
In production, with 1.1.0 without profiling, we have registered these long-running traces associated to the Even if there is a technical explanation, I think Datadog should prevent the problem in those those long-running processes. Or, if not-too-long-lived processes are required for Thanks for the help with this one. |
I would like to confirm: is the metric showing on this graph you shared the total memory usage across all resque processes/workers, or just for a single one? Since you were able to confirm that the issue is not a regression, but something that is triggered on 0.54.2 as well, the next step would be to understand what exactly is getting leaked. Unfortunately dd-trace-rb can not (yet!) help there since our profiler only supports CPU and Wall-clock data, and does not have any heap profiling capabilities (yet!). Would you be able to share the analysis results of a Ruby heap dump of one such process? The (One other alternative is the newly-released I'm also available to jumping on a live call so we can debug this together, if you're up for it. Note that we/I don't recommend using neither |
@jorgemanrubia This detail and comparison is helpful, thank you. We do have hard-caps on trace & buffer size (100K spans) which are supposed to mitigate the impact of large traces. However, they may be set too high, or the criteria for triggering the cap may be too coarse grained. We may want to re-evaulate this behavior. As an experiment, you could tweak the default cap with |
Hey folks I'll get more info about it this week. It's on my radar, I just need to allocate time for it. |
👋 @jorgemanrubia and @rgrey, thank you for patience here! I was trying to find any possible points where This leads me back to your situation: the issue is likely a long-running Rake task. To be clear, any never-ending trace started by I've talked to our team, and we'll add selective Rake instrumentation by default: you'll have to explicitly declare what tasks need to be monitored, instead of all tasks by default. Because instrumenting a never-ending Rake can cause critical issues, we are not comfortable having instrumentation that by default can cause your application to run out of memory. Until we have that in place, I recommend disabling the Rake instrumentation, if you haven't already. # Ensure tracer is configured for Rake tasks, if not configured elsewhere
require 'ddtrace'
Datadog.configure{...}
task :my_task do |t|
# Using `rake.invoke` and task name as the `resource` matches the current instrumentation.
# Your monitors and alerts will work unmodified.
Datadog::Tracing.trace('rake.invoke', resource: t.name) do
# Task work goes here
end
end I'll also look into reducing the hard limit of 100k spans: it's practically useless to have such a large trace today. The amount of data in each Datadog App page is overwhelming and the flamegraph becomes illegible. I'll trying to find a realistic hard limit amongst our current client usage. Overall, I'd like us to find an approach that is proactive in preventing never-ending traces without affecting otherwise correctly instrumented long traces. |
This same thing happens for our Sidekiq workers. A single long running worker that did a lot of queries and iterated a lot of database objects caused this behaviour. After disabling profiling the memory consumption curve stays flat. The memory graph illustrates what happens when the worker is started, restarted, then restarted with profiling disabled. I spent a lot of time today trying to pin down this issue, and it seems that somehow, with this setting enabled, datadog holds on to a a bunch of activerecord objects.
(snippet from a memory dump using https://github.com/SamSaffron/memory_profiler) Running:
Made no difference. |
Hey @erkie, sorry that you're running into problems with memory usage. May I ask you to open you report as a separate issue, so that we can investigate there? Unfortunately since github has no threading it becomes harder to discuss potentially-different problems that may have the same outcome (memory usage). So that we can investigate, do mind also including the following:
|
I'm experiencing what looks like a similar problem, but with sidekiq. Opened #2930 for better threading of the conversation, but am linking here in case some other traveler finds it useful. |
We started experiencing similar issues after upgrading from Datadog 0.54.2 to 1.0.0. It was primarily affecting a long-running data synchronisation process (implemented as a Rake task) that could allocate up to 8 GB (maximum that we allowed) in 6-8 hours. There were also OOM errors in our Que workers that started happening, but I can't prove that it was related. After we upgraded to Datadog 1.12.1 the issue was gone 🎉 I find it really funny that I spent the last week using Datadog to troubleshoot memory issues that were caused by Datadog 😆 |
Happy to hear that you're not impacted anymore 🎉 Please do open a ticket if you ever run into any issue -- we're glad to look into it :) |
Thanks a lot for linking the issue here. This helped me as well ( |
Hey! We were reviewing old tickets and bumped into this one. There's been quite a number of releases/fixes to ddtrace since this was discussed, and for a few folks it looks like @jorgemanrubia and any folks that run into any issue -- please do let us know by opening new issues if you see anything suspicious -- we definitely want to look into them 🙏 😄 |
When we upgraded to version 1.0 our resque processes started to leak memory and, sporadically, we registered some sudden multi-GB memory bloats:
I disabled the
resque
andactive_job
instrumentation integrations but it was still leaking, so not really sure it's related to those specifically. The processes leaking memory were the resque workers for sure. Once reverted to the previous gem, things went back to normal.This was in a beta server that had barely any activity.
The text was updated successfully, but these errors were encountered: