Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045

jorgemanrubia · 2022-05-25T11:03:19Z

When we upgraded to version 1.0 our resque processes started to leak memory and, sporadically, we registered some sudden multi-GB memory bloats:

I disabled the resque and active_job instrumentation integrations but it was still leaking, so not really sure it's related to those specifically. The processes leaking memory were the resque workers for sure. Once reverted to the previous gem, things went back to normal.

This was in a beta server that had barely any activity.

The text was updated successfully, but these errors were encountered:

delner · 2022-05-25T13:28:28Z

Thanks for the report. Is there anything you can share to help us replicate this? (Config, Gemfile etc?)

jorgemanrubia · 2022-05-25T14:58:52Z

@delner 👍 I’ll prepare something I can share here.

jorgemanrubia · 2022-05-26T09:21:56Z

I found something interesting when trying to come up with a minimal setup to share: what causes the leak is to enable profiling. So I can reproduce with this setup:

Datadog.configure do |config|
  config.service = "bc3"
  config.env = Rails.env

  config.tracing.instrument :rails
  config.tracing.instrument :resque

  config.profiling.enabled = true
  config.tracing.enabled = true
end

But the problem gets solved if I remove this line:

config.profiling.enabled = true

As mentioned, we don't have the same problem with v0.54.2 and having profiling enabled.

I can't share our full Gemfile but here are some relevant gems and versions we use related to Rails, resque, elasticsearch and AWS.

ivoanjo · 2022-05-26T09:46:45Z

Hey 👋 . I work on the profiler bits so it seemed relevant to step in.

Between 0.54.2 and 1.0.0 there were very little changes to the profiler, so perhaps that will make it easier to track it down. 😅

~~Could you try adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block?~~ (Update, late 2022: This was actually not the source of the issue, it was a "red herring")

That will disable a new feature we added in 1.0.0 (#1813) and check if that's the culprit.

jorgemanrubia · 2022-05-26T10:13:29Z

Hey @ivoanjo, it totally looks like that. After setting config.profiling.advanced.code_provenance_enabled = false the memory leak problem seems solved. I'll leave this running for some time to be sure and get back, but almost sure that's it from the first minutes 👍.

jorgemanrubia · 2022-05-26T10:47:49Z

Confirmed it's that feature @ivoanjo:

delner · 2022-05-26T13:00:09Z

Cool, glad we could narrow this down so quickly. This will help a lot in getting out a fix. Thanks again for the report and your help @jorgemanrubia!

ivoanjo · 2022-05-26T13:06:38Z

+1 thanks @jorgemanrubia for the fast feedback on the experiment! It's good that we seem to have found the culprit and that hopefully this can unblock you to run dd-trace-rb 1.0.0 (and 1.1.0, just released!).

I'd like to investigate this further. Could I ask you to open a ticket with support via https://docs.datadoghq.com/help/ linking to this issue and include the following (if possible):

The Ruby version you're using (ruby -v)
The rubygems and bundler version you're using (bundle -v and bundle exec gem -v)
If possible a link to the affected service's profile

rgrey · 2022-05-26T15:16:10Z

Yep - production affecting incident for us too.

jorgemanrubia · 2022-05-26T15:43:49Z

Could I ask you to open a ticket with support via https://docs.datadoghq.com/help/ l

@ivoanjo I just opened it (#821036)

ivoanjo · 2022-05-26T16:22:21Z

@rgrey thanks for the report. Are you seeing this issue in your resque workers as well, or is it unrelated to resque?

~~Edit: And does adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block solve the issue?~~ (Update, late 2022: This was actually not the source of the issue, it was a "red herring")

rgrey · 2022-05-27T09:59:08Z

@rgrey thanks for the report. Are you seeing this issue in your resque workers as well, or is it unrelated to resque?

Edit: And does adding config.profiling.advanced.code_provenance_enabled = false to your Datadog.configure block solve the issue?

Currently in the process of checking to see if this workaround works. For us, we don't use resque. Our rake tasks are failing, but those with more experience of our codebase are looking further to try understand/isolate.

ivoanjo · 2022-05-27T11:09:21Z

Thanks for the extra info @rgrey. I'm investigating and attempting to reproduce the issue, but have not had luck so far. Any information you can share is very helpful.

I would also like to know if for you the issue also showed up on a 0.54 to 1.0/1.1 migration.

If possible, I'll also request that you open a ticket with support and mention this github issue, so we can look into your profiles in more detail.

rgrey · 2022-06-06T11:07:25Z

Thanks for the extra info @rgrey. I'm investigating and attempting to reproduce the issue, but have not had luck so far. Any information you can share is very helpful.

I would also like to know if for you the issue also showed up on a 0.54 to 1.0/1.1 migration.

If possible, I'll also request that you open a ticket with support and mention this github issue, so we can look into your profiles in more detail.

We’ve reproduced the memory issue on a staging console running an internal rake task with ddtrace 1.0.0. Memory usage slowly increased and it died after half an hour or so. It still failed after adding config.profiling.advanced.code_provenance_enabled = false

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

A new rabbit hole on 1.1.0:

WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: Your ddtrace installation is missing support for the Continuous Profiler because there was a problem in setting up the libddprof dependency. For help solving this issue, please contact Datadog support at https://docs.datadoghq.com/help/.

I can’t see any documentation about this. It just tells you to contact support. Memory usage is growing again as well, so looks like the issue isn’t fixed. Removing the profiler doesn't help. Still leaking.

However, having removed the rake integration from config/initializers/datadog.rb:

    # Having rake enabled for the job workers means that datadog will not report job traces
    # because it has a open trace for the jobs:work rake task and it won't report until that is
    # closed by shutting down the job worker.
    c.tracing.instrument :rake,
      service_name: "freeagent-rake",
      enabled: !Rake.application.top_level_tasks.include?("jobs:work"),
      analytics_enabled: nil

This does seems to fix the memory issue, but of course, isn't the solution!

ivoanjo · 2022-06-06T14:48:38Z

We’ve reproduced the memory issue on a staging console running an internal rake task with ddtrace 1.0.0. Memory usage slowly increased and it died after half an hour or so.

Great, having a way to reproduce the issue really helps.

To clarify:

Is the task you mentioned using rake to trigger the processing of resque tasks?
When the failure occurred, was it during a single very long-running task, or was the Ruby app working on multiple smaller tasks for the half-hour period you mention?

It still failed after adding config.profiling.advanced.code_provenance_enabled = false
Removing the profiler doesn't help. Still leaking.

Ok, this is quite interesting! This makes me think that your issue @rgrey may not be the same one that @jorgemanrubia, since disabling profiling seemed to help his setup.

A new rabbit hole on 1.1.0:
WARN -- ddtrace: [ddtrace] Profiling was requested but is not supported, profiling disabled: (...)

Interesting! This will not contribute to the memory leak, but is annoying since it blocks you from using the profiler. I've created a separate issue to track and investigate this: #2068

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

I believe still having libddwaf marked as unstable is an oversight. I'll discuss this with my colleagues and I'll get back to you on this.

Note that this dependency is lazy loaded and thus not even required unless the security-related features of ddtrace are turned on.

rgrey · 2022-06-06T15:17:53Z

Thanks @ivoanjo - will confirm details with an engineer tomorrow re tasks, but no resque for us.

We will be looking for the ASM component, as we are heavily dependent on Sqreen and need to protect continuity of protective capabilities by migrating to ASM and keeping Sqreen in play until 31 Mar 2023.

delner · 2022-06-06T18:24:15Z

@rgrey Looking at the above, I first seriously suspect the Rake instrumentation.

If a long running Rake task is instrumented (especially one that starts a worker process) then it can incorrectly create one trace for the entire application. This is because traces are thread-bound: there's up to one per thread, and they release memory after the original span finishes.

If a Rake span opens on a Rake task that runs indefinitely, and the main thread of that Rake task continuously enters/exits instrumented code, there's a high likelihood a trace will grow indefinitely until the max span limit is reached (100K spans), thereby leaking memory.

Best workaround is to disable Rake instrumentation on any long running Rake tasks (I see you had something like this in your configuration). We do have some mechanisms that could be employed to manage trace context or trace individual tasks, but I wouldn't recommend them until I understand if they would address your issue.

@rgrey Is there a simple scenario I can run on my side to reproduce this? Will help give me an idea on what to recommend or how we could change behavior. Thanks!

lloeki · 2022-06-07T14:24:46Z

Also noticed that ddtrace 1.0.0 adds a dependency on libddwaf gem. The changelog describes all releases since 2020 as “unstable”, which doesn’t really inspire confidence.

I believe still having libddwaf marked as unstable is an oversight. I'll discuss this with my colleagues and I'll get back to you on this.

libddwaf's "unstable" marker here means the API may change and have breaking changes on minor versions, while its codebase and resulting binaries are definitely production-ready. To fully clarify, libddwaf is a heir to libsqreen, the latter which was semantically versioned using 0.x. To mark the break between libsqreen and libddwaf (which involved a lot of renaming and changes), we decided to bump the major version, but we needed some time nonetheless to stabilise the public API. So In essence libddwaf's 1.x is operating following semver's usual 0.x (where minor means "breaking change" and patch means "bugfix").

Since libddwaf should not be used directly and is wrapped by bindings such as libddwaf-rb, any such low-level C API change is handled by Datadog internally and isolated by the higher level binding code, which aims to provide a much stabler high level Ruby-oriented API. In any case, libddwaf-rb dependency is directly consumed by the ddtrace gem, and should there be a breaking change in libddwaf-rb's API we would handle it as gracefully as technically possible at the ddtrace level, and properly handled using ddtrace's gemspec dependency verssion constraints so that it picks only compatible versions of libddwaf-rb.

For additional clarity, any pre-release grade version of libddwaf-rb would be marked using pre-release gem version components such as .beta1.

I will create a PR to propose some changes to libddwaf's README to clarify the versioning scheme and API stability expectations.

I hope that makes things clearer!

jorgemanrubia · 2022-06-10T10:35:34Z

I've been running some more tests and found some interesting things:

I confirm that the problem happens when profiling is enabled.
The problem happens when instrumenting resque. I thought that instrumenting rake could be a factor, because resque-pool uses a rake task to spawn the main process, resulting in that long-living process @delner mentioned above. But I can reproduce the problem also when disabling rake monitoring.
I can reproduce the problem with latest gem version AND also with 0.54.2.
With 1.1.0, I can reproduce the problem with and without code_provenance enabled, as long as profiling is enabled.

Sorry I initially associated the problem with upgrading to version 1.0 of the gem. I have updated the title to remove that. I think I observed inconsistent results because the problem is associated to exercising jobs in the queue. This time I used a script to keep enqueuing jobs in all my tests.

If a long running Rake task is instrumented (especially one that starts a worker process) then it can incorrectly create one trace for the entire application. This is because traces are thread-bound: there's up to one per thread, and they release memory after the original span finishes.

In production, with 1.1.0 without profiling, we have registered these long-running traces associated to the resque-pool rake task. We haven't registered these clear memory leaks, but we have observed large memory bloats. We've removed rake instrumentation to see if that helps with those:

Even if there is a technical explanation, I think Datadog should prevent the problem in those those long-running processes. Or, if not-too-long-lived processes are required for ddtrace not leaking memory, it should be documented and actively warned against IMO.

Thanks for the help with this one.

ivoanjo · 2022-06-10T12:48:18Z

I would like to confirm: is the metric showing on this graph you shared the total memory usage across all resque processes/workers, or just for a single one?

Since you were able to confirm that the issue is not a regression, but something that is triggered on 0.54.2 as well, the next step would be to understand what exactly is getting leaked.

Unfortunately dd-trace-rb can not (yet!) help there since our profiler only supports CPU and Wall-clock data, and does not have any heap profiling capabilities (yet!).

Would you be able to share the analysis results of a Ruby heap dump of one such process? The heap-profiler gem can be used as a nice shorthand for dumping the heap and showing a report explaining where the memory is being used.

(One other alternative is the newly-released ruby_memprofiler_pprof gem which produces a pprof file tracking where the objects being retained were allocated).

I'm also available to jumping on a live call so we can debug this together, if you're up for it.

Note that we/I don't recommend using neither heap-profiler nor ruby_memprofiler_pprof in production, but since you mentioned this is reproduceable in a beta/staging environment, the added overhead and potential impact of using these gems should not be an issue.

delner · 2022-06-10T14:56:21Z

@jorgemanrubia This detail and comparison is helpful, thank you.

We do have hard-caps on trace & buffer size (100K spans) which are supposed to mitigate the impact of large traces. However, they may be set too high, or the criteria for triggering the cap may be too coarse grained. We may want to re-evaulate this behavior.

As an experiment, you could tweak the default cap with Datadog::Tracing::TraceOperation.const_set(:DEFAULT_MAX_LENGTH, 1000), or set it anywhere below 100K, to see if it has any meaningful impact. That would tell us if this particular control is effective, or whether we need to go further.

jorgemanrubia · 2022-06-14T10:25:39Z

Hey folks I'll get more info about it this week. It's on my radar, I just need to allocate time for it.

marcotc · 2022-07-15T22:53:31Z

👋 @jorgemanrubia and @rgrey, thank you for patience here!

I was trying to find any possible points where ddtrace holds on to memory by simply running large Ruby applications (I tried GitLab and Discourse) with ddtrace enabled: a ObjectSpace.dump on both did not reveal any ddtrace objects being kept alive for longer than the duration of their processing. The memory usage also eventually stabilized.

This leads me back to your situation: the issue is likely a long-running Rake task. To be clear, any never-ending trace started by ddtrace will cause large memory accumulation: we have a hard limit of 100k spans, but, in my very simple local tests, such a large span occupied 460MiB of memory, and you can have one active span per Ruby thread. I don't see such a large span being sustainable on a production deployment.

I've talked to our team, and we'll add selective Rake instrumentation by default: you'll have to explicitly declare what tasks need to be monitored, instead of all tasks by default. Because instrumenting a never-ending Rake can cause critical issues, we are not comfortable having instrumentation that by default can cause your application to run out of memory.
The explicit declaration will definitely be more cumbersome, but it's the safe thing to do using the tools we have today.

Until we have that in place, I recommend disabling the Rake instrumentation, if you haven't already.
In the meantime, you can manually instrument any important Rake tasks you have today that you wish to maintain visibility on:

# Ensure tracer is configured for Rake tasks, if not configured elsewhere
require 'ddtrace'
Datadog.configure{...}

task :my_task do |t|
  # Using `rake.invoke` and task name as the `resource` matches the current instrumentation.
  # Your monitors and alerts will work unmodified.
  Datadog::Tracing.trace('rake.invoke', resource: t.name) do
    # Task work goes here
  end
end

I'll also look into reducing the hard limit of 100k spans: it's practically useless to have such a large trace today. The amount of data in each Datadog App page is overwhelming and the flamegraph becomes illegible. I'll trying to find a realistic hard limit amongst our current client usage.

Overall, I'd like us to find an approach that is proactive in preventing never-ending traces without affecting otherwise correctly instrumented long traces.

erkie · 2023-02-11T21:23:06Z

This same thing happens for our Sidekiq workers. A single long running worker that did a lot of queries and iterated a lot of database objects caused this behaviour. After disabling profiling the memory consumption curve stays flat. The memory graph illustrates what happens when the worker is started, restarted, then restarted with profiling disabled.

I spent a lot of time today trying to pin down this issue, and it seems that somehow, with this setting enabled, datadog holds on to a a bunch of activerecord objects.

   3.07 kB  connection_pool-2.3.0/lib/connection_pool.rb:61
   2.88 kB  connection_pool-2.3.0/lib/connection_pool.rb:64
   2.18 kB  sidekiq-7.0.0/lib/sidekiq/fetch.rb:49
   1.60 kB  redis-client-0.12.1/lib/redis_client/command_builder.rb:9
   1.34 kB  connection_pool-2.3.0/lib/connection_pool.rb:68
   1.34 kB  connection_pool-2.3.0/lib/connection_pool.rb:71
  928.00 B  mysql2-0.5.4/lib/mysql2/client.rb:148
  640.00 B  activesupport-7.0.4/lib/active_support/core_ext/numeric/deprecated_conversions.rb:6
  320.00 B  connection_pool-2.3.0/lib/connection_pool.rb:60
  320.00 B  redis-client-0.12.1/lib/redis_client/decorator.rb:27
   80.00 B  sidekiq-7.0.0/lib/sidekiq/launcher.rb:105
   40.00 B  activemodel-7.0.4/lib/active_model/attribute_set.rb:93
   40.00 B  connection_pool-2.3.0/lib/connection_pool/timed_stack.rb:63
   40.00 B  sidekiq-7.0.0/lib/sidekiq/launcher.rb:228

(snippet from a memory dump using https://github.com/SamSaffron/memory_profiler)

Running:

Datadog::Tracing::TraceOperation.const_set(:DEFAULT_MAX_LENGTH, 1000)

Made no difference.

ivoanjo · 2023-02-14T13:22:27Z

Hey @erkie, sorry that you're running into problems with memory usage.

May I ask you to open you report as a separate issue, so that we can investigate there? Unfortunately since github has no threading it becomes harder to discuss potentially-different problems that may have the same outcome (memory usage).

So that we can investigate, do mind also including the following:

Your version of dd-trace-rb
How you are disabling/enabling dd-trace-rb

cluesque · 2023-06-23T22:18:59Z

I'm experiencing what looks like a similar problem, but with sidekiq. Opened #2930 for better threading of the conversation, but am linking here in case some other traveler finds it useful.

adamniedzielski · 2023-07-07T11:22:44Z

We started experiencing similar issues after upgrading from Datadog 0.54.2 to 1.0.0. It was primarily affecting a long-running data synchronisation process (implemented as a Rake task) that could allocate up to 8 GB (maximum that we allowed) in 6-8 hours. There were also OOM errors in our Que workers that started happening, but I can't prove that it was related.

After we upgraded to Datadog 1.12.1 the issue was gone 🎉

I find it really funny that I spent the last week using Datadog to troubleshoot memory issues that were caused by Datadog 😆

ivoanjo · 2023-07-07T14:15:30Z

Happy to hear that you're not impacted anymore 🎉

Please do open a ticket if you ever run into any issue -- we're glad to look into it :)

peterfication · 2023-07-13T06:02:53Z

I'm experiencing what looks like a similar problem, but with sidekiq. Opened #2930 for better threading of the conversation, but am linking here in case some other traveler finds it useful.

Thanks a lot for linking the issue here. This helped me as well (tracing.partial_flush.enabled = true).

ivoanjo · 2024-01-03T14:27:54Z

Hey! We were reviewing old tickets and bumped into this one.

There's been quite a number of releases/fixes to ddtrace since this was discussed, and for a few folks it looks like partial_flush is the way to go to make sure huge traces get broken up into smaller chunks that get evicted.

@jorgemanrubia and any folks that run into any issue -- please do let us know by opening new issues if you see anything suspicious -- we definitely want to look into them 🙏 😄

jorgemanrubia changed the title ~~Memory leak and memory bloat with 1.0~~ Memory leak and memory bloat in resque processes with 1.0 May 25, 2022

delner self-assigned this May 25, 2022

delner added bug Involves a bug integrations Involves tracing integrations community Was opened by a community member labels May 25, 2022

ivoanjo changed the title ~~Memory leak and memory bloat in resque processes with 1.0~~ Memory leak and memory bloat in resque processes with 1.0 when profiling is enabled May 27, 2022

ivoanjo mentioned this issue Jun 6, 2022

Investigate "Profiler disabled because there was a problem in setting up the libddprof dependency" #2068

Closed

ivoanjo added the profiling Involves Datadog profiling label Jun 7, 2022

jorgemanrubia changed the title ~~Memory leak and memory bloat in resque processes with 1.0 when profiling is enabled~~ Memory leak and memory bloat in resque processes when profiling is enabled Jun 10, 2022

lloeki mentioned this issue Jun 10, 2022

Add version semantic and unstable release information DataDog/libddwaf#81

Merged

ivoanjo changed the title ~~Memory leak and memory bloat in resque processes when profiling is enabled~~ Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) Jul 6, 2022

ivoanjo self-assigned this Jul 11, 2022

delner assigned marcotc Jul 11, 2022

marcotc mentioned this issue Jul 22, 2022

Rake:Require instrumented task list #2174

Merged

marcotc added this to the 1.3.0 milestone Jul 23, 2022

seannives mentioned this issue Oct 31, 2022

Memory running away with ddtrace on heroku #2343

Closed

ixti mentioned this issue Apr 6, 2023

fix: Correctly set mutex #2757

Merged

ivoanjo closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045

Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045

jorgemanrubia commented May 25, 2022

delner commented May 25, 2022

jorgemanrubia commented May 25, 2022

jorgemanrubia commented May 26, 2022 •

edited

Loading

ivoanjo commented May 26, 2022 •

edited

Loading

jorgemanrubia commented May 26, 2022

jorgemanrubia commented May 26, 2022

delner commented May 26, 2022

ivoanjo commented May 26, 2022

rgrey commented May 26, 2022

jorgemanrubia commented May 26, 2022

ivoanjo commented May 26, 2022 •

edited

Loading

rgrey commented May 27, 2022

ivoanjo commented May 27, 2022

rgrey commented Jun 6, 2022

ivoanjo commented Jun 6, 2022

rgrey commented Jun 6, 2022

delner commented Jun 6, 2022 •

edited

Loading

lloeki commented Jun 7, 2022

jorgemanrubia commented Jun 10, 2022 •

edited

Loading

ivoanjo commented Jun 10, 2022

delner commented Jun 10, 2022

jorgemanrubia commented Jun 14, 2022

marcotc commented Jul 15, 2022

erkie commented Feb 11, 2023

ivoanjo commented Feb 14, 2023

cluesque commented Jun 23, 2023

adamniedzielski commented Jul 7, 2023

ivoanjo commented Jul 7, 2023

peterfication commented Jul 13, 2023

ivoanjo commented Jan 3, 2024

Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045

Memory leak and memory bloat in resque processes when profiling is enabled (suspected to impact at least 0.54.2 and above) #2045

Comments

jorgemanrubia commented May 25, 2022

delner commented May 25, 2022

jorgemanrubia commented May 25, 2022

jorgemanrubia commented May 26, 2022 • edited Loading

ivoanjo commented May 26, 2022 • edited Loading

jorgemanrubia commented May 26, 2022

jorgemanrubia commented May 26, 2022

delner commented May 26, 2022

ivoanjo commented May 26, 2022

rgrey commented May 26, 2022

jorgemanrubia commented May 26, 2022

ivoanjo commented May 26, 2022 • edited Loading

rgrey commented May 27, 2022

ivoanjo commented May 27, 2022

rgrey commented Jun 6, 2022

ivoanjo commented Jun 6, 2022

rgrey commented Jun 6, 2022

delner commented Jun 6, 2022 • edited Loading

lloeki commented Jun 7, 2022

jorgemanrubia commented Jun 10, 2022 • edited Loading

ivoanjo commented Jun 10, 2022

delner commented Jun 10, 2022

jorgemanrubia commented Jun 14, 2022

marcotc commented Jul 15, 2022

erkie commented Feb 11, 2023

ivoanjo commented Feb 14, 2023

cluesque commented Jun 23, 2023

adamniedzielski commented Jul 7, 2023

ivoanjo commented Jul 7, 2023

peterfication commented Jul 13, 2023

ivoanjo commented Jan 3, 2024

jorgemanrubia commented May 26, 2022 •

edited

Loading

ivoanjo commented May 26, 2022 •

edited

Loading

ivoanjo commented May 26, 2022 •

edited

Loading

delner commented Jun 6, 2022 •

edited

Loading

jorgemanrubia commented Jun 10, 2022 •

edited

Loading