Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One semaphore to exclusively load them all #550

Merged
merged 1 commit into from
Feb 5, 2019

Conversation

jrafanie
Copy link
Member

@jrafanie jrafanie commented Jan 28, 2019

https://bugzilla.redhat.com/show_bug.cgi?id=1671458
Fixes #544

From MartinH's findings in #544, we had one thread at the
cspec[:klass].constantize line while another thread was trying to run
klass.columns_hash, causing a
Circular dependency detected while autoloading... error.

I then tried a bunch of things until I found a way to reliably recreate
this error:

require_relative 'config/environment'

threads = []
4.times do
  threads << Thread.new { Api::Environment.time_attributes }
end
threads.collect(&:join)

I could then run the above script several times in my shell and get the
circular dependency error most of the time.

for x in `seq 1 10`; do; bundle exec ruby test.rb; done

With this test in place, I then tried a few solutions:

  1. Move the klass.columns_hash block into the permit_concurrent_loads
    block:
diff --git a/lib/api/environment.rb b/lib/api/environment.rb
index 87b34f99..f4b6554a 100644
--- a/lib/api/environment.rb
+++ b/lib/api/environment.rb
@@ -22,9 +22,10 @@ module Api
         # Temporary measure to avoid thread race condition which could lead to a deadlock
         ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
           klass = cspec[:klass].constantize
-        end
-        klass.columns_hash.each do |name, typeobj|
-          result << name if %w(date datetime).include?(typeobj.type.to_s)
+
+          klass.columns_hash.each do |name, typeobj|
+            result << name if %w(date datetime).include?(typeobj.type.to_s)
+          end
         end
       end
     end

This did not fix the circular dependency error. Perhaps
permit_concurrent_loads doesn't handle arbitrarily deep nested autoloads
cross threads?

  1. I tried Mutex#synchronize and this worked, but I'd rather we work
    with the interlock provided by rails.
diff --git a/lib/api/environment.rb b/lib/api/environment.rb
index 87b34f99..f51de73f 100644
--- a/lib/api/environment.rb
+++ b/lib/api/environment.rb
@@ -1,5 +1,6 @@
 module Api
   class Environment
+    ONE_AUTOLOADER_LOCK = Mutex.new
     def self.url_attributes
       @url_attributes ||= Set.new(%w(href))
     end
@@ -19,12 +20,13 @@ module Api
         next if cspec[:klass].blank?
         klass = nil

-        # Temporary measure to avoid thread race condition which could lead to a deadlock
-        ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
+        # Ensure we're the only thread trying to autoload classes and their columns
+        ONE_AUTOLOADER_LOCK.synchronize do
           klass = cspec[:klass].constantize
-        end
-        klass.columns_hash.each do |name, typeobj|
-          result << name if %w(date datetime).include?(typeobj.type.to_s)
+
+          klass.columns_hash.each do |name, typeobj|
+            result << name if %w(date datetime).include?(typeobj.type.to_s)
+          end
         end
       end
     end

  1. I tried Sync.new with SH and EX locks instead of Mutex and this
    failed with Thread killed errors.

  2. I changed the permit_concurrent_loads on the interlock to loading
    and this worked, but Yuri found this caused a deadlock.

  3. Use loading from 4) and move the columns_hash call into
    the loading block. This solves the circular dependency and avoids the
    deadlock encountered in 4).

I noticed, permit_concurrent_loads calls yield_shares with compatible: [:load])
and that method has this in the source code comment:

https://github.com/rails/rails/blob/bb22fe9d4a6102d2a28cb1adfd6fe9d38fc9bb22/activesupport/lib/active_support/concurrency/share_lock.rb#L166-L168

  Temporarily give up all held Share locks while executing the
  supplied block, allowing any +compatible+ exclusive lock request
  to proceed.

Perhaps, since we're loading code, permit_concurrent_loads is too
permissive of other exclusive lock requests and we really need to ensure
nothing else is trying to load.

@jrafanie
Copy link
Member Author

Note, this whole problem goes away if we autoload this memoized method once at boot before puma worker threads are spawned.

@miq-bot miq-bot added the wip label Jan 28, 2019
Copy link
Member

@skateman skateman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Seal of Approval
Tested, looks good 👍

@jrafanie jrafanie closed this Jan 29, 2019
@jrafanie jrafanie reopened this Jan 29, 2019
@jrafanie
Copy link
Member Author

@skateman @himdel Updated to use the interlock but to request an exclusive lock while we're autoloading. Was there a BZ associated with this bug or just #544?

@jrafanie jrafanie changed the title [WIP] One semaphore to load them all [WIP] One semaphore to exclusively load them all Jan 30, 2019
@skateman
Copy link
Member

@jrafanie no BZ, but we might need one .. the bug can appear in older versions as well.

@jrafanie jrafanie changed the title [WIP] One semaphore to exclusively load them all One semaphore to exclusively load them all Jan 31, 2019
@jrafanie
Copy link
Member Author

@miq-bot rm_label wip
@miq-bot add_label gaprindashvili/yes
@miq-bot add_label hammer/yes
@miq-bot add_label bug

@jrafanie jrafanie changed the title One semaphore to exclusively load them all [WIP] One semaphore to exclusively load them all Jan 31, 2019
@jrafanie
Copy link
Member Author

Hold off on merging, @yrudman found that this seems to cause the first click after login to spin indefinitely, possibly a deadlock...

@miq-bot add_label wip

@miq-bot miq-bot added the wip label Jan 31, 2019
@miq-bot
Copy link
Member

miq-bot commented Jan 31, 2019

Checked commit jrafanie@67a74c5 with ruby 2.3.3, rubocop 0.52.1, haml-lint 0.20.0, and yamllint 1.10.0
1 file checked, 0 offenses detected
Everything looks fine. 🏆

result << name if %w(date datetime).include?(typeobj.type.to_s)
klass.columns_hash.each do |name, typeobj|
result << name if %w(date datetime).include?(typeobj.type.to_s)
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@skateman
Copy link
Member

I feel bad that I don't understand the whole rails code reloading process well enough to fix something like this 😞

# Temporary measure to avoid thread race condition which could lead to a deadlock
ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
# Ensure we're the only thread trying to autoload classes and their columns
ActiveSupport::Dependencies.interlock.loading do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this also works:

        Rails.application.executor.wrap do
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fryguy do you prefer the executor wrapper? I can research more tomorrow if that makes sense over the interlock.loading.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the executor wrapper seemed like a better approach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, executor.wrap deadlocks if I hit compute -> cloud after login 🤣 ... back to .loading

@jrafanie
Copy link
Member Author

@miq-bot rm_label wip

Ok, we fixed the issue @yrudman found. The call to columns_hash hits our ar_virtual modification to load_schema! here, which calls define_virtual_delegate which loads the class here, so for now, we need to protect the columns_hash call too.

@miq-bot miq-bot changed the title [WIP] One semaphore to exclusively load them all One semaphore to exclusively load them all Jan 31, 2019
@miq-bot miq-bot removed the wip label Jan 31, 2019
@jrafanie
Copy link
Member Author

Ok, we fixed the issue @yrudman found. The call to columns_hash hits our ar_virtual modification to load_schema! here, which calls define_virtual_delegate which loads the class here, so for now, we need to protect the columns_hash call too.

I need to circle around to @kbrock and @NickLaMuro regarding the load_schema! monkey patch to see if there's another way to avoid the constantizing dynamically in load_schema!

@Fryguy
Copy link
Member

Fryguy commented Feb 4, 2019

@kbrock @NickLaMuro When @jrafanie went through this, we came to the conclusion that vanilla Rails' built-in columns_hash (and thus load_schema) never goes through the autoload process, whereas after ar_virtual is introduced, load_schema is changed to autoload some things. This is sort of the fundamental problem.

Since load_schema happens during the code require process, the const_missing ends up causing a cascading require. In a single-thread this is fine for some reason, but in multi-thread, you end up with a collision which manifests as a circular dep (I don't think it's actually circular but instead a shared "did I see this yet" list shows a class as already loaded - but this all gets really confusing to follow)

@kbrock
Copy link
Member

kbrock commented Feb 5, 2019

Not sure if it was documented elsewhere but...

The issue is with virtual attribute's delegate_attribute.
A delegate defines an attribute based upon an attribute in another model.
So when you are loading one schema, it calls another model's schema_load.
Thus entering race condition.

It is the type_for_attribute [here] that calls load_schema! on another class

@kbrock
Copy link
Member

kbrock commented Feb 5, 2019

load_schema does not happen during the class load. It happens because we class_hash.
The issue is the class_hash in one model is causing a class_hash in another model.

A work around could be for us to explicitly state the type of the attribute delegated. That way loading one class is not dependent upon other classes.

One interesting note, has_one does not load the referencing class.

@jrafanie
Copy link
Member Author

jrafanie commented Feb 5, 2019

Still testing this...

@jrafanie jrafanie changed the title One semaphore to exclusively load them all [WIP] One semaphore to exclusively load them all Feb 5, 2019
https://bugzilla.redhat.com/show_bug.cgi?id=1671458
Fixes ManageIQ#544

From MartinH's findings in ManageIQ#544, we had one thread at the
`cspec[:klass].constantize` line while another thread was trying to run
`klass.columns_hash`, causing a
`Circular dependency detected while autoloading...` error.

I then tried a bunch of things until I found a way to reliably recreate
this error:

```ruby
require_relative 'config/environment'

threads = []
4.times do
  threads << Thread.new { Api::Environment.time_attributes }
end
threads.collect(&:join)
```

I could then run the above script several times in my shell and get the
`circular dependency` error most of the time.

```
for x in `seq 1 10`; do; bundle exec ruby test.rb; done
```

With this test in place, I then tried a few solutions:

1) Move the `klass.columns_hash` block into the permit_concurrent_loads
block:

```
diff --git a/lib/api/environment.rb b/lib/api/environment.rb
index 87b34f99..f4b6554a 100644
--- a/lib/api/environment.rb
+++ b/lib/api/environment.rb
@@ -22,9 +22,10 @@ module Api
         # Temporary measure to avoid thread race condition which could lead to a deadlock
         ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
           klass = cspec[:klass].constantize
-        end
-        klass.columns_hash.each do |name, typeobj|
-          result << name if %w(date datetime).include?(typeobj.type.to_s)
+
+          klass.columns_hash.each do |name, typeobj|
+            result << name if %w(date datetime).include?(typeobj.type.to_s)
+          end
         end
       end
     end
```

This did not fix the `circular dependency` error. Perhaps
`permit_concurrent_loads` doesn't handle arbitrarily deep nested autoloads
cross threads?

2) I tried Mutex#synchronize and this worked, but I'd rather we work
with the interlock provided by rails.

```
diff --git a/lib/api/environment.rb b/lib/api/environment.rb
index 87b34f99..f51de73f 100644
--- a/lib/api/environment.rb
+++ b/lib/api/environment.rb
@@ -1,5 +1,6 @@
 module Api
   class Environment
+    ONE_AUTOLOADER_LOCK = Mutex.new
     def self.url_attributes
       @url_attributes ||= Set.new(%w(href))
     end
@@ -19,12 +20,13 @@ module Api
         next if cspec[:klass].blank?
         klass = nil

-        # Temporary measure to avoid thread race condition which could lead to a deadlock
-        ActiveSupport::Dependencies.interlock.permit_concurrent_loads do
+        # Ensure we're the only thread trying to autoload classes and their columns
+        ONE_AUTOLOADER_LOCK.synchronize do
           klass = cspec[:klass].constantize
-        end
-        klass.columns_hash.each do |name, typeobj|
-          result << name if %w(date datetime).include?(typeobj.type.to_s)
+
+          klass.columns_hash.each do |name, typeobj|
+            result << name if %w(date datetime).include?(typeobj.type.to_s)
+          end
         end
       end
     end

```

3) I tried Sync.new with SH and EX locks instead of Mutex and this
failed with Thread killed errors.

4) I changed the `permit_concurrent_loads` on the interlock to `loading`
and this worked, but Yuri found this caused a deadlock.

5) Use `AS::Dep.interlock.loading` from 4) and move the `columns_hash` call into
the `loading` block.  This solves the circular dependency and avoids the
deadlock encountered in 4).

I noticed, `permit_concurrent_loads` calls `yield_shares` with `compatible: [:load])`
and that method has this in the source code comment:

https://github.com/rails/rails/blob/bb22fe9d4a6102d2a28cb1adfd6fe9d38fc9bb22/activesupport/lib/active_support/concurrency/share_lock.rb#L166-L168

```
  Temporarily give up all held Share locks while executing the
  supplied block, allowing any +compatible+ exclusive lock request
  to proceed.
```

Perhaps, since we're loading code, `permit_concurrent_loads` is too
permissive of other exclusive lock requests and we really need to ensure
nothing else is trying to load.
@jrafanie jrafanie changed the title [WIP] One semaphore to exclusively load them all One semaphore to exclusively load them all Feb 5, 2019
@jrafanie
Copy link
Member Author

jrafanie commented Feb 5, 2019

It turns out Rails.application.executor.wrap do still deadlocks. Oh well, 🤷‍♂️

@Fryguy Fryguy merged commit e75c9e0 into ManageIQ:master Feb 5, 2019
@Fryguy Fryguy added this to the Sprint 105 Ending Feb 18, 2019 milestone Feb 5, 2019
@Fryguy Fryguy self-assigned this Feb 5, 2019
simaishi pushed a commit that referenced this pull request Feb 6, 2019
@simaishi
Copy link
Contributor

simaishi commented Feb 6, 2019

Hammer backport details:

$ git log -1
commit b96fac3780976f91b26f286f14cfc308660b0632
Author: Jason Frey <fryguy9@gmail.com>
Date:   Tue Feb 5 15:33:17 2019 -0500

    Merge pull request #550 from jrafanie/ensure_one_autoloader_thread
    
    One semaphore to exclusively load them all
    
    (cherry picked from commit e75c9e08e7ccfe9dbd5c2b687303b6d981d4fd03)
    
    https://bugzilla.redhat.com/show_bug.cgi?id=1673039

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Very first request sometimes causes Circular dependency error on ConversionHost
7 participants