Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: Prevent IOHandler.on_notify() being called concurrently #1876

Merged

Conversation

fujimotos
Copy link
Member

This patch is a fix for a multi-threading bug in Fluentd core. Apparently, this bug
has been one of major causes of the long-standing testing failures on Travis CI.

Problem

Essentially, there is a concurrency bug in the implementation of in_tail.

While IOHandler (in_tail#L673) is not designed to be thread-safe, its method
on_notify can be called by multiple threads concurrently.

Here is how it occurs:

  1. TailInput calls on_notify on detach/shutdown (main thread)
  2. In the same time, StatWatcher triggers on_notify on filesystem events
    (eventloop thread)

In this situation, one thread can see the data retrieved (and being actively
processed) by another thread, which will result in duplication of data records.

Solution

This patch fixes the issue by introducing a mutex, which should prevent two
threads from entering into the critical session simultaneously.

@fujimotos fujimotos force-pushed the sf/fix-concurrent-processing-in-tail branch 2 times, most recently from 2867f0a to 0f458fd Compare March 2, 2018 06:57
Essentially there is a concurrency bug in the implementation of `in_tail`.

While `IOHandler` is not designed to be thread-safe, its method `on_notify`
can be called by multiple threads concurrently. In particular, consider the
following situation:

 1. TailInput calls `on_notify` on detach/shutdown (main thread)
 2. In the same time, StatWatcher triggers `on_notify` on filesystem events
    (eventloop thread)

This patch fixes this issue by introducing a mutex, which should prevent
multiple threads from entering the critical session simultaneously.
@fujimotos fujimotos force-pushed the sf/fix-concurrent-processing-in-tail branch from 0f458fd to af84ba5 Compare March 2, 2018 08:07
@fujimotos
Copy link
Member Author

fujimotos commented Mar 2, 2018

Note on the relation to Travis CI failures. For example, see the following test failure:

/home/travis/build/fluent/fluentd/test/plugin/test_in_tail.rb:161:in `test_emit`
     158:       assert_equal({"message" => "test4"}, events[1][2])
     159:       assert(events[0][1].is_a?(Fluent::EventTime))
     160:       assert(events[1][1].is_a?(Fluent::EventTime))
  => 161:       assert_equal(1, d.emit_count)
     162:     end
     163: 
     164:     def test_emit_with_emit_unmatched_lines_true
<1> expected but was
<2>
Failure: test_emit[flat](TailInputTest::singleline)

(Also similar failure cases are: #4160.2, #4138.4, #4129.5 #4128.4, #4127.5)

This failure occurs because TailInput emits the same event repeatedly due to the
concurrency bug describe above ( so emit_count become "2", instead of "1")

This class of test failures will go away once this patch merged.

Also: You can reproduce this test failure in local environments simply by inserting
sleep 0.5 into in_tail#708-709 as follows:

unless @lines.empty?
   if @receive_lines.call(@lines)
      @watcher.pe.update_pos(io.pos - @fifo.bytesize)
      sleep 0.5
      @lines.clear
   else
      read_more = false
   end
end

@repeatedly repeatedly merged commit 88ae430 into fluent:master Mar 5, 2018
@repeatedly
Copy link
Member

Good catch! Thanks.

@fujimotos fujimotos deleted the sf/fix-concurrent-processing-in-tail branch March 5, 2018 06:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants