Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fluentd crashes with "Too many open files" #1696

Closed
reimarstier opened this issue Sep 18, 2017 · 8 comments
Closed

fluentd crashes with "Too many open files" #1696

reimarstier opened this issue Sep 18, 2017 · 8 comments

Comments

@reimarstier
Copy link

  • fluentd-0.14.21
  • raspbian, log files reside on a cifs (samba) mount
  • configuration reads lines with in_tail plugin and writes to kafka
  • fluentd is reading fast rolling logs and sending them to a kafka cluster

After some time fluentd crashes with an error "Too many open files". This problem only arises in connection with fast rolling log files (sometimes a file only exists for about 10 seconds - hence the reason why fluentd was necessary to trace those logs). Other applications without the fast rolling log behavior do not have this problem. Increasing the hard/soft limits of the number of open files in /etc/security/limits.conf simply prolongs the first incident of this problem.

It seems like fluentd is not able to close those files in time to keep reading at that pace. Maybe this is not an issue as long there are not that many files to read.

Fluentd process quits with following stacktrace:

2017-09-14 16:00:08 +0200 [info]: #0 following tail of /srv/cifs/Application.txt
2017-09-14 16:00:08 +0200 [error]: #0 bignum too big to convert into `long long'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:691:in `seek'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:691:in `open'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:707:in `with_io'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:645:in `on_notify'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:477:in `detach'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:306:in `detach_watcher'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:316:in `block in detach_watcher_after_rotate_wait'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/timer.rb:77:in `on_timer'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.1/lib/cool.io/loop.rb:88:in `run_once'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.1/lib/cool.io/loop.rb:88:in `run'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-09-14 16:00:08 +0200 [error]: #0 bignum too big to convert into `long long'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:691:in `seek'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:691:in `open'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:707:in `with_io'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:645:in `on_notify'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:497:in `on_notify'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:470:in `attach'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:240:in `setup_watcher'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:261:in `block in start_watchers'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:248:in `each'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:248:in `start_watchers'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:234:in `refresh_watchers'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/timer.rb:77:in `on_timer'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.1/lib/cool.io/loop.rb:88:in `run_once'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/cool.io-1.5.1/lib/cool.io/loop.rb:88:in `run'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2017-09-14 16:00:08 +0200 [info]: #0 following tail of /srv/cifs/ApplicationServerLog.txt.1
2017-09-14 16:00:08 +0200 [error]: #0 Too many open files @ rb_sysopen - /srv/cifs/main.log
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:690:in `initialize'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:690:in `open'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:690:in `open'
  2017-09-14 16:00:08 +0200 [error]: #0 /var/lib/gems/2.3.0/gems/fluentd-0.14.21/lib/fluent/plugin/in_tail.rb:707:in `with_io'
2017-09-14 16:00:08 +0200 [error]: #0 bignum too big to convert into `long long'
@repeatedly
Copy link
Member

Hmm... The root cause seems bignum too big to convert into long long'`.
We can fix fd leak problem by closing opened fd here:


But it means in_tail skips target file.
I don't have a raspbian, so I'm not sure why this error happens on raspbian.
Maybe, one file is larger than 2^32?

@reimarstier
Copy link
Author

reimarstier commented Oct 4, 2017

How can i trace the origin of the bignum conversion error? The stacktrace above is only about the in_tail plugin being not able to open that file.

The files are definitely not 4 GB in size. Rather 500kb to a few megabyte at most. Could it be that the file handle descriptor number is too big?

When is the file closed during normal operation? What happens if the file disappears (because of rotation) when fluentd opened the file but hasn't read it entirely?

I am using kafka output.
Maybe this bug hits me too:

@repeatedly
Copy link
Member

repeatedly commented Nov 2, 2017

Applying following patch shows what value is passed to `seek:

diff --git a/lib/fluent/plugin/in_tail.rb b/lib/fluent/plugin/in_tail.rb
index eeaadb1..4796315 100644
--- a/lib/fluent/plugin/in_tail.rb
+++ b/lib/fluent/plugin/in_tail.rb
@@ -698,6 +698,9 @@ module Fluent::Plugin
           io = Fluent::FileWrapper.open(@watcher.path)
           io.seek(@watcher.pe.read_pos + @fifo.bytesize)
           io
+        rescue RangeError => e
+          @watcher.log.warn "in_tail seek error: path = #{@watcher.path}, pos = #{@watcher.pe.read_pos + @fifo.bytesize}"
+          raise e
         rescue Errno::ENOENT
           nil
         end

If this pos shows 0xffffffffffffffff value, pe has closed file content.

@repeatedly
Copy link
Member

BTW, your log has 2017-09-14 16:00:08 +0200 [info]: #0 following tail of /srv/cifs/ApplicationServerLog.txt.1.
Does your in_tail configuration use * in path parameter and logrotation happens in its directory?

@repeatedly
Copy link
Member

repeatedly commented Nov 13, 2017

Merge #1742.
It will show which value cause the seek error and avoid "too many open files" error.

@anthonyhaussman
Copy link

Have the same error on fluentd 1.0.2 but it seems to be cause by ForwardInput:

2018-01-11 14:05:11 +0000 [error]: #0 unexpected error error_class=Errno::EMFILE error="Too many open files - accept(2)"
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/2.4.0/socket.rb:1298:in `__accept_nonblock'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/td-agent/embedded/lib/ruby/2.4.0/socket.rb:1298:in `accept_nonblock'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/cool.io-1.5.3/lib/cool.io/listener.rb:46:in `on_readable'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run_once'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/cool.io-1.5.3/lib/cool.io/loop.rb:88:in `run'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.0.2/lib/fluent/plugin_helper/event_loop.rb:84:in `block in start'
  2018-01-11 14:05:11 +0000 [error]: #0 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.0.2/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create'
2018-01-11 14:05:11 +0000 [error]: #0 unexpected error error_class=Errno::EMFILE error="Too many open files - accept(2)"
  2018-01-11 14:05:11 +0000 [error]: #0 suppressed same stacktrace```

My configuration:
@type forward port 24224 skip_invalid_event true @type monitor_agent bind 0.0.0.0 port 24220

<match fluent.**>
@type null

<match access_log.**>
@type elasticsearch

# Buffering
slow_flush_log_threshold 90
<buffer>
  @type file
  path /data/fluent/access_log.*.buffer
  #compress gzip
  chunk_limit_size 128m  # 32MB compressed events
  queue_limit_length 2048
  total_limit_size 250g
  flush_interval 60s
  flush_at_shutdown true
  flush_thread_count 4
  retry_max_times 10
  retry_wait 1s
  retry_forever 
  overflow_action throw_exception
</buffer>

# ElasticSearch
hosts elastic-01.adm.ew1:9200,elastic-02.adm.ew1:9200,elastic-03.adm.ew1:9200,elastic-04.adm.ew1:9200,elastic-05.adm.ew1:9200,elastic-06.adm.ew1:9200,elastic-04.adm.ew1:9200,elastic-05.adm.ew1:9200,elastic-06.adm.ew1:9200
ssl_verify false
logstash_format true
logstash_prefix access
logstash_dateformat %Y.%m.%d
request_timeout 15s
reload_connections true
reload_on_failure true # defaults to false, can be useful to quickly remove a dead node from the list of addresses
resurrect_after 30 # defaults to 60s
reconnect_on_error true
```

@anthonyhaussman
Copy link

I have upgrade to the last 1.2.1 version and still have some issue of "Too many open files".

2018-06-08 14:45:57 +0000 [warn]: #1 thread exited by unexpected error plugin=Fluent::Plugin::ForwardInput title=:event_loop error_class=Errno::EMFILE error="Too many open files - accept(2)"
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/serverengine-2.0.6/lib/serverengine/socket_manager_unix.rb:37:in `recv_io'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/serverengine-2.0.6/lib/serverengine/socket_manager_unix.rb:37:in `recv'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/serverengine-2.0.6/lib/serverengine/socket_manager.rb:47:in `listen'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/serverengine-2.0.6/lib/serverengine/socket_manager.rb:59:in `listen_udp'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/plugin_helper/server.rb:363:in `server_create_udp_socket'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/plugin_helper/server.rb:174:in `server_create'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/plugin/in_forward.rb:168:in `start'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:165:in `block in start'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:154:in `block (2 levels) in lifecycle'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:153:in `each'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:153:in `block in lifecycle'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:140:in `each'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:140:in `lifecycle'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/root_agent.rb:164:in `start'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/engine.rb:274:in `start'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/engine.rb:219:in `run'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/supervisor.rb:799:in `run_engine'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/supervisor.rb:549:in `block in run_worker'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/supervisor.rb:724:in `main_process'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/supervisor.rb:544:in `run_worker'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/lib/fluent/command/fluentd.rb:316:in `<top (required)>'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/td-agent/embedded/lib/ruby/site_ruby/2.4.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/td-agent/embedded/lib/ruby/site_ruby/2.4.0/rubygems/core_ext/kernel_require.rb:55:in `require'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/lib/ruby/gems/2.4.0/gems/fluentd-1.2.1/bin/fluentd:8:in `<top (required)>'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/bin/fluentd:22:in `load'
  2018-06-08 14:45:57 +0000 [error]: #1 /opt/fluentd/embedded/bin/fluentd:22:in `<main>'
2018-06-08 14:45:57 +0000 [error]: #1 unexpected error error_class=SocketError error="file descriptor was not passed (msg_controllen=0 smaller than sizeof(struct cmsghdr)=16)"
  2018-06-08 14:45:57 +0000 [error]: #1 suppressed same stacktrace
2018-06-08 14:45:56 +0000 [debug]: #0 flush_thread actually running
2018-06-08 14:45:57 +0000 [warn]: #0 thread exited by unexpected error plugin=Fluent::Plugin::ForwardInput title=:event_loop error_class=Errno::EMFILE error="Too many open files - accept(2)"

I have update my configuration in consequence and setup the debug mode:

<system>
  workers 4
  root_dir /etc/fluentd/root
</system>

<source>
  @type forward
  port 24224
</source>

<source>
  @type monitor_agent
  bind 0.0.0.0
  port 24220
</source>

<match fluent.**>
  @type null
</match>

<match kubernetes.**>
  @type rewrite_tag_filter
  <rule>
    key     log
    pattern (\[access\_log\])
    tag access_log.${tag}
  </rule>
  <rule>
    key     log
    pattern (\[json\])
    tag json_log.${tag}
  </rule>
  <rule>
    key     log
    pattern (\[perf\])
    tag perf_log.${tag}
  </rule>
  <rule>
    key     log
    pattern (.+)
    tag default_log.${tag}
  </rule>
</match>

<match access_log.**>
  @type elasticsearch
  @log_level debug

  # Buffering
  slow_flush_log_threshold 90
  <buffer>
    @type file
    path /data/fluent/access_log
    chunk_limit_size 256m
    total_limit_size 250g
    flush_interval 60s
    flush_at_shutdown true
    flush_thread_count 8
    retry_max_times 10
    retry_wait 1s
    retry_forever
    overflow_action block
  </buffer>

  # ElasticSearch
  hosts gcp-es-01:9200,gcp-es-02:9200,gcp-es-03:9200,gcp-es-04:9200,gcp-es-05:9200,gcp-es-06:9200
  ssl_verify false
  logstash_format true
  logstash_prefix access
  logstash_dateformat %Y.%m.%d
  request_timeout 15s
  reload_connections true
  reload_on_failure true
  resurrect_after 30
  reconnect_on_error true
</match>

<match json_log.**>
  @type elasticsearch
  @log_level debug

  # Buffering
  slow_flush_log_threshold 90
  <buffer>
    @type file
    path /data/fluent/json_log
    chunk_limit_size 256m
    total_limit_size 250g
    flush_interval 60s
    flush_at_shutdown true
    flush_thread_count 8
    retry_max_times 10
    retry_wait 1s
    retry_forever
    overflow_action block
  </buffer>

  # ElasticSearch
  hosts gcp-es-01:9200,gcp-es-02:9200,gcp-es-03:9200,gcp-es-04:9200,gcp-es-05:9200,gcp-es-06:9200
  ssl_verify false
  logstash_format true
  logstash_prefix json
  logstash_dateformat %Y.%m.%d
  request_timeout 15s
  reload_connections true
  reload_on_failure true
  resurrect_after 30
  reconnect_on_error true
</match>

<match perf_log.**>
  @type elasticsearch
  @log_level debug

  # Buffering
  slow_flush_log_threshold 90
  <buffer>
    @type file
    path /data/fluent/perf_log
    chunk_limit_size 256m
    total_limit_size 250g
    flush_interval 60s
    flush_at_shutdown true
    flush_thread_count 8
    retry_max_times 10
    retry_wait 1s
    retry_forever
    overflow_action block
  </buffer>

  # ElasticSearch
  hosts gcp-es-01:9200,gcp-es-02:9200,gcp-es-03:9200,gcp-es-04:9200,gcp-es-05:9200,gcp-es-06:9200
  ssl_verify false
  logstash_format true
  logstash_prefix perf
  logstash_dateformat %Y.%m.%d
  request_timeout 15s
  reload_connections true
  reload_on_failure true
  resurrect_after 30 
  reconnect_on_error true
</match>
<match default_log.**>
  @type elasticsearch
  @log_level debug

  # Buffering
  slow_flush_log_threshold 90
  <buffer>
    @type file
    path /data/fluent/queue_log
    chunk_limit_size 256m 
    total_limit_size 250g
    flush_interval 60s
    flush_at_shutdown true
    flush_thread_count 8
    retry_max_times 10
    retry_wait 1s
    retry_forever
    overflow_action block
  </buffer>

  # ElasticSearch
  hosts gcp-es-01:9200,gcp-es-02:9200,gcp-es-03:9200,gcp-es-04:9200,gcp-es-05:9200,gcp-es-06:9200
  ssl_verify false
  logstash_format true
  logstash_prefix logstash
  logstash_dateformat %Y.%m.%d
  request_timeout 15s
  reload_connections true
  reload_on_failure true
  resurrect_after 30 
  reconnect_on_error true
</match>

If you need more debug, I'm ready to help.

@repeatedly
Copy link
Member

You need to check your fd limitations and the number of files/socket/thread.
BTW, I close this issue because I merged the patch and no response from original author.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants