Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[in_tail][follow_inode enabled] found a racing bug that new created watcher be unwatched unexpectedly and thus log stuck #4190

Closed
garyzjq opened this issue May 31, 2023 · 4 comments · Fixed by #4191 or #4208
Labels
bug Something isn't working

Comments

@garyzjq
Copy link
Contributor

garyzjq commented May 31, 2023

Describe the bug

Env & Setting
We are using fluentd in_tail plugin to parse log from pods running on k8s.

k8s log rotation policy: latest log will always write log file with same name, let's say 'container.log'. When file reach to certain size and need rotate, k8s will rename the original one to 'container.backup.log' as backup (inode keep the same) and create a new file still named 'container.log' (different inode)

In our env, follow_inode is enabled.

Issue Behavior
When log file rotated, fluentd in_tail sometimes no longer track new file change, no offset change in pos file although new entrypoint is created with no inode number.

RC Analysis
After update the in_tail.rb to print more troubleshooting logs, we locate the potential RC, which is a racing issue.

When a file is rotated, expectedly only old watcher for rotated file should be detach. But in some case, both old and newly created watcher are detached.

Here's the racing description:

  1. File rotated: previous log file 'container.log' is renamed to ''container.backup.log'' (inode_old), new 'container.log' created (inode_new)
  2. refresh_watchers regularly triggered, mark old watcher for inode_old to be detach and create a new watcher for inode_new (@tail[path] now point to new watcher)
  3. old watcher on_rotate triggers update_watcher as watcher_needs_update = true.
  4. in update_watcher, "rotated_tw = @Tails[path]" will return the new watcher to be detached because tw with same path has been updated to step 2.
  5. After rotate_wait, both old and new watchers are detached in step 2 and 4, thus new watcher stuck

The main problem is in #4, when try to detach a watcher, need to determine whether the rotated_tw is still the old inode, which can be detach, or already refresh to new inode, which need skip detach (detached by other event).

To Reproduce

Here's the detail trace log to repro the racing issue
2023-05-31 03:09:08 +0000 [info]: #0 stop_watchers path="<log_path>" ino=<old_ino>
2023-05-31 03:09:08 +0000 [info]: #0 detach_watcher_after_rotate_wait path="<log_path>" ino=<old_ino>
2023-05-31 03:09:08 +0000 [info]: #0 construct_watcher path="<log_path>"
2023-05-31 03:09:08 +0000 [info]: #0 construct_watcher2 path="<log_path>" ino=<new_ino>
2023-05-31 03:09:08 +0000 [info]: #0 setup_watcher path="<log_path>" ino=<new_ino> pe_inode=<new_ino>
2023-05-31 03:09:08 +0000 [info]: #0 TailWatcher initialize path="<log_path>" ino=<new_ino>
2023-05-31 03:09:08 +0000 [info]: #0 on_rotate path="<log_path>" ino=<new_ino>
2023-05-31 03:09:08 +0000 [info]: #0 following tail of <log_path>
2023-05-31 03:09:08 +0000 [info]: #0 on_rotate path="<log_path>" ino=<old_ino>
2023-05-31 03:09:08 +0000 [info]: #0 watcher_needs_update path="<log_path>" ino=<old_ino>
2023-05-31 03:09:08 +0000 [info]: #0 detected rotation of <log_path> ino <old_ino> -> <new_ino>; waiting 5.0 seconds
2023-05-31 03:09:08 +0000 [info]: #0 rotated_tw <log_path> <new_ino> //unexpected
2023-05-31 03:09:08 +0000 [info]: #0 new_position_entry ino=<new_ino>
2023-05-31 03:09:08 +0000 [info]: #0 detach_watcher_after_rotate_wait path="<log_path>" ino=<old_ino>
2023-05-31 03:09:13 +0000 [info]: #0 detach_watcher path="<log_path>" tw_ino=<old_ino> expect_ino=<old_ino>
2023-05-31 03:09:13 +0000 [info]: #0 detaching a watcher path="<log_path>" ino=<old_ino>
2023-05-31 03:09:13 +0000 [info]: #0 detach_watcher path="<log_path>" tw_ino=<new_ino> expect_ino=<old_ino>
2023-05-31 03:09:13 +0000 [warn]: #0 detaching a watcher path="<log_path>" ino=<new_ino> //unexpected

Here's the detail trace log of normal rotation case:

2023-05-31 03:26:01 +0000 [info]: #0 on_rotate path="<log_path>" ino=<old_ino>
2023-05-31 03:26:01 +0000 [info]: #0 watcher_needs_update path="<log_path>" ino=<old_ino>
2023-05-31 03:26:01 +0000 [info]: #0 detected rotation of <log_path> ino <old_ino> -> <new_ino>; waiting 5.0 seconds
2023-05-31 03:26:01 +0000 [info]: #0 rotated_tw <log_path> <old_ino> //expected
2023-05-31 03:26:01 +0000 [info]: #0 new_position_entry ino=0
2023-05-31 03:26:01 +0000 [info]: #0 setup_watcher path="<log_path>" ino=<new_ino> pe_inode=0
2023-05-31 03:26:01 +0000 [info]: #0 TailWatcher initialize path="<log_path>" ino=<new_ino>
2023-05-31 03:26:01 +0000 [info]: #0 on_rotate path="<log_path>" ino=<new_ino>
2023-05-31 03:26:01 +0000 [info]: #0 following tail of <log_path>
2023-05-31 03:26:01 +0000 [info]: #0 detach_watcher_after_rotate_wait path="<log_path>" ino=<old_ino>
2023-05-31 03:26:06 +0000 [info]: #0 detach_watcher path="<log_path>" tw_ino=<old_ino> expect_ino=<old_ino>
2023-05-31 03:26:06 +0000 [info]: #0 detaching a watcher path="<log_path>" ino=<old_ino> //expected

Expected behavior

No log stuck, new watcher can keep monitor new file change

Your Environment

- Fluentd version: 1.15
- TD Agent version: 4.4.2
- Operating system:
- Kernel version:

Your Configuration

@type tail
    path /var/log/containers/*.log
    pos_file /var/fluentd-containers.log.pos
    read_from_head true
    refresh_interval 2
    rotate_wait 5
    follow_inodes true

Your Error Log

no error, it is stuck

Additional context

No response

@ashie
Copy link
Member

ashie commented Jun 1, 2023

Thank you for describing the cause in detail, it's very clear for me!
It's indeed one of the cause of in_tail's stall issue we are discussing in #3614
It should be fixed.

@garyzjq
Copy link
Contributor Author

garyzjq commented Jun 2, 2023

thanks @ashie! I'm also testing that fix with heavy log load to see any other stuck issue.
From code perspective, it is ideal (when follow_inode, tracking tails by inode number but not path).

A minor concern is this also add additionally 3 "if @follow_inode else", feeling we finally need a refactor one day to avoid the code full of this condition :)

@ashie
Copy link
Member

ashie commented Jun 27, 2023

We believe this issue is fixed by #4208
Thanks for your effort to clarify the cause of this issue!

@ashie ashie closed this as completed Jun 27, 2023
@garyzjq
Copy link
Contributor Author

garyzjq commented Jun 28, 2023

Thanks @ashie . looking for the next release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants