-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timing issues with multiple tasks #295
Comments
Here is some output with DEBUG logging level. It shows that the tasks were at least being started correctly. Each task command in
|
seeing the same issue when I modify to just echo a message:
It took approx 2 minutes to see the first info level echo message (debug start/end happens like clockwork). |
err, one thing to mention is that I built off the head of master. It looks like command.RunWithTimeout is undergoing some changes so maybe this is expected to not be working fully. I'll rebuild from 2.6.0 tag and see what happens. edit: rebuilt from 2.6.0 tag and still seeing the issue:
you can see only one heartbeat had an info log with the echo message. There are many more in the logs without any echo message being printed. |
Thanks for providing the debug messages, that helps a lot. I suspect that you're seeing the same thing that @sourcec0de is seeing in #294 I've got two "top-level" hypotheses at this point. One is that we're failing to spawn the process somehow. This would be very bad. The other (which I think is more likely) is that we're spawning the process but that there's a race in how we're capturing the stdout/stderr and so we're dropping the logs on the floor. The way to verify this would be to update your hook to do something persistent in the environment like drop a file on disk for each invocation. (Alternately you can trace execution with dtrace or strace but let's assume for the moment that we've got a container with limited tooling). Something like this should do it: NOW=$(date -u "+%Y-%m-%dT%H:%M:%SZ")
echo $NOW
echo $NOW > $NOW If we see all files existing then we know that the invocations are happening but that we're dropping logs. If we don't see all the files existing then we know we're failing to spawn for some reason (and failing to log that fact as well!).
Yeah, that's important to note! Master is not production-suitable at the moment. But we're seeing it in 2.6+ so it looks like it's an existing bug. |
I put a sleep for 3s after the log statement in my tasks, and that makes them all show up in the logs with the correct timings. So it looks to be fitting into your race condition hypothesis. Something is racing between scanning the output and cleaning up the command subprocess. That also aligns with my observation that an echo message had far more dropped logs, presumably because it exits very quickly. |
Thanks, that's very helpful. At first my suspicion was that this problem was revealed in 2.5.1 where we had this PR to handle a null-pointer exception bug. This might have changed the timing enough for the underlying race to set up the log handlers appear. But this problem has probably been lying around latent ever since we created For each command we set up the logging in And then here we defer closing the I'll try to create a minimal program to reproduce this behavior to verify this hypothesis. I'm not exactly sure what we can do to fix it yet though. |
@tgross I can confirm the as @riot-jayb. This fits the issue I reported in #294 my co-process writes new nginx configs that come in over a google cloud pubsub queue. I can see those being written but no log output for some time. Eventually, the logs do appear. |
Ok, below is a minimal program that reproduces the behavior (mirrored in a gist here too). If you run this as package main
import (
"bufio"
"fmt"
"io"
"os"
"os/exec"
"time"
)
func NewLogWriter() io.WriteCloser {
r, w := io.Pipe()
go func() {
scanner := bufio.NewScanner(r)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}()
return w
}
func echo(args []string) {
logger := NewLogWriter()
go func() {
defer logger.Close()
cmd := exec.Command(args[0], args[1:]...)
cmd.Stdout = logger
cmd.Stderr = logger
cmd.Start()
cmd.Process.Wait()
}()
}
func main() {
if len(os.Args) < 2 {
fmt.Println("need args")
}
echo(os.Args[1:])
echo(os.Args[1:])
echo(os.Args[1:])
echo(os.Args[1:])
echo(os.Args[1:])
time.Sleep(1 * time.Second)
} |
I just pushed a commit to the big v3 PR here which will minimize this race but doesn't eliminate it entirely. But in v2 we don't even have that context cancellation to rely on. The implementation of the underlying pipe (ref) is unbuffered and blocking, but designed to allow safe parallel access to read/write/close. Reads and Writes are matches one-to-one but Close has to be allowed to interleave, so the race is simply unavoidable so long as we're using these primitives. I think the approach we should probably try to take here is to avoid creating a new |
This commit 3338d0a should implement the fix we'll need. Because the stdlib won't let us reuse the (We'll still potentially drop logs if that higher-level The linked commit is for v3, but it should be adaptable as a backport to v2. I'll try to have that patch for v2 in the next couple days. |
Awesome work dude :) |
It took me a bit longer to land #290 than I expected, but I'll try to pick this up today or tomorrow for v2. |
Released in https://github.com/joyent/containerpilot/releases/tag/2.7.1. I'm going to mark this as fixed and close this issue, but feel free to reopen if it looks like we missed something. |
Hi,
I'm seeing a strange issue when using multiple tasks with small frequencies. Either the tasks are sometimes not executed, or else just not logged. I have a "heartbeat" with 15s frequency and "backup" task with 20s frequency (these are just stubs for testing the config atm), and the logs give very inconsistent results.
Here is my containerpilot config:
And here is the output log during container run time:
As you can see it appears to be one or the other that actually logs out anything. I'm not sure whether these are actually not running or just not logging, because they are stub methods in the manager tool right now.
Has anyone seen this type of issue before?
The text was updated successfully, but these errors were encountered: