Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

timing issues with multiple tasks #295

Closed
riot-jayb opened this issue Mar 13, 2017 · 13 comments
Closed

timing issues with multiple tasks #295

riot-jayb opened this issue Mar 13, 2017 · 13 comments

Comments

@riot-jayb
Copy link

riot-jayb commented Mar 13, 2017

Hi,

I'm seeing a strange issue when using multiple tasks with small frequencies. Either the tasks are sometimes not executed, or else just not logged. I have a "heartbeat" with 15s frequency and "backup" task with 20s frequency (these are just stubs for testing the config atm), and the logs give very inconsistent results.

Here is my containerpilot config:

{
  "postStop" : "percona-mngr leave",
  "tasks": [
    {
      "name": "heartbeat",
      "command": "percona-mngr heartbeat",
      "frequency": "15s"
    },
    {
      "name": "backup",
      "command": "percona-mngr backup",
      "frequency": "20s"
    }
  ]
}

And here is the output log during container run time:

2017/03/13 05:07:11 percona-mngr backup (stub)
2017/03/13 05:07:16 percona-mngr heartbeat (stub)
2017/03/13 05:07:31 percona-mngr heartbeat (stub)
2017/03/13 05:07:46 percona-mngr heartbeat (stub)
2017/03/13 05:07:51 percona-mngr backup (stub)
2017/03/13 05:08:16 percona-mngr heartbeat (stub)
2017/03/13 05:08:46 percona-mngr heartbeat (stub)
2017/03/13 05:08:51 percona-mngr backup (stub)
2017/03/13 05:09:16 percona-mngr heartbeat (stub)
2017/03/13 05:09:46 percona-mngr heartbeat (stub)
2017/03/13 05:09:51 percona-mngr backup (stub)
2017/03/13 05:10:11 percona-mngr backup (stub)
2017/03/13 05:10:16 percona-mngr heartbeat (stub)
2017/03/13 05:10:31 percona-mngr heartbeat (stub)
2017/03/13 05:10:46 percona-mngr heartbeat (stub)
2017/03/13 05:10:51 percona-mngr backup (stub)
2017/03/13 05:11:01 percona-mngr heartbeat (stub)
2017/03/13 05:11:11 percona-mngr backup (stub)
2017/03/13 05:12:16 percona-mngr heartbeat (stub)
2017/03/13 05:13:16 percona-mngr heartbeat (stub)
2017/03/13 05:13:31 percona-mngr heartbeat (stub)
2017/03/13 05:13:46 percona-mngr heartbeat (stub)
2017/03/13 05:13:51 percona-mngr backup (stub)
2017/03/13 05:14:11 percona-mngr backup (stub)
2017/03/13 05:14:16 percona-mngr heartbeat (stub)
2017/03/13 05:14:31 percona-mngr backup (stub)
2017/03/13 05:14:31 percona-mngr heartbeat (stub)
2017/03/13 05:14:46 percona-mngr heartbeat (stub)
2017/03/13 05:14:51 percona-mngr backup (stub)
2017/03/13 05:15:01 percona-mngr heartbeat (stub)
2017/03/13 05:15:46 percona-mngr heartbeat (stub)
2017/03/13 05:16:31 percona-mngr heartbeat (stub)
2017/03/13 05:16:46 percona-mngr heartbeat (stub)
2017/03/13 05:17:16 percona-mngr heartbeat (stub)
2017/03/13 05:17:31 percona-mngr backup (stub)
2017/03/13 05:17:31 percona-mngr heartbeat (stub)
2017/03/13 05:17:46 percona-mngr heartbeat (stub)
2017/03/13 05:18:01 percona-mngr heartbeat (stub)
2017/03/13 05:18:11 percona-mngr backup (stub)
2017/03/13 05:18:16 percona-mngr heartbeat (stub)
2017/03/13 05:18:46 percona-mngr heartbeat (stub)
2017/03/13 05:18:51 percona-mngr backup (stub)
2017/03/13 05:19:11 percona-mngr backup (stub)
2017/03/13 05:20:16 percona-mngr heartbeat (stub)
2017/03/13 05:21:01 percona-mngr heartbeat (stub)
2017/03/13 05:21:11 percona-mngr backup (stub)
2017/03/13 05:21:31 percona-mngr backup (stub)
2017/03/13 05:21:51 percona-mngr backup (stub)
2017/03/13 05:22:01 percona-mngr heartbeat (stub)

As you can see it appears to be one or the other that actually logs out anything. I'm not sure whether these are actually not running or just not logging, because they are stub methods in the manager tool right now.

Has anyone seen this type of issue before?

@riot-jayb
Copy link
Author

Here is some output with DEBUG logging level. It shows that the tasks were at least being started correctly. Each task command in percona-mngr is just a stub method that prints a log message, so there isn't much to go wrong there. It appears the log messages are somehow being dropped or else is the command not actually executed and failing somehow without logging any errors?

time="2017-03-13T05:38:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:21Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:36Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:51Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:36Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:40:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].RunWithTimeout end"

@riot-jayb
Copy link
Author

seeing the same issue when I modify to just echo a message:

  "logging": {
    "level": "DEBUG",
    "format": "text"
  },
  "postStop" : "percona-mngr leave",
  "tasks": [
    {
      "name": "heartbeat",
      "command": "echo 'heartbeat'",
      "frequency": "15s"
    },
    {
      "name": "backup",
      "command": "echo 'backup'",
      "frequency": "20s"
    }
  ]
}

It took approx 2 minutes to see the first info level echo message (debug start/end happens like clockwork).

@riot-jayb
Copy link
Author

riot-jayb commented Mar 13, 2017

err, one thing to mention is that I built off the head of master. It looks like command.RunWithTimeout is undergoing some changes so maybe this is expected to not be working fully.

I'll rebuild from 2.6.0 tag and see what happens.

edit: rebuilt from 2.6.0 tag and still seeing the issue:

time="2017-03-13T06:22:17Z" level=debug msg="task[backup].RunWithTimeout start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].Cmd.Start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run waiting for PID 119: " 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run complete" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run#gofunc received quit" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].RunWithTimeout end" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].RunWithTimeout start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].Cmd.Start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run waiting for PID 120: " 
time="2017-03-13T06:22:17Z" level=info msg="'heartbeat'" process=task task=heartbeat 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run complete" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].RunWithTimeout end" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run#gofunc received quit"

you can see only one heartbeat had an info log with the echo message. There are many more in the logs without any echo message being printed.

@tgross
Copy link
Contributor

tgross commented Mar 13, 2017

Thanks for providing the debug messages, that helps a lot. I suspect that you're seeing the same thing that @sourcec0de is seeing in #294

I've got two "top-level" hypotheses at this point. One is that we're failing to spawn the process somehow. This would be very bad. The other (which I think is more likely) is that we're spawning the process but that there's a race in how we're capturing the stdout/stderr and so we're dropping the logs on the floor.

The way to verify this would be to update your hook to do something persistent in the environment like drop a file on disk for each invocation. (Alternately you can trace execution with dtrace or strace but let's assume for the moment that we've got a container with limited tooling). Something like this should do it:

NOW=$(date -u "+%Y-%m-%dT%H:%M:%SZ")
echo $NOW
echo $NOW > $NOW

If we see all files existing then we know that the invocations are happening but that we're dropping logs. If we don't see all the files existing then we know we're failing to spawn for some reason (and failing to log that fact as well!).

err, one thing to mention is that I built off the head of master.

Yeah, that's important to note! Master is not production-suitable at the moment. But we're seeing it in 2.6+ so it looks like it's an existing bug.

@riot-jayb
Copy link
Author

riot-jayb commented Mar 13, 2017

I put a sleep for 3s after the log statement in my tasks, and that makes them all show up in the logs with the correct timings. So it looks to be fitting into your race condition hypothesis. Something is racing between scanning the output and cleaning up the command subprocess.

That also aligns with my observation that an echo message had far more dropped logs, presumably because it exits very quickly.

@tgross
Copy link
Contributor

tgross commented Mar 14, 2017

Thanks, that's very helpful. At first my suspicion was that this problem was revealed in 2.5.1 where we had this PR to handle a null-pointer exception bug. This might have changed the timing enough for the underlying race to set up the log handlers appear. But this problem has probably been lying around latent ever since we created tasks and the way they get logged separately from the stdout/stderr.

For each command we set up the logging in commands/Command.setUpCmd, which in turn calls utils/NewLogWriter. That in turns sets up an async handler so that we can stream logs in as they're received and scan them into new log lines for each line of the output.

And then here we defer closing the io.WriterCloser that we created for the logger. We have to do this (somewhere) or we leak resources. But that means that if we're scanning the log async but close the buffer before we can scan it, then the call to scanner.Scan() returns false and that goroutine exits and we drop the log line.

I'll try to create a minimal program to reproduce this behavior to verify this hypothesis. I'm not exactly sure what we can do to fix it yet though.

@sourcec0de
Copy link

@tgross I can confirm the as @riot-jayb.

This fits the issue I reported in #294 my co-process writes new nginx configs that come in over a google cloud pubsub queue. I can see those being written but no log output for some time. Eventually, the logs do appear.

@tgross
Copy link
Contributor

tgross commented Mar 14, 2017

Ok, below is a minimal program that reproduces the behavior (mirrored in a gist here too).

If you run this as go run logbug.go echo hi then you'll see somewhere between 3 and 5 "hi". It's a race so obviously it varies. The time.Sleep ensures we're not exiting the program before we have a chance to catch the log, but doesn't interfere with closing the io.WriteCloser in the defer clause in run().


package main

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"os/exec"
	"time"
)

func NewLogWriter() io.WriteCloser {
	r, w := io.Pipe()
	go func() {
		scanner := bufio.NewScanner(r)
		for scanner.Scan() {
			fmt.Println(scanner.Text())
		}
	}()
	return w
}

func echo(args []string) {
	logger := NewLogWriter()
	go func() {
		defer logger.Close()
		cmd := exec.Command(args[0], args[1:]...)
		cmd.Stdout = logger
		cmd.Stderr = logger
		cmd.Start()
		cmd.Process.Wait()
	}()
}

func main() {
	if len(os.Args) < 2 {
		fmt.Println("need args")
	}
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	time.Sleep(1 * time.Second)
}

@tgross
Copy link
Contributor

tgross commented Mar 14, 2017

I just pushed a commit to the big v3 PR here which will minimize this race but doesn't eliminate it entirely. But in v2 we don't even have that context cancellation to rely on.

The implementation of the underlying pipe (ref) is unbuffered and blocking, but designed to allow safe parallel access to read/write/close. Reads and Writes are matches one-to-one but Close has to be allowed to interleave, so the race is simply unavoidable so long as we're using these primitives.

I think the approach we should probably try to take here is to avoid creating a new io.WriteCloser if we can. This way we never have to close them (until ContainerPilot exits or reloads config). This will require a bit of reworking of how we pass fields into the log entries and set log levels for them.

@tgross
Copy link
Contributor

tgross commented Mar 21, 2017

This commit 3338d0a should implement the fix we'll need.

Because the stdlib won't let us reuse the exec.Cmd after it's been run, we recreate it each time we call commands.Command.Run(). This includes creating a io.WriteCloser to pipe-out the logs from the stdout/stderr of the process we fork. In this patch, we instead create the io.WriteCloser and attach it to the reusable Command object. The io.WriteCloser won't go out of scope during an execution of our process, so we won't drop logs because they have no where to go anymore.

(We'll still potentially drop logs if that higher-level Command is explicitly closed but this shouldn't happen unless we're exiting ContainerPilot outright or we just killed ContainerPilot and all its child processes. But that's not really avoidable.)

The linked commit is for v3, but it should be adaptable as a backport to v2. I'll try to have that patch for v2 in the next couple days.

@riot-jayb
Copy link
Author

Awesome work dude :)

@tgross
Copy link
Contributor

tgross commented Mar 23, 2017

It took me a bit longer to land #290 than I expected, but I'll try to pick this up today or tomorrow for v2.

@tgross
Copy link
Contributor

tgross commented Mar 24, 2017

Released in https://github.com/joyent/containerpilot/releases/tag/2.7.1. I'm going to mark this as fixed and close this issue, but feel free to reopen if it looks like we missed something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants