timing issues with multiple tasks #295

riot-jayb · 2017-03-13T05:25:24Z

Hi,

I'm seeing a strange issue when using multiple tasks with small frequencies. Either the tasks are sometimes not executed, or else just not logged. I have a "heartbeat" with 15s frequency and "backup" task with 20s frequency (these are just stubs for testing the config atm), and the logs give very inconsistent results.

Here is my containerpilot config:

{
  "postStop" : "percona-mngr leave",
  "tasks": [
    {
      "name": "heartbeat",
      "command": "percona-mngr heartbeat",
      "frequency": "15s"
    },
    {
      "name": "backup",
      "command": "percona-mngr backup",
      "frequency": "20s"
    }
  ]
}

And here is the output log during container run time:

2017/03/13 05:07:11 percona-mngr backup (stub)
2017/03/13 05:07:16 percona-mngr heartbeat (stub)
2017/03/13 05:07:31 percona-mngr heartbeat (stub)
2017/03/13 05:07:46 percona-mngr heartbeat (stub)
2017/03/13 05:07:51 percona-mngr backup (stub)
2017/03/13 05:08:16 percona-mngr heartbeat (stub)
2017/03/13 05:08:46 percona-mngr heartbeat (stub)
2017/03/13 05:08:51 percona-mngr backup (stub)
2017/03/13 05:09:16 percona-mngr heartbeat (stub)
2017/03/13 05:09:46 percona-mngr heartbeat (stub)
2017/03/13 05:09:51 percona-mngr backup (stub)
2017/03/13 05:10:11 percona-mngr backup (stub)
2017/03/13 05:10:16 percona-mngr heartbeat (stub)
2017/03/13 05:10:31 percona-mngr heartbeat (stub)
2017/03/13 05:10:46 percona-mngr heartbeat (stub)
2017/03/13 05:10:51 percona-mngr backup (stub)
2017/03/13 05:11:01 percona-mngr heartbeat (stub)
2017/03/13 05:11:11 percona-mngr backup (stub)
2017/03/13 05:12:16 percona-mngr heartbeat (stub)
2017/03/13 05:13:16 percona-mngr heartbeat (stub)
2017/03/13 05:13:31 percona-mngr heartbeat (stub)
2017/03/13 05:13:46 percona-mngr heartbeat (stub)
2017/03/13 05:13:51 percona-mngr backup (stub)
2017/03/13 05:14:11 percona-mngr backup (stub)
2017/03/13 05:14:16 percona-mngr heartbeat (stub)
2017/03/13 05:14:31 percona-mngr backup (stub)
2017/03/13 05:14:31 percona-mngr heartbeat (stub)
2017/03/13 05:14:46 percona-mngr heartbeat (stub)
2017/03/13 05:14:51 percona-mngr backup (stub)
2017/03/13 05:15:01 percona-mngr heartbeat (stub)
2017/03/13 05:15:46 percona-mngr heartbeat (stub)
2017/03/13 05:16:31 percona-mngr heartbeat (stub)
2017/03/13 05:16:46 percona-mngr heartbeat (stub)
2017/03/13 05:17:16 percona-mngr heartbeat (stub)
2017/03/13 05:17:31 percona-mngr backup (stub)
2017/03/13 05:17:31 percona-mngr heartbeat (stub)
2017/03/13 05:17:46 percona-mngr heartbeat (stub)
2017/03/13 05:18:01 percona-mngr heartbeat (stub)
2017/03/13 05:18:11 percona-mngr backup (stub)
2017/03/13 05:18:16 percona-mngr heartbeat (stub)
2017/03/13 05:18:46 percona-mngr heartbeat (stub)
2017/03/13 05:18:51 percona-mngr backup (stub)
2017/03/13 05:19:11 percona-mngr backup (stub)
2017/03/13 05:20:16 percona-mngr heartbeat (stub)
2017/03/13 05:21:01 percona-mngr heartbeat (stub)
2017/03/13 05:21:11 percona-mngr backup (stub)
2017/03/13 05:21:31 percona-mngr backup (stub)
2017/03/13 05:21:51 percona-mngr backup (stub)
2017/03/13 05:22:01 percona-mngr heartbeat (stub)

As you can see it appears to be one or the other that actually logs out anything. I'm not sure whether these are actually not running or just not logging, because they are stub methods in the manager tool right now.

Has anyone seen this type of issue before?

The text was updated successfully, but these errors were encountered:

riot-jayb · 2017-03-13T05:45:11Z

Here is some output with DEBUG logging level. It shows that the tasks were at least being started correctly. Each task command in percona-mngr is just a stub method that prints a log message, so there isn't much to go wrong there. It appears the log messages are somehow being dropped or else is the command not actually executed and failing somehow without logging any errors?

time="2017-03-13T05:38:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:38:11Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:21Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:21Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:31Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:36Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:36Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:38:51Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:38:51Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:38:51Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:39:11Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:21Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:31Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:36Z" level=info msg="percona-mngr heartbeat (stub)" process=task task=heartbeat
time="2017-03-13T05:39:36Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:39:51Z" level=debug msg="task[backup].RunWithTimeout end"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:39:51Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].RunWithTimeout start"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].Cmd start"
time="2017-03-13T05:40:06Z" level=debug msg="task[heartbeat].RunWithTimeout end"
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].RunWithTimeout start"
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].Cmd start"
time="2017-03-13T05:40:11Z" level=info msg="percona-mngr backup (stub)" process=task task=backup
time="2017-03-13T05:40:11Z" level=debug msg="task[backup].RunWithTimeout end"

riot-jayb · 2017-03-13T05:50:50Z

seeing the same issue when I modify to just echo a message:

  "logging": {
    "level": "DEBUG",
    "format": "text"
  },
  "postStop" : "percona-mngr leave",
  "tasks": [
    {
      "name": "heartbeat",
      "command": "echo 'heartbeat'",
      "frequency": "15s"
    },
    {
      "name": "backup",
      "command": "echo 'backup'",
      "frequency": "20s"
    }
  ]
}

It took approx 2 minutes to see the first info level echo message (debug start/end happens like clockwork).

riot-jayb · 2017-03-13T06:00:41Z

err, one thing to mention is that I built off the head of master. It looks like command.RunWithTimeout is undergoing some changes so maybe this is expected to not be working fully.

I'll rebuild from 2.6.0 tag and see what happens.

edit: rebuilt from 2.6.0 tag and still seeing the issue:

time="2017-03-13T06:22:17Z" level=debug msg="task[backup].RunWithTimeout start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].Cmd.Start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run waiting for PID 119: " 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run complete" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].run#gofunc received quit" 
time="2017-03-13T06:22:17Z" level=debug msg="task[backup].RunWithTimeout end" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].RunWithTimeout start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].Cmd.Start" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run waiting for PID 120: " 
time="2017-03-13T06:22:17Z" level=info msg="'heartbeat'" process=task task=heartbeat 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run complete" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].RunWithTimeout end" 
time="2017-03-13T06:22:17Z" level=debug msg="task[heartbeat].run#gofunc received quit"

you can see only one heartbeat had an info log with the echo message. There are many more in the logs without any echo message being printed.

tgross · 2017-03-13T13:31:15Z

Thanks for providing the debug messages, that helps a lot. I suspect that you're seeing the same thing that @sourcec0de is seeing in #294

I've got two "top-level" hypotheses at this point. One is that we're failing to spawn the process somehow. This would be very bad. The other (which I think is more likely) is that we're spawning the process but that there's a race in how we're capturing the stdout/stderr and so we're dropping the logs on the floor.

The way to verify this would be to update your hook to do something persistent in the environment like drop a file on disk for each invocation. (Alternately you can trace execution with dtrace or strace but let's assume for the moment that we've got a container with limited tooling). Something like this should do it:

NOW=$(date -u "+%Y-%m-%dT%H:%M:%SZ")
echo $NOW
echo $NOW > $NOW

If we see all files existing then we know that the invocations are happening but that we're dropping logs. If we don't see all the files existing then we know we're failing to spawn for some reason (and failing to log that fact as well!).

err, one thing to mention is that I built off the head of master.

Yeah, that's important to note! Master is not production-suitable at the moment. But we're seeing it in 2.6+ so it looks like it's an existing bug.

riot-jayb · 2017-03-13T22:00:48Z

I put a sleep for 3s after the log statement in my tasks, and that makes them all show up in the logs with the correct timings. So it looks to be fitting into your race condition hypothesis. Something is racing between scanning the output and cleaning up the command subprocess.

That also aligns with my observation that an echo message had far more dropped logs, presumably because it exits very quickly.

tgross · 2017-03-14T13:18:34Z

Thanks, that's very helpful. At first my suspicion was that this problem was revealed in 2.5.1 where we had this PR to handle a null-pointer exception bug. This might have changed the timing enough for the underlying race to set up the log handlers appear. But this problem has probably been lying around latent ever since we created tasks and the way they get logged separately from the stdout/stderr.

For each command we set up the logging in commands/Command.setUpCmd, which in turn calls utils/NewLogWriter. That in turns sets up an async handler so that we can stream logs in as they're received and scan them into new log lines for each line of the output.

And then here we defer closing the io.WriterCloser that we created for the logger. We have to do this (somewhere) or we leak resources. But that means that if we're scanning the log async but close the buffer before we can scan it, then the call to scanner.Scan() returns false and that goroutine exits and we drop the log line.

I'll try to create a minimal program to reproduce this behavior to verify this hypothesis. I'm not exactly sure what we can do to fix it yet though.

sourcec0de · 2017-03-14T14:26:05Z

@tgross I can confirm the as @riot-jayb.

This fits the issue I reported in #294 my co-process writes new nginx configs that come in over a google cloud pubsub queue. I can see those being written but no log output for some time. Eventually, the logs do appear.

tgross · 2017-03-14T14:29:17Z

Ok, below is a minimal program that reproduces the behavior (mirrored in a gist here too).

If you run this as go run logbug.go echo hi then you'll see somewhere between 3 and 5 "hi". It's a race so obviously it varies. The time.Sleep ensures we're not exiting the program before we have a chance to catch the log, but doesn't interfere with closing the io.WriteCloser in the defer clause in run().

package main

import (
	"bufio"
	"fmt"
	"io"
	"os"
	"os/exec"
	"time"
)

func NewLogWriter() io.WriteCloser {
	r, w := io.Pipe()
	go func() {
		scanner := bufio.NewScanner(r)
		for scanner.Scan() {
			fmt.Println(scanner.Text())
		}
	}()
	return w
}

func echo(args []string) {
	logger := NewLogWriter()
	go func() {
		defer logger.Close()
		cmd := exec.Command(args[0], args[1:]...)
		cmd.Stdout = logger
		cmd.Stderr = logger
		cmd.Start()
		cmd.Process.Wait()
	}()
}

func main() {
	if len(os.Args) < 2 {
		fmt.Println("need args")
	}
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	echo(os.Args[1:])
	time.Sleep(1 * time.Second)
}

tgross · 2017-03-14T14:56:43Z

I just pushed a commit to the big v3 PR here which will minimize this race but doesn't eliminate it entirely. But in v2 we don't even have that context cancellation to rely on.

The implementation of the underlying pipe (ref) is unbuffered and blocking, but designed to allow safe parallel access to read/write/close. Reads and Writes are matches one-to-one but Close has to be allowed to interleave, so the race is simply unavoidable so long as we're using these primitives.

I think the approach we should probably try to take here is to avoid creating a new io.WriteCloser if we can. This way we never have to close them (until ContainerPilot exits or reloads config). This will require a bit of reworking of how we pass fields into the log entries and set log levels for them.

tgross · 2017-03-21T17:32:08Z

This commit 3338d0a should implement the fix we'll need.

Because the stdlib won't let us reuse the exec.Cmd after it's been run, we recreate it each time we call commands.Command.Run(). This includes creating a io.WriteCloser to pipe-out the logs from the stdout/stderr of the process we fork. In this patch, we instead create the io.WriteCloser and attach it to the reusable Command object. The io.WriteCloser won't go out of scope during an execution of our process, so we won't drop logs because they have no where to go anymore.

(We'll still potentially drop logs if that higher-level Command is explicitly closed but this shouldn't happen unless we're exiting ContainerPilot outright or we just killed ContainerPilot and all its child processes. But that's not really avoidable.)

The linked commit is for v3, but it should be adaptable as a backport to v2. I'll try to have that patch for v2 in the next couple days.

riot-jayb · 2017-03-21T17:55:13Z

Awesome work dude :)

tgross · 2017-03-23T18:27:19Z

It took me a bit longer to land #290 than I expected, but I'll try to pick this up today or tomorrow for v2.

tgross · 2017-03-24T13:04:59Z

Released in https://github.com/joyent/containerpilot/releases/tag/2.7.1. I'm going to mark this as fixed and close this issue, but feel free to reopen if it looks like we missed something.

tgross mentioned this issue Mar 13, 2017

coprocess logs missing from container output #294

Closed

tgross added bug v2 labels Mar 13, 2017

This was referenced Mar 23, 2017

Container pilot should log failed health check #267

Closed

backport fix for dropped logs to v2 #297

Merged

tgross added the open PR label Mar 23, 2017

tgross closed this as completed Mar 24, 2017

tgross added fixed and removed open PR labels Mar 24, 2017

tgross mentioned this issue Jul 3, 2017

exec stdout w/o newlines breaks health checks by logging on closed pipe #423

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

timing issues with multiple tasks #295

timing issues with multiple tasks #295

riot-jayb commented Mar 13, 2017 •

edited

Loading

riot-jayb commented Mar 13, 2017

riot-jayb commented Mar 13, 2017

riot-jayb commented Mar 13, 2017 •

edited

Loading

tgross commented Mar 13, 2017

riot-jayb commented Mar 13, 2017 •

edited

Loading

tgross commented Mar 14, 2017 •

edited

Loading

sourcec0de commented Mar 14, 2017

tgross commented Mar 14, 2017

tgross commented Mar 14, 2017

tgross commented Mar 21, 2017 •

edited

Loading

riot-jayb commented Mar 21, 2017

tgross commented Mar 23, 2017

tgross commented Mar 24, 2017

timing issues with multiple tasks #295

timing issues with multiple tasks #295

Comments

riot-jayb commented Mar 13, 2017 • edited Loading

riot-jayb commented Mar 13, 2017

riot-jayb commented Mar 13, 2017

riot-jayb commented Mar 13, 2017 • edited Loading

tgross commented Mar 13, 2017

riot-jayb commented Mar 13, 2017 • edited Loading

tgross commented Mar 14, 2017 • edited Loading

sourcec0de commented Mar 14, 2017

tgross commented Mar 14, 2017

tgross commented Mar 14, 2017

tgross commented Mar 21, 2017 • edited Loading

riot-jayb commented Mar 21, 2017

tgross commented Mar 23, 2017

tgross commented Mar 24, 2017

riot-jayb commented Mar 13, 2017 •

edited

Loading

riot-jayb commented Mar 13, 2017 •

edited

Loading

riot-jayb commented Mar 13, 2017 •

edited

Loading

tgross commented Mar 14, 2017 •

edited

Loading

tgross commented Mar 21, 2017 •

edited

Loading