Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a timeout cmd line option and an option to report state info upon timeout to assist with debugging Jenkins tests #1724

Merged
merged 1 commit into from
May 28, 2016

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented May 27, 2016

No description provided.

@mike-dubman
Copy link
Member

cool! but will this state dump stacktraces? sounds like it will print only orte internal state?
can it be option for stacktraces as well?

@jsquyres
Copy link
Member

jsquyres commented May 27, 2016

If mpirun times out, there's not much point printing out mpirun's stack trace (i.e., because that means that mpirun is working properly -- its libevent tripped the timeout).

I assume that adding a mechanism to go get the stack traces from all the MPI processes would be a much bigger job.

@rhc54
Copy link
Contributor Author

rhc54 commented May 27, 2016

Yeah, it sure would - I'd have to think about if/when it could be done

@mike-dubman
Copy link
Member

great! btw - maybe orte when detects and reports status after timeout can system("gstack vpid")

@jsquyres
Copy link
Member

@miked-mellanox Not sure what you're suggesting... are you looking for the stack traces from mpirun or from the MPI/OSHMEM processes? The former wouldn't be too useful (per above); the latter is a much bigger job.

@rhc54
Copy link
Contributor Author

rhc54 commented May 27, 2016

I could have mpirun signal the orteds and get them to issue a "gstack pid" of every child - the IO would be trapped by the orted and brought back. That wouldn't be too awful

@mike-dubman
Copy link
Member

from both. orte keeps pids of all launched processes and can launch system(gstack pid). (jenkins runs on single node)

@rhc54
Copy link
Contributor Author

rhc54 commented May 27, 2016

i realize jenkins runs on one node, but i think we want to keep this a bit more general - e.g., MTT might find it useful as well

@mike-dubman
Copy link
Member

@rhc54 - yep, your suggestion even better!

@rhc54
Copy link
Contributor Author

rhc54 commented May 28, 2016

Okay, here is what the output looks like now (obviously, for lots of procs this is going to be pretty long). We can clearly tinker with it some more, perhaps adding the option to get the stacktrace of a specific rank, or only from N ranks, etc.

$ mpirun -n 4 --map-by node -timeout 2 --report-state-on-timeout --get-stacktraces ./mpi_spin
--------------------------------------------------------------------------
The user-provided time limit for job execution has been
reached:

  MPIEXEC_TIMEOUT: 2 seconds

The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified
by MPIEXEC_TIMEOUT in your environment or -timeout on the cmd line).
--------------------------------------------------------------------------
DATA FOR JOB: [64147,0]
    Num apps: 1 Num procs: 2    JobState: ALL DAEMONS REPORTED  Abort: False
    Num launched: 0 Num reported: 2 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 30669  State: RUNNING  ExitCode 0
        Rank: 1 Node: bend003   PID: 30674  State: RUNNING  ExitCode 0

DATA FOR JOB: [64147,1]
    Num apps: 1 Num procs: 4    JobState: SYNC REGISTERED   Abort: False
    Num launched: 4 Num reported: 4 Num terminated: 0

    Procs:
        Rank: 0 Node: bend001   PID: 30675  State: SYNC REGISTERED  ExitCode 0
        Rank: 1 Node: bend003   PID: 9726   State: SYNC REGISTERED  ExitCode 0
        Rank: 2 Node: bend001   PID: 30676  State: SYNC REGISTERED  ExitCode 0
        Rank: 3 Node: bend003   PID: 9727   State: SYNC REGISTERED  ExitCode 0

STACKTRACE FOR PROC [[64147,1],0]
    Thread 3 (Thread 0x7fb99456d700 (LWP 30678)):
    #0  0x00007fb995d43c3d in poll () from /usr/lib64/libc.so.6
    #1  0x00007fb996bf7cdb in poll_dispatch (base=0x120a090, tv=<optimized out>) at poll.c:165
    #2  0x00007fb996bee6fe in opal_libevent2022_event_base_loop (base=0x120a090, flags=1) at event.c:1630
    #3  0x00007fb996b93f7e in progress_engine (obj=0x12098e8) at runtime/opal_progress_threads.c:105
    #4  0x00007fb996020dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fb995d4e28d in clone () from /usr/lib64/libc.so.6
    Thread 2 (Thread 0x7fb9938e5700 (LWP 30680)):
    #0  0x00007fb995d4e863 in epoll_wait () from /usr/lib64/libc.so.6
    #1  0x00007fb996beb36b in epoll_dispatch (base=0x122e780, tv=<optimized out>) at epoll.c:407
    #2  0x00007fb996bee6fe in opal_libevent2022_event_base_loop (base=0x122e780, flags=1) at event.c:1630
    #3  0x00007fb99391952a in progress_engine (obj=0x122e780) at src/util/progress_threads.c:52
    #4  0x00007fb996020dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fb995d4e28d in clone () from /usr/lib64/libc.so.6
    Thread 1 (Thread 0x7fb997748800 (LWP 30675)):
    #0  0x000000000040075a in main (argc=1, argv=0x7ffd3099eb18) at mpi_spin.c:23

STACKTRACE FOR PROC [[64147,1],2]
    Thread 3 (Thread 0x7fa8290f4700 (LWP 30677)):
    #0  0x00007fa82a8cac3d in poll () from /usr/lib64/libc.so.6
    #1  0x00007fa82b77ecdb in poll_dispatch (base=0x1a12090, tv=<optimized out>) at poll.c:165
    #2  0x00007fa82b7756fe in opal_libevent2022_event_base_loop (base=0x1a12090, flags=1) at event.c:1630
    #3  0x00007fa82b71af7e in progress_engine (obj=0x1a118e8) at runtime/opal_progress_threads.c:105
    #4  0x00007fa82aba7dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fa82a8d528d in clone () from /usr/lib64/libc.so.6
    Thread 2 (Thread 0x7fa823fff700 (LWP 30679)):
    #0  0x00007fa82a8d5863 in epoll_wait () from /usr/lib64/libc.so.6
    #1  0x00007fa82b77236b in epoll_dispatch (base=0x1a36780, tv=<optimized out>) at epoll.c:407
    #2  0x00007fa82b7756fe in opal_libevent2022_event_base_loop (base=0x1a36780, flags=1) at event.c:1630
    #3  0x00007fa8284a052a in progress_engine (obj=0x1a36780) at src/util/progress_threads.c:52
    #4  0x00007fa82aba7dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fa82a8d528d in clone () from /usr/lib64/libc.so.6
    Thread 1 (Thread 0x7fa82c2cf800 (LWP 30676)):
    #0  0x000000000040075a in main (argc=1, argv=0x7ffcc94b05e8) at mpi_spin.c:23

STACKTRACE FOR PROC [[64147,1],1]
    Thread 3 (Thread 0x7fd56526a700 (LWP 9732)):
    #0  0x00007fd567246bcd in poll () from /usr/lib64/libc.so.6
    #1  0x00007fd567ee8cdb in poll_dispatch (base=0x19450c0, tv=<optimized out>) at poll.c:165
    #2  0x00007fd567edf6fe in opal_libevent2022_event_base_loop (base=0x19450c0, flags=1) at event.c:1630
    #3  0x00007fd567e84f7e in progress_engine (obj=0x1944918) at runtime/opal_progress_threads.c:105
    #4  0x00007fd567523dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fd56725121d in clone () from /usr/lib64/libc.so.6
    Thread 2 (Thread 0x7fd55ffff700 (LWP 9734)):
    #0  0x00007fd5672517f3 in epoll_wait () from /usr/lib64/libc.so.6
    #1  0x00007fd567edc36b in epoll_dispatch (base=0x19697d0, tv=<optimized out>) at epoll.c:407
    #2  0x00007fd567edf6fe in opal_libevent2022_event_base_loop (base=0x19697d0, flags=1) at event.c:1630
    #3  0x00007fd56461652a in progress_engine (obj=0x19697d0) at src/util/progress_threads.c:52
    #4  0x00007fd567523dc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007fd56725121d in clone () from /usr/lib64/libc.so.6
    Thread 1 (Thread 0x7fd568a22740 (LWP 9726)):
    #0  0x000000000040074e in main (argc=1, argv=0x7ffc641ce338) at mpi_spin.c:23

STACKTRACE FOR PROC [[64147,1],3]
    Thread 3 (Thread 0x7f59ee4a4700 (LWP 9731)):
    #0  0x00007f59f0480bcd in poll () from /usr/lib64/libc.so.6
    #1  0x00007f59f1122cdb in poll_dispatch (base=0x15710c0, tv=<optimized out>) at poll.c:165
    #2  0x00007f59f11196fe in opal_libevent2022_event_base_loop (base=0x15710c0, flags=1) at event.c:1630
    #3  0x00007f59f10bef7e in progress_engine (obj=0x1570918) at runtime/opal_progress_threads.c:105
    #4  0x00007f59f075ddc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007f59f048b21d in clone () from /usr/lib64/libc.so.6
    Thread 2 (Thread 0x7f59ed81c700 (LWP 9733)):
    #0  0x00007f59f048b7f3 in epoll_wait () from /usr/lib64/libc.so.6
    #1  0x00007f59f111636b in epoll_dispatch (base=0x15957d0, tv=<optimized out>) at epoll.c:407
    #2  0x00007f59f11196fe in opal_libevent2022_event_base_loop (base=0x15957d0, flags=1) at event.c:1630
    #3  0x00007f59ed85052a in progress_engine (obj=0x15957d0) at src/util/progress_threads.c:52
    #4  0x00007f59f075ddc5 in start_thread () from /usr/lib64/libpthread.so.0
    #5  0x00007f59f048b21d in clone () from /usr/lib64/libc.so.6
    Thread 1 (Thread 0x7f59f1c5c740 (LWP 9727)):
    #0  main (argc=1, argv=0x7ffc4c296d38) at mpi_spin.c:24

@mike-dubman
Copy link
Member

wow, that`s great! thanks a lot!

@jsquyres
Copy link
Member

That is totally sweet. We should definitely put this in v2.1.

@jsquyres jsquyres added this to the v2.1.0 milestone May 28, 2016
@rhc54
Copy link
Contributor Author

rhc54 commented May 28, 2016

Jeff provided a few minor changes - please note that he changed the -get-stacktraces option to the more grammatically correct -get-stack-traces

In terms of usage, I would suggest setting the Jenkins/MTT timeout value to something like 10-60 seconds less than the mpirun cmd line option. This gives plenty of time for the stack traces to be collected (note that I have hardcoded a time limit of 30 secs for that to occur - we can always add an option to set that time as well) and for mpirun to terminate the job. You could then hit mpirun with a segv or whatever if you want the mpirun stack trace.

So this would clearly differentiate between scenarios if the app has hung, or mpirun and its daemons have hung, and provide the relevant stack traces.

… timeout to assist with debugging Jenkins tests

If requested, obtain stacktraces for each application process and report it to stderr upon timeout

stack traces: minor improvements

- Also include the hostname and PID of the each process for which
  we're sending the stack traces (vs. just including the ORTE process
  name)
- Send a specific error message if we couldn't find "gstack" in the
  $PATH (e.g., on OS X)
- Send a sepcific error message if gstack fails to run
- Print a message that obtaining the stack traces may take a few
  seconds so that users don't wonder what's happening

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

help-orterun.txt: minor tweaks

Trivial update: show "--timeout" (instead of "-timeout") in the help
message, just to encourage the use of double-dash options.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>

trivial: stacktrace -> stack trace

Trivial word smything.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
@rhc54 rhc54 merged commit a93c01d into open-mpi:master May 28, 2016
@rhc54 rhc54 deleted the topic/timeout branch May 28, 2016 15:36
@mike-dubman
Copy link
Member

@rhc54 - thanks a lot!
do you plan to integrate this PR into v1.x and v2.x?

@rhc54
Copy link
Contributor Author

rhc54 commented May 29, 2016

@jsquyres indicated a desire to have it in 2.1 - let me look and see how hard it would be to bring to the 1.10 series. I'm not sure how big a change it would require

@jjhursey
Copy link
Member

jjhursey commented Jun 6, 2016

@rhc54 I would be interested to see this in v2.x. Did you have a chance to look into how difficult it might be to port it over?

@rhc54
Copy link
Contributor Author

rhc54 commented Jun 6, 2016

it's no trouble at all into 2.x - 1.10 is kinda messy. I've just been holding off because @jsquyres indicated it would wait until 2.1, so no point in generating the PR yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants