-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCT/SM/CUDA: Fix common intra-node keepalive protocol #7780
Conversation
src/uct/base/uct_iface.c
Outdated
|
||
UCT_EP_KEEPALIVE_CHECK_PARAM(flags, comp); | ||
|
||
if (*ka_p == NULL) { | ||
status = uct_ep_keepalive_create(pid, ka_p); | ||
status = uct_ep_keepalive_create(pid, ka_p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need has cross-if alignment by =
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have them a lot thru the code, but i can remove
src/uct/base/uct_iface.h
Outdated
@@ -295,7 +295,7 @@ typedef struct uct_failed_iface { | |||
* Keepalive info used by EP | |||
*/ | |||
typedef struct uct_keepalive_info { | |||
struct timespec start_time; /* Process start time */ | |||
unsigned long start_time; /* Process start time */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls, remove extra whitespaces here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but typically we have more than one space for struct fields
782cbc6
to
5dc7710
Compare
5dc7710
to
2ab66ef
Compare
c34cdc0
to
558b0c9
Compare
test/gtest/uct/test_peer_failure.cc
Outdated
ASSERT_TRUE(has_mm()); | ||
uct_mm_ep_t *ep = ucs_derived_of(m_entity->ep(0), uct_mm_ep_t); | ||
m_ka = &ep->keepalive; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: i would add a getter function to return ka pointer instead of caching it in a member variable
these 3 lines could be the body of that getter function
Is this PR ready to merge? |
65bc359
to
7135263
Compare
What
Implement intra-node keep-alive protocol based on startime value from /proc/pid/stat
Why ?
The current implementation relies on stat() info performed on /proc/pid dir. This is incorrect, because the files in /proc have no existence of their own, and the info returned by stat() is not persistent.