Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/SM/CUDA: Fix common intra-node keepalive protocol #7780

Merged
merged 1 commit into from
Jan 3, 2022

Conversation

brminich
Copy link
Contributor

@brminich brminich commented Dec 8, 2021

What

Implement intra-node keep-alive protocol based on startime value from /proc/pid/stat

Why ?

The current implementation relies on stat() info performed on /proc/pid dir. This is incorrect, because the files in /proc have no existence of their own, and the info returned by stat() is not persistent.

src/ucs/sys/sys.c Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.h Show resolved Hide resolved

UCT_EP_KEEPALIVE_CHECK_PARAM(flags, comp);

if (*ka_p == NULL) {
status = uct_ep_keepalive_create(pid, ka_p);
status = uct_ep_keepalive_create(pid, ka_p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need has cross-if alignment by =?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have them a lot thru the code, but i can remove

@@ -295,7 +295,7 @@ typedef struct uct_failed_iface {
* Keepalive info used by EP
*/
typedef struct uct_keepalive_info {
struct timespec start_time; /* Process start time */
unsigned long start_time; /* Process start time */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls, remove extra whitespaces here and below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but typically we have more than one space for struct fields

@brminich brminich force-pushed the uct/intra_node_ka_fix branch 2 times, most recently from 782cbc6 to 5dc7710 Compare December 10, 2021 13:15
@brminich brminich marked this pull request as draft December 10, 2021 13:16
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/ucs/sys/sys.c Outdated Show resolved Hide resolved
src/uct/base/uct_iface.c Outdated Show resolved Hide resolved
src/uct/cuda/cuda_ipc/cuda_ipc_ep.c Outdated Show resolved Hide resolved
@brminich brminich marked this pull request as ready for review December 14, 2021 14:26
Comment on lines 474 to 476
ASSERT_TRUE(has_mm());
uct_mm_ep_t *ep = ucs_derived_of(m_entity->ep(0), uct_mm_ep_t);
m_ka = &ep->keepalive;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: i would add a getter function to return ka pointer instead of caching it in a member variable
these 3 lines could be the body of that getter function

@eric-haibin-lin
Copy link

Is this PR ready to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants