Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCT/UGNI: Start using spinlocks to protect critical structures #1494

Merged
merged 4 commits into from
May 19, 2017

Conversation

MattBBaker
Copy link
Contributor

This PR is deceptively big. A lot of it is shuffling a few big functions and rearranging some data structures to allow a shared thread safe code path. I couldn't find a useful way of splitting it up more.

What it does is add a spin lock to the device, and that is what will get used to do locking in UGNI. This lock is then used to attach the CDM to the device. The cdm device fields are then merged together into a shared struct.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1607/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3528/ for details (Mellanox internal link).

@MattBBaker
Copy link
Contributor Author

@shamisp Could you review?

@shamisp
Copy link
Contributor

shamisp commented May 10, 2017

v1.3, right ?

@MattBBaker
Copy link
Contributor Author

@shamisp Yes, this a bit much of a change for 1.2.

@MattBBaker MattBBaker added this to the v1.3 milestone May 11, 2017
@shamisp
Copy link
Contributor

shamisp commented May 11, 2017

@MattBBaker it is actually up to you. I think these changes can be considered as a thread safety bugfix. If you think it is important for ORNL to have it in v1.2 we can push.
I don't think @yosefe will have major objection

@MattBBaker
Copy link
Contributor Author

@shamisp This is part of a larger set of patches coming down the pipe , with some more rearrangements and potential performance impacts. I think that saying it's 1.3 is safer. Though I think that we should also talk about what kind of release time frame there is for 1.3.

@MattBBaker
Copy link
Contributor Author

@shamisp Also, don't merge yet, I just found a bug.

@MattBBaker MattBBaker force-pushed the topic/uct-ugni-thread-safe-cdm branch from 55f8146 to e88b6c6 Compare May 11, 2017 15:32
@MattBBaker
Copy link
Contributor Author

@shamisp Fixed now. I guess gcc doesn't check the parameter list for empty macros.

@shamisp
Copy link
Contributor

shamisp commented May 11, 2017 via email

@MattBBaker
Copy link
Contributor Author

Yes.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1634/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3553/ for details (Mellanox internal link).

@@ -53,7 +53,7 @@ static ucs_status_t uct_ugni_smsg_mbox_reg(uct_ugni_smsg_iface_t *iface, uct_ugn
}
pthread_mutex_lock(&uct_ugni_global_lock);

ugni_rc = GNI_MemRegister(iface->super.nic_handle, (uint64_t)address,
ugni_rc = GNI_MemRegister(uct_ugni_iface_nic_handle(&iface->super), (uint64_t)address,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious - what is the motivation behind uct_ugni_iface_nic_handle() ? type more :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's most useful in UDT, where changes to the structure (like when I was deciding where that field should land in the final structure layout) required changes in a lot of place. This way, change it once and everything compiles fine. I can condense the name. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay... you are the one that will be typing this

@@ -479,8 +466,6 @@ static UCS_CLASS_INIT_FUNC(uct_ugni_udt_iface_t, uct_md_h md, uct_worker_h worke
ucs_warn("GNI_EpDestroy failed, Error status: %s %d",
gni_err_str[ugni_rc], ugni_rc);
}
clean_iface_activate:
ugni_deactivate_iface(&self->super);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did the cleanup go ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a different commit that apparently I failed to include. Oops, will rectify that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I noticed that the clean up flow disappeared from a few places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Story is that there are multiple related commits, together they are massive so I broke them up. The clean up path was fixed in the next PR, but I need to fix it for this one.

static inline int uct_ugni_check_device_type(uct_ugni_iface_t *iface, gni_nic_device_t type)
{
uct_ugni_device_t *dev = uct_ugni_iface_device(iface);
return dev->type == type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool instead of int ?
Also I can be a single line.

GNI_CQ_NOBLOCK,
NULL, NULL, &self->local_cq);
if (GNI_RC_SUCCESS != ugni_rc) {
ucs_error("GNI_CqCreate failed, Error status: %s %d",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somebody has to clean CDM ?

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1638/ for details.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1640/ for details.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3557/ for details (Mellanox internal link).

@@ -101,4 +101,8 @@ static inline int uct_ugni_udt_ep_any_post(uct_ugni_udt_iface_t *iface)
return UCS_OK;
}

static inline gni_nic_handle_t uct_ugni_udt_iface_nic_handle(uct_ugni_udt_iface_t *iface)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a macro to me..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an inline function. The one just pushed one looks like a macro.

@shamisp
Copy link
Contributor

shamisp commented May 11, 2017

@hppritcha if you have cycles, please take a look.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3559/ for details (Mellanox internal link).

@MattBBaker MattBBaker force-pushed the topic/uct-ugni-thread-safe-cdm branch from 38e5ea0 to 6a82d5c Compare May 12, 2017 14:25
@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1651/ for details.

@MattBBaker
Copy link
Contributor Author

ORNL jenkins failure is an mlx failure that is already known. Not relevant to this PR.

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3569/ for details (Mellanox internal link).

@MattBBaker
Copy link
Contributor Author

bot:ornl:retest

if (GNI_RC_SUCCESS != ugni_rc) {
ucs_error("GNI_CdmAttach failed (domain id %d, %d), Error status: %s %d",
cdm->domain_id, ugni_domain_counter, gni_err_str[ugni_rc], ugni_rc);
GNI_CdmDestroy(cdm->cdm_handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No status check for the return code

@MattBBaker
Copy link
Contributor Author

@shamisp Updated. Should be good to merge now yes?

Copy link
Contributor

@shamisp shamisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the rest of the code for similar issues

}
return status;
clean_cq:
GNI_CqDestroy(self->local_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log error code

ucs_mpool_cleanup(&self->free_desc, 1);
ucs_mpool_cleanup(&self->free_mbox, 1);
GNI_CqDestroy(self->remote_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please log the error

@mellanox-github
Copy link
Contributor

Test PASSed.
See http://bgate.mellanox.com/jenkins/job/gh-ucx-pr/1693/ for details.

{
ucs_arbiter_cleanup(&iface->arbiter);
ucs_mpool_cleanup(&iface->flush_pool, 1);
GNI_CqDestroy(iface->local_cq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shamisp Thing is, the next PR pulls out all of the GNI_Cq* calls and has common spin locked code paths with error logging. I'd like to fix these problems in the next PR instead.

@@ -180,7 +180,7 @@ ucs_status_t uct_ugni_smsg_ep_connect_to_ep(uct_ep_h tl_ep,
}

ep_hash = (uint32_t)iface_addr->ep_hash;
gni_rc = GNI_EpSetEventData(ep->super.ep, iface->domain_id, ep_hash);
gni_rc = GNI_EpSetEventData(ep->super.ep, iface->cdm.domain_id, ep_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems in some places we use gni_rc and another ugni_rc. IMHO should be consistent. (separate PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a plan.

@mellanox-github
Copy link
Contributor

Test FAILed.
See http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/3623/ for details (Mellanox internal link).

@shamisp
Copy link
Contributor

shamisp commented May 19, 2017

Opened Issue #1529

@shamisp shamisp merged commit 6b29f42 into openucx:master May 19, 2017
@MattBBaker MattBBaker deleted the topic/uct-ugni-thread-safe-cdm branch May 24, 2017 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants