Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/rxm : Corrected rxm to return an existing MR key in its MR Map. #10371

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions prov/rxm/src/rxm_domain.c
Original file line number Diff line number Diff line change
Expand Up @@ -372,9 +372,10 @@ static int rxm_mr_add_map_entry(struct util_domain *domain,
ofi_genlock_lock(&domain->lock);
ret = ofi_mr_map_insert(&domain->mr_map, msg_attr, &temp_key, rxm_mr, flags);
if (OFI_UNLIKELY(ret)) {
FI_WARN(&rxm_prov, FI_LOG_DOMAIN,
"MR map insert for atomic verification failed %d\n",
ret);
if (ret != -FI_ENOKEY)
FI_WARN(&rxm_prov, FI_LOG_DOMAIN,
"MR map insert for atomic verification failed %d\n",
ret);
Comment on lines +375 to +378
Copy link
Contributor

@j-xiong j-xiong Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of FI_ENOKEY, the key in the mr_map is still associated with an old rxm_mr, which apparently is still valid. We are now in a situation that a key is associated with multiple rxm_mr's, but only one is present in the mr_map.

In rxm_handle_atomic_req, rxm_mr_get_map_entry is called to retrieve the rxm_mr associated with the key. In this case, the old rxm_mr is going to be referenced. That may or may not work as expected. If one close the old rxm_mr. The key will be removed from the mr_map. The atomic req processing would fail even though it shouldn't.

I don't have a good solution for this. It's basically a behavior confliction of cached MR at verbs level and non-cached behavior of the mr_map at rxm level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to do a lookup before inserting the rxm_mr and only insert if lookup fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't make any difference. The mr_map is mainly used to find rxm_mr with a key. There can be only one association in mr_map, but we have multiple rxm_mr's.

} else {
assert(rxm_mr->mr_fid.key == temp_key);
}
Expand Down Expand Up @@ -622,7 +623,7 @@ static int rxm_mr_regattr(struct fid *fid, const struct fi_mr_attr *attr,
if (rxm_domain->util_domain.info_domain_caps & FI_ATOMIC) {
ret = rxm_mr_add_map_entry(&rxm_domain->util_domain,
&msg_attr, rxm_mr, flags);
if (ret)
if (ret && ret != -FI_ENOKEY)
goto map_err;
}

Expand Down Expand Up @@ -674,7 +675,7 @@ static int rxm_mr_regv(struct fid *fid, const struct iovec *iov, size_t count,
if (rxm_domain->util_domain.info_domain_caps & FI_ATOMIC) {
ret = rxm_mr_add_map_entry(&rxm_domain->util_domain,
&msg_attr, rxm_mr, flags);
if (ret)
if (ret && ret != -FI_ENOKEY)
goto map_err;
}

Expand Down
Loading