-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/lnx: LINKx (lnx) provider #10437
base: main
Are you sure you want to change the base?
Conversation
Add the FI_PEER capability bit to the CXI provider fi_info Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open, check the FI_PEER_IMPORT, if set, set all internal cq operation to be enosys, with the exception to the read callback. The read callback is overloaded to operate as a progress callback function. Invoking the read callback will progress the enpoints linked to this CQ. Keep track of the fid_peer_cq structure passed in. If the FI_PEER_IMPORT flag is set, then set the callbacks in cxip_cq structure which handle writing to the peer_cq, otherwise set them to the ones which write to the util_cq. A provider needs to call a different set of functions to insert completion events into an imported CQ vs an internal CQ. These set of callback definition standardize a way to assign a different function to a CQ object, which can then be called to insert into the CQ. For example: struct prov_cq { struct util_cq *util_cq; struct fid_peer_cq *peer_cq; ofi_peer_cq_cb cq_cb; }; When a provider opens a CQ it can: if (attr->flags & FI_PEER_IMPORT) { prov_cq->cq_cb.cq_comp = prov_peer_cq_comp; } else { prov_cq->cq_cb.cq_comp = prov_cq_comp; } Collect the peer CQ callbacks in one structure for use in CXI. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Restructure the code to allow for posting on the owner provider's shared receive queues. Do not do a reverse lookup on the AV table to get the fi_addr_t, instead register an address matching callback with the owner. The owner can then call the address matching callback to match an fi_addr_t to the source address in the message received. This is more efficient as the peer lookup can be an O(1) operation; AV[fi_addr_t]. The peer's CXI address can be compared with the CXI address in the message received. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Handle the case where a message from a peer arrives before the peer is inserted. Implement the callflow to support this scenario. Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The LINKx (lnx) provider offers a framework by which multiple providers can be linked together and presented as one provider to the application. This abstracts away the details of the traffic providers from the application. This iteration of the provider allows linking only two providers, shm and another provider, ex; CXI or RXM. The composite providers which are linked together need to support the peer infrastructure. In order to use the lnx provider the user needs to: export FI_LNX_PROV_LINKS="shm+<inter-node provider>" ex: export FI_LNX_PROV_LINKS="shm+cxi" or export FI_LNX_PROV_LINKS="shm+tcp;ofi_rxm" Signed-off-by: Amir Shehata <shehataa@ornl.gov>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked through the link patch yet, just took a look at the first few and had some questions. I'll do the link patch in a separate review
@@ -138,7 +138,7 @@ | |||
FI_REMOTE_COMM | FI_RMA_EVENT | FI_MULTI_RECV | FI_FENCE | FI_TRIGGER) | |||
#define CXIP_EP_CAPS (CXIP_EP_PRI_CAPS | CXIP_EP_SEC_CAPS) | |||
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID) | |||
#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS) | |||
#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS | FI_PEER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FI_PEER
is a domain capability, move into CXIP_DOM_CAPS
Also move this patch after the patches were you actually add support for peer CQ, cntr, and srx
struct cxip_peer_cq_cb { | ||
int (*cq_comp)(struct util_cq *cq, void *context, | ||
uint64_t flags, size_t len, void *buf, uint64_t data, | ||
uint64_t tag); | ||
int (*cq_comp_src)(struct util_cq *cq, void *context, | ||
uint64_t flags, size_t len, void *buf, uint64_t data, | ||
uint64_t tag, fi_addr_t addr); | ||
int (*cq_err)(struct util_cq *cq, | ||
const struct fi_cq_err_entry *err_entry); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimization is handled already by the util_cq code. In cxi you should just always be able to go through ofi_peer_cq_write
and it should work for both the peer and owner case. In other words, I think you should be able to drop most of this patch and just keep the replacement of
ofi_cq_write -> ofi_peer_cq_write(...FI_ADDR_NOTAVAIL)
ofi_cq_write_src -> ofi_peer_cq_write(...src)
ofi_cq_write_error -> ofi_peer_cq_write_error
prov/cxi/src/cxip_dom.c
Outdated
.start_tag = cxip_unexp_start, | ||
.discard_msg = cxip_discard, | ||
.discard_tag = cxip_discard, | ||
.addr_match = cxip_addr_match, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we decided to drop this change? Is it still needed?
@@ -918,6 +918,9 @@ int cxip_ep_bind(struct fid *fid, struct fid *bfid, uint64_t flags) | |||
|
|||
break; | |||
|
|||
case FI_CLASS_SRX_CTX: | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing indent here
Is there really nothing to be done here? Can we make sure that the caller/owner is using the correct API flow? Ref counts? Something?
This implementation always grabs the srx from the domain and pretty much ignores the bind call/association with the EP. Do we want to allow that flexibility or enforce the bind?
dom->owner_srx->peer_ops = &cxip_srx_peer_ops; | ||
return 0; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To match the definition of peer resources, we've defined it such that the peer needs to return a peer-owned fid to the owner for resource management.
@@ -402,6 +402,11 @@ struct cxip_rxc *cxip_rxc_calloc(struct cxip_ep_obj *ep_obj, void *context) | |||
{ | |||
struct cxip_rxc *rxc = NULL; | |||
|
|||
/* update the rx_match_mode in case it has changed since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain the need for this change more and describe the reason the mode might have changed?
struct fi_peer_rx_entry *entry = calloc(sizeof(*entry), 1); | ||
|
||
if (!entry) | ||
return -FI_ENOMEM; | ||
|
||
ux_mb.raw = ux_send->put_ev.tgt_long.match_bits; | ||
entry->peer_context = ux_send; | ||
if (ux_mb.tagged) | ||
owner_srx->owner_ops->queue_tag(entry); | ||
else | ||
owner_srx->owner_ops->queue_msg(entry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked to closely at whatever path would trigger this but this looks wrong. queue_msg should never get called with an entry that was not allocated by the owner. It should only get called with the entry returned from the get_msg call. One big reason is that the entry has an srx field to get back to the owner resources and you haven't initialized that field in the entry so it would probably segfault in this path
struct dlist_entry *tmp; | ||
int ret; | ||
|
||
dlist_foreach_container_safe(&rxc->sw_pending_ux_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything ever getting added to this list in the peer path and that add_ux call looks problematic anyway. What's the intent of this list?
if (rxc->base.state == RXC_PENDING_PTLTE_SOFTWARE_MANAGED) | ||
cxip_post_ux_onload_sw(rxc); | ||
else | ||
cxip_post_ux_onload_fc(rxc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to drop these calls? I don't see this done anywhere else
@@ -177,7 +177,7 @@ | |||
#define CXIP_MINOR_VERSION 1 | |||
#define CXIP_PROV_VERSION FI_VERSION(CXIP_MAJOR_VERSION, \ | |||
CXIP_MINOR_VERSION) | |||
#define CXIP_FI_VERSION FI_VERSION(1, 21) | |||
#define CXIP_FI_VERSION FI_VERSION(2, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a separate change
This PR introduces the LINKx (lnx) provider.
It currently supports linking only two providers, first of which is the shm provider.
It has been tested with SHM+CXI provider and SHM+RXM provider.
The data structures are designed to allow for support of a multi-rail feature, where
multiple providers and multiple endpoints can be linked together.
Future work items, which I plan to tackle after the initial PR lands: