Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/lnx: LINKx (lnx) provider #10437

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

amirshehataornl
Copy link
Contributor

This PR introduces the LINKx (lnx) provider.

It currently supports linking only two providers, first of which is the shm provider.
It has been tested with SHM+CXI provider and SHM+RXM provider.
The data structures are designed to allow for support of a multi-rail feature, where
multiple providers and multiple endpoints can be linked together.

Future work items, which I plan to tackle after the initial PR lands:

  1. Support for all the libfabric APIs. Currently it supports the tagged APIs and partial support for rma and atomics.
  2. General support for address caching in lnx
  3. Enabling HW offload support.
  4. Multiple provider linkage
  5. Multi-rail feature

Add the FI_PEER capability bit to the CXI provider fi_info

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
On cq_open, check the FI_PEER_IMPORT, if set, set all internal cq operation
to be enosys, with the exception to the read callback.

The read callback is overloaded to operate as a progress callback
function. Invoking the read callback will progress the enpoints linked to
this CQ.

Keep track of the fid_peer_cq structure passed in.

If the FI_PEER_IMPORT flag is set, then set the callbacks in cxip_cq structure
which handle writing to the peer_cq, otherwise set them to the ones which
write to the util_cq.

A provider needs to call a different set of functions to insert
completion events into an imported CQ vs an internal CQ.

These set of callback definition standardize a way to assign a different
function to a CQ object, which can then be called to insert into the CQ.

For example:

	struct prov_cq {
		struct util_cq *util_cq;
		struct fid_peer_cq *peer_cq;
		ofi_peer_cq_cb cq_cb;
	};

When a provider opens a CQ it can:

	if (attr->flags & FI_PEER_IMPORT) {
		prov_cq->cq_cb.cq_comp = prov_peer_cq_comp;
	} else {
		prov_cq->cq_cb.cq_comp = prov_cq_comp;
	}

Collect the peer CQ callbacks in one structure for use in CXI.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Restructure the code to allow for posting on the owner provider's shared
receive queues.

Do not do a reverse lookup on the AV table to get the fi_addr_t, instead
register an address matching callback with the owner. The owner can then
call the address matching callback to match an fi_addr_t to the source
address in the message received.

This is more efficient as the peer lookup can be an O(1) operation;
AV[fi_addr_t]. The peer's CXI address can be compared with the CXI address
in the message received.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Handle the case where a message from a peer arrives before the peer is
inserted.

Implement the callflow to support this scenario.

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
The LINKx (lnx) provider offers a framework by which multiple providers
can be linked together and presented as one provider to the application.
This abstracts away the details of the traffic providers from the
application. This iteration of the provider allows linking only two
providers, shm and another provider, ex; CXI or RXM. The composite
providers which are linked together need to support the peer
infrastructure.

In order to use the lnx provider the user needs to:

export FI_LNX_PROV_LINKS="shm+<inter-node provider>"

ex:

export FI_LNX_PROV_LINKS="shm+cxi"
  or
export FI_LNX_PROV_LINKS="shm+tcp;ofi_rxm"

Signed-off-by: Amir Shehata <shehataa@ornl.gov>
Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked through the link patch yet, just took a look at the first few and had some questions. I'll do the link patch in a separate review

@@ -138,7 +138,7 @@
FI_REMOTE_COMM | FI_RMA_EVENT | FI_MULTI_RECV | FI_FENCE | FI_TRIGGER)
#define CXIP_EP_CAPS (CXIP_EP_PRI_CAPS | CXIP_EP_SEC_CAPS)
#define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID)
#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS)
#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS | FI_PEER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FI_PEER is a domain capability, move into CXIP_DOM_CAPS
Also move this patch after the patches were you actually add support for peer CQ, cntr, and srx

Comment on lines +1390 to +1399
struct cxip_peer_cq_cb {
int (*cq_comp)(struct util_cq *cq, void *context,
uint64_t flags, size_t len, void *buf, uint64_t data,
uint64_t tag);
int (*cq_comp_src)(struct util_cq *cq, void *context,
uint64_t flags, size_t len, void *buf, uint64_t data,
uint64_t tag, fi_addr_t addr);
int (*cq_err)(struct util_cq *cq,
const struct fi_cq_err_entry *err_entry);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this optimization is handled already by the util_cq code. In cxi you should just always be able to go through ofi_peer_cq_write and it should work for both the peer and owner case. In other words, I think you should be able to drop most of this patch and just keep the replacement of
ofi_cq_write -> ofi_peer_cq_write(...FI_ADDR_NOTAVAIL)
ofi_cq_write_src -> ofi_peer_cq_write(...src)
ofi_cq_write_error -> ofi_peer_cq_write_error

.start_tag = cxip_unexp_start,
.discard_msg = cxip_discard,
.discard_tag = cxip_discard,
.addr_match = cxip_addr_match,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we decided to drop this change? Is it still needed?

@@ -918,6 +918,9 @@ int cxip_ep_bind(struct fid *fid, struct fid *bfid, uint64_t flags)

break;

case FI_CLASS_SRX_CTX:
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing indent here
Is there really nothing to be done here? Can we make sure that the caller/owner is using the correct API flow? Ref counts? Something?
This implementation always grabs the srx from the domain and pretty much ignores the bind call/association with the EP. Do we want to allow that flexibility or enforce the bind?

dom->owner_srx->peer_ops = &cxip_srx_peer_ops;
return 0;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the definition of peer resources, we've defined it such that the peer needs to return a peer-owned fid to the owner for resource management.

@@ -402,6 +402,11 @@ struct cxip_rxc *cxip_rxc_calloc(struct cxip_ep_obj *ep_obj, void *context)
{
struct cxip_rxc *rxc = NULL;

/* update the rx_match_mode in case it has changed since
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the need for this change more and describe the reason the mode might have changed?

Comment on lines +2037 to +2047
struct fi_peer_rx_entry *entry = calloc(sizeof(*entry), 1);

if (!entry)
return -FI_ENOMEM;

ux_mb.raw = ux_send->put_ev.tgt_long.match_bits;
entry->peer_context = ux_send;
if (ux_mb.tagged)
owner_srx->owner_ops->queue_tag(entry);
else
owner_srx->owner_ops->queue_msg(entry);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked to closely at whatever path would trigger this but this looks wrong. queue_msg should never get called with an entry that was not allocated by the owner. It should only get called with the entry returned from the get_msg call. One big reason is that the entry has an srx field to get back to the owner resources and you haven't initialized that field in the entry so it would probably segfault in this path

struct dlist_entry *tmp;
int ret;

dlist_foreach_container_safe(&rxc->sw_pending_ux_list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything ever getting added to this list in the peer path and that add_ux call looks problematic anyway. What's the intent of this list?

Comment on lines -2064 to -2067
if (rxc->base.state == RXC_PENDING_PTLTE_SOFTWARE_MANAGED)
cxip_post_ux_onload_sw(rxc);
else
cxip_post_ux_onload_fc(rxc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to drop these calls? I don't see this done anywhere else

@@ -177,7 +177,7 @@
#define CXIP_MINOR_VERSION 1
#define CXIP_PROV_VERSION FI_VERSION(CXIP_MAJOR_VERSION, \
CXIP_MINOR_VERSION)
#define CXIP_FI_VERSION FI_VERSION(1, 21)
#define CXIP_FI_VERSION FI_VERSION(2, 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a separate change

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants