prov/lnx: LINKx (lnx) provider #10437

amirshehataornl · 2024-10-04T21:17:23Z

This PR introduces the LINKx (lnx) provider.

It currently supports linking only two providers, first of which is the shm provider.
It has been tested with SHM+CXI provider and SHM+RXM provider.
The data structures are designed to allow for support of a multi-rail feature, where
multiple providers and multiple endpoints can be linked together.

Future work items, which I plan to tackle after the initial PR lands:

Support for all the libfabric APIs. Currently it supports the tagged APIs and partial support for rma and atomics.
General support for address caching in lnx
Enabling HW offload support.
Multiple provider linkage
Multi-rail feature

Add the FI_PEER capability bit to the CXI provider fi_info Signed-off-by: Amir Shehata <shehataa@ornl.gov>

On cq_open, check the FI_PEER_IMPORT, if set, set all internal cq operation to be enosys, with the exception to the read callback. The read callback is overloaded to operate as a progress callback function. Invoking the read callback will progress the enpoints linked to this CQ. Keep track of the fid_peer_cq structure passed in. If the FI_PEER_IMPORT flag is set, then set the callbacks in cxip_cq structure which handle writing to the peer_cq, otherwise set them to the ones which write to the util_cq. A provider needs to call a different set of functions to insert completion events into an imported CQ vs an internal CQ. These set of callback definition standardize a way to assign a different function to a CQ object, which can then be called to insert into the CQ. For example: struct prov_cq { struct util_cq *util_cq; struct fid_peer_cq *peer_cq; ofi_peer_cq_cb cq_cb; }; When a provider opens a CQ it can: if (attr->flags & FI_PEER_IMPORT) { prov_cq->cq_cb.cq_comp = prov_peer_cq_comp; } else { prov_cq->cq_cb.cq_comp = prov_cq_comp; } Collect the peer CQ callbacks in one structure for use in CXI. Signed-off-by: Amir Shehata <shehataa@ornl.gov>

Restructure the code to allow for posting on the owner provider's shared receive queues. Do not do a reverse lookup on the AV table to get the fi_addr_t, instead register an address matching callback with the owner. The owner can then call the address matching callback to match an fi_addr_t to the source address in the message received. This is more efficient as the peer lookup can be an O(1) operation; AV[fi_addr_t]. The peer's CXI address can be compared with the CXI address in the message received. Signed-off-by: Amir Shehata <shehataa@ornl.gov>

Handle the case where a message from a peer arrives before the peer is inserted. Implement the callflow to support this scenario. Signed-off-by: Amir Shehata <shehataa@ornl.gov>

The LINKx (lnx) provider offers a framework by which multiple providers can be linked together and presented as one provider to the application. This abstracts away the details of the traffic providers from the application. This iteration of the provider allows linking only two providers, shm and another provider, ex; CXI or RXM. The composite providers which are linked together need to support the peer infrastructure. In order to use the lnx provider the user needs to: export FI_LNX_PROV_LINKS="shm+<inter-node provider>" ex: export FI_LNX_PROV_LINKS="shm+cxi" or export FI_LNX_PROV_LINKS="shm+tcp;ofi_rxm" Signed-off-by: Amir Shehata <shehataa@ornl.gov>

aingerson

Haven't looked through the link patch yet, just took a look at the first few and had some questions. I'll do the link patch in a separate review

aingerson · 2024-10-09T22:42:10Z

prov/cxi/include/cxip.h

@@ -138,7 +138,7 @@
 	 FI_REMOTE_COMM | FI_RMA_EVENT | FI_MULTI_RECV | FI_FENCE | FI_TRIGGER)
 #define CXIP_EP_CAPS (CXIP_EP_PRI_CAPS | CXIP_EP_SEC_CAPS)
 #define CXIP_DOM_CAPS (FI_LOCAL_COMM | FI_REMOTE_COMM | FI_AV_USER_ID)
-#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS)
+#define CXIP_CAPS (CXIP_DOM_CAPS | CXIP_EP_CAPS | FI_PEER)


FI_PEER is a domain capability, move into CXIP_DOM_CAPS
Also move this patch after the patches were you actually add support for peer CQ, cntr, and srx

aingerson · 2024-10-09T22:52:00Z

prov/cxi/include/cxip.h

+struct cxip_peer_cq_cb {
+	int (*cq_comp)(struct util_cq *cq, void *context,
+		       uint64_t flags, size_t len, void *buf, uint64_t data,
+		       uint64_t tag);
+	int (*cq_comp_src)(struct util_cq *cq, void *context,
+			   uint64_t flags, size_t len, void *buf, uint64_t data,
+			   uint64_t tag, fi_addr_t addr);
+	int (*cq_err)(struct util_cq *cq,
+		      const struct fi_cq_err_entry *err_entry);
+};


I think this optimization is handled already by the util_cq code. In cxi you should just always be able to go through ofi_peer_cq_write and it should work for both the peer and owner case. In other words, I think you should be able to drop most of this patch and just keep the replacement of
ofi_cq_write -> ofi_peer_cq_write(...FI_ADDR_NOTAVAIL)
ofi_cq_write_src -> ofi_peer_cq_write(...src)
ofi_cq_write_error -> ofi_peer_cq_write_error

aingerson · 2024-10-09T22:54:12Z

prov/cxi/src/cxip_dom.c

+	.start_tag = cxip_unexp_start,
+	.discard_msg = cxip_discard,
+	.discard_tag = cxip_discard,
+	.addr_match = cxip_addr_match,


I thought we decided to drop this change? Is it still needed?

aingerson · 2024-10-09T23:02:40Z

prov/cxi/src/cxip_ep.c

@@ -918,6 +918,9 @@ int cxip_ep_bind(struct fid *fid, struct fid *bfid, uint64_t flags)

 		break;

+	case FI_CLASS_SRX_CTX:
+	break;


Missing indent here
Is there really nothing to be done here? Can we make sure that the caller/owner is using the correct API flow? Ref counts? Something?
This implementation always grabs the srx from the domain and pretty much ignores the bind call/association with the EP. Do we want to allow that flexibility or enforce the bind?

aingerson · 2024-10-09T23:06:42Z

prov/cxi/src/cxip_dom.c

+		dom->owner_srx->peer_ops = &cxip_srx_peer_ops;
+		return 0;
+	}
+


To match the definition of peer resources, we've defined it such that the peer needs to return a peer-owned fid to the owner for resource management.

aingerson · 2024-10-09T23:33:00Z

prov/cxi/src/cxip_rxc.c

@@ -402,6 +402,11 @@ struct cxip_rxc *cxip_rxc_calloc(struct cxip_ep_obj *ep_obj, void *context)
 {
 	struct cxip_rxc *rxc = NULL;

+	/* update the rx_match_mode in case it has changed since


Can you explain the need for this change more and describe the reason the mode might have changed?

aingerson · 2024-10-09T23:39:53Z

prov/cxi/src/cxip_msg_hpc.c

+	struct fi_peer_rx_entry *entry = calloc(sizeof(*entry), 1);
+
+	if (!entry)
+		return -FI_ENOMEM;
+
+	ux_mb.raw = ux_send->put_ev.tgt_long.match_bits;
+	entry->peer_context = ux_send;
+	if (ux_mb.tagged)
+		owner_srx->owner_ops->queue_tag(entry);
+	else
+		owner_srx->owner_ops->queue_msg(entry);


I haven't looked to closely at whatever path would trigger this but this looks wrong. queue_msg should never get called with an entry that was not allocated by the owner. It should only get called with the entry returned from the get_msg call. One big reason is that the entry has an srx field to get back to the owner resources and you haven't initialized that field in the entry so it would probably segfault in this path

aingerson · 2024-10-09T23:46:27Z

prov/cxi/src/cxip_msg_hpc.c

+		struct dlist_entry *tmp;
+		int ret;
+
+		dlist_foreach_container_safe(&rxc->sw_pending_ux_list,


I don't see anything ever getting added to this list in the peer path and that add_ux call looks problematic anyway. What's the intent of this list?

aingerson · 2024-10-09T23:47:48Z

prov/cxi/src/cxip_msg_hpc.c

-	if (rxc->base.state == RXC_PENDING_PTLTE_SOFTWARE_MANAGED)
-		cxip_post_ux_onload_sw(rxc);
-	else
-		cxip_post_ux_onload_fc(rxc);


Is it ok to drop these calls? I don't see this done anywhere else

aingerson · 2024-10-10T14:12:29Z

prov/cxi/include/cxip.h

@@ -177,7 +177,7 @@
 #define CXIP_MINOR_VERSION		1
 #define CXIP_PROV_VERSION		FI_VERSION(CXIP_MAJOR_VERSION, \
 						   CXIP_MINOR_VERSION)
-#define CXIP_FI_VERSION			FI_VERSION(1, 21)
+#define CXIP_FI_VERSION			FI_VERSION(2, 0)


This seems like a separate change

amirshehataornl added 5 commits October 4, 2024 17:05

prov/cxi: Add FI_PEER capability bit

6ddb318

Add the FI_PEER capability bit to the CXI provider fi_info Signed-off-by: Amir Shehata <shehataa@ornl.gov>

prov/cxi: handle late peer insertion

e9db7c5

Handle the case where a message from a peer arrives before the peer is inserted. Implement the callflow to support this scenario. Signed-off-by: Amir Shehata <shehataa@ornl.gov>

aingerson reviewed Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prov/lnx: LINKx (lnx) provider #10437

prov/lnx: LINKx (lnx) provider #10437

amirshehataornl commented Oct 4, 2024

aingerson left a comment

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 9, 2024

aingerson Oct 10, 2024

prov/lnx: LINKx (lnx) provider #10437

Are you sure you want to change the base?

prov/lnx: LINKx (lnx) provider #10437

Conversation

amirshehataornl commented Oct 4, 2024

aingerson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment