Issue establishing connection between RDM endpoints using TCP provider #9052

shefty · 2023-06-16T16:29:34Z

Discussed in #9051

^{Originally posted by mason1504 June 16, 2023}
I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.

The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.

Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.

Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:

Host 1:
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 0000024171D3AAC0
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():463 simultaneous, reject peer

Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN

Note that the call to fi_send returns -FI_EAGAIN and remains in that state.

Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.

I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.

Thanks.

shefty · 2023-07-07T19:54:30Z

Connection issue was a result of mixing msg and rdm endpoints on the same domain, which isn't supported by the provider (locking restriction). An update has been merged upstream to return a failure earlier when creating the endpoint if it doesn't match that for the domain.

shefty closed this as completed Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue establishing connection between RDM endpoints using TCP provider #9052

Issue establishing connection between RDM endpoints using TCP provider #9052

shefty commented Jun 16, 2023

shefty commented Jul 7, 2023

Issue establishing connection between RDM endpoints using TCP provider #9052

Issue establishing connection between RDM endpoints using TCP provider #9052

Comments

shefty commented Jun 16, 2023

Discussed in #9051

shefty commented Jul 7, 2023