You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Originally posted by mason1504 June 16, 2023
I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.
The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.
Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.
Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:
Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN
Note that the call to fi_send returns -FI_EAGAIN and remains in that state.
Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.
I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.
Thanks.
The text was updated successfully, but these errors were encountered:
Connection issue was a result of mixing msg and rdm endpoints on the same domain, which isn't supported by the provider (locking restriction). An update has been merged upstream to return a failure earlier when creating the endpoint if it doesn't match that for the domain.
Discussed in #9051
Originally posted by mason1504 June 16, 2023
I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.
The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.
Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.
Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:
Host 1:
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 0000024171D3AAC0
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():463 simultaneous, reject peer
Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN
Note that the call to fi_send returns -FI_EAGAIN and remains in that state.
Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.
I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.
Thanks.
The text was updated successfully, but these errors were encountered: