Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue establishing connection between RDM endpoints using TCP provider #9052

Closed
shefty opened this issue Jun 16, 2023 Discussed in #9051 · 1 comment
Closed

Issue establishing connection between RDM endpoints using TCP provider #9052

shefty opened this issue Jun 16, 2023 Discussed in #9051 · 1 comment

Comments

@shefty
Copy link
Member

shefty commented Jun 16, 2023

Discussed in #9051

Originally posted by mason1504 June 16, 2023
I have an issue with Libfabric v1.18.0 using an FI_EP_RDM endpoint type with the tcp provider.

The issue I see is that when I run up two instances of my application, both sides create an RDM endpoint using the tcp provider and add the corresponding address to the address vector.

Both applications then start sending and receiving data on these endpoints, but sometimes I see an issue where there is a race condition as Libfabric must be attempting to establish a tcp connection from both hosts at the same time, with fi_send being called in a loop in both applications.

Enabling Libfabric debug logging I see the following repeated, and a connection is never actually established between hosts:

Host 1:
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_CONNREQ
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():422 connreq for 0000024171D3AAC0
libfabric:31984:1686915992::tcp:ep_ctrl:xnet_process_connreq():463 simultaneous, reject peer

Host 2:
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_cm_msg():112 Connection refused from remote
libfabric:15804:1686915992::tcp:ep_ctrl:xnet_req_done():196 Failed to receive connect response
11libfabric:15804:1686915992::tcp:ep_ctrl:xnet_handle_event_list():519 event FI_SHUTDOWN

Note that the call to fi_send returns -FI_EAGAIN and remains in that state.

Note also that this is a race condition, sometimes the connections establish without issue and data is transferred between hosts fine.

I'm just wondering if anyone else has experienced this, I have some ideas on how to resolve but any advice on how best to fix this would be appreciated.

Thanks.

@shefty
Copy link
Member Author

shefty commented Jul 7, 2023

Connection issue was a result of mixing msg and rdm endpoints on the same domain, which isn't supported by the provider (locking restriction). An update has been merged upstream to return a failure earlier when creating the endpoint if it doesn't match that for the domain.

@shefty shefty closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant