Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fabtests: Add extra configure options for efa and lpp tests #10419

Merged
merged 2 commits into from
Oct 2, 2024

Conversation

zachdworkin
Copy link
Contributor

add --enable-efa=yes/no --enable-lpp=yes/no, defaulted to yes to turn off these providers if they are not needed to be built.

@shijin-aws
Copy link
Contributor

shijin-aws commented Sep 27, 2024

@zachdworkin @j-xiong Sigh, I think 6901162 should be a better fix for this issue for EFA provider. Sorry that I haven't pushed in that direction earlier

@shijin-aws
Copy link
Contributor

I believe you may have trouble with -lefa?

@zachdworkin
Copy link
Contributor Author

#10411 broke intel CI because our nodes don't have an updated rdma-core package with the EFADV_DEVICE_ATTR_CAPS_RDMA_WRITE flag. I need some solution to completely disable efa in fabtests. Can we have both options? One will conditionally disable if the header is missing and the other can force disable it.

@shijin-aws
Copy link
Contributor

@zachdworkin sure we can do both

@aingerson
Copy link
Contributor

@shijin-aws I think having an option to disable it is fine, but I also think the efa config needs to fully check for that support by default and be smarter like your fix is doing so our CI shouldn't need the disabling option - I tried your proposed fix and it does not fix the issue for me

@shijin-aws
Copy link
Contributor

@aingerson it needs one more diff, let me open a PR and let you try

@zachdworkin
Copy link
Contributor Author

@aingerson does your efadv.h file have the added flag? Even if the header is detected it might fail to compile because the package is too old

@shijin-aws
Copy link
Contributor

@zachdworkin I can add AC_CHECK_DECL for EFADV_DEVICE_ATTR_CAPS_RDMA_WRITE

@shijin-aws
Copy link
Contributor

@zachdworkin @aingerson can you test this #10420 ?

Copy link
Contributor

@aingerson aingerson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't necessarily have a problem with including these changes but I don't think they are really solving an issue and I don't think we should disable the efa and lpp builds on our CI. Our CI should probably try building as many things as possible to catch all build issues we can catch.
The thing I would change about these patches is the name of the config define (EFA and LPP) to make them more descriptive and avoid any potential duplicate defines

@zachdworkin
Copy link
Contributor Author

I don't necessarily have a problem with including these changes but I don't think they are really solving an issue and I don't think we should disable the efa and lpp builds on our CI. Our CI should probably try building as many things as possible to catch all build issues we can catch. The thing I would change about these patches is the name of the config define (EFA and LPP) to make them more descriptive and avoid any potential duplicate defines

I agree that the CI should build as many things as possible. I can change the names but we need this option until we do our cluster upgrade next quarter. We can remove the command line --enable-efa=no and --enable-lpp=no when that upgrade is complete.

@aingerson
Copy link
Contributor

Why do we need this option? Doesn't Shi's configure change fix it?

@zachdworkin
Copy link
Contributor Author

I guess we don't need it but it could be nice to have for specific types of builds

@j-xiong
Copy link
Contributor

j-xiong commented Sep 27, 2024

In libfabric we have per provider configure script that check if a provider can be built and global option to enable/disable providers. We can have the same options in fabtests, so I am fine with both this and #10420 in.

@aingerson
Copy link
Contributor

Yeah, I'm totally fine with it being included - just wanted to clarify that it wasn't actually needed

@zachdworkin
Copy link
Contributor Author

@shijin-aws whats the aws failure?

@darrylabbate
Copy link
Member

bot:aws:retest

@zachdworkin
Copy link
Contributor Author

@shijin-aws @darrylabbate whats the AWS failure? Rerun doesn't seem to let it pass

@shijin-aws
Copy link
Contributor

@zachdworkin you need to rebase on the latest main branch, there was a bug in efa provider which was fixed by #10421. It's not blocking your PR though

Add --enable-efa argument to fabtests to disable efa building.
Default is enabled. This is to disable efa for Intel CI.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
Add --enable-lpp option to configure. Default on.
This is to turn it off for CI that doesn't need to build this
provider.

Signed-off-by: Zach Dworkin <zachary.dworkin@intel.com>
@zachdworkin
Copy link
Contributor Author

@shijin-aws can you share the AWS CI failure?

@shijin-aws
Copy link
Contributor

It's socket provider failure again


server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.41.41 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10419-debug/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.41.41'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.41.41 'timeout 360 /bin/bash --login -c '"'"'FI_LOG_LEVEL=warn /home/ec2-user/PortaFiducia/build/libraries/libfabric/pr10419-debug/install/fabtests/bin/fi_rdm_tagged_peek -p sockets -s 172.31.41.41 172.31.41.41'"'"''
client_stdout:
timeout: the monitored command dumped core

client returncode: 255
server_stdout:
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
fi_cq_sread/fi_cq_read(): functional/rdm_tagged_peek.c:56, ret=-259 (Error available)
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_tx_entry():1959<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
libfabric:130006:1727803785::sockets:ep_data:sock_pe_progress_rx_pe_entry():2037<warn> Peer disconnected: removing fd from pollset: fi_sockaddr_in://172.31.41.41:41875
Sending 10 tagged messages
Waiting for messages to complete

server returncode: 1

@shijin-aws
Copy link
Contributor

bot:aws:retest

@j-xiong j-xiong merged commit c410477 into ofiwg:main Oct 2, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants