-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCS/ARCH: Add SVE memcpy #5954
UCS/ARCH: Add SVE memcpy #5954
Conversation
Can one of the admins verify this patch? |
ok to test |
Using |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
we had an idea to have a separate UCS wrapper for |
Makes sense to have a stripped down version of memcpy for shared memory copy. |
I asked some external folks to review it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per @shamisp's review request (I'm a developer of the compiler and MPI library for A64FX):
Though I'm not familiar with UCX code, the added code looks good to me.
@kawashima-fj thanks for review !!! 👍 |
@dmitrygx are we good to go ? |
What
Add ARM Scalable Vector Extension (SVE) version of the
memcpy
function to UCS.Why ?
We are seeing very low intra-node bandwidth on our A64FX machine with the latest UCX. The
memcpy
function shipped in glibc 2.28 (CentOS 8.1) isn't optimized, and it will probably take a long time, if ever, for users to have access to SVE-optimized glibc in production.How ?
Add SVE-optimized
memcpy
written with intrinsics, enabled automatically when the compiler supports it. The SVE version is faster than what glibc provides for all array sizes in my tests.Compiling UCX using GCC 10.2, with
-march=armv8.2-a+sve
in CFLAGS, runningtaskset -c 1 ucx_info -s
on A64FX FX700.Without patch:
With patch:
However, this patch only applies to the places that call
ucs_memcpy_relaxed
, soucx_perftest
results and actual applications are not improved because UCT is still using the standardmemcpy
in many places. For example, inucs/sm/base/sm_ep.c: uct_sm_ep_put_short
.Running
ucx_perftest
to test intra-node bandwidth inside the same A64FX CMG (NUMA node), using 64MB messages:Client:
ucx_perftest -b ~/profile -f -c 2 fj-125
Server:
ucx_perftest -b ~/profile -f -c 1
Results:
These numbers are a lot lower than what's achievable on A64FX's HBM2. I would like to know if the UCX team is planning to let the UCT shared-memory transports to use the built-in optimized
memcpy
functions? For example, the one introduced in #4760.Also, maybe this PR should be extended to incorporate the changes in PR #3724, so that UCX only enables the built-in
memcpy
when enabled in the configuration.Any suggestions are welcome! Thank you.