Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCS/ARCH: Add SVE memcpy #5954

Merged
merged 1 commit into from
Dec 1, 2020
Merged

UCS/ARCH: Add SVE memcpy #5954

merged 1 commit into from
Dec 1, 2020

Conversation

lyu
Copy link
Contributor

@lyu lyu commented Nov 25, 2020

What

Add ARM Scalable Vector Extension (SVE) version of the memcpy function to UCS.

Why ?

We are seeing very low intra-node bandwidth on our A64FX machine with the latest UCX. The memcpy function shipped in glibc 2.28 (CentOS 8.1) isn't optimized, and it will probably take a long time, if ever, for users to have access to SVE-optimized glibc in production.

How ?

Add SVE-optimized memcpy written with intrinsics, enabled automatically when the compiler supports it. The SVE version is faster than what glibc provides for all array sizes in my tests.

Compiling UCX using GCC 10.2, with -march=armv8.2-a+sve in CFLAGS, running taskset -c 1 ucx_info -s on A64FX FX700.

Without patch:

# Timer frequency: 100.000 MHz
# CPU vendor: Fujitsu ARM
# CPU model: ARM 64-bit
# Memcpy bandwidth:
#           4096 bytes: 9596.430 MB/s
#           8192 bytes: 10226.157 MB/s
#          16384 bytes: 10588.992 MB/s
#          32768 bytes: 9458.691 MB/s
#          65536 bytes: 8615.322 MB/s
#         131072 bytes: 8657.415 MB/s
#         262144 bytes: 8674.676 MB/s
#         524288 bytes: 8653.066 MB/s
#        1048576 bytes: 8664.883 MB/s
#        2097152 bytes: 8668.302 MB/s
#        4194304 bytes: 7902.334 MB/s
#        8388608 bytes: 8057.701 MB/s
#       16777216 bytes: 8056.261 MB/s
#       33554432 bytes: 8030.777 MB/s
#       67108864 bytes: 7762.344 MB/s
#      134217728 bytes: 7765.089 MB/s
#      268435456 bytes: 7764.248 MB/s

With patch:

# Timer frequency: 100.000 MHz
# CPU vendor: Fujitsu ARM
# CPU model: ARM 64-bit
# Memcpy bandwidth:
#           4096 bytes: 28501.541 MB/s
#           8192 bytes: 37439.737 MB/s
#          16384 bytes: 44341.234 MB/s
#          32768 bytes: 39440.581 MB/s
#          65536 bytes: 22662.431 MB/s
#         131072 bytes: 23066.897 MB/s
#         262144 bytes: 23276.764 MB/s
#         524288 bytes: 23383.291 MB/s
#        1048576 bytes: 23424.930 MB/s
#        2097152 bytes: 23439.418 MB/s
#        4194304 bytes: 18326.620 MB/s
#        8388608 bytes: 20706.165 MB/s
#       16777216 bytes: 20703.716 MB/s
#       33554432 bytes: 20633.525 MB/s
#       67108864 bytes: 18873.960 MB/s
#      134217728 bytes: 18875.474 MB/s
#      268435456 bytes: 18873.447 MB/s

However, this patch only applies to the places that call ucs_memcpy_relaxed, so ucx_perftest results and actual applications are not improved because UCT is still using the standard memcpy in many places. For example, in ucs/sm/base/sm_ep.c: uct_sm_ep_put_short.

Running ucx_perftest to test intra-node bandwidth inside the same A64FX CMG (NUMA node), using 64MB messages:

Client: ucx_perftest -b ~/profile -f -c 2 fj-125
Server: ucx_perftest -b ~/profile -f -c 1
Results:

+--------------+--------------+-----------------------------+---------------------+-----------------------+
|              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+UCT_put_cma_zcopy-------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 11137.115  9134.142     5746.55    7006.68         100         110
+UCT_get_cma_zcopy-------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 10109.901  9126.714     6330.43    7012.38         110         110
+UCT_put_knem_zcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 11351.496  9104.581     5638.02    7029.43          99         110
+UCT_get_knem_zcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 10223.389  9102.322     6260.16    7031.17         110         110
+UCT_put_posix_short-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8577.506  8243.300     7461.38    7763.88         121         121
+UCT_put_posix_bcopy-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8576.701  8241.764     7462.08    7765.33         121         121
+UCT_get_posix_bcopy-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8268.692  8268.751     7740.04    7739.98         125         121
+UCT_put_sysv_short------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8577.377  8243.505     7461.49    7763.69         121         121
+UCT_put_sysv_bcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8578.211  8243.329     7460.76    7763.85         121         121
+UCT_get_sysv_bcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8269.690  8274.670     7739.11    7734.45         125         121
+UCP_put-------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8238.832  8238.558     7768.09    7768.35         121         121
+UCP_get-------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8237.541  8236.292     7769.31    7770.49         121         121

These numbers are a lot lower than what's achievable on A64FX's HBM2. I would like to know if the UCX team is planning to let the UCT shared-memory transports to use the built-in optimized memcpy functions? For example, the one introduced in #4760.

Also, maybe this PR should be extended to incorporate the changes in PR #3724, so that UCX only enables the built-in memcpy when enabled in the configuration.

Any suggestions are welcome! Thank you.

@swx-jenkins3
Copy link
Collaborator

Can one of the admins verify this patch?

@shamisp
Copy link
Contributor

shamisp commented Nov 25, 2020

ok to test

@shamisp
Copy link
Contributor

shamisp commented Nov 25, 2020

Using ucs_memcpy_relaxed for SM makes sense. @yosefe is there any particular research why it is not used ? I would guess it will benefit x86 as well ?

src/ucs/arch/aarch64/cpu.h Outdated Show resolved Hide resolved
src/ucs/arch/aarch64/cpu.h Outdated Show resolved Hide resolved
@yosefe
Copy link
Contributor

yosefe commented Nov 27, 2020

Using ucs_memcpy_relaxed for SM makes sense. @yosefe is there any particular research why it is not used ? I would guess it will benefit x86 as well ?

AFAIR it showed inferior results on some architectures/sizes, @hoopoepg @dmitrygx do you remember?

@yosefe
Copy link
Contributor

yosefe commented Nov 27, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dmitrygx
Copy link
Member

AFAIR it showed inferior results on some architectures/sizes, @hoopoepg @dmitrygx do you remember?

yes, there is a 2x degradation with using rep movsb on AMD for copy-in/copy-out to/from shared regions

@dmitrygx
Copy link
Member

we had an idea to have a separate UCS wrapper for memcpy() for shared regions to tune them on different platforms, e.g. disable for AMD for now

@shamisp
Copy link
Contributor

shamisp commented Nov 27, 2020

we had an idea to have a separate UCS wrapper for memcpy() for shared regions to tune them on different platforms, e.g. disable for AMD for now

Makes sense to have a stripped down version of memcpy for shared memory copy.

@shamisp
Copy link
Contributor

shamisp commented Nov 28, 2020

I asked some external folks to review it.

Copy link

@kawashima-fj kawashima-fj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per @shamisp's review request (I'm a developer of the compiler and MPI library for A64FX):
Though I'm not familiar with UCX code, the added code looks good to me.

@shamisp
Copy link
Contributor

shamisp commented Dec 1, 2020

@kawashima-fj thanks for review !!! 👍

@shamisp
Copy link
Contributor

shamisp commented Dec 1, 2020

@dmitrygx are we good to go ?

@shamisp shamisp merged commit a2769c4 into openucx:master Dec 1, 2020
@lyu lyu deleted the add_sve_memcpy branch December 1, 2020 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants