UCS/ARCH: Add SVE memcpy #5954

lyu · 2020-11-25T20:12:20Z

What

Add ARM Scalable Vector Extension (SVE) version of the memcpy function to UCS.

Why ?

We are seeing very low intra-node bandwidth on our A64FX machine with the latest UCX. The memcpy function shipped in glibc 2.28 (CentOS 8.1) isn't optimized, and it will probably take a long time, if ever, for users to have access to SVE-optimized glibc in production.

How ?

Add SVE-optimized memcpy written with intrinsics, enabled automatically when the compiler supports it. The SVE version is faster than what glibc provides for all array sizes in my tests.

Compiling UCX using GCC 10.2, with -march=armv8.2-a+sve in CFLAGS, running taskset -c 1 ucx_info -s on A64FX FX700.

Without patch:

# Timer frequency: 100.000 MHz
# CPU vendor: Fujitsu ARM
# CPU model: ARM 64-bit
# Memcpy bandwidth:
#           4096 bytes: 9596.430 MB/s
#           8192 bytes: 10226.157 MB/s
#          16384 bytes: 10588.992 MB/s
#          32768 bytes: 9458.691 MB/s
#          65536 bytes: 8615.322 MB/s
#         131072 bytes: 8657.415 MB/s
#         262144 bytes: 8674.676 MB/s
#         524288 bytes: 8653.066 MB/s
#        1048576 bytes: 8664.883 MB/s
#        2097152 bytes: 8668.302 MB/s
#        4194304 bytes: 7902.334 MB/s
#        8388608 bytes: 8057.701 MB/s
#       16777216 bytes: 8056.261 MB/s
#       33554432 bytes: 8030.777 MB/s
#       67108864 bytes: 7762.344 MB/s
#      134217728 bytes: 7765.089 MB/s
#      268435456 bytes: 7764.248 MB/s

With patch:

# Timer frequency: 100.000 MHz
# CPU vendor: Fujitsu ARM
# CPU model: ARM 64-bit
# Memcpy bandwidth:
#           4096 bytes: 28501.541 MB/s
#           8192 bytes: 37439.737 MB/s
#          16384 bytes: 44341.234 MB/s
#          32768 bytes: 39440.581 MB/s
#          65536 bytes: 22662.431 MB/s
#         131072 bytes: 23066.897 MB/s
#         262144 bytes: 23276.764 MB/s
#         524288 bytes: 23383.291 MB/s
#        1048576 bytes: 23424.930 MB/s
#        2097152 bytes: 23439.418 MB/s
#        4194304 bytes: 18326.620 MB/s
#        8388608 bytes: 20706.165 MB/s
#       16777216 bytes: 20703.716 MB/s
#       33554432 bytes: 20633.525 MB/s
#       67108864 bytes: 18873.960 MB/s
#      134217728 bytes: 18875.474 MB/s
#      268435456 bytes: 18873.447 MB/s

However, this patch only applies to the places that call ucs_memcpy_relaxed, so ucx_perftest results and actual applications are not improved because UCT is still using the standard memcpy in many places. For example, in ucs/sm/base/sm_ep.c: uct_sm_ep_put_short.

Running ucx_perftest to test intra-node bandwidth inside the same A64FX CMG (NUMA node), using 64MB messages:

Client: ucx_perftest -b ~/profile -f -c 2 fj-125
Server: ucx_perftest -b ~/profile -f -c 1
Results:

+--------------+--------------+-----------------------------+---------------------+-----------------------+
|              |              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
|    Stage     | # iterations | typical | average | overall |  average |  overall |  average  |  overall  |
+--------------+--------------+---------+---------+---------+----------+----------+-----------+-----------+
+UCT_put_cma_zcopy-------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 11137.115  9134.142     5746.55    7006.68         100         110
+UCT_get_cma_zcopy-------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 10109.901  9126.714     6330.43    7012.38         110         110
+UCT_put_knem_zcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 11351.496  9104.581     5638.02    7029.43          99         110
+UCT_get_knem_zcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000 10223.389  9102.322     6260.16    7031.17         110         110
+UCT_put_posix_short-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8577.506  8243.300     7461.38    7763.88         121         121
+UCT_put_posix_bcopy-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8576.701  8241.764     7462.08    7765.33         121         121
+UCT_get_posix_bcopy-----+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8268.692  8268.751     7740.04    7739.98         125         121
+UCT_put_sysv_short------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8577.377  8243.505     7461.49    7763.69         121         121
+UCT_put_sysv_bcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8578.211  8243.329     7460.76    7763.85         121         121
+UCT_get_sysv_bcopy------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8269.690  8274.670     7739.11    7734.45         125         121
+UCP_put-------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8238.832  8238.558     7768.09    7768.35         121         121
+UCP_get-------+---------+---------+---------+----------+----------+-----------+-----------+
Final:                  1000     0.000  8237.541  8236.292     7769.31    7770.49         121         121

These numbers are a lot lower than what's achievable on A64FX's HBM2. I would like to know if the UCX team is planning to let the UCT shared-memory transports to use the built-in optimized memcpy functions? For example, the one introduced in #4760.

Also, maybe this PR should be extended to incorporate the changes in PR #3724, so that UCX only enables the built-in memcpy when enabled in the configuration.

Any suggestions are welcome! Thank you.

swx-jenkins3 · 2020-11-25T20:14:20Z

Can one of the admins verify this patch?

shamisp · 2020-11-25T20:31:18Z

ok to test

src/ucs/arch/aarch64/cpu.h

shamisp · 2020-11-25T20:36:34Z

Using ucs_memcpy_relaxed for SM makes sense. @yosefe is there any particular research why it is not used ? I would guess it will benefit x86 as well ?

src/ucs/arch/aarch64/cpu.h

yosefe · 2020-11-27T11:46:15Z

Using ucs_memcpy_relaxed for SM makes sense. @yosefe is there any particular research why it is not used ? I would guess it will benefit x86 as well ?

AFAIR it showed inferior results on some architectures/sizes, @hoopoepg @dmitrygx do you remember?

yosefe · 2020-11-27T11:46:20Z

/azp run

azure-pipelines · 2020-11-27T11:46:32Z

Azure Pipelines successfully started running 1 pipeline(s).

dmitrygx · 2020-11-27T14:24:27Z

AFAIR it showed inferior results on some architectures/sizes, @hoopoepg @dmitrygx do you remember?

yes, there is a 2x degradation with using rep movsb on AMD for copy-in/copy-out to/from shared regions

dmitrygx · 2020-11-27T14:26:03Z

we had an idea to have a separate UCS wrapper for memcpy() for shared regions to tune them on different platforms, e.g. disable for AMD for now

shamisp · 2020-11-27T17:19:15Z

we had an idea to have a separate UCS wrapper for memcpy() for shared regions to tune them on different platforms, e.g. disable for AMD for now

Makes sense to have a stripped down version of memcpy for shared memory copy.

shamisp · 2020-11-28T03:06:29Z

I asked some external folks to review it.

kawashima-fj

As per @shamisp's review request (I'm a developer of the compiler and MPI library for A64FX):
Though I'm not familiar with UCX code, the added code looks good to me.

shamisp · 2020-12-01T03:03:40Z

@kawashima-fj thanks for review !!! 👍

shamisp · 2020-12-01T03:03:51Z

@dmitrygx are we good to go ?

shamisp reviewed Nov 25, 2020

View reviewed changes

src/ucs/arch/aarch64/cpu.h Show resolved Hide resolved

dmitrygx suggested changes Nov 25, 2020

View reviewed changes

src/ucs/arch/aarch64/cpu.h Outdated Show resolved Hide resolved

src/ucs/arch/aarch64/cpu.h Outdated Show resolved Hide resolved

yosefe reviewed Nov 25, 2020

View reviewed changes

src/ucs/arch/aarch64/cpu.h Show resolved Hide resolved

yosefe approved these changes Nov 27, 2020

View reviewed changes

kawashima-fj reviewed Nov 30, 2020

View reviewed changes

dmitrygx approved these changes Dec 1, 2020

View reviewed changes

UCS/ARCH: Add SVE memcpy

87447d8

shamisp merged commit a2769c4 into openucx:master Dec 1, 2020

lyu deleted the add_sve_memcpy branch December 1, 2020 16:23

it-is-a-robot mentioned this pull request Nov 30, 2021

Huawei 1230 kunpengcompute/hucx#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UCS/ARCH: Add SVE memcpy #5954

UCS/ARCH: Add SVE memcpy #5954

lyu commented Nov 25, 2020 •

edited

Loading

swx-jenkins3 commented Nov 25, 2020

shamisp commented Nov 25, 2020

shamisp commented Nov 25, 2020

yosefe commented Nov 27, 2020

yosefe commented Nov 27, 2020

azure-pipelines bot commented Nov 27, 2020

dmitrygx commented Nov 27, 2020

dmitrygx commented Nov 27, 2020

shamisp commented Nov 27, 2020 •

edited

Loading

shamisp commented Nov 28, 2020

kawashima-fj left a comment

shamisp commented Dec 1, 2020

shamisp commented Dec 1, 2020

UCS/ARCH: Add SVE memcpy #5954

UCS/ARCH: Add SVE memcpy #5954

Conversation

lyu commented Nov 25, 2020 • edited Loading

What

Why ?

How ?

swx-jenkins3 commented Nov 25, 2020

shamisp commented Nov 25, 2020

shamisp commented Nov 25, 2020

yosefe commented Nov 27, 2020

yosefe commented Nov 27, 2020

azure-pipelines bot commented Nov 27, 2020

dmitrygx commented Nov 27, 2020

dmitrygx commented Nov 27, 2020

shamisp commented Nov 27, 2020 • edited Loading

shamisp commented Nov 28, 2020

kawashima-fj left a comment

Choose a reason for hiding this comment

shamisp commented Dec 1, 2020

shamisp commented Dec 1, 2020

lyu commented Nov 25, 2020 •

edited

Loading

shamisp commented Nov 27, 2020 •

edited

Loading