TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

zhuyj · 2020-03-16T04:07:25Z

Signed-off-by: Zhu Yanjun yanjunz@mellanox.com

Signed-off-by: root <root@c-141-67-100-103.mtl.labs.mlnx>

mellanox-github · 2020-03-16T07:38:37Z

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 23-Mar-2020

Agent/Stage Status

_main ✔️ SUCCESS

hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS

hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS

hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS

hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS

hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS

hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS

hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS

hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS

hpc-test-node-gpu_W0 ✔️ SUCCESS

hpc-test-node-gpu_W1 ✔️ SUCCESS

hpc-test-node-gpu_W2 ✔️ SUCCESS

hpc-test-node-gpu_W3 ✔️ SUCCESS

hpc-test-node-legacy_W0 ✔️ SUCCESS

hpc-test-node-legacy_W1 ✔️ SUCCESS

hpc-test-node-legacy_W2 ✔️ SUCCESS

hpc-test-node-legacy_W3 ✔️ SUCCESS

hpc-test-node-new_W0 ✔️ SUCCESS

hpc-test-node-new_W1 ✔️ SUCCESS

hpc-test-node-new_W2 ✔️ SUCCESS

hpc-test-node-new_W3 ✔️ SUCCESS

r-vmb-ppc-jenkins_W0 ✔️ SUCCESS

r-vmb-ppc-jenkins_W1 ✔️ SUCCESS

r-vmb-ppc-jenkins_W2 ✔️ SUCCESS

r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

alinask · 2020-03-16T09:58:03Z

Hello @lyu ,

A while ago you committed a fix to the UCX project with this PR - #3350
It fixed the ucx_perftest when running with several threads.
However, we have recently noticed that the commit in that PR introduced a performance degradation with the multi-threaded mode.
The code in this PR, submitted by @zhuyj , seems to resolve it. It adds a #pragma omp master before the call to the new function from your PR - ucx_perf_test_prepare_new_run().
Do you know what it helps? Why should only the main worker thread call this function?

Thanks!

lyu · 2020-03-16T12:24:47Z

@alinask @zhuyj
During multi-threaded runs, every thread encounters the macro UCX_PERF_TEST_FOREACH which calls the function ucx_perf_context_done(), which halts the benchmark once perf->current.iters >= perf->max_iters.

Therefore, if only the master thread calls ucx_perf_test_prepare_new_run() to reset its counters, all other threads will have their counters stuck in the past (i.e. at the finish of the warm-up run) and so they all fail the check in UCX_PERF_TEST_FOREACH and skip the actual run immediately. You can check the comments in my original PR for more details.

With only one thread running the benchmark, you see "higher" performance. This can be verified by checking the other threads' CPU utilization during the benchmark.

The title of this PR mentions "hang", which shouldn't happen. Could you please give me more details of the hang? If hanging means extremely low performance on the aarch64 platform then it is probably a known issue, see #3569.

zhuyj · 2020-03-16T13:31:55Z

Thanks, lyu. The followings are my steps to reproduce the hang.

I made tests on 2 Cloud VMs;
./autogen.sh && ./contrib/configure-devel && make

Server:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all ./src/tools/perf/ucx_perftest -T 2
Client:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all ./src/tools/perf/ucx_perftest 3.3.3.5 -t stream_bw -n 5000000 -s 4096 -T 2

This hang does not occur every time. Run the above commands over and over again. If lucky, this hang will occur very soon.

zhuyj · 2020-03-16T13:35:35Z

Just now I confronted a hang of ucx_perftest. The followings are the scenarios. I hope that this can help you.

UCX_NET_DEVICES=mlx5_0:1 [1584365076.316668] Waiting for connection...
+---------------- | API: | Test: | Data layout: (automatic) | Send memory: host | Recv memory: host | Message size: 4096 +---------------- [1584365079.503294] +--------------+- | +--------------+- | # iterations +--------------+- 168943 5.000 340194 5.000 511562 5.000 682027 5.000 852364 5.000 1056617 1291055 1522163 1741625 1907622 2095740 2276336 2459543 2644756 2808534 2954822 3103995 3279725 3433016 3596495 3763094 3935264 4099253 4260863 4424481 4594598 4792294 4983404 ^CKilled by signal 2. UCX_TLS=all /workspace/yanjunz/ucx/src/tools/perf/ucx_perftest 3.3.3.5 -t stream_bw -n 5000000 -s 4096 -T 2
[c-141-98-1-005:16020:0] perftest.c:1423 UCX WARN CPU affinity is not set (bound to 10 cpus). Performance may be impacted.
--------------------------------------------------------------------------+
protocol layer |
stream bandwidth |
|
|
|
|
--------------------------------------------------------------------------+
[c-141-98-1-006:6685 :0] perftest.c:1423 UCX WARN CPU affinity is not set (bound to 10 cpus). Performance may be impacted.
----------------------------+---------------------+-----------------------+
| overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
--------+---------+---------+----------+----------+-----------+-----------+
| typical | average | overall | average | overall | average | overall |
--------+---------+---------+----------+----------+-----------+-----------+
5.919 5.919 659.93 659.93 168943 168943
5.839 5.879 668.95 664.44 171251 170097
5.835 5.864 669.40 666.10 171367 170520
5.866 5.865 665.88 666.04 170465 170506
5.871 5.866 665.38 665.91 170336 170472
3.000 4.896 5.679 797.86 687.90 204253 176103
3.000 4.266 5.422 915.77 720.45 234438 184436
3.000 4.327 5.256 902.76 743.24 231108 190270
3.000 4.557 5.168 857.26 755.91 219459 193513
3.000 6.024 5.242 648.42 745.16 165997 190762
3.000 5.316 5.249 734.82 744.22 188114 190521
3.000 5.537 5.272 705.45 740.99 180595 189694
3.000 5.458 5.286 715.65 739.04 183207 189195
3.000 5.399 5.294 723.49 737.93 185213 188910
3.000 6.106 5.341 639.75 731.39 163777 187235
3.000 6.836 5.415 571.43 721.39 146285 184675
3.000 6.704 5.477 582.70 713.23 149171 182587
3.000 5.691 5.488 686.45 711.74 175730 182206
3.000 6.524 5.535 598.79 705.80 153291 180684
3.000 6.117 5.561 638.59 702.44 163479 179824
3.000 6.002 5.581 650.78 699.98 166599 179194
3.000 5.808 5.591 672.53 698.73 172167 178875
3.000 6.098 5.611 640.58 696.20 163988 178227
3.000 6.188 5.633 631.28 693.50 161607 177535
3.000 6.112 5.650 639.10 691.32 163611 176978
3.000 5.878 5.659 664.52 690.29 170117 176714
3.000 5.058 5.634 772.25 693.32 197696 177491
3.000 5.233 5.619 746.51 695.22 191108 177977

lyu · 2020-03-16T14:15:33Z

Unfortunately I could not reproduce this hanging issue on our cluster after many runs.

My UCX configuration:
./contrib/configure-devel --prefix="$PWD/instl" --enable-mt --without-java --without-valgrind

My stream_bw test results using two threads:

+--------------+-----------------------------+---------------------+-----------------------+
|              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
        704005     0.522     1.424     1.424    2743.19    2743.19      702258      702258
       1677695     0.523     1.029     1.195    3794.71    3268.90      971445      836839
       2651587     0.522     1.029     1.134    3795.49    3444.42      971646      881772
       3624946     0.523     1.030     1.106    3793.41    3531.66      971112      904106
       4598977     0.523     1.029     1.090    3796.04    3584.54      971785      917641
       5000000     0.522     1.054     1.087    3706.65    3594.03      948902      920073

The bandwidth and overhead varies a lot between different runs but it always runs to completion.

However, I do notice that the last line of the result (5000000 iterations) takes longer than it should to appear, and your test seems to be stuck right before that line shows up.

Could you please provide the backtrace of the program when it's stuck?

This is how I get it:

gdb $EXE --pid=$PID --nx --quiet --batch -ex 'thread apply all bt'

Thank you

alinask · 2020-03-16T14:49:18Z

@lyu Thank you for the detailed response!

It seems that the reporting of the results isn't correct in case of more than one thread - https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724
For bandwidth, the results should be summed up among the threads but here we have only the main thread reporting the result.
Also, after applying the fix in this PR, the main thread reports full bandwidth - which doesn't really make sense since the other threads are running and 'consuming' bw as well.

lyu · 2020-03-16T15:09:34Z

@alinask You are right, the numbers from different threads are not aggregated yet. I thought about implementing this but couldn't decide on the best way to report these numbers.

Applying this PR will cause all the other threads to not do any work at all, which defeats the point of multi-threaded benchmarks. So IMHO the real issue here is the hanging problem discovered by @zhuyj.

If it turns out that the benchmark is stuck in ucp_worker_progress() then I'm afraid that we will need someone else's help, because I actually don't know how to debug from there.

zhuyj · 2020-03-17T00:34:33Z

Sorry. It is late since it took me some time to reproduce this hang.
Now it is still hang. I will keep this test scenario. If you need more information, please let me know.

[root@c-141-98-1-005 ~]# gdb lt-ucx_perftest --pid=16920 --nx --quiet --batch -ex 'thread apply all bt'
lt-ucx_perftest: No such file or directory.
[New LWP 16944]
[New LWP 16941]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f9c26675585 in pthread_spin_lock () from /lib64/libpthread.so.0

Thread 3 (Thread 0x7f9c1b596700 (LWP 16941)):
#0 0x00007f9c25d67ea3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f9c26fe1403 in ucs_event_set_wait (event_set=0x1c260a0, num_events=num_events@entry=0x7f9c1b595e40, timeout_ms=100, event_set_handler=event_set_handler@entry=0x7f9c26fc4ff0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f9c1b595e50) at sys/event_set.c:196
#2 0x00007f9c26fc58cc in ucs_async_thread_func (arg=0x1c26050) at async/thread.c:127
#3 0x00007f9c26670ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f9c25d678cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f9c178b0700 (LWP 16944)):
#0 0x00007f9c26675573 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00007f9c2746fb78 in ucs_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:105
#2 ucs_recursive_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:117
#3 ucp_worker_progress (worker=0x7f9c27806010) at core/ucp_worker.c:1957
#4 0x00000000004690c1 in progress_responder (this=) at ucp_tests.cc:89
#5 wait_stream_recv (request=, this=) at ucp_tests.cc:123
#6 recv_stream (datatype=, length=, buf=, ep=, this=) at ucp_tests.cc:464
#7 recv (sn=, datatype=, length=, buffer=, ep=, worker=, this=) at ucp_tests.cc:275
#8 ucp_perf_test_runner<(ucx_perf_cmd_t)9, (ucx_perf_test_type_t)1, 0u>::run_stream_uni (this=this@entry=0x7f9c178afdb0) at ucp_tests.cc:393
#9 0x000000000045bc7c in run (this=0x7f9c178afdb0) at ucp_tests.cc:421
#10 ucp_perf_test_dispatch (perf=) at ucp_tests.cc:549
#11 0x0000000000409286 in ucx_perf_thread_run_test (arg=0x1cbf348) at libperf.c:1713
#12 ucx_perf_thread_spawn._omp_fn.0 () at libperf.c:1765
#13 0x00007f9c2604d405 in ?? () from /lib64/libgomp.so.1
#14 0x00007f9c26670ea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f9c25d678cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f9c278fa7c0 (LWP 16920)):
#0 0x00007f9c26675585 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00007f9c2746fb78 in ucs_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:105
#2 ucs_recursive_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:117
#3 ucp_worker_progress (worker=0x7f9c27806010) at core/ucp_worker.c:1957
#4 0x00000000004690c1 in progress_responder (this=) at ucp_tests.cc:89
#5 wait_stream_recv (request=, this=) at ucp_tests.cc:123
#6 recv_stream (datatype=, length=, buf=, ep=, this=) at ucp_tests.cc:464
#7 recv (sn=, datatype=, length=, buffer=, ep=, worker=, this=) at ucp_tests.cc:275
#8 ucp_perf_test_runner<(ucx_perf_cmd_t)9, (ucx_perf_test_type_t)1, 0u>::run_stream_uni (this=this@entry=0x7ffffe7d6040) at ucp_tests.cc:393
#9 0x000000000045bc7c in run (this=0x7ffffe7d6040) at ucp_tests.cc:421
#10 ucp_perf_test_dispatch (perf=) at ucp_tests.cc:549
#11 0x0000000000409286 in ucx_perf_thread_run_test (arg=0x1cbb080) at libperf.c:1713
#12 ucx_perf_thread_spawn._omp_fn.0 () at libperf.c:1765
#13 0x000000000040b80a in ucx_perf_thread_spawn (result=0x7ffffe7d63d0, perf=0x1a33f70) at libperf.c:1752
#14 ucx_perf_run (params=params@entry=0x7ffffe7d63d0, result=result@entry=0x7ffffe7d61d0) at libperf.c:1665
#15 0x0000000000406e22 in run_test_recurs (ctx=ctx@entry=0x7ffffe7d63d0, parent_params=parent_params@entry=0x7ffffe7d63d0, depth=depth@entry=0) at perftest.c:1471
#16 0x0000000000404f1d in run_test (ctx=0x7ffffe7d63d0) at perftest.c:1520
#17 main (argc=, argv=) at perftest.c:1578

zhuyj · 2020-03-17T01:14:44Z

@alinask @lyu
When I comment this line https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724,
"
1721 #pragma omp master
1722 {
1723 /* Assuming all threads are fairly treated, reporting only tid==0
1724 TODO: aggregate reports /
1725 // ucx_perf_calc_result(perf, result);
1726 rte_call(perf, report, result, perf->params.report_arg, 1);
1727 }
"
the test result is as below:
"
745449 1.583 1.345 1.345 2904.83 2904.83 743637 743637
1374289 1.575 1.594 1.459 2450.43 2677.63 627311 685474
2003389 1.575 1.593 1.501 2451.45 2602.24 627572 666173
2632338 1.562 1.594 1.523 2450.86 2564.39 627421 656485
3261263 1.578 1.594 1.537 2450.77 2541.67 627397 650667
3889844 1.572 1.595 1.546 2449.42 2526.29 627052 646731
4518782 1.585 1.594 1.553 2450.82 2515.51 627410 643971
0 0.000 0.000 0.000 0.00 0.00 0 0
"
It seems that https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724, only print the last line.
"
1657 / Run test */
1658 status = ucx_perf_funcs[params->api].run(perf);
1659 ucx_perf_funcs[params->api].barrier(perf);
1660 if (status == UCS_OK) {
1661 ucx_perf_calc_result(perf, result);
1662 rte_call(perf, report, result, perf->params.report_arg, 1);
1663 }
"
The above print the other lines in the above test output.

Zhu Yanjun

lyu · 2020-03-17T02:04:02Z

@zhuyj Thanks for the backtrace.

Just like what I was afraid of, it is stuck in ucp_worker_progress(). I am not an UCX expert and every time our application stuck in that function I end up having to find the root cause using some other methods. We probably need help from one of the UCX developers @shamisp.

Also, could you please try to use mutexes instead of spinlocks, by setting UCX_USE_MT_MUTEX=y? It's a long shot but why not.

Maybe we should close this PR and open an issue to discuss this, since we still don't know the root cause of this problem.

zhuyj · 2020-03-17T02:39:49Z

@lyu
The followings are the test command
"
UCX_NET_DEVICES=mlx5_5:1 UCX_TLS=all UCX_USE_MT_MUTEX=y ./src/tools/perf/ucx_perftest -T 6

UCX_NET_DEVICES=mlx5_5:1 UCX_TLS=all UCX_USE_MT_MUTEX=y ./src/tools/perf/ucx_perftest 1.1.1.4 -t stream_bw -n 5000000 -s 4096 -T 6
"
Sometimes this will cause ucx_perftest hang.

Without "UCX_USE_MT_MUTEX=y", there is no hang.

lyu · 2020-03-17T03:05:42Z

@zhuyj By default spinlocks are used, which is equivalent to not setting UCX_USE_MT_MUTEX=y, so maybe using mutexes increases the probability of hanging? Sounds like a race condition. I will try to reproduce this and wait to see if others have some idea.

zhuyj · 2020-03-17T03:34:17Z

@lyu
Before the following commit, perftest multi-thread can work. After this commit, the multi-thread performance is bad and hang will occur.
I do not know why.

Sorry. It is a difficult task to handle this multi-thread bad performance and hang.

"
From: lyu lyu@users.noreply.github.com
Date: Sat, 9 Mar 2019 11:12:06 -0500
Subject: [PATCH 1/1] TOOLS/PERF: Fix ucx_perftest's multi-threaded mode

Reset perf counters and timers correctly after a multi-threaded warm-up run
Make sock/mpi/ext_rte_barrier more robust against non-master threads that arrive late
Remove extra ucp_perf_update as its functionality has been fullfilled by a preceding ucx_perf_get_time, and it causes NaNs in ucx_perf_calc_result
Disable non-master threads' reports from within ucx_perf_update
"

lyu · 2020-03-17T04:02:40Z

@zhuyj Sorry if I didn't make this clear, but #3350 has enabled truly multi-threaded benchmarks, which is the right thing to do. The performance is "bad" as a result of thread contention, which is expected.

You can try to run a multi-threaded benchmark with a large number of iterations, with this PR applied, I can assure you that after the initial warm-up iterations, all the other threads will be sitting on their hands, doing nothing. As @alinask has pointed out, it makes no sense when a 6-threaded run reports the same bandwidth as a single-threaded run, unless all the other threads are not consuming any bandwidth.

This PR will "solve" the problem we are seeing right now because there will be no race condition if we only use a single thread.

Anyway, I was able to reproduce the issue with UCX_USE_MT_MUTEX=y, let's see what can I do.

zhuyj · 2020-03-17T05:30:45Z

@lyu
From your comments, do you mean that the multi-thread does not work well before #3350?

lyu · 2020-03-17T15:13:19Z

@zhuyj No, multi-threading benchmark runs didn't work before #3350.

Now that I have also confirmed the hanging of stream_bw but couldn't find the cause, you may open an issue about this.

TOOLS/PERF: fix hang and enhance the multi-thread performance

6d5bec0

Signed-off-by: root <root@c-141-67-100-103.mtl.labs.mlnx>

zhuyj closed this Mar 17, 2020

zhuyj deleted the topic/perftest_fix branch March 17, 2020 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

zhuyj commented Mar 16, 2020 •

edited by alinask

Loading

mellanox-github commented Mar 16, 2020

alinask commented Mar 16, 2020

lyu commented Mar 16, 2020 •

edited

Loading

zhuyj commented Mar 16, 2020

zhuyj commented Mar 16, 2020

lyu commented Mar 16, 2020 •

edited

Loading

alinask commented Mar 16, 2020

lyu commented Mar 16, 2020

zhuyj commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

Conversation

zhuyj commented Mar 16, 2020 • edited by alinask Loading

mellanox-github commented Mar 16, 2020

alinask commented Mar 16, 2020

lyu commented Mar 16, 2020 • edited Loading

zhuyj commented Mar 16, 2020

zhuyj commented Mar 16, 2020

lyu commented Mar 16, 2020 • edited Loading

alinask commented Mar 16, 2020

lyu commented Mar 16, 2020

zhuyj commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 17, 2020

lyu commented Mar 17, 2020

zhuyj commented Mar 16, 2020 •

edited by alinask

Loading

lyu commented Mar 16, 2020 •

edited

Loading

lyu commented Mar 16, 2020 •

edited

Loading