Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

Closed
wants to merge 1 commit into from
Closed

TOOLS/PERF: fix hang and enhance the multi-thread performance #4890

wants to merge 1 commit into from

Conversation

zhuyj
Copy link

@zhuyj zhuyj commented Mar 16, 2020

Signed-off-by: Zhu Yanjun yanjunz@mellanox.com

Signed-off-by: root <root@c-141-67-100-103.mtl.labs.mlnx>
@mellanox-github
Copy link
Contributor

Mellanox CI: PASSED on 25 workers (click for details)

Note: the logs will be deleted after 23-Mar-2020

Agent/Stage Status
_main ✔️ SUCCESS
hpc-arm-cavium-jenkins_W0 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W1 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W2 ✔️ SUCCESS
hpc-arm-cavium-jenkins_W3 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W0 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W1 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W2 ✔️ SUCCESS
hpc-arm-hwi-jenkins_W3 ✔️ SUCCESS
hpc-test-node-gpu_W0 ✔️ SUCCESS
hpc-test-node-gpu_W1 ✔️ SUCCESS
hpc-test-node-gpu_W2 ✔️ SUCCESS
hpc-test-node-gpu_W3 ✔️ SUCCESS
hpc-test-node-legacy_W0 ✔️ SUCCESS
hpc-test-node-legacy_W1 ✔️ SUCCESS
hpc-test-node-legacy_W2 ✔️ SUCCESS
hpc-test-node-legacy_W3 ✔️ SUCCESS
hpc-test-node-new_W0 ✔️ SUCCESS
hpc-test-node-new_W1 ✔️ SUCCESS
hpc-test-node-new_W2 ✔️ SUCCESS
hpc-test-node-new_W3 ✔️ SUCCESS
r-vmb-ppc-jenkins_W0 ✔️ SUCCESS
r-vmb-ppc-jenkins_W1 ✔️ SUCCESS
r-vmb-ppc-jenkins_W2 ✔️ SUCCESS
r-vmb-ppc-jenkins_W3 ✔️ SUCCESS

@alinask
Copy link
Contributor

alinask commented Mar 16, 2020

Hello @lyu ,

A while ago you committed a fix to the UCX project with this PR - #3350
It fixed the ucx_perftest when running with several threads.
However, we have recently noticed that the commit in that PR introduced a performance degradation with the multi-threaded mode.
The code in this PR, submitted by @zhuyj , seems to resolve it. It adds a #pragma omp master before the call to the new function from your PR - ucx_perf_test_prepare_new_run().
Do you know what it helps? Why should only the main worker thread call this function?

Thanks!

@lyu
Copy link
Contributor

lyu commented Mar 16, 2020

@alinask @zhuyj
During multi-threaded runs, every thread encounters the macro UCX_PERF_TEST_FOREACH which calls the function ucx_perf_context_done(), which halts the benchmark once perf->current.iters >= perf->max_iters.

Therefore, if only the master thread calls ucx_perf_test_prepare_new_run() to reset its counters, all other threads will have their counters stuck in the past (i.e. at the finish of the warm-up run) and so they all fail the check in UCX_PERF_TEST_FOREACH and skip the actual run immediately. You can check the comments in my original PR for more details.

With only one thread running the benchmark, you see "higher" performance. This can be verified by checking the other threads' CPU utilization during the benchmark.

The title of this PR mentions "hang", which shouldn't happen. Could you please give me more details of the hang? If hanging means extremely low performance on the aarch64 platform then it is probably a known issue, see #3569.

@zhuyj
Copy link
Author

zhuyj commented Mar 16, 2020

Thanks, lyu. The followings are my steps to reproduce the hang.

  1. I made tests on 2 Cloud VMs;
  2. ./autogen.sh && ./contrib/configure-devel && make

Server:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all ./src/tools/perf/ucx_perftest -T 2
Client:
UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all ./src/tools/perf/ucx_perftest 3.3.3.5 -t stream_bw -n 5000000 -s 4096 -T 2

  1. This hang does not occur every time. Run the above commands over and over again. If lucky, this hang will occur very soon.

@zhuyj
Copy link
Author

zhuyj commented Mar 16, 2020

Just now I confronted a hang of ucx_perftest. The followings are the scenarios. I hope that this can help you.

UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all /workspace/yanjunz/ucx/src/tools/perf/ucx_perftest 3.3.3.5 -t stream_bw -n 5000000 -s 4096 -T 2
[1584365076.316668] [c-141-98-1-005:16020:0] perftest.c:1423 UCX WARN CPU affinity is not set (bound to 10 cpus). Performance may be impacted.
Waiting for connection...
+------------------------------------------------------------------------------------------+
| API: protocol layer |
| Test: stream bandwidth |
| Data layout: (automatic) |
| Send memory: host |
| Recv memory: host |
| Message size: 4096 |
+------------------------------------------------------------------------------------------+
[1584365079.503294] [c-141-98-1-006:6685 :0] perftest.c:1423 UCX WARN CPU affinity is not set (bound to 10 cpus). Performance may be impacted.
+--------------+-----------------------------+---------------------+-----------------------+
| | overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall | average | overall | average | overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
168943 5.000 5.919 5.919 659.93 659.93 168943 168943
340194 5.000 5.839 5.879 668.95 664.44 171251 170097
511562 5.000 5.835 5.864 669.40 666.10 171367 170520
682027 5.000 5.866 5.865 665.88 666.04 170465 170506
852364 5.000 5.871 5.866 665.38 665.91 170336 170472
1056617 3.000 4.896 5.679 797.86 687.90 204253 176103
1291055 3.000 4.266 5.422 915.77 720.45 234438 184436
1522163 3.000 4.327 5.256 902.76 743.24 231108 190270
1741625 3.000 4.557 5.168 857.26 755.91 219459 193513
1907622 3.000 6.024 5.242 648.42 745.16 165997 190762
2095740 3.000 5.316 5.249 734.82 744.22 188114 190521
2276336 3.000 5.537 5.272 705.45 740.99 180595 189694
2459543 3.000 5.458 5.286 715.65 739.04 183207 189195
2644756 3.000 5.399 5.294 723.49 737.93 185213 188910
2808534 3.000 6.106 5.341 639.75 731.39 163777 187235
2954822 3.000 6.836 5.415 571.43 721.39 146285 184675
3103995 3.000 6.704 5.477 582.70 713.23 149171 182587
3279725 3.000 5.691 5.488 686.45 711.74 175730 182206
3433016 3.000 6.524 5.535 598.79 705.80 153291 180684
3596495 3.000 6.117 5.561 638.59 702.44 163479 179824
3763094 3.000 6.002 5.581 650.78 699.98 166599 179194
3935264 3.000 5.808 5.591 672.53 698.73 172167 178875
4099253 3.000 6.098 5.611 640.58 696.20 163988 178227
4260863 3.000 6.188 5.633 631.28 693.50 161607 177535
4424481 3.000 6.112 5.650 639.10 691.32 163611 176978
4594598 3.000 5.878 5.659 664.52 690.29 170117 176714
4792294 3.000 5.058 5.634 772.25 693.32 197696 177491
4983404 3.000 5.233 5.619 746.51 695.22 191108 177977
^CKilled by signal 2.

@lyu
Copy link
Contributor

lyu commented Mar 16, 2020

Unfortunately I could not reproduce this hanging issue on our cluster after many runs.

My UCX configuration:
./contrib/configure-devel --prefix="$PWD/instl" --enable-mt --without-java --without-valgrind

My stream_bw test results using two threads:

+--------------+-----------------------------+---------------------+-----------------------+
|              |      overhead (usec)        |   bandwidth (MB/s)  |  message rate (msg/s) |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
| # iterations | typical | average | overall |  average |  overall |   average |   overall |
+--------------+---------+---------+---------+----------+----------+-----------+-----------+
        704005     0.522     1.424     1.424    2743.19    2743.19      702258      702258
       1677695     0.523     1.029     1.195    3794.71    3268.90      971445      836839
       2651587     0.522     1.029     1.134    3795.49    3444.42      971646      881772
       3624946     0.523     1.030     1.106    3793.41    3531.66      971112      904106
       4598977     0.523     1.029     1.090    3796.04    3584.54      971785      917641
       5000000     0.522     1.054     1.087    3706.65    3594.03      948902      920073

The bandwidth and overhead varies a lot between different runs but it always runs to completion.

However, I do notice that the last line of the result (5000000 iterations) takes longer than it should to appear, and your test seems to be stuck right before that line shows up.

Could you please provide the backtrace of the program when it's stuck?

This is how I get it:

gdb $EXE --pid=$PID --nx --quiet --batch -ex 'thread apply all bt'

Thank you

@alinask
Copy link
Contributor

alinask commented Mar 16, 2020

@lyu Thank you for the detailed response!

It seems that the reporting of the results isn't correct in case of more than one thread - https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724
For bandwidth, the results should be summed up among the threads but here we have only the main thread reporting the result.
Also, after applying the fix in this PR, the main thread reports full bandwidth - which doesn't really make sense since the other threads are running and 'consuming' bw as well.

@lyu
Copy link
Contributor

lyu commented Mar 16, 2020

@alinask You are right, the numbers from different threads are not aggregated yet. I thought about implementing this but couldn't decide on the best way to report these numbers.

Applying this PR will cause all the other threads to not do any work at all, which defeats the point of multi-threaded benchmarks. So IMHO the real issue here is the hanging problem discovered by @zhuyj.

If it turns out that the benchmark is stuck in ucp_worker_progress() then I'm afraid that we will need someone else's help, because I actually don't know how to debug from there.

@zhuyj
Copy link
Author

zhuyj commented Mar 17, 2020

Sorry. It is late since it took me some time to reproduce this hang.
Now it is still hang. I will keep this test scenario. If you need more information, please let me know.

[root@c-141-98-1-005 ~]# gdb lt-ucx_perftest --pid=16920 --nx --quiet --batch -ex 'thread apply all bt'
lt-ucx_perftest: No such file or directory.
[New LWP 16944]
[New LWP 16941]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f9c26675585 in pthread_spin_lock () from /lib64/libpthread.so.0

Thread 3 (Thread 0x7f9c1b596700 (LWP 16941)):
#0 0x00007f9c25d67ea3 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f9c26fe1403 in ucs_event_set_wait (event_set=0x1c260a0, num_events=num_events@entry=0x7f9c1b595e40, timeout_ms=100, event_set_handler=event_set_handler@entry=0x7f9c26fc4ff0 <ucs_async_thread_ev_handler>, arg=arg@entry=0x7f9c1b595e50) at sys/event_set.c:196
#2 0x00007f9c26fc58cc in ucs_async_thread_func (arg=0x1c26050) at async/thread.c:127
#3 0x00007f9c26670ea5 in start_thread () from /lib64/libpthread.so.0
#4 0x00007f9c25d678cd in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7f9c178b0700 (LWP 16944)):
#0 0x00007f9c26675573 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00007f9c2746fb78 in ucs_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:105
#2 ucs_recursive_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:117
#3 ucp_worker_progress (worker=0x7f9c27806010) at core/ucp_worker.c:1957
#4 0x00000000004690c1 in progress_responder (this=) at ucp_tests.cc:89
#5 wait_stream_recv (request=, this=) at ucp_tests.cc:123
#6 recv_stream (datatype=, length=, buf=, ep=, this=) at ucp_tests.cc:464
#7 recv (sn=, datatype=, length=, buffer=, ep=, worker=, this=) at ucp_tests.cc:275
#8 ucp_perf_test_runner<(ucx_perf_cmd_t)9, (ucx_perf_test_type_t)1, 0u>::run_stream_uni (this=this@entry=0x7f9c178afdb0) at ucp_tests.cc:393
#9 0x000000000045bc7c in run (this=0x7f9c178afdb0) at ucp_tests.cc:421
#10 ucp_perf_test_dispatch (perf=) at ucp_tests.cc:549
#11 0x0000000000409286 in ucx_perf_thread_run_test (arg=0x1cbf348) at libperf.c:1713
#12 ucx_perf_thread_spawn._omp_fn.0 () at libperf.c:1765
#13 0x00007f9c2604d405 in ?? () from /lib64/libgomp.so.1
#14 0x00007f9c26670ea5 in start_thread () from /lib64/libpthread.so.0
#15 0x00007f9c25d678cd in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f9c278fa7c0 (LWP 16920)):
#0 0x00007f9c26675585 in pthread_spin_lock () from /lib64/libpthread.so.0
#1 0x00007f9c2746fb78 in ucs_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:105
#2 ucs_recursive_spin_lock (lock=0x7f9c27806018) at /workspace/yanjunz/ucx/src/ucs/type/spinlock.h:117
#3 ucp_worker_progress (worker=0x7f9c27806010) at core/ucp_worker.c:1957
#4 0x00000000004690c1 in progress_responder (this=) at ucp_tests.cc:89
#5 wait_stream_recv (request=, this=) at ucp_tests.cc:123
#6 recv_stream (datatype=, length=, buf=, ep=, this=) at ucp_tests.cc:464
#7 recv (sn=, datatype=, length=, buffer=, ep=, worker=, this=) at ucp_tests.cc:275
#8 ucp_perf_test_runner<(ucx_perf_cmd_t)9, (ucx_perf_test_type_t)1, 0u>::run_stream_uni (this=this@entry=0x7ffffe7d6040) at ucp_tests.cc:393
#9 0x000000000045bc7c in run (this=0x7ffffe7d6040) at ucp_tests.cc:421
#10 ucp_perf_test_dispatch (perf=) at ucp_tests.cc:549
#11 0x0000000000409286 in ucx_perf_thread_run_test (arg=0x1cbb080) at libperf.c:1713
#12 ucx_perf_thread_spawn._omp_fn.0 () at libperf.c:1765
#13 0x000000000040b80a in ucx_perf_thread_spawn (result=0x7ffffe7d63d0, perf=0x1a33f70) at libperf.c:1752
#14 ucx_perf_run (params=params@entry=0x7ffffe7d63d0, result=result@entry=0x7ffffe7d61d0) at libperf.c:1665
#15 0x0000000000406e22 in run_test_recurs (ctx=ctx@entry=0x7ffffe7d63d0, parent_params=parent_params@entry=0x7ffffe7d63d0, depth=depth@entry=0) at perftest.c:1471
#16 0x0000000000404f1d in run_test (ctx=0x7ffffe7d63d0) at perftest.c:1520
#17 main (argc=, argv=) at perftest.c:1578

@zhuyj
Copy link
Author

zhuyj commented Mar 17, 2020

@alinask @lyu
When I comment this line https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724,
"
1721 #pragma omp master
1722 {
1723 /* Assuming all threads are fairly treated, reporting only tid==0
1724 TODO: aggregate reports /
1725 // ucx_perf_calc_result(perf, result);
1726 rte_call(perf, report, result, perf->params.report_arg, 1);
1727 }
"
the test result is as below:
"
745449 1.583 1.345 1.345 2904.83 2904.83 743637 743637
1374289 1.575 1.594 1.459 2450.43 2677.63 627311 685474
2003389 1.575 1.593 1.501 2451.45 2602.24 627572 666173
2632338 1.562 1.594 1.523 2450.86 2564.39 627421 656485
3261263 1.578 1.594 1.537 2450.77 2541.67 627397 650667
3889844 1.572 1.595 1.546 2449.42 2526.29 627052 646731
4518782 1.585 1.594 1.553 2450.82 2515.51 627410 643971
0 0.000 0.000 0.000 0.00 0.00 0 0
"
It seems that https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724, only print the last line.
"
1657 /
Run test */
1658 status = ucx_perf_funcs[params->api].run(perf);
1659 ucx_perf_funcs[params->api].barrier(perf);
1660 if (status == UCS_OK) {
1661 ucx_perf_calc_result(perf, result);
1662 rte_call(perf, report, result, perf->params.report_arg, 1);
1663 }
"
The above print the other lines in the above test output.

Zhu Yanjun

@lyu
Copy link
Contributor

lyu commented Mar 17, 2020

@zhuyj Thanks for the backtrace.

Just like what I was afraid of, it is stuck in ucp_worker_progress(). I am not an UCX expert and every time our application stuck in that function I end up having to find the root cause using some other methods. We probably need help from one of the UCX developers @shamisp.

Also, could you please try to use mutexes instead of spinlocks, by setting UCX_USE_MT_MUTEX=y? It's a long shot but why not.

Maybe we should close this PR and open an issue to discuss this, since we still don't know the root cause of this problem.

@zhuyj
Copy link
Author

zhuyj commented Mar 17, 2020

@lyu
The followings are the test command
"
UCX_NET_DEVICES=mlx5_5:1 UCX_TLS=all UCX_USE_MT_MUTEX=y ./src/tools/perf/ucx_perftest -T 6

UCX_NET_DEVICES=mlx5_5:1 UCX_TLS=all UCX_USE_MT_MUTEX=y ./src/tools/perf/ucx_perftest 1.1.1.4 -t stream_bw -n 5000000 -s 4096 -T 6
"
Sometimes this will cause ucx_perftest hang.

Without "UCX_USE_MT_MUTEX=y", there is no hang.

@lyu
Copy link
Contributor

lyu commented Mar 17, 2020

@zhuyj By default spinlocks are used, which is equivalent to not setting UCX_USE_MT_MUTEX=y, so maybe using mutexes increases the probability of hanging? Sounds like a race condition. I will try to reproduce this and wait to see if others have some idea.

@zhuyj
Copy link
Author

zhuyj commented Mar 17, 2020

@lyu
Before the following commit, perftest multi-thread can work. After this commit, the multi-thread performance is bad and hang will occur.
I do not know why.

Sorry. It is a difficult task to handle this multi-thread bad performance and hang.

"
From: lyu lyu@users.noreply.github.com
Date: Sat, 9 Mar 2019 11:12:06 -0500
Subject: [PATCH 1/1] TOOLS/PERF: Fix ucx_perftest's multi-threaded mode

  • Reset perf counters and timers correctly after a multi-threaded warm-up run
  • Make sock/mpi/ext_rte_barrier more robust against non-master threads that arrive late
  • Remove extra ucp_perf_update as its functionality has been fullfilled by a preceding ucx_perf_get_time, and it causes NaNs in ucx_perf_calc_result
  • Disable non-master threads' reports from within ucx_perf_update
    "

@lyu
Copy link
Contributor

lyu commented Mar 17, 2020

@zhuyj Sorry if I didn't make this clear, but #3350 has enabled truly multi-threaded benchmarks, which is the right thing to do. The performance is "bad" as a result of thread contention, which is expected.

You can try to run a multi-threaded benchmark with a large number of iterations, with this PR applied, I can assure you that after the initial warm-up iterations, all the other threads will be sitting on their hands, doing nothing. As @alinask has pointed out, it makes no sense when a 6-threaded run reports the same bandwidth as a single-threaded run, unless all the other threads are not consuming any bandwidth.

This PR will "solve" the problem we are seeing right now because there will be no race condition if we only use a single thread.

Anyway, I was able to reproduce the issue with UCX_USE_MT_MUTEX=y, let's see what can I do.

@zhuyj
Copy link
Author

zhuyj commented Mar 17, 2020

@lyu
From your comments, do you mean that the multi-thread does not work well before #3350?

@zhuyj zhuyj closed this Mar 17, 2020
@zhuyj zhuyj deleted the topic/perftest_fix branch March 17, 2020 07:35
@lyu
Copy link
Contributor

lyu commented Mar 17, 2020

@zhuyj No, multi-threading benchmark runs didn't work before #3350.

Now that I have also confirmed the hanging of stream_bw but couldn't find the cause, you may open an issue about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants