-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TOOLS/PERF: fix hang and enhance the multi-thread performance #4890
Conversation
Signed-off-by: root <root@c-141-67-100-103.mtl.labs.mlnx>
Mellanox CI: PASSED on 25 workers (click for details)Note: the logs will be deleted after 23-Mar-2020
|
Hello @lyu , A while ago you committed a fix to the UCX project with this PR - #3350 Thanks! |
@alinask @zhuyj Therefore, if only the master thread calls With only one thread running the benchmark, you see "higher" performance. This can be verified by checking the other threads' CPU utilization during the benchmark. The title of this PR mentions "hang", which shouldn't happen. Could you please give me more details of the hang? If hanging means extremely low performance on the aarch64 platform then it is probably a known issue, see #3569. |
Thanks, lyu. The followings are my steps to reproduce the hang.
Server:
|
Just now I confronted a hang of ucx_perftest. The followings are the scenarios. I hope that this can help you. UCX_NET_DEVICES=mlx5_0:1 UCX_TLS=all /workspace/yanjunz/ucx/src/tools/perf/ucx_perftest 3.3.3.5 -t stream_bw -n 5000000 -s 4096 -T 2 |
Unfortunately I could not reproduce this hanging issue on our cluster after many runs. My UCX configuration: My
The bandwidth and overhead varies a lot between different runs but it always runs to completion. However, I do notice that the last line of the result (5000000 iterations) takes longer than it should to appear, and your test seems to be stuck right before that line shows up. Could you please provide the backtrace of the program when it's stuck? This is how I get it:
Thank you |
@lyu Thank you for the detailed response! It seems that the reporting of the results isn't correct in case of more than one thread - https://github.com/openucx/ucx/blob/master/src/tools/perf/lib/libperf.c#L1724 |
@alinask You are right, the numbers from different threads are not aggregated yet. I thought about implementing this but couldn't decide on the best way to report these numbers. Applying this PR will cause all the other threads to not do any work at all, which defeats the point of multi-threaded benchmarks. So IMHO the real issue here is the hanging problem discovered by @zhuyj. If it turns out that the benchmark is stuck in |
Sorry. It is late since it took me some time to reproduce this hang. [root@c-141-98-1-005 ~]# gdb lt-ucx_perftest --pid=16920 --nx --quiet --batch -ex 'thread apply all bt' Thread 3 (Thread 0x7f9c1b596700 (LWP 16941)): Thread 2 (Thread 0x7f9c178b0700 (LWP 16944)): Thread 1 (Thread 0x7f9c278fa7c0 (LWP 16920)): |
@alinask @lyu Zhu Yanjun |
@zhuyj Thanks for the backtrace. Just like what I was afraid of, it is stuck in Also, could you please try to use mutexes instead of spinlocks, by setting Maybe we should close this PR and open an issue to discuss this, since we still don't know the root cause of this problem. |
@lyu UCX_NET_DEVICES=mlx5_5:1 UCX_TLS=all UCX_USE_MT_MUTEX=y ./src/tools/perf/ucx_perftest 1.1.1.4 -t stream_bw -n 5000000 -s 4096 -T 6 Without "UCX_USE_MT_MUTEX=y", there is no hang. |
@zhuyj By default spinlocks are used, which is equivalent to not setting |
@lyu Sorry. It is a difficult task to handle this multi-thread bad performance and hang. "
|
@zhuyj Sorry if I didn't make this clear, but #3350 has enabled truly multi-threaded benchmarks, which is the right thing to do. The performance is "bad" as a result of thread contention, which is expected. You can try to run a multi-threaded benchmark with a large number of iterations, with this PR applied, I can assure you that after the initial warm-up iterations, all the other threads will be sitting on their hands, doing nothing. As @alinask has pointed out, it makes no sense when a 6-threaded run reports the same bandwidth as a single-threaded run, unless all the other threads are not consuming any bandwidth. This PR will "solve" the problem we are seeing right now because there will be no race condition if we only use a single thread. Anyway, I was able to reproduce the issue with |
Signed-off-by: Zhu Yanjun yanjunz@mellanox.com