Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(title changed) OpenMPI bogus warnings "UCX is unable to handle VM_UNMAP" #3686

Closed
zerothi opened this issue Jun 11, 2019 · 26 comments
Closed
Labels

Comments

@zerothi
Copy link

zerothi commented Jun 11, 2019

I am running:

$> ucx_info -v
# UCT version=1.5.1 revision 7e67a4b
# configured with: --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --with-rc --with-ud --with-dc --with-cm --with-mlx5-dv --with-ib-hw-tm --with-dm --with-mcpu --with-march --prefix=/opt/gnu/9.1.0/ucx/1.5.1

and OpenMPI 4.0.1 and GCC 9.1.0.

I am running on a local machine, which of course does not make sense with all the options, but configure just disables the extended features.

However, when running the simplest MPI program (only MPI_init and MPI_finalize) I get the following error:

[1560244589.726560] [nicpa-dtu:6527 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1560244589.726561] [nicpa-dtu:6528 :0]            sys.c:619  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
$> ipcs -l

------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 32000
max semaphores per array = 32000
max semaphores system wide = 1024000000
max ops per semop call = 500
semaphore max value = 32767

It seems unrelated to #3023 and #3668 since this is a local machine.

@yosefe
Copy link
Contributor

yosefe commented Jun 11, 2019

actually, it seems like #3023.
what is the output of capsh --print|grep ipc?
can you pls check if UCX master branch works?

@zerothi
Copy link
Author

zerothi commented Jun 11, 2019

$> capsh --print|grep ipc
Bounding set =cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

I'll check with master

@zerothi
Copy link
Author

zerothi commented Jun 11, 2019

@yosefe but #3023 was fixed before 1.5.1 was released? Ok, i'll try master ;)

@yosefe
Copy link
Contributor

yosefe commented Jun 11, 2019

@zerothi v1.5.x was branched before the fix (on Nov'18, fix was in Jan'18)

@zerothi
Copy link
Author

zerothi commented Jun 11, 2019

ah, ok. :)

@shamisp
Copy link
Contributor

shamisp commented Jun 11, 2019

@yosefe - do we want to back port this to v1.5.2 ?

@yosefe
Copy link
Contributor

yosefe commented Jun 17, 2019

@shamisp as we discussed on the dev call, we would not port it to v1.5.2

@zerothi, did v1.6.x/master solve the problem?

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

@yosefe thanks for following up!

I just tried with 1.6.0-rc2. It seems to be resolved. However, now I get:

[nicpa-dtu:24568] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[nicpa-dtu:24569] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

which could be OMPI, but it seems related to UCX?

Even adding the flag proposed I still get the warning.

@yosefe
Copy link
Contributor

yosefe commented Jun 17, 2019

@zerothi it's probably a real issue, which UCX detects as of v1.6.x.
Can you try running with "-mca btl ^uct"?
see also #3581

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

I did this:

mpirun --mca opal_common_ucx_opal_mem_hooks 1 --mca btl ^uct -np 2 ./test

and got the same:

[nicpa-dtu:29789] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. 
[nicpa-dtu:29790] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption.

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

@yosefe should I move this to the mentioned ticket?

@yosefe
Copy link
Contributor

yosefe commented Jun 17, 2019

@zerothi can you try just this:
mpirun --mca btl ^uct -np 2 ./test ?
(edit: removed --mca opal_common_ucx_opal_mem_hooks 1)

There is also other issue (fixed in UCX master, @hoopoepg will port it to v1.6.x as well) that when you would pass '--mca opal_common_ucx_opal_mem_hooks 1' as recommended by the warning message, the warning still shows

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

I get the same output:

$> mpirun --mca btl ^uct -np 2 ./test                                                                                                                                                                                           
[nicpa-dtu:20776] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[nicpa-dtu:20777] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

You can see my config.log here:
https://gist.github.com/zerothi/a2a7822417e5c312787d8a9f0012a8dd

if it helps?

@yosefe
Copy link
Contributor

yosefe commented Jun 17, 2019

probably some other OpenMPI component is initializing memory patcher framework, which overrides UCX hooks.. can yous pls try:
$ mpirun --mca btl self -np 2 ./test ?

Also, is it possible to provide config.log for OpenMPI?

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

@yosefe same thing :(

$> mpirun --mca btl self -np 2 ./test                                                                                                                                                                                           
[nicpa-dtu:21644] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[nicpa-dtu:21645] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

I have amended the gist with ompi-config.log as well.

@yosefe
Copy link
Contributor

yosefe commented Jun 17, 2019

@zerothi unfortunately, i've tried to reproduce the issue with same versions and configuration on my system, and no luck..

Can you pls rebuild UCX with debug like this (--disable-logging replaced by --enable-logging)?
../configure --enable-optimizations --enable-logging --disable-debug --disable-assertions --disable-params-check --with-rc --with-ud --with-dc --with-dm --with-mcpu --with-march ...

And then run like this:
mpirun -x UCX_MEM_LOG_LEVEL=debug -mca btl self -n 2 ./a.out

This should produce some logging output to help identify the problem

@zerothi
Copy link
Author

zerothi commented Jun 17, 2019

I got this:

[1560801008.017671] [nicpa-dtu:585]          install.c:124  UCX  DEBUG mmap test: got 0x0 out of 0x2007f
[1560801008.017669] [nicpa-dtu:586]          install.c:124  UCX  DEBUG mmap test: got 0x0 out of 0x2007f
[nicpa-dtu:00585] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[nicpa-dtu:00586] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

EDIT: hidden because I had only added --enable-logging re-shown since I got the same:

$> mpirun -x UCX_MEM_LOG_LEVEL=debug -mca btl self -n 2 ./test                                                                                            

[1560802687.976514] [nicpa-dtu:1701]          install.c:124  UCX  DEBUG mmap test: got 0x0 out of 0x2007f
[1560802687.976560] [nicpa-dtu:1700]          install.c:124  UCX  DEBUG mmap test: got 0x0 out of 0x2007f
[nicpa-dtu:01701] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.
[nicpa-dtu:01700] ../../../../../opal/mca/common/ucx/common_ucx.c:146  Warning: UCX is unable to handle VM_UNMAP event. This may cause performance degradation or data corruption. Pls try adding --mca opal_common_ucx_opal_mem_hooks 1 to mpirun/oshrun command line to resolve this issue.

Here is the header of my config.log:

$ /home/nicpa/installation/bash-build/.compile/ucx-1.6.0/contrib/../configure --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-logging --enable-optimizations --disable-debug --disable--assertions --disable-param-check --with-rc --with-ud --with-dc --with-dm --with-mcpu --with-march --prefix=/opt/gnu/9.1.0/ucx/1.6.0

@yosefe yosefe added the Bug label Jun 18, 2019
@zerothi
Copy link
Author

zerothi commented Jun 18, 2019

@yosefe If you need anything more, please do not hesitate to contact me :)

@yosefe
Copy link
Contributor

yosefe commented Jun 18, 2019

@zerothi i guess you don't have any IB/RDMA device, or knem driver, right?
seems the issue is bogus warning when memhooks are not installed because rcache was not created. i'm preparing a fix for it and will update when have a PR

@zerothi
Copy link
Author

zerothi commented Jun 18, 2019

Correct, I don't have anything on my machine. :)
It is because I am doing a "software stack" that I sort of check on my own laptop/stationary, and thus I don't have exactly the same setup with advanced devices. :)

@zerothi
Copy link
Author

zerothi commented Jun 18, 2019

My guess is that you could actually run CI tests on travis/azure/... for such a non-use case? (just an idea)

@yosefe yosefe changed the title UCX ERROR shmget Operation not permitted, on local machine (title changed) OpenMPI bogus warnings "UCX is unable to handle VM_UNMAP" Jun 18, 2019
@yosefe
Copy link
Contributor

yosefe commented Jun 18, 2019

@zerothi the warnings should be fixed by #3716

@yosefe
Copy link
Contributor

yosefe commented Jun 20, 2019

@zerothi FYI, ucx v1.6.0-rc3 contains a fix for this issue. any chance you can give it a try?

@zerothi
Copy link
Author

zerothi commented Jun 20, 2019

I am running installation! Will return! ;) Thanks

@zerothi
Copy link
Author

zerothi commented Jun 20, 2019

Now I get:

$> mpirun -np 2 ./test
<nothing>

$> mpirun -x UCX_MEM_LOG_LEVEL=debug -mca btl self -n 2 ./test
[1561016808.820435] [nicpa-dtu:8984]          install.c:192  UCX  DEBUG mmap test: got 0x0 out of 0x0
[1561016808.820435] [nicpa-dtu:8985]          install.c:192  UCX  DEBUG mmap test: got 0x0 out of 0x0

I guess this means it can be closed!

@yosefe yosefe closed this as completed Jun 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants