Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSMEM direct modex problems #1006

Closed
artpol84 opened this issue Sep 23, 2016 · 3 comments
Closed

OSMEM direct modex problems #1006

artpol84 opened this issue Sep 23, 2016 · 3 comments
Assignees

Comments

@artpol84
Copy link
Contributor

OMPI version open-mpi/ompi@917d96b (compiled without debug)
UCX version 69545a1 (default configuration)

For OSHMEM with ConnectX-4 adapter for the following cmdline:

/ompi2/msg/bin/shmemrun -np 896 --mca coll '^hcoll' --mca pml ucx --mca spml ucx --mca mtl '^r2' --mca btl self --mca mpi_add_procs_cutoff 0 --mca pmix_base_async_modex true -x UCX_TLS=dc_x -x SHMEM_SYMMETRIC_HEAP_SIZE=2470M --map-by node hello_oshmem

I'm getting this backtrace:

==== backtrace ====
0 0x000000000000f100 _L_unlock_13()  funlockfile.c:0
1 0x000000000014a1a7 __memcpy_ssse3_back()  :0
2 0x0000000000010f5a ucp_eager_handler()  /openucx-ucx-69545a1/src/ucp/tag/match.h:86
3 0x0000000000010f5a ucp_eager_only_handler()  /openucx-ucx-69545a1/src/ucp/tag/eager_rcv.c:89
4 0x0000000000033535 uct_iface_invoke_am()  /openucx-ucx-69545a1/src/uct/base/uct_iface.h:468
5 0x0000000000033535 uct_rc_mlx5_iface_common_poll_rx()  /openucx-ucx-69545a1/src/uct/ib/rc/accel/rc_mlx5_common.h:154
6 0x0000000000033535 uct_dc_mlx5_iface_progress()  /openucx-ucx-69545a1/src/uct/ib/dc/accel/dc_mlx5.c:513
7 0x00000000000162ae ucs_callbackq_dispatch()  /openucx-ucx-69545a1/src/ucs/datastruct/callbackq.h:263
8 0x00000000000162ae uct_worker_progress()  /openucx-ucx-69545a1/src/uct/base/uct_md.c:229
9 0x000000000000dfc0 ucp_worker_progress()  /openucx-ucx-69545a1/src/ucp/core/ucp_worker.c:546
10 0x000000000000dfc0 ucs_async_check_miss()  /openucx-ucx-69545a1/src/ucs/async/async.h:135
11 0x000000000000dfc0 ucp_worker_progress()  /openucx-ucx-69545a1/src/ucp/core/ucp_worker.c:547
12 0x0000000000002c61 mca_pml_ucx_progress()  ??:0
13 0x000000000002809c opal_progress()  ??:0
14 0x000000000004c7e5 ompi_request_default_wait_all()  ??:0
15 0x0000000000098a1c ompi_coll_base_sendrecv_nonzero_actual()  ??:0
16 0x000000000009834b ompi_coll_base_allgatherv_intra_neighborexchange()  ??:0
17 0x000000000005f56c PMPI_Allgatherv()  ??:0
18 0x0000000000029c91 oshmem_shmem_allgatherv()  ??:0
19 0x000000000000218b mca_spml_ucx_add_procs()  ??:0
20 0x000000000002952b oshmem_shmem_init()  ??:0
21 0x000000000002be04 pshmem_init()  ??:0
22 0x0000000000400af9 main()  hello_oshmem_c.c:39
23 0x0000000000021b15 __libc_start_main()  ??:0
24 0x00000000004009c9 _start()  ??:0
===================
@yosefe
Copy link
Contributor

yosefe commented Sep 24, 2016

looks like allgatherv() is passing invalid buffer pointer/length to UCX

@artpol84
Copy link
Contributor Author

@yosefe
I don't think this issue is still relevant. Can be closed.

@brminich
Copy link
Contributor

brminich commented May 25, 2017

I did not manage to reproduce it thru lots of iterations, but I had to specify additional env var -x UCX_NET_DEVICES=mlx5_3:1 to get rid of #1534 symptoms. Thought we could try to reproduce it with original command line when #1534 is fixed.

@yosefe yosefe closed this as completed May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants