Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send completion with error with dc_verbs #1569

Closed
alinask opened this issue Jun 5, 2017 · 1 comment
Closed

send completion with error with dc_verbs #1569

alinask opened this issue Jun 5, 2017 · 1 comment
Labels
Milestone

Comments

@alinask
Copy link
Contributor

alinask commented Jun 5, 2017

There is a 'completion with error' coming from mtt but didn't reproduce later even after running 220 iteraions over three hours.

4 hosts, ppn=16.

Command line to reproduce:

env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_sshmem=verbs OMPI_MCA_sshmem_verbs_shared_mr=2 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc SHMEM_SYMMETRIC_HEAP_SIZE=1299M srun --cpu_bind=core -m cyclic --mpi=pmix_v1 -n 64 --nodes=4 -p pvegas /hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/installs/uvIS/tests/mpp_24/hpc_tests.git/mpp_bench_v1.0/bin/pingpong.shmem 0 4 16

The output is:
[1496344648.548260] [vegas06:14879:0]       dc_verbs.c:614  UCX  ERROR Send completion with error on qp 0x6214: remote access error syndrome 0x88
[1496344648.548334] [vegas06:14879:0]      uct_iface.c:330  UCX  ERROR Error Endpoint timeout was not handled for ep 0x942030
mlx5: vegas06: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 08006214 0000f6d2[1496344648.549828] [vegas06:14872:0]       dc_verbs.c:614  UCX  ERROR Send completion with error on qp 0x6183: remote access error syndrome 0x88
[1496344648.549885] [vegas06:14872:0]      uct_iface.c:330  UCX  ERROR Error Endpoint timeout was not handled for ep 0x944200
mlx5: vegas06: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 08006183 000814d2[1496344648.550088] [vegas06:14871:0]       dc_verbs.c:614  UCX  ERROR Send completion with error on qp 0x61f8: remote access error syndrome 0x88
[1496344648.550123] [vegas06:14871:0]      uct_iface.c:330  UCX  ERROR Error Endpoint timeout was not handled for ep 0x941e30
mlx5: vegas06: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 0000000000000000 00008813 080061f8 000041d2
[1496344648.550273] [vegas06:14873:0]       dc_verbs.c:614  UCX  ERROR Send completion with error on qp 0x61ef: remote access error syndrome 0x88
[1496344648.550311] [vegas06:14873:0]      uct_iface.c:330  UCX  ERROR Error Endpoint timeout was not handled for ep 0x943e80
...

http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/html/test_stdout_ZmXLTo.txt

MLNX_OFED_LINUX-3.4-2.1.9.0

@alinask alinask added the Bug label Jun 5, 2017
@yosefe yosefe modified the milestone: v1.3 Jul 1, 2017
@alinask
Copy link
Contributor Author

alinask commented Aug 28, 2017

Closing as this doesn't reproduce.

@alinask alinask closed this as completed Aug 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants