We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
There is a 'completion with error' coming from mtt but didn't reproduce later even after running 220 iteraions over three hours.
4 hosts, ppn=16.
Command line to reproduce:
env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_sshmem=verbs OMPI_MCA_sshmem_verbs_shared_mr=2 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc SHMEM_SYMMETRIC_HEAP_SIZE=1299M srun --cpu_bind=core -m cyclic --mpi=pmix_v1 -n 64 --nodes=4 -p pvegas /hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/installs/uvIS/tests/mpp_24/hpc_tests.git/mpp_bench_v1.0/bin/pingpong.shmem 0 4 16
The output is: [1496344648.548260] [vegas06:14879:0] dc_verbs.c:614 UCX ERROR Send completion with error on qp 0x6214: remote access error syndrome 0x88 [1496344648.548334] [vegas06:14879:0] uct_iface.c:330 UCX ERROR Error Endpoint timeout was not handled for ep 0x942030 mlx5: vegas06: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00008813 08006214 0000f6d2[1496344648.549828] [vegas06:14872:0] dc_verbs.c:614 UCX ERROR Send completion with error on qp 0x6183: remote access error syndrome 0x88 [1496344648.549885] [vegas06:14872:0] uct_iface.c:330 UCX ERROR Error Endpoint timeout was not handled for ep 0x944200 mlx5: vegas06: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00008813 08006183 000814d2[1496344648.550088] [vegas06:14871:0] dc_verbs.c:614 UCX ERROR Send completion with error on qp 0x61f8: remote access error syndrome 0x88 [1496344648.550123] [vegas06:14871:0] uct_iface.c:330 UCX ERROR Error Endpoint timeout was not handled for ep 0x941e30 mlx5: vegas06: got completion with error: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000000000000000 00008813 080061f8 000041d2 [1496344648.550273] [vegas06:14873:0] dc_verbs.c:614 UCX ERROR Send completion with error on qp 0x61ef: remote access error syndrome 0x88 [1496344648.550311] [vegas06:14873:0] uct_iface.c:330 UCX ERROR Error Endpoint timeout was not handled for ep 0x943e80 ...
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/html/test_stdout_ZmXLTo.txt
MLNX_OFED_LINUX-3.4-2.1.9.0
The text was updated successfully, but these errors were encountered:
Closing as this doesn't reproduce.
Sorry, something went wrong.
No branches or pull requests
There is a 'completion with error' coming from mtt but didn't reproduce later even after running 220 iteraions over three hours.
4 hosts, ppn=16.
Command line to reproduce:
env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_sshmem=verbs OMPI_MCA_sshmem_verbs_shared_mr=2 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc SHMEM_SYMMETRIC_HEAP_SIZE=1299M srun --cpu_bind=core -m cyclic --mpi=pmix_v1 -n 64 --nodes=4 -p pvegas /hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/installs/uvIS/tests/mpp_24/hpc_tests.git/mpp_bench_v1.0/bin/pingpong.shmem 0 4 16
http://hpcweb.lab.mtl.com/hpc/mtr_scrap/users/mtt/scratch/shmem/20170601_205447_1666_122649_vegas06/html/test_stdout_ZmXLTo.txt
MLNX_OFED_LINUX-3.4-2.1.9.0
The text was updated successfully, but these errors were encountered: