Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transport retry count exceeded in many-to-one tests #1920

Closed
amaslenn opened this issue Oct 17, 2017 · 8 comments
Closed

transport retry count exceeded in many-to-one tests #1920

amaslenn opened this issue Oct 17, 2017 · 8 comments
Assignees
Labels
Bug MTT MTT Error
Milestone

Comments

@amaslenn
Copy link
Contributor

Configuration:

OMPI: 4.0.0a1
Hercules x25 (clx-hercules-[036,054-055,057,059-063,069,073,076,080-082,084-085,087,089-092,097-098,112])

MTT: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/shmem/20171017_051617_17042_17161_clx-hercules-054/html/test_stdout_l1nyGu.txt

All devices are up:

$pdsh -w "clx-hercules-[036,054-055,057,059-063,069,073,076,080-082,084-085,087,089-092,097-098,112]" 'ibv_devinfo -d mlx5_0 | grep state' | dshbak -c
----------------
clx-hercules-[036,054-055,057,059-063,069,073,076,080-082,084-085,087,089-092,097-098,112]
----------------
                        state:                  PORT_ACTIVE (4)

Cmd:
env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_osc=ucx OMPI_MCA_sshmem=mmap OMPI_MCA_spml_ucx_heap_reg_nb=0 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc_x SHMEM_SYMMETRIC_HEAP_SIZE=128M srun --cpu_bind=core -m block --mpi=pmi2 -n 25 --nodes=25 -p hercules /mnt/lustre/users/mtt/scratch/shmem/20171017_051617_17042_17161_clx-hercules-054/installs/h5Tu/tests/verifier/tests-mellanox.git/verifier/install/bin/oshmem_test exec --no-colour --task=analysis:tc2 --task=analysis:tc3 --task=analysis:tc4 --task=analysis:tc5 --duration 10

Output:

[clx-hercules-061:15303] OPAL ERROR: Not found in file btl_openib_component.c at line 2441
...
[clx-hercules-061:15303] OPAL ERROR: Not found in file btl_openib_component.c at line 2441
OpenSHMEM Verification Tool ver."1.2.60"
**********************************
* Log file: none
* Host: clx-hercules-036
* Output level: 4
* Log level: 0
**********************************

[1508227003.991871] [clx-hercules-084:26649:0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xe87a wqe[35]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508227003.991911] [clx-hercules-084:26649:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x2366cf0 - dc_mlx5/mlx5_0:1
@amaslenn
Copy link
Contributor Author

amaslenn commented Oct 17, 2017

One more: http://e2e-gw.mellanox.com:4080/mnt/lustre/users/mtt/scratch/shmem/20171016_201633_7562_17116_clx-hercules-001/html/test_stdout_XUrCql.txt

Cmd:
env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_osc=ucx OMPI_MCA_sshmem=mmap OMPI_MCA_spml_ucx_heap_reg_nb=0 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc SHMEM_SYMMETRIC_HEAP_SIZE=128M srun --cpu_bind=core -m cyclic --mpi=pmi2 -n 512 --nodes=16 -p hercules /mnt/lustre/users/mtt/scratch/shmem/20171016_201633_7562_17116_clx-hercules-001/installs/Xz6K/tests/misc/hpc_tests.git/miscellaneous/mx_bug 1

Output:

[1508178126.562913] [clx-hercules-116:10453:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdcc9a0 - dc/mlx5_0:1
[1508178126.562872] [clx-hercules-116:10501:0]       dc_verbs.c:633  UCX  ERROR Send completion with error on qp 0x1fc71: transport retry counter exceeded syndrome 0x81
[1508178126.562913] [clx-hercules-116:10501:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdd34b0 - dc/mlx5_0:1
[1508178126.563058] [clx-hercules-116:10497:0]       dc_verbs.c:633  UCX  ERROR Send completion with error on qp 0x1fc47: transport retry counter exceeded syndrome 0x81
[1508178126.563092] [clx-hercules-116:10497:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdd34b0 - dc/mlx5_0:1
[1508178126.563132] [clx-hercules-116:10449:0]       dc_verbs.c:633  UCX  ERROR Send completion with error on qp 0x1fca2: transport retry counter exceeded syndrome 0x81
[1508178126.563138] [clx-hercules-116:10489:0]       dc_verbs.c:633  UCX  ERROR Send completion with error on qp 0x1fc75: transport retry counter exceeded syndrome 0x81
[1508178126.563185] [clx-hercules-116:10449:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdcc9f0 - dc/mlx5_0:1
[1508178126.563185] [clx-hercules-116:10489:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdd34b0 - dc/mlx5_0:1
[1508178126.563257] [clx-hercules-116:10493:0]       dc_verbs.c:633  UCX  ERROR Send completion with error on qp 0x1fc2b: transport retry counter exceeded syndrome 0x81
[1508178126.563290] [clx-hercules-116:10493:0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xdd34b0 - dc/mlx5_0:1[clx-hercules-116:10453] pml_ucx.c:714 Error: ucx send failed: Endpoint timeout

@alinask alinask added Bug MTT MTT Error labels Oct 17, 2017
@amaslenn
Copy link
Contributor Author

amaslenn commented Oct 19, 2017

One more: http://e2e-gw.mellanox.com:4080//mnt/lustre/users/mtt/scratch/shmem/20171019_011142_14786_17674_clx-hercules-018/html/test_stdout_IDrqye.txt

Cmd:
env OMPI_MCA_btl_openib_warn_default_gid_prefix=0 OMPI_MCA_sshmem_verbs_hca_name=mlx5_0:1 OMPI_MCA_btl_openib_if_include=mlx5_0:1 MXM_RDMA_PORTS=mlx5_0:1 UCX_NET_DEVICES=mlx5_0:1 OMPI_MCA_sshmem=verbs 'OMPI_MCA_coll=^hcoll' OMPI_MCA_coll_hcoll_enable=0 OMPI_MCA_spml=ucx OMPI_MCA_pml=ucx UCX_TLS=dc_x SHMEM_SYMMETRIC_HEAP_SIZE=128M srun --cpu_bind=core -m cyclic --mpi=pmi2 -n 1600 --nodes=50 -p hercules /mnt/lustre/users/mtt/scratch/shmem/20171019_011142_14786_17674_clx-hercules-018/installs/7Ps7/tests/osu_micro_benchmark/osu-micro-benchmarks-5.0/openshmem/osu_oshm_atomics heap

Output:

# OSU OpenSHMEM Atomic Operation Rate Test v5.0
# Operation              Million ops/s      Latency (us)[1508365233.896658] [clx-hercules-047:4362 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xc1f wqe[10]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.896703] [clx-hercules-047:4362 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xaa6cf0 - dc_mlx5/mlx5_0:1
[1508365233.896666] [clx-hercules-047:4370 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xc07 wqe[12]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.896703] [clx-hercules-047:4370 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b28c0 - dc_mlx5/mlx5_0:1
[1508365233.896795] [clx-hercules-047:4367 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xbf1 wqe[5]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.896839] [clx-hercules-047:4367 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0xb19ac0 - dc_mlx5/mlx5_0:1
[1508365233.896890] [clx-hercules-047:4375 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xc10 wqe[813]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.896929] [clx-hercules-047:4375 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x112a900 - dc_mlx5/mlx5_0:1[1508365233.898073] [clx-hercules-047:4384 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xbeb wqe[14]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.898110] [clx-hercules-047:4384 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b39c0 - dc_mlx5/mlx5_0:1
[1508365233.898120] [clx-hercules-047:4383 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xbd9 wqe[436]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.898126] [clx-hercules-047:4372 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xbd0 wqe[55]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.898159] [clx-hercules-047:4372 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b2b50 - dc_mlx5/mlx5_0:1
[1508365233.898154] [clx-hercules-047:4383 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x112b2b0 - dc_mlx5/mlx5_0:1
[1508365233.898670] [clx-hercules-047:4376 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xba8 wqe[394]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.898705] [clx-hercules-047:4376 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b2f80 - dc_mlx5/mlx5_0:1
[1508365233.899170] [clx-hercules-047:4380 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xba7 wqe[393]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.899206] [clx-hercules-047:4380 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b34a0 - dc_mlx5/mlx5_0:1
[1508365233.904077] [clx-hercules-047:4375 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xb70 wqe[18]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.904084] [clx-hercules-047:4375 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1125730 - dc_mlx5/mlx5_0:1
[1508365233.904166] [clx-hercules-047:4367 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xc12 wqe[3]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.904175] [clx-hercules-047:4367 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1124ac0 - dc_mlx5/mlx5_0:1
[1508365233.904227] [clx-hercules-047:4372 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xbad wqe[400]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.904232] [clx-hercules-047:4372 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b2840 - dc_mlx5/mlx5_0:1
[1508365233.904688] [clx-hercules-047:4376 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xb5f wqe[394]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1508365233.904695] [clx-hercules-047:4376 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x10b2c70 - dc_mlx5/mlx5_0:1srun: forcing job termination
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
slurmstepd: error: *** STEP 17674.32 ON clx-hercules-018 CANCELLED AT 2017-10-19T04:59:53 ***

@evgeny-leksikov evgeny-leksikov self-assigned this Dec 25, 2017
@yosefe yosefe added this to the v1.3.0 milestone Jan 21, 2018
@alex--m
Copy link
Contributor

alex--m commented Jan 23, 2018

I think I'm getting the same with RC, on 120 orion nodes.

Commandline: salloc -p orion -N 120 mpirun --bind-to core --report-bindings $HPCX_MPI_TESTS_DIR/osu-micro-benchmarks-5.3.2/osu_gather 1> osu_gather-n120-ppn28.txt 2>>err &

Output:

$cat osu_gather-n120-ppn28.txt

# OSU MPI Gather Latency Test v5.3.2
# Size       Avg Latency(us)
1                       0.91
2                       0.87
4                       0.88
8                       0.92
16                      1.02
32                      1.20
64                      1.29
128                     6.72
256                     7.09
512                    10.13
1024                   12.58
2048                   17.51
4096                   27.58
[1516700569.377563] [clx-orion-059:5831 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6179 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.378516] [clx-orion-059:5831 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1bdfee0 - rc_mlx5/mlx5_2:1
[1516700569.390309] [clx-orion-114:26149:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x16495 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.391567] [clx-orion-114:26149:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1a8f610 - rc_mlx5/mlx5_2:1
[1516700569.489246] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x827 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.491537] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d3d980 - rc_mlx5/mlx5_2:1
[1516700569.491547] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x822 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.492589] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d19fa0 - rc_mlx5/mlx5_2:1
[1516700569.492597] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x825 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.493696] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d34c70 - rc_mlx5/mlx5_2:1
[1516700569.493703] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x823 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.494538] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d22a50 - rc_mlx5/mlx5_2:1
[1516700569.494545] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x820 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.495325] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d07cd0 - rc_mlx5/mlx5_2:1
[1516700569.495332] [clx-orion-009:4360 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x824 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.496029] [clx-orion-009:4360 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d225d0 - rc_mlx5/mlx5_2:1
[1516700569.572416] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6211 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.585598] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d44e80 - rc_mlx5/mlx5_2:1
[1516700569.585610] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x620d wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.586469] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d2af90 - rc_mlx5/mlx5_2:1
[1516700569.586478] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x620c wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.587419] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d22850 - rc_mlx5/mlx5_2:1
[1516700569.587429] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6209 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.588335] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1cf0b20 - rc_mlx5/mlx5_2:1
[1516700569.588344] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x620f wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.589068] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d44a20 - rc_mlx5/mlx5_2:1
[1516700569.589077] [clx-orion-011:25238:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6210 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.594107] [clx-orion-050:11718:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6176 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.589815] [clx-orion-011:25238:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d45640 - rc_mlx5/mlx5_2:1
[1516700569.598593] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f9 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.595076] [clx-orion-050:11718:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1c18030 - rc_mlx5/mlx5_2:1
[1516700569.600180] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d5e7a0 - rc_mlx5/mlx5_2:1
[1516700569.600192] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f5 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.601083] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d33870 - rc_mlx5/mlx5_2:1
[1516700569.601092] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f3 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.602001] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1cf8f80 - rc_mlx5/mlx5_2:1
[1516700569.602008] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f6 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.602731] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d2ab50 - rc_mlx5/mlx5_2:1
[1516700569.602738] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f7 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.603648] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d33010 - rc_mlx5/mlx5_2:1
[1516700569.603656] [clx-orion-010:31714:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x61f4 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.604455] [clx-orion-010:31714:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d33400 - rc_mlx5/mlx5_2:1
[1516700569.717505] [clx-orion-060:11291:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x5104 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.719088] [clx-orion-060:11291:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1bdf210 - rc_mlx5/mlx5_2:1
[1516700569.759661] [clx-orion-064:1359 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x8328 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.760459] [clx-orion-064:1359 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1ba8bd0 - rc_mlx5/mlx5_2:1
[1516700569.850845] [clx-orion-014:9165 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x609a wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.851900] [clx-orion-014:9165 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d44300 - rc_mlx5/mlx5_2:1
[1516700569.851911] [clx-orion-014:9165 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6098 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.852642] [clx-orion-014:9165 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d54bc0 - rc_mlx5/mlx5_2:1
[1516700569.852651] [clx-orion-014:9165 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x6095 wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700569.853268] [clx-orion-014:9165 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1d213d0 - rc_mlx5/mlx5_2:1
[1516700582.936080] [clx-orion-039:7128 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x614a wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700582.936929] [clx-orion-016:9617 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x605a wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700582.936518] [clx-orion-012:12602:0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x603e wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700582.940829] [clx-orion-039:7128 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x16c4770 - rc_mlx5/mlx5_2:1
[1516700582.937968] [clx-orion-016:9617 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1c07930 - rc_mlx5/mlx5_2:1
[1516700582.937981] [clx-orion-016:9617 :0]    ib_mlx5_log.c:113  UCX  ERROR Error on QP 0x605f wqe[0]: Transport retry count exceeded (synd 0x15 vend 0x81) opcode SEND
[1516700582.938739] [clx-orion-016:9617 :0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x1c21a10 - rc_mlx5/mlx5_2:1
[1516700582.941628] [clx-orion-012:12602:0]     ucp_worker.c:437  UCX  ERROR Error Endpoint timeout was not handled for ep 0x16c5250 - rc_mlx5/mlx5_2:1
8192                 7488.32
16384                8727.08
32768               12060.72
65536               22148.32
131072              40553.41
<hang?>

@evgeny-leksikov
Copy link
Contributor

@alex--m seems like your case was environment or hcoll related but not related with original issue. I can't reproduce it today but it was stable during last 3 days. Could you confirm if it works now for you as well?

@evgeny-leksikov
Copy link
Contributor

@amaslenn oshmem_test and mx_bug works fine duling 12h loop, pls check latest mtt logs if it's still reproducible.

@yosefe yosefe modified the milestone: v1.3.0 Jan 28, 2018
@amaslenn
Copy link
Contributor Author

Don't see these anymore.

@alinask alinask reopened this Jan 29, 2018
@alinask
Copy link
Contributor

alinask commented Jan 29, 2018

let's wait for @alex--m 's response before closing.

@yosefe yosefe changed the title Error transport retry count exceeded transport retry count exceeded in many-to-one tests Feb 5, 2018
@alex--m
Copy link
Contributor

alex--m commented Feb 7, 2018

Checked yesterday, looks good now. Let's close it.

@alinask alinask closed this as completed Feb 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MTT MTT Error
Projects
None yet
Development

No branches or pull requests

5 participants