Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mtt] Timeout in mpi_test_suite with HW TM #1926

Closed
amaslenn opened this issue Oct 18, 2017 · 6 comments
Closed

[mtt] Timeout in mpi_test_suite with HW TM #1926

amaslenn opened this issue Oct 18, 2017 · 6 comments
Assignees
Labels

Comments

@amaslenn
Copy link
Contributor

Configuration:

MOFED: MLNX_OFED_LINUX-4.1-4.1.1.0
OMPI: 4.0.0a1
Orion x36 (clx-orion-[022,025,028-032,037-038,045-047,049,053,063-064,066-071,081-090,092-095])

MTT: http://e2e-gw.mellanox.com:4080//hpc/scrap/users/mtt/scratch/hcol/20171016_215200_22084_17122_clx-orion-071/html/test_stdout_zAiLea.txt

Cmd:
mpirun -np 1008 --debug-daemons --display-map --bind-to core --map-by node -mca pml ucx -mca btl_openib_warn_default_gid_prefix 0 -mca btl_openib_if_include mlx5_0:1 --timestamp-output -x HCOLL_IB_IF_INCLUDE=mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_ENABLE_MCAST_ALL=0 -x HCOLL_MCAST_NP=5 -x HCOLL_CONTEXT_CACHE_ENABLE=0 -x UCX_SHM_DEVICES=all -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_ACC_DEVICES=all -x HCOLL_ENABLE_SHARP=0 -x HCOLL_ENABLE_TOPOLOGY=0 -x HCOLL_BCOL_P2P_MCAST_ALLREDUCE_ALG=1 /hpc/scrap/users/mtt/scratch/hcol/20171016_215200_22084_17122_clx-orion-071/installs/gzmC/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'Alltoall' -d 'MPI_CONTIGUOUS_INT' -n 300

Output:

Wed Oct 18 16:09:18 2017<stdout>:(Rank:0) tst_test_array[0]:Alltoall
Wed Oct 18 16:09:18 2017<stdout>:P2P tests Alltoall (23/1), comm MPI_COMM_WORLD (1/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:18 2017<stdout>:P2P tests Alltoall (23/1), comm Duplicated MPI_COMM_WORLD (4/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:18 2017<stdout>:[1508332158.841121] [clx-orion-046:7098 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0x3b83f wqe[598]: Local length (synd 0x1 vend 0x68) opcode SEND
Wed Oct 18 16:09:18 2017<stdout>:[1508332158.841198] [clx-orion-046:7098 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x189eae0 - dc_mlx5/mlx5_0:1
Wed Oct 18 16:09:18 2017<stdout>:[1508332158.875755] [clx-orion-046:7098 :0]        offload.c:476  UCX  ERROR Failed to cancel tag rndv op Endpoint timeout
Wed Oct 18 16:09:18 2017<stdout>:P2P tests Alltoall (23/1), comm Reversed MPI_COMM_WORLD (5/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:18 2017<stddiag>:[clx-orion-046:07098] pml_ucx.c:714 Error: ucx send failed: Endpoint timeout
[clx-orion-046:07098] *** An error occurred in MPI_Isend
[clx-orion-046:07098] *** reported by process [3194683393,140733193388366]
[clx-orion-046:07098] *** on communicator MPI COMMUNICATOR 4 CREATE FROM 0
[clx-orion-046:07098] *** MPI_ERR_OTHER: known error not in list
[clx-orion-046:07098] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[clx-orion-046:07098] ***    and potentially your MPI job)

MXM works:

$mpirun -np 1000 --bind-to core --map-by node -mca pml yalla -mca btl_openib_warn_default_gid_prefix 0 -mca btl_openib_if_include mlx5_0:1 --timestamp-output -x HCOLL_IB_IF_INCLUDE=mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_ENABLE_MCAST_ALL=0 -x HCOLL_MCAST_NP=5 -x HCOLL_CONTEXT_CACHE_ENABLE=0 -x UCX_SHM_DEVICES=all -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_ACC_DEVICES=all -x HCOLL_ENABLE_SHARP=0 -x HCOLL_ENABLE_TOPOLOGY=0 -x HCOLL_BCOL_P2P_MCAST_ALLREDUCE_ALG=1 /hpc/scrap/users/mtt/scratch/hcol/20171016_215200_22084_17122_clx-orion-071/installs/gzmC/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'Alltoall' -d 'MPI_CONTIGUOUS_INT' -n 300
Wed Oct 18 16:09:40 2017<stdout>:(Rank:0) tst_test_array[0]:Alltoall
Wed Oct 18 16:09:40 2017<stdout>:P2P tests Alltoall (23/1), comm MPI_COMM_WORLD (1/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:40 2017<stdout>:P2P tests Alltoall (23/1), comm Duplicated MPI_COMM_WORLD (4/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Reversed MPI_COMM_WORLD (5/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Halved MPI_COMM_WORLD (6/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Odd/Even split MPI_COMM_WORLD (7/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Zero-and-Rest Intercommunicator (8/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Halved Intercommunicator (12/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm Intracomm merged of the Halved Intercomm (13/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:P2P tests Alltoall (23/1), comm MPI_COMM_TYPE_SHARED comm (14/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:09:41 2017<stdout>:Number of failed tests:0

Also reproduced without HCOLL:

$mpirun -np 1000 --bind-to core --map-by node -mca pml ucx -mca coll ^hcoll -mca btl_openib_warn_default_gid_prefix 0 -mca btl_openib_if_include mlx5_0:1 --timestamp-output -x HCOLL_IB_IF_INCLUDE=mlx5_0:1 -x MXM_RDMA_PORTS=mlx5_0:1 -x HCOLL_ENABLE_MCAST_ALL=0 -x HCOLL_MCAST_NP=5 -x HCOLL_CONTEXT_CACHE_ENABLE=0 -x UCX_SHM_DEVICES=all -x UCX_NET_DEVICES=mlx5_0:1 -x UCX_ACC_DEVICES=all -x HCOLL_ENABLE_SHARP=0 -x HCOLL_ENABLE_TOPOLOGY=0 -x HCOLL_BCOL_P2P_MCAST_ALLREDUCE_ALG=1 /hpc/scrap/users/mtt/scratch/hcol/20171016_215200_22084_17122_clx-orion-071/installs/gzmC/tests/mpi-test-suite/ompi-tests/mpi_test_suite/mpi_test_suite -x relaxed -t 'Alltoall' -d 'MPI_CONTIGUOUS_INT' -n 300
Wed Oct 18 16:10:04 2017<stdout>:(Rank:0) tst_test_array[0]:Alltoall
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm MPI_COMM_WORLD (1/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Duplicated MPI_COMM_WORLD (4/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Reversed MPI_COMM_WORLD (5/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Halved MPI_COMM_WORLD (6/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Odd/Even split MPI_COMM_WORLD (7/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Zero-and-Rest Intercommunicator (8/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Halved Intercommunicator (12/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm Intracomm merged of the Halved Intercomm (13/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:[1508332204.790096] [clx-orion-082:5228 :0]    ib_mlx5_log.c:109  UCX  ERROR Error on QP 0xfbe4 wqe[1516]: Local QP operation (synd 0x2 vend 0x68) opcode SEND
Wed Oct 18 16:10:04 2017<stdout>:[1508332204.790188] [clx-orion-082:5228 :0]     ucp_worker.c:380  UCX  ERROR Error Endpoint timeout was not handled for ep 0x141d510 - dc_mlx5/mlx5_0:1
Wed Oct 18 16:10:04 2017<stdout>:[1508332204.800257] [clx-orion-082:5228 :0]        offload.c:476  UCX  ERROR Failed to cancel tag rndv op Endpoint timeout
Wed Oct 18 16:10:04 2017<stdout>:P2P tests Alltoall (23/1), comm MPI_COMM_TYPE_SHARED comm (14/14), type MPI_CONTIGUOUS_INT (21/1)
Wed Oct 18 16:10:04 2017<stdout>:Number of failed tests:0
@alinask alinask added Bug MTT MTT Error labels Oct 18, 2017
@yosefe yosefe modified the milestone: v1.3.0 Jan 21, 2018
@brminich
Copy link
Contributor

brminich commented Feb 5, 2018

Reproduced with
mpirun -np 64 -mca btl_openib_warn_default_gid_prefix 0 --bind-to core -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 -mca btl_openib_if_include mlx5_0:1 -mca coll '^hcoll' -x UCX_TLS='dc,dc_x' -x UCX_DC_VERBS_TM_ENABLE=y -x UCX_TM_OFFLOAD=y -x LD_PRELOAD="/labhome/mikhailb/ucx/install/lib/libucp.so:/labhome/mikhailb/ucx/install/lib/libuct.so" --map-by node -mca opal_pmix_base_async_modex 0 -mca mpi_add_procs_cutoff 9999 -mca pmix_base_collect_data 0 -x UCX_ZCOPY_THRESH=5000000 -x UCX_SEG_SIZE=250000 -x UCX_RNDV_THRESH=250000 -x UCX_MAX_BCOPY=250000 -x UCX_DC_MLX5_FC_ENABLE=n -x UCX_DC_VERBS_FC_ENABLE=n -x UCX_LOG_FILE=./log-%h-%p.txt -x UCX_LOG_LEVEL_TRIGGER=error ~/tmp/mpi_test_suite -x relaxed -t 'P2P' -c 'All' -d 'All' -n 600

Output:

[clx-orion-002:12385:0:12385]   eager_rcv.c:194  ERROR: Unexpected sync ack received: tag 250002e00003 uuid 5445a00b10db876b
[clx-orion-002:12380:0:12380]   eager_rcv.c:194  ERROR: Unexpected sync ack received: tag 250001000003 uuid 21330e00e51cc509
[clx-orion-002:12384:0:12384] ib_mlx5_log.c:113  FATAL: Error on QP 0x547c wqe[598]: Local length (synd 0x1 vend 0x68) opcode SEND

@yosefe
Copy link
Contributor

yosefe commented Feb 5, 2018

but this is not same symptom as the original issue, plus the original issue used older UCX which didn't have TM in dc_x

@brminich
Copy link
Contributor

brminich commented Feb 5, 2018

  • "synd 0x1 vend 0x68" is present in both: original problem and my reproducer
  • TM for dc_x is not used in my reproducer as well

@yosefe
Copy link
Contributor

yosefe commented Feb 5, 2018

right, missed the 0x68
So you mean that old one was using UCT_TLS=all, which is somewhat equivalent to dc,dc_x today when dc_x/TM is disabled?

@brminich
Copy link
Contributor

brminich commented Feb 5, 2018

Yes, because we had TM enabled for DC verbs for a while

@yosefe yosefe changed the title [mtt] Error: ucx send failed: Endpoint timeout [mtt] Timeout in mpi_test_suite with HW TM Feb 5, 2018
@brminich
Copy link
Contributor

brminich commented Feb 7, 2018

RM issue N1293994

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants