From 810ae0a812a02d477166629ecde222c7c9ac454f Mon Sep 17 00:00:00 2001 From: "Pavel Shamis (Pasha)" Date: Mon, 2 Mar 2020 17:44:27 -0600 Subject: [PATCH 1/3] NEWS: Backport of NEWS from master Signed-off-by: Pavel Shamis (Pasha) --- NEWS | 58 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 54 insertions(+), 4 deletions(-) diff --git a/NEWS b/NEWS index 379e14ad4fc..dad9159dfec 100644 --- a/NEWS +++ b/NEWS @@ -1,7 +1,7 @@ # -## Copyright (C) Mellanox Technologies Ltd. 2001-2015. ALL RIGHTS RESERVED. -## Copyright (C) UT-Battelle, LLC. 2014-2015. ALL RIGHTS RESERVED. -## Copyright (C) ARM Ltd. 2017-2018. ALL RIGHTS RESERVED. +## Copyright (C) Mellanox Technologies Ltd. 2001-2020. ALL RIGHTS RESERVED. +## Copyright (C) UT-Battelle, LLC. 2014-2019. ALL RIGHTS RESERVED. +## Copyright (C) ARM Ltd. 2017-2020. ALL RIGHTS RESERVED. ## ## See file LICENSE for terms. ## @@ -11,6 +11,56 @@ Features: - TBD +## 1.7.0 (January 19, 2020) +Features: +- Added support for multiple listening transports +- Added UCT socket-based connection manager transport +- Updated API for UCT component management +- Added API to retrieve the listening port +- Added UCP active message API +- Removed deprecated API for querying UCT memory domains +- Refactored server/client examples +- Added support for dlopen interception in UCM +- Added support for PCIe atomics +- Updated Java API: added support for most of UCP layer operations +- Updated support for Mellanox DevX API +- Added multiple UCT/TCP transport performance optimizations +- Optimized memcpy() for Intel platforms +- Added protection from non-UCX socket based app connections +- Improved search time for PKEY object +- Enable gtest over IPv6 interfaces +- Updated Mellanox and Bull device IDs +- Added support for CUDA_VISIBLE_DEVICES +- Increased limits for CUDA IPC registration + +Bugfixes: +- Multiple fixes in UCP, UCT, UCM libraries +- Multiple fixes for BSD and Mac OS systems +- Fixes for Clang compiler +- Fixes for CUDA IPC +- Fix CPU optimization configuration options +- Fix JUCX build on GPU nodes +- Fix in Azure release pipeline flow +- Fix in CUDA memory hooks management +- Fix in GPU memory peer direct gtest +- Fix in TCP connection establishment flow +- Fix in GPU IPC check +- Fix in CUDA Jenkins test flow +- Multiple fixes in CUDA IPC flow +- Fix adding missing header files +- Fix to prevent failures in presence VPN enabled Ethernet interfaces + +## 1.6.1 (September 23, 2019) +Features: +- Added Bull Atos HCA device IDs +- Added Azure Pipelines testing + +Bugfixes: +- Multiple static checker fixes +- Remove pkg.m4 dependency +- Multiple clang static checker fixes +- Fix mem type support with generic datatype + ## 1.6.0 (July 17, 2019) Features: - Modular architecture for UCT transports @@ -90,7 +140,7 @@ Tested configurations: Features: - Improved support for installation with latest ROCm - Improved support for latest rdma-core -- Adding support for CUDA IPC for intra-node GPU +- Added support for CUDA IPC for intra-node GPU - Added support for CUDA memory allocation cache for mem-type detection - Added support for latest Mellanox devices - Added support for Nvidia GPU managed memory From fdaca9f3e27d630305d8a1f72e61522a5ca59966 Mon Sep 17 00:00:00 2001 From: "Pavel Shamis (Pasha)" Date: Mon, 2 Mar 2020 18:15:45 -0600 Subject: [PATCH 2/3] NEWS: Adding first cut of v1.8.0-rc1 news Signed-off-by: Pavel Shamis (Pasha) --- NEWS | 20 +++++++++++++++++++- 1 file changed, 19 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index dad9159dfec..cbcde540adc 100644 --- a/NEWS +++ b/NEWS @@ -8,8 +8,26 @@ # ## Current +## 1.8.0-rc1 (TBD) Features: -- TBD +- Improved detection for DEVX support +- Improved TCP scalability +- Added ROCM perftest +- Added optimized memcpy for ROCM devices +- Protection from TCP connection from non-UCX applications +- Added hardware tag-matching for CUDA buffers +- Added support for CUDA and ROCM managed memories + +Bugfixes: +- Multiple fixes in JUCX +- Fixes in UCP thread safety +- Fixes for most recent versions GCC, PGI, and ICC +- Fixes for CPU affinity on Azure instances +- Fixes in XPMEM support on PPC64 +- Performance fixes in CUDA IPC +- Fixes in RDMA CM flows +- Multiple fixes in TCP TL +- Multiple fixes in documentation ## 1.7.0 (January 19, 2020) Features: From 88d60f62dac38258219f38ef74eaffc1b0c50d26 Mon Sep 17 00:00:00 2001 From: "Pavel Shamis (Pasha)" Date: Fri, 6 Mar 2020 22:28:55 -0600 Subject: [PATCH 3/3] NEWS: Refactoring - Addressing reviewers comments - Making sure that we are consistent across the document Signed-off-by: Pavel Shamis (Pasha) --- NEWS | 225 ++++++++++++++++++++++++++++++----------------------------- 1 file changed, 113 insertions(+), 112 deletions(-) diff --git a/NEWS b/NEWS index cbcde540adc..9e69db5150f 100644 --- a/NEWS +++ b/NEWS @@ -7,18 +7,26 @@ ## # -## Current ## 1.8.0-rc1 (TBD) -Features: +### Features: +#### UCX Core - Improved detection for DEVX support - Improved TCP scalability -- Added ROCM perftest +- Added support for ROCM to perftest +- Added support for different source and target memory types to perftest - Added optimized memcpy for ROCM devices -- Protection from TCP connection from non-UCX applications - Added hardware tag-matching for CUDA buffers - Added support for CUDA and ROCM managed memories - -Bugfixes: +- Added support for client/server disconnect protocol over rdma connection manager +- Added support for striding receive queue for hardware tag-matching +- Added XPMEM-based rendezvous protocol for shared memory +- Added support shared memory communication between containers on same machine +- Added support for multi-threaded RDMA memory registration for large regions +#### UCX Java (API Preview) +- Added APIs for stream send/recv, tag probe, and connect request handle +- Added Java package (automatically published) to Maven central + +### Bugfixes: - Multiple fixes in JUCX - Fixes in UCP thread safety - Fixes for most recent versions GCC, PGI, and ICC @@ -26,11 +34,11 @@ Bugfixes: - Fixes in XPMEM support on PPC64 - Performance fixes in CUDA IPC - Fixes in RDMA CM flows -- Multiple fixes in TCP TL +- Multiple fixes in TCP transport - Multiple fixes in documentation ## 1.7.0 (January 19, 2020) -Features: +### Features: - Added support for multiple listening transports - Added UCT socket-based connection manager transport - Updated API for UCT component management @@ -51,7 +59,7 @@ Features: - Added support for CUDA_VISIBLE_DEVICES - Increased limits for CUDA IPC registration -Bugfixes: +### Bugfixes: - Multiple fixes in UCP, UCT, UCM libraries - Multiple fixes for BSD and Mac OS systems - Fixes for Clang compiler @@ -66,21 +74,21 @@ Bugfixes: - Fix in CUDA Jenkins test flow - Multiple fixes in CUDA IPC flow - Fix adding missing header files -- Fix to prevent failures in presence VPN enabled Ethernet interfaces +- Fix to prevent failures in presence of VPN enabled Ethernet interfaces ## 1.6.1 (September 23, 2019) -Features: +### Features: - Added Bull Atos HCA device IDs - Added Azure Pipelines testing -Bugfixes: +### Bugfixes: - Multiple static checker fixes - Remove pkg.m4 dependency - Multiple clang static checker fixes - Fix mem type support with generic datatype ## 1.6.0 (July 17, 2019) -Features: +### Features: - Modular architecture for UCT transports - ROCm transport re-design: support for managed memory, direct copy, ROCm GDR - Random scheduling policy for DC transport @@ -89,7 +97,7 @@ Features: - Support for PCI atomics with IB transports - Reduced UCP address size for homogeneous environments -Bugfixes: +### Bugfixes: - Multiple stability and performance improvements in TCP transport - Multiple stability fixes in Verbs and MLX5 transports - Multiple stability fixes in UCM memory hooks @@ -114,20 +122,20 @@ Bugfixes: - Fix race condition updating fired_events from multiple threads - Fix madvise() hook -Tested configurations: +### Tested configurations: - RDMA: MLNX_OFED 4.5, distribution inbox drivers, rdma-core 22.1 - CUDA: gdrcopy 1.3.2, cuda 9.2, ROCm 2.2 - XPMEM: 2.6.2 - KNEM: 1.1.3 ## 1.5.1 (April 1, 2019) -Bugfixes: +### Bugfixes: - Fix dc_mlx5 transport support check for inbox libmlx5 drivers - issue #3301 - Fix compilation warnings with gcc9 and clang - ROCm - reduce log level of device-not-found message ## 1.5.0 (February 14, 2019) -Features: +### Features: - New emulation mode enabling full UCX functionality (Atomic, Put, Get) over TCP and RDMA-CORE interconnects that don't implement full RDMA semantics - Non-blocking API for all one-sided operations. All blocking communication APIs marked @@ -139,7 +147,7 @@ Features: - Statistics for UCT tag API - GPU-to-Infiniband HCA affinity support based on locality/distance (PCIe) -Bugfixes: +### Bugfixes: - Fix overflow in RC/DC flush operations - Update description in SPEC file and README - Fix RoCE source port for dc_mlx5 flow control @@ -147,15 +155,14 @@ Bugfixes: - Fix segfault in UCP, due to int truncation in count_one_bits() - Multiple other bugfixes (full list on github) -Tested configurations: +### Tested configurations: - InfiniBand: MLNX_OFED 4.4-4.5, distribution inbox drivers, rdma-core - CUDA: gdrcopy 1.2, cuda 9.1.85 - XPMEM: 2.6.2 - KNEM: 1.1.2 ## 1.4.0-rc2 (October 23, 2018) - -Features: +### Features: - Improved support for installation with latest ROCm - Improved support for latest rdma-core - Added support for CUDA IPC for intra-node GPU @@ -167,7 +174,7 @@ Features: and INADDR_ANY - Added support for bitwise atomics operations -Bugfixes: +### Bugfixes: - Performance fixes for rendezvous protocol - Memory hook fixes - Clang support fixes @@ -178,37 +185,36 @@ Bugfixes: - Segfault fix for a code generated by armclang compiler - UCP memory-domain index fix for zero-copy active messages -Tested configurations: +### Tested configurations: - InfiniBand: MLNX_OFED 4.2-4.4, distribution inbox drivers, rdma-core - CUDA: gdrcopy 1.2, cuda 9.1.85 - XPMEM: 2.6.2 - KNEM: 1.1.2 - Multiple bugfixes (full list on github) -Known issues: - #2919 - Segfault in CUDA support when KNEM not present and CMA is active - intra-node RMA transport. As a workaround user can disable CMA support at - compile time: --disable-cma. Alternatively user can remove CMA from UCX_TLS - list, for example: UCX_TLS=mm,rc,cuda_copy,cuda_ipc,gdr_copy. +### Known issues: +#2919 - Segfault in CUDA support when KNEM not present and CMA is active +intra-node RMA transport. As a workaround user can disable CMA support at +compile time: --disable-cma. Alternatively user can remove CMA from UCX_TLS +list, for example: UCX_TLS=mm,rc,cuda_copy,cuda_ipc,gdr_copy. ## 1.3.1 (August 20, 2018) - -Bugfixes: +### Bugfixes: - Prevent potential out-of-order sending in shared memory active messages - CUDA: Include cudamem.h in source tarball, pass cudaFree memory size - Registration cache: fix large range lookup, handle shmat(REMAP)/mmap(FIXED) - Limit IB CQE size for specific ARM boards - RPM: explicitly set gcc-c++ as requirement - Multiple bugfixes (full list on github) -Tested configurations: + +### Tested configurations: - InfiniBand: MLNX_OFED 4.2, inbox OFED drivers. - CUDA: gdrcopy 1.2, cuda 9.1.85 - XPMEM: 2.6.2 - KNEM: 1.1.2 ## 1.3.0 (February 15, 2018) - -Features: +### Features: - Added stream-based communication API to UCP - Added support for GPU platforms: Nvidia CUDA and AMD ROCm software stacks - Added API for client/server based connection establishment @@ -227,30 +233,31 @@ Features: - Add support for external epoll fd and edge-triggered events - Added registration cache for knem - Initial support for Java bindings -Bugfixes: + +### Bugfixes: - Multiple bugfixes (full list on github) -Tested configurations: + +### Tested configurations: - InfiniBand: MLNX_OFED 4.2, inbox OFED drivers. - CUDA: gdrcopy 1.2, cuda 9.1.85 - XPMEM: 2.6.2 - KNEM: 1.1.2 -Known issues: - #2047 - UCP: ucp_do_am_bcopy_multi drops data on UCS_ERROR_NO_RESOURCE - #2047 - failure in ud/uct_flush_test.am_zcopy_flush_ep_nb/1 - #1977 - failure in shm/test_ucp_rma.blocking_small/0 - #1926 - Timeout in mpi_test_suite with HW TM - #1920 - transport retry count exceeded in many-to-one tests - #1689 - Segmentation fault on memory hooks test in jenkins +### Known issues: +#2047 - UCP: ucp_do_am_bcopy_multi drops data on UCS_ERROR_NO_RESOURCE +#2047 - failure in ud/uct_flush_test.am_zcopy_flush_ep_nb/1 +#1977 - failure in shm/test_ucp_rma.blocking_small/0 +#1926 - Timeout in mpi_test_suite with HW TM +#1920 - transport retry count exceeded in many-to-one tests +#1689 - Segmentation fault on memory hooks test in jenkins ## 1.2.2 (January 4, 2018) - -Main: +### Main: - Support including UCX API headers from C++ code - UD transport to handle unicast flood on RoCE fabric - Compilation fixes for gcc 7.1.1, clang 3.6, clang 5 -Details: +### Details: - When UD transport is used with RoCE, packets intended for other peers may arrive on different adapters (as a result of unicast flooding). - This change adds packet filtering based on destination GIDs. Now the packet @@ -263,79 +270,73 @@ Details: - [cleanup] Fixup license headers ## 1.2.1 (August 28, 2017) - +### Bugfixes: - Compilation fixes for gcc 7.1 - Spec file cleanups - Versioning cleanups ## 1.2.0 (June 15, 2017) - -Supported platforms - - Shared memory: KNEM, CMA, XPMEM, SYSV, Posix - - VERBs over InfiniBand and RoCE. - VERBS over other RDMA interconnects (iWarp, OmniPath, etc.) is available - for community evaluation and has not been tested in context of this release - - Cray Gemini and Aries - - Architectures: x86_64, ARMv8 (64bit), Power64 -Features: - - Added support for InfiniBand DC and UD transports, including accelerated verbs for Mellanox devices - - Full support for PGAS/SHMEM interfaces, blocking and non-blocking APIs - - Support for MPI tag matching, both in software and offload mode - - Zero copy protocols and rendezvous, registration cache - - Handling transport errors - - Flow control for DC/RC - - Dataypes support: contiguous, IOV, generic - - Multi-threading support - - Support for ARMv8 64bit architecture - - A new API for efficient memory polling - - Support for malloc-hooks and memory registration caching -Bugfixes: - - Multiple bugfixes improving overall stability of the library -Known issues: - #1604 - Failure in ud/test_ud_slow_timer.retransmit1/1 with valgrind bug - #1588 - Fix reading cpuinfo timebase for ppc bug portability training - #1579 - Ud/test_ud.ca_md test takes too long too complete bug - #1576 - Failure in ud/test_ud_slow_timer.retransmit1/0 with valgrind bug - #1569 - Send completion with error with dc_verbs bug - #1566 - Segfault in malloc_hook.fork on arm bug - #1565 - Hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker bug - #1534 - Wireup.c:473 Fatal: endpoint reconfiguration not supported yet bug - #1533 - Stack overflow under Valgrind 'rc_mlx5/uct_p2p_err_test.local_access_error/0' bug - #1513 - Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test bug - #1504 - Failure in cm/uct_p2p_am_test.am_bcopy/1 bug - #1492 - Hang when using polling fd bug - #1489 - Hang on the osu_fop_latency test with RoCE bug - #1005 - ROcE problem with OMPI direct modex - UD assertion +### Supported platforms +- Shared memory: KNEM, CMA, XPMEM, SYSV, Posix +- VERBs over InfiniBand and RoCE. + VERBS over other RDMA interconnects (iWarp, OmniPath, etc.) is available + for community evaluation and has not been tested in context of this release +- Cray Gemini and Aries +- Architectures: x86_64, ARMv8 (64bit), Power64 + +### Features: +- Added support for InfiniBand DC and UD transports, including accelerated verbs for Mellanox devices +- Full support for PGAS/SHMEM interfaces, blocking and non-blocking APIs +- Support for MPI tag matching, both in software and offload mode +- Zero copy protocols and rendezvous, registration cache +- Handling transport errors +- Flow control for DC/RC +- Dataypes support: contiguous, IOV, generic +- Multi-threading support +- Support for ARMv8 64bit architecture +- A new API for efficient memory polling +- Support for malloc-hooks and memory registration caching + +### Bugfixes: + - Multiple bugfixes improving overall stability of the library + +### Known issues: +#1604 - Failure in ud/test_ud_slow_timer.retransmit1/1 with valgrind bug +#1588 - Fix reading cpuinfo timebase for ppc bug portability training +#1579 - Ud/test_ud.ca_md test takes too long too complete bug +#1576 - Failure in ud/test_ud_slow_timer.retransmit1/0 with valgrind bug +#1569 - Send completion with error with dc_verbs bug +#1566 - Segfault in malloc_hook.fork on arm bug +#1565 - Hang in udrc/test_ucp_rma.nonblocking_stream_get_nbi_flush_worker bug +#1534 - Wireup.c:473 Fatal: endpoint reconfiguration not supported yet bug +#1533 - Stack overflow under Valgrind 'rc_mlx5/uct_p2p_err_test.local_access_error/0' bug +#1513 - Hang in MPI_Finalize with UCX_TLS=rc[_x],sm on the bsend2 test bug +#1504 - Failure in cm/uct_p2p_am_test.am_bcopy/1 bug +#1492 - Hang when using polling fd bug +#1489 - Hang on the osu_fop_latency test with RoCE bug +#1005 - ROcE problem with OMPI direct modex - UD assertion ## 1.1.0 (September 1, 2015) - -Workarounds: -Features: - - Added support for AM based on FIFO in `mm` shared memory transport - - Added support for UCT `knem` shared memory transport (http://knem.gforge.inria.fr) - - Added support for UCT `mm/xpmem` shared memory transport (https://github.com/hjelmn/xpmem) - - -Bugfixes: -Known issues: - +### Workarounds: +### Features: +- Added support for AM based on FIFO in `mm` shared memory transport +- Added support for UCT `knem` shared memory transport (http://knem.gforge.inria.fr) +- Added support for UCT `mm/xpmem` shared memory transport (https://github.com/hjelmn/xpmem) ## 1.0.0 (July 22, 2015) - -Features: - - - Added support for UCT `cma` shared memory transport (Cross-Memory Attatch) - - Added support for UCT `mm` shared memory transport with mmap/sysv APIs - - Added support for UCT `rc` transport based on Infiniband/RC with verbs - - Added support for UCT `mlx5_rc` transport based on Infiniband/RC with accelerated verbs - - Added support for UCT `cm` transport based on Infiniband/SIDR (Service ID Resolution) - - Added support for UCT `ugni` transport based on Cray/UGNI - - Added support for Doxygen based documentation generation - - Added support for UCP basic protocol layer to fit PGAS paradigm (RMA, AMO) - - Added ucx_perftest utility to exercise major UCX flows and provide performance metrics - - Added test script for jenkins (contrib/test_jenkins.sh) - - Added packaging for RPM/DEB based linux distributions (see contrib/buildrpm.sh) - - Added Unit-tests infractucture for UCX functionality based on Google Test framework (see test/gtest/) - - Added initial integration for OpenMPI with UCX for PGAS/SHMEM API - (see: https://github.com/openucx/ompi-mirror/pull/1) - - Added end-to-end testing infrastructure based on MTT (see contrib/mtt/README_MTT) +### Features: +- Added support for UCT `cma` shared memory transport (Cross-Memory Attatch) +- Added support for UCT `mm` shared memory transport with mmap/sysv APIs +- Added support for UCT `rc` transport based on Infiniband/RC with verbs +- Added support for UCT `mlx5_rc` transport based on Infiniband/RC with accelerated verbs +- Added support for UCT `cm` transport based on Infiniband/SIDR (Service ID Resolution) +- Added support for UCT `ugni` transport based on Cray/UGNI +- Added support for Doxygen based documentation generation +- Added support for UCP basic protocol layer to fit PGAS paradigm (RMA, AMO) +- Added ucx_perftest utility to exercise major UCX flows and provide performance metrics +- Added test script for jenkins (contrib/test_jenkins.sh) +- Added packaging for RPM/DEB based linux distributions (see contrib/buildrpm.sh) +- Added Unit-tests infractucture for UCX functionality based on Google Test framework (see test/gtest/) +- Added initial integration for OpenMPI with UCX for PGAS/SHMEM API + (see: https://github.com/openucx/ompi-mirror/pull/1) +- Added end-to-end testing infrastructure based on MTT (see contrib/mtt/README_MTT)