Update release notes for 1.7.4-aws release

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
aws · Dec 4, 2023 · 87c445a · 87c445a
1 parent d66225b
commit 87c445a
Showing 1 changed file with 45 additions and 0 deletions.
diff --git a/RELEASENOTES.md b/RELEASENOTES.md
@@ -14,6 +14,51 @@ release) was only available in one of the two variants, we note that
 in the release notes.
 
 
+# v1.7.4-aws release notes
+This release requires [Libfabric v1.11.0](https://github.com/ofiwg/libfabric/releases/tag/v1.11.0)
+or later and supports [NCCL v2.19.3-1](https://github.com/NVIDIA/nccl/releases/tag/v2.19.3-1) while
+maintaining backward compatibility with older NCCL versions ([NCCL v2.4.8](https://github.com/NVIDIA/nccl/releases/tag/v2.4.8-1) and later).
+It was tested with Libfabric versions up to
+[Libfabric v1.19.0](https://github.com/ofiwg/libfabric/releases/tag/v1.19.0).
+
+With NCCL 2.18.5 or later and v1.7.3-aws or later of the plugin,
+[NVLink SHARP](https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch/)
+is enabled for the first time on AWS platforms.  NVLink SHARP offloads
+the computation part of Allreduce collectives to the NVLink fabric,
+and involves a different set of algorithms for multi-node parallelism
+than previously used.  We have seen NVLink SHARP both help and hurt
+performance of applications.  While NVLink SHARP is enabled by default
+if NCCL 2.18.5 or later is used, users may wish to disable it by
+setting `NCCL_NVLS_ENABLE=0` in the environment of your job.
+
+New Features:
+* Hard fail if GPUDirect RDMA initialization fails on an EC2 instance
+  that should support GPUDirect RDMA (such as P4d.24xlarge or
+  P5.48xlarge), rather than fall back to host copy buffers at
+  significantly reduced performance.  Setting the environment variable
+  `OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1` will disable this behavior.
+* Change the threshold at which the rdma transport switches from round
+  robin to striping from 8 KiB to 256 KiB, improving the efficiency of
+  large message transfers.
+
+Bug Fixes:
+* Fixed debugging output in some initialization failure cases.
+* Request `FI_LOCAL_COMM` feature from Libfabric, as flush and eager
+  copies are both implemented via local communication.
+* Fix initialization when using the Libfabric TCP provider.
+* Improve documentation on using the plugin with AWS's Elastic Fabric
+  Adapter (EFA).
+* Improve handling of Neuron device detection when the plugin is used
+  with Tranium instances.
+* Fix segfault in error case of freelist memory growth.
+* The test programs that only support 2 ranks now fail with a useful
+  error message if run with another number of ranks.
+
+Testing:
+The plugin has been tested with following libfabric providers using unit tests
+bundled in the source code and [nccl-tests](https://github.com/NVIDIA/nccl-tests) test suite:
+* efa
+
 # v1.7.3-aws release notes
 This release requires [Libfabric v1.11.0](https://github.com/ofiwg/libfabric/releases/tag/v1.11.0)
 or later and supports [NCCL v2.18.5-1](https://github.com/NVIDIA/nccl/releases/tag/v2.18.3-1) while