Skip to content

Commit

Permalink
Update release notes for 1.7.4-aws release
Browse files Browse the repository at this point in the history
Signed-off-by: Brian Barrett <bbarrett@amazon.com>
  • Loading branch information
bwbarrett committed Dec 4, 2023
1 parent d66225b commit 87c445a
Showing 1 changed file with 45 additions and 0 deletions.
45 changes: 45 additions & 0 deletions RELEASENOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,51 @@ release) was only available in one of the two variants, we note that
in the release notes.


# v1.7.4-aws release notes
This release requires [Libfabric v1.11.0](https://github.com/ofiwg/libfabric/releases/tag/v1.11.0)
or later and supports [NCCL v2.19.3-1](https://github.com/NVIDIA/nccl/releases/tag/v2.19.3-1) while
maintaining backward compatibility with older NCCL versions ([NCCL v2.4.8](https://github.com/NVIDIA/nccl/releases/tag/v2.4.8-1) and later).
It was tested with Libfabric versions up to
[Libfabric v1.19.0](https://github.com/ofiwg/libfabric/releases/tag/v1.19.0).

With NCCL 2.18.5 or later and v1.7.3-aws or later of the plugin,
[NVLink SHARP](https://developer.nvidia.com/blog/upgrading-multi-gpu-interconnectivity-with-the-third-generation-nvidia-nvswitch/)
is enabled for the first time on AWS platforms. NVLink SHARP offloads
the computation part of Allreduce collectives to the NVLink fabric,
and involves a different set of algorithms for multi-node parallelism
than previously used. We have seen NVLink SHARP both help and hurt
performance of applications. While NVLink SHARP is enabled by default
if NCCL 2.18.5 or later is used, users may wish to disable it by
setting `NCCL_NVLS_ENABLE=0` in the environment of your job.

New Features:
* Hard fail if GPUDirect RDMA initialization fails on an EC2 instance
that should support GPUDirect RDMA (such as P4d.24xlarge or
P5.48xlarge), rather than fall back to host copy buffers at
significantly reduced performance. Setting the environment variable
`OFI_NCCL_DISABLE_GDR_REQUIRED_CHECK=1` will disable this behavior.
* Change the threshold at which the rdma transport switches from round
robin to striping from 8 KiB to 256 KiB, improving the efficiency of
large message transfers.

Bug Fixes:
* Fixed debugging output in some initialization failure cases.
* Request `FI_LOCAL_COMM` feature from Libfabric, as flush and eager
copies are both implemented via local communication.
* Fix initialization when using the Libfabric TCP provider.
* Improve documentation on using the plugin with AWS's Elastic Fabric
Adapter (EFA).
* Improve handling of Neuron device detection when the plugin is used
with Tranium instances.
* Fix segfault in error case of freelist memory growth.
* The test programs that only support 2 ranks now fail with a useful
error message if run with another number of ranks.

Testing:
The plugin has been tested with following libfabric providers using unit tests
bundled in the source code and [nccl-tests](https://github.com/NVIDIA/nccl-tests) test suite:
* efa

# v1.7.3-aws release notes
This release requires [Libfabric v1.11.0](https://github.com/ofiwg/libfabric/releases/tag/v1.11.0)
or later and supports [NCCL v2.18.5-1](https://github.com/NVIDIA/nccl/releases/tag/v2.18.3-1) while
Expand Down

0 comments on commit 87c445a

Please sign in to comment.