Skip to content

Releases: aws/sagemaker-training-toolkit

v4.8.1

09 Sep 16:55
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • Added p5 as a supported NCCL instance

v4.8.0

14 Aug 19:02
Compare
Choose a tag to compare

Features

  • Add support for py39 and py310

Bug Fixes and Other Changes

  • typo in the run unit tests command
  • run unit tests in sequence order for release process as well to prevent coverage conflicting issues
  • chore: removing unnecessary logging information

v4.7.4

31 Oct 18:03
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • update the boto deps to use latest boto

v4.7.3

23 Oct 16:46
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • bypass DNS check for studio local exec

v4.7.2

19 Oct 16:46
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • use smddprun only if it is installed

v4.7.1

17 Oct 16:46
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • Add NCCL_PROTO=simple environment variable to handle the out-of-order data delivery from EFA
  • toolkit build failure

v4.7.0

08 Aug 16:46
Compare
Choose a tag to compare

Features

  • support codeartifact for installing requirements.txt packages

v4.6.1

19 Jun 16:46
Compare
Choose a tag to compare

Bug Fixes and Other Changes

  • removed unused import statment
  • forgot to run black on torch_distributed.py after updating my comments from last commit
  • Modified my comment on line 98-103 in torch_distrbuted.py to comply with formatting standard.
  • Revert "Ran black on entire sagemaker-trianing-toolkit directory"
  • Ran black on entire sagemaker-trianing-toolkit directory
  • Ran Black (python formatter) on the files with my code updates (torch_distributed.py and test_torch_distributed.py)
  • Added test for neuron_parallel_compile in test_torch_distributed.py
  • Updated comment syntax based on feedback in pull request as well as added full example of the neuron_parallel_compile command as it would appear in the command line
  • added unit test for neuron_parallel_compile code change
  • Updated torch_distributed.py

v4.6.0

15 Jun 16:46
Compare
Choose a tag to compare

Features

  • add smddp exception classes in mpi distribution

v4.5.0

26 Apr 16:47
Compare
Choose a tag to compare

Features

  • add NCCL_PROTO, NCCL_ALGO environments for modelparallel jobs