Use nvmlDeviceGetCount_v2() first for CUDA check #9170

qkoziol · 2023-07-25T14:33:23Z

Check for CUDA devices with nvmlDeviceGetCount_v2() first, to avoid more expensive call to cudaGetDeviceCount() when possible.

wenduwan · 2023-07-25T15:19:21Z

CI failure looks unrelated...

C:\projects\libfabric\include\ofi_atom.h(420,1): warning C4047: 'function': 'long' differs in levels of indirection from 'int32_t *' (compiling source file prov\hook\src\hook.c) [C:\projects\libfabric\libfabric.vcxproj]
C:\projects\libfabric\include\ofi_atom.h(420,1): warning C4024: '_InterlockedCompareExchange': different types for formal and actual parameter 3 (compiling source file prov\hook\src\hook.c) [C:\projects\libfabric\libfabric.vcxproj]
C:\projects\libfabric\include\ofi_atom.h(420,1): warning C4047: 'function': 'long' differs in levels of indirection from 'int32_t *' (compiling source file prov\hook\perf\src\hook_perf.c) [C:\projects\libfabric\libfabric.vcxproj]
C:\projects\libfabric\include\ofi_atom.h(421,1): warning C4133: 'function': incompatible types - from 'volatile ofi_atomic_int_64_t *' to 'volatile long *' (compiling source file prov\hook\src\hook.c) [C:\projects\libfabric\libfabric.vcxproj]
  Category: Warning
  Code: C4244
  File: C:\projects\libfabric\include\ofi_signal.h
  Line: 107
  Column: 64
  Project name: libfabric
  Project file name: C:\projects\libfabric\libfabric.vcxproj

shefty · 2023-07-25T15:17:43Z

src/hmem_cuda.c

-		break;
+	/* Verify NVIDIA devices are present on the host. */
+	nvml_ret = ofi_nvmlDeviceGetCount_v2(&nvml_device_count);
+	if (NVML_SUCCESS == nvml_ret) {


Just return an error here, rather than indenting the entire function within the if statement. Also, use forward logic for comparisons.

OK, will update

shefty · 2023-07-25T15:18:45Z

src/hmem_cuda.c

-	case cudaErrorNoDevice:
-		return -FI_ENOSYS;
+		if (nvml_device_count > 0) {
+			cudaError_t cuda_ret;


Declare variables at the top of the function. It makes them easier to find.

OK, will update

configure.ac

src/hmem_cuda.c

shijin-aws · 2023-07-26T15:35:39Z

Please look at the AWS CI failure

src/hmem_cuda.c

qkoziol · 2023-07-26T23:14:54Z

bot:aws:retest

qkoziol · 2023-07-27T14:43:10Z

I'll close this PR and address all the formatting and commit-related issues by submitting another PR with a single commit covering the combined set of changes.

shijin-aws · 2023-07-27T14:57:57Z

I'll close this PR and address all the formatting and commit-related issues by submitting another PR with a single commit covering the combined set of changes.

Just squash locally and force push it. No need for a new PR

shefty · 2023-07-27T15:01:46Z

Please don't close the PR. That loses the comments. Just update the original patch and force push.

Checking w/lightweight nvmlDeviceGetCount_v2() call first allows us to avoid the more expensive call to cudaGetDeviceCount() when there's no NVIDIA devices on the node. Signed-off-by: Quincey Koziol <qkoziol@amazon.com>

qkoziol · 2023-07-27T15:43:55Z

Squashed, updated the commit message, and force-pushed

shijin-aws · 2023-07-27T20:46:14Z

We are evaluating the performance of the updated version before merging it.

qkoziol · 2023-07-27T23:01:18Z

Performance is still good, merging now.

qkoziol · 2023-07-27T23:02:04Z

@shefty - Should this be backported to any release branches?

shefty · 2023-07-27T23:11:09Z

I'll let AWS decide that. IMO, it doesn't seem critical enough to backport. I doubt many apps would notice this outside of some benchmarks.

wenduwan requested review from a team and shefty July 25, 2023 14:53

shefty reviewed Jul 25, 2023

View reviewed changes

shijin-aws reviewed Jul 25, 2023

View reviewed changes

src/hmem_cuda.c Show resolved Hide resolved

qkoziol requested review from shefty and shijin-aws July 25, 2023 18:46

shefty reviewed Jul 26, 2023

View reviewed changes

src/hmem_cuda.c Show resolved Hide resolved

Check for CUDA devices with nvmlDeviceGetCount_v2() first

fad962b

Checking w/lightweight nvmlDeviceGetCount_v2() call first allows us to avoid the more expensive call to cudaGetDeviceCount() when there's no NVIDIA devices on the node. Signed-off-by: Quincey Koziol <qkoziol@amazon.com>

qkoziol requested a review from shefty July 27, 2023 15:44

shefty approved these changes Jul 27, 2023

View reviewed changes

shijin-aws approved these changes Jul 27, 2023

View reviewed changes

qkoziol merged commit 086b741 into ofiwg:main Jul 27, 2023
8 checks passed

qkoziol deleted the nvml_get_device_count branch July 28, 2023 19:57

qkoziol mentioned this pull request Aug 8, 2023

Issues with CUDA accelerator component initialization open-mpi/ompi#11831

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use nvmlDeviceGetCount_v2() first for CUDA check #9170

Use nvmlDeviceGetCount_v2() first for CUDA check #9170

qkoziol commented Jul 25, 2023

wenduwan commented Jul 25, 2023

shefty Jul 25, 2023

qkoziol Jul 25, 2023

shefty Jul 25, 2023

qkoziol Jul 25, 2023

shijin-aws commented Jul 26, 2023

qkoziol commented Jul 26, 2023

qkoziol commented Jul 27, 2023

shijin-aws commented Jul 27, 2023

shefty commented Jul 27, 2023

qkoziol commented Jul 27, 2023

shijin-aws commented Jul 27, 2023 •

edited

Loading

qkoziol commented Jul 27, 2023

qkoziol commented Jul 27, 2023

shefty commented Jul 27, 2023

Use nvmlDeviceGetCount_v2() first for CUDA check #9170

Use nvmlDeviceGetCount_v2() first for CUDA check #9170

Conversation

qkoziol commented Jul 25, 2023

wenduwan commented Jul 25, 2023

shefty Jul 25, 2023

Choose a reason for hiding this comment

qkoziol Jul 25, 2023

Choose a reason for hiding this comment

shefty Jul 25, 2023

Choose a reason for hiding this comment

qkoziol Jul 25, 2023

Choose a reason for hiding this comment

shijin-aws commented Jul 26, 2023

qkoziol commented Jul 26, 2023

qkoziol commented Jul 27, 2023

shijin-aws commented Jul 27, 2023

shefty commented Jul 27, 2023

qkoziol commented Jul 27, 2023

shijin-aws commented Jul 27, 2023 • edited Loading

qkoziol commented Jul 27, 2023

qkoziol commented Jul 27, 2023

shefty commented Jul 27, 2023

shijin-aws commented Jul 27, 2023 •

edited

Loading