Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nvidia and nvidia-frontend names when getting device major #330

Merged
merged 2 commits into from
Feb 12, 2024

Conversation

tariq1890
Copy link
Contributor

@tariq1890 tariq1890 commented Feb 2, 2024

This PR fixes #324

It checks the NVIDIA Devices struct constructed from /proc/devices for both nvidia and nvidia-frontend to retrieve the major number.

This enables support for newer driver versions ( >= 550.40 drivers) where the module was renamed while still allowing this to detect the kernel module not being loaded.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @tariq1890.

If we were to do the check for nvidia or nvidia-frontend in the devices.Get method as below:

diff --git a/internal/info/proc/devices/devices.go b/internal/info/proc/devices/devices.go
index 95430428..65f3108f 100644
--- a/internal/info/proc/devices/devices.go
+++ b/internal/info/proc/devices/devices.go
@@ -32,8 +32,12 @@ const (
 	NVIDIACTLMinor      = 255
 	NVIDIAModesetMinor  = 254
 
+	// NVIDIADeviceMajor is hardcoded as NV_MAJOR_DEVICE_NUMBER in nvidia-modprobe:
+	// https://github.com/NVIDIA/nvidia-modprobe/blob/d6bce304f30b6661c9ab6a993f49340eafca7a7e/modprobe-utils/nvidia-modprobe-utils.c#L85
+	NVIDIADeviceMajor = 195
+
 	NVIDIAFrontend = Name("nvidia-frontend")
-	NVIDIAGPU      = NVIDIAFrontend
+	NVIDIAGPU      = Name("nvidia")
 	NVIDIACaps     = Name("nvidia-caps")
 	NVIDIAUVM      = Name("nvidia-uvm")
 
@@ -61,12 +65,15 @@ var _ Devices = devices(nil)
 
 // Exists checks if a Device with a given name exists or not
 func (d devices) Exists(name Name) bool {
-	_, exists := d[name]
+	_, exists := d.Get(name)
 	return exists
 }
 
 // Get a Device from Devices
 func (d devices) Get(name Name) (Major, bool) {
+	if name == NVIDIAGPU || name == NVIDIAFrontend {
+		return NVIDIADeviceMajor, true
+	}
 	device, exists := d[name]
 	return device, exists
 }

The change set seems a little easier to reason about. This also doesn't change the interface to the lower level package at all.

The general cleanups that we can make can be made as additional commits on top.

Is there a way that we can verify the failure of the existing code and the fix in a unit test?

@klueska
Copy link
Contributor

klueska commented Feb 2, 2024

Thanks for this @tariq1890.

If we were to do the check for nvidia or nvidia-frontend in the devices.Get method as below:

diff --git a/internal/info/proc/devices/devices.go b/internal/info/proc/devices/devices.go
index 95430428..65f3108f 100644
--- a/internal/info/proc/devices/devices.go
+++ b/internal/info/proc/devices/devices.go
@@ -32,8 +32,12 @@ const (
 	NVIDIACTLMinor      = 255
 	NVIDIAModesetMinor  = 254
 
+	// NVIDIADeviceMajor is hardcoded as NV_MAJOR_DEVICE_NUMBER in nvidia-modprobe:
+	// https://github.com/NVIDIA/nvidia-modprobe/blob/d6bce304f30b6661c9ab6a993f49340eafca7a7e/modprobe-utils/nvidia-modprobe-utils.c#L85
+	NVIDIADeviceMajor = 195
+
 	NVIDIAFrontend = Name("nvidia-frontend")
-	NVIDIAGPU      = NVIDIAFrontend
+	NVIDIAGPU      = Name("nvidia")
 	NVIDIACaps     = Name("nvidia-caps")
 	NVIDIAUVM      = Name("nvidia-uvm")
 
@@ -61,12 +65,15 @@ var _ Devices = devices(nil)
 
 // Exists checks if a Device with a given name exists or not
 func (d devices) Exists(name Name) bool {
-	_, exists := d[name]
+	_, exists := d.Get(name)
 	return exists
 }
 
 // Get a Device from Devices
 func (d devices) Get(name Name) (Major, bool) {
+	if name == NVIDIAGPU || name == NVIDIAFrontend {
+		return NVIDIADeviceMajor, true
+	}
 	device, exists := d[name]
 	return device, exists
 }

The change set seems a little easier to reason about. This also doesn't change the interface to the lower level package at all.

The general cleanups that we can make can be made as additional commits on top.

Is there a way that we can verify the failure of the existing code and the fix in a unit test?

This seems cleaner and more consistent given that the rest of the devices are all looked up using their major device file.

@elezar
Copy link
Member

elezar commented Feb 2, 2024

Another option would be to inject the hardcoded name -> major mappings when we read the /proc/devices file. Something like:

diff --git a/internal/info/proc/devices/devices.go b/internal/info/proc/devices/devices.go
index 95430428..7379f032 100644
--- a/internal/info/proc/devices/devices.go
+++ b/internal/info/proc/devices/devices.go
@@ -89,7 +89,14 @@ func nvidiaDevices(devicesPath string) (Devices, error) {
 	}
 	defer devicesFile.Close()
 
-	return nvidiaDeviceFrom(devicesFile)
+	procDevices, err := nvidiaDeviceFrom(devicesFile)
+	if err != nil {
+		return nil, err
+	}
+	procDevices["nvida"] = 195
+	procDevices["nvidia-frontend"] = 195
+
+	return procDevices, nil
 }
 
 var errNoNvidiaDevices = errors.New("no NVIDIA devices found")

This would prevent us from forgetting to use the .Get or .Exists methods and allow direct element access to work as expected.

@tariq1890
Copy link
Contributor Author

Thanks for the reviews @klueska @elezar

Agreed that this PR is more complex than it needs to be. My only concern is that we are returning Major number of the nvidia device at all times when passing in nvidia/nvidia-frontend as input to the Get function. Shouldn't we address the scenario where there many not be an entry for nvidia gpu device in /proc/devices ?

@tariq1890
Copy link
Contributor Author

Another option would be to inject the hardcoded name -> major mappings when we read the /proc/devices file. Something like:

diff --git a/internal/info/proc/devices/devices.go b/internal/info/proc/devices/devices.go
index 95430428..7379f032 100644
--- a/internal/info/proc/devices/devices.go
+++ b/internal/info/proc/devices/devices.go
@@ -89,7 +89,14 @@ func nvidiaDevices(devicesPath string) (Devices, error) {
 	}
 	defer devicesFile.Close()
 
-	return nvidiaDeviceFrom(devicesFile)
+	procDevices, err := nvidiaDeviceFrom(devicesFile)
+	if err != nil {
+		return nil, err
+	}
+	procDevices["nvida"] = 195
+	procDevices["nvidia-frontend"] = 195
+
+	return procDevices, nil
 }

This would not be an accurate representation of the /proc/devices. Ideally we shouldn't have to mutate the current state

@tariq1890
Copy link
Contributor Author

Is there a way that we can verify the failure of the existing code and the fix in a unit test?

@elezar I have created a PR for this in the meantime #334

@elezar
Copy link
Member

elezar commented Feb 5, 2024

Thanks for the reviews @klueska @elezar

Agreed that this PR is more complex than it needs to be. My only concern is that we are returning Major number of the nvidia device at all times when passing in nvidia/nvidia-frontend as input to the Get function. Shouldn't we address the scenario where there many not be an entry for nvidia gpu device in /proc/devices ?

The fact that we're not maintaining behaviour for when the /proc/devices file does not contain any nvidia-* entries is a valid concern. If we modify the .Get method as follows:

diff --git a/internal/info/proc/devices/devices.go b/internal/info/proc/devices/devices.go
index 95430428..976417ea 100644
--- a/internal/info/proc/devices/devices.go
+++ b/internal/info/proc/devices/devices.go
@@ -33,7 +33,7 @@ const (
 	NVIDIAModesetMinor  = 254
 
 	NVIDIAFrontend = Name("nvidia-frontend")
-	NVIDIAGPU      = NVIDIAFrontend
+	NVIDIAGPU      = Name("nvidia")
 	NVIDIACaps     = Name("nvidia-caps")
 	NVIDIAUVM      = Name("nvidia-uvm")
 
@@ -61,14 +61,28 @@ var _ Devices = devices(nil)
 
 // Exists checks if a Device with a given name exists or not
 func (d devices) Exists(name Name) bool {
-	_, exists := d[name]
+	_, exists := d.Get(name)
 	return exists
 }
 
 // Get a Device from Devices
 func (d devices) Get(name Name) (Major, bool) {
-	device, exists := d[name]
-	return device, exists
+	for _, n := range name.getWithFallback() {
+		device, exists := d[n]
+		if exists {
+			return device, true
+		}
+	}
+	return 0, false
+}
+
+// getWithFallback returns a prioritised list of device names for a specific name.
+// This allows multiple names to be associated with a single name to support various driver versions.
+func (n Name) getWithFallback() []Name {
+	if n == NVIDIAGPU || n == NVIDIAFrontend {
+		return []Name{NVIDIAGPU, NVIDIAFrontend}
+	}
+	return []Name{n}
 }
 
 // GetNVIDIADevices returns the set of NVIDIA Devices on the machine

where we add a getWithFallback() method to Name that allows multiple names to be specified and query each of these. That means for: nvidia or nvidia-frontend we will query []string{nvidia, nvidia-frontend} but still fail if none of these exist in the file.

Thanks for creating #334. Feel free to integrate those changes as part of this PR so that it does not require additional changes once this is merged.

@tariq1890 tariq1890 force-pushed the nvidia-dev-maj-num-lookup branch 3 times, most recently from ad7277b to 1024dae Compare February 6, 2024 06:39
@tariq1890 tariq1890 changed the title lookup the nvidia/nvidia-frontend device by device major number add fallback logic when retrieving major number of the nvidia control device Feb 6, 2024
@tariq1890
Copy link
Contributor Author

Thanks for your patience with this @elezar . I've made the changes and it hopefully addresses all of your concerns.

… device

Signed-off-by: Tariq Ibrahim <tibrahim@nvidia.com>
Signed-off-by: Evan Lezar <elezar@nvidia.com>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tariq1890.

I have added a commit on top with some refactoring that should simplify testing this.

@@ -31,11 +31,25 @@ func TestCreateControlDevices(t *testing.T) {

nvidiaDevices := &devices.DevicesMock{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the use of a mock here means that the change to the Get function in proc.Devices are not tested here. I have added a simple commit on top of this that adds a devices.New() function that allows a devices struct to be constructed explicitly. This can then be used in tests to construct two sets of "proc files" and ideally all tests are repeated for each of these.

I think the various functions in proc/devices/devices.go would have to be cleaned up a bit to use the contructor correctly, but this can be considered out of scope for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@@ -46,6 +60,7 @@ func TestCreateControlDevices(t *testing.T) {
root string
devices devices.Devices
mknodeError error
hasError bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally require.ErrorIs should be used. If we're wrapping errors correctly, then things should work as expected.

@@ -126,9 +160,12 @@ func TestCreateControlDevices(t *testing.T) {
d.mknoder = mknode

err := d.CreateNVIDIAControlDevices()
require.ErrorIs(t, err, tc.expectedError)
if tc.hasError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require.ErrorIs should hanldle the case where tc.expectedError is nil. The need to do this may have been affected by us not wrapping an error correctly somewhere in the callchain.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elezar elezar changed the title add fallback logic when retrieving major number of the nvidia control device Support nvidia and nvidia-frontend names when getting device major Feb 12, 2024
@elezar elezar merged commit a2a1a78 into NVIDIA:main Feb 12, 2024
8 checks passed
elezar added a commit to elezar/nvidia-container-toolkit that referenced this pull request Feb 12, 2024
Signed-off-by: Evan Lezar <elezar@nvidia.com>
elezar added a commit that referenced this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for >= 550.40.x drivers to the toolkit container
3 participants