You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.
Have you ever had this error? Do you have any ideas?
Thanks for help.
Best
[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
modinfo nvidia_peermem
filename: /lib/modules/5.19.0-45-generic/kernel/drivers/video/nvidia-peermem.ko
version: 550.90.07
license: Linux-OpenIB
description: NVIDIA GPU memory plug-in
author: Yishai Hadas
srcversion: 4F8B460B3801C5451579324
depends: nvidia,ib_core
retpoline: Y
name: nvidia_peermem
vermagic: 5.19.0-45-generic SMP preempt mod_unload modversions
parm: peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)
nvidia-smi
Thu Oct 10 11:39:22 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 ||-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |||| MIG M. ||=========================================+========================+======================|| 0 NVIDIA H100 80GB HBM3 On | 00000000:04:00.0 Off | 0 || N/A 26C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:23:00.0 Off | 0 || N/A 22C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:43:00.0 Off | 0 || N/A 24C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:64:00.0 Off | 0 || N/A 23C P0 67W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:84:00.0 Off | 0 || N/A 23C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:A3:00.0 Off | 0 || N/A 23C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:C3:00.0 Off | 0 || N/A 24C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:E4:00.0 Off | 0 || N/A 23C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |||| Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=========================================================================================|| No running processes found |
+-----------------------------------------------------------------------------------------+
Hello,
On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.
Have you ever had this error? Do you have any ideas?
Thanks for help.
Best
cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS" ofed_info -s MLNX_OFED_LINUX-5.8-5.1.1.2: uname -r 5.19.0-45-generic
The text was updated successfully, but these errors were encountered: