nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #306

hmeScaler · 2024-10-11T11:46:04Z

Hello,

On my HGX GPU cluster I have the following error which occurs when the trainings start to run.
This causes problems with the reliability of the AI models.

Have you ever had this error? Do you have any ideas?

Thanks for help.

Best

[25033.266922] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.280589] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.300821] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.320988] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.342081] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.360507] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.380740] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.400553] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.420777] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.440911] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.461063] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.481198] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing
[25033.501350] nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing

modinfo nvidia_peermem
filename:       /lib/modules/5.19.0-45-generic/kernel/drivers/video/nvidia-peermem.ko
version:        550.90.07
license:        Linux-OpenIB
description:    NVIDIA GPU memory plug-in
author:         Yishai Hadas
srcversion:     4F8B460B3801C5451579324
depends:        nvidia,ib_core
retpoline:      Y
name:           nvidia_peermem
vermagic:       5.19.0-45-generic SMP preempt mod_unload modversions 
parm:           peerdirect_support:Set level of support for Peer-direct, 0 [default] or 1 [legacy, for example MLNX_OFED 4.9 LTS] (int)

nvidia-smi 
Thu Oct 10 11:39:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:04:00.0 Off |                    0 |
| N/A   26C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:23:00.0 Off |                    0 |
| N/A   22C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:43:00.0 Off |                    0 |
| N/A   24C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:64:00.0 Off |                    0 |
| N/A   23C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:84:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:A3:00.0 Off |                    0 |
| N/A   23C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:C3:00.0 Off |                    0 |
| N/A   24C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:E4:00.0 Off |                    0 |
| N/A   23C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

ofed_info -s
MLNX_OFED_LINUX-5.8-5.1.1.2:

uname -r
5.19.0-45-generic

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #306

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #306

hmeScaler commented Oct 11, 2024

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #306

nvidia-peermem nv_get_p2p_free_callback:127 ERROR detected invalid context, skipping further processing #306

Comments

hmeScaler commented Oct 11, 2024