Skip to content

Getting nvidia-device-plugin container CrashLoopBackOff | version v0.14.0 | container runtime : containerd #406

Open
@DineshwarSingh

Description

@DineshwarSingh

Getting nvidia-device-plugin container CrashLoopBackOff error. Using K8-device-plugin version v0.14.0 and container runtime as containerd. Same is working fine with container runtime as dockerd.

Pod ErrorLog:

I0524 08:28:03.907585       1 main.go:256] Retreiving plugins.
W0524 08:28:03.908010       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0524 08:28:03.908084       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0524 08:28:03.908113       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0524 08:28:03.908121       1 factory.go:115] Incompatible platform detected
E0524 08:28:03.908130       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0524 08:28:03.908136       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0524 08:28:03.908142       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0524 08:28:03.908149       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0524 08:28:03.915664       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

nvidia-smi output:

sh-4.2$ nvidia-smi
Wed May 24 08:57:00 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:1E.0 Off |                    0 |
| N/A   25C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Metadata

Metadata

Assignees

Labels

lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.needs-triageissue or PR has not been assigned a priority-px label

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions