当 systemd 用于管理容器的 cgroup 并触发它重新加载任何引用 NVIDIA GPU 的 Unit 文件(例如 systemctl daemon-reload)时,容器化 GPU 工作负载可能会突然失去对其 GPU 的访问权限。
在 GPUStack 中,GPU 可能会在“资源”菜单中丢失,并且在 GPUStack 容器中运行 nvidia-smi
可能会导致错误: Failed to initialize NVML: Unknown Error
为了防止此问题 ,需要在 Docker 中禁用 systemd cgroup 管理。
在 /etc/docker/daemon.json
文件中设置参数 “exec-opts”: [“native.cgroupdriver=cgroupfs”] 并重新启动 docker,例如:
vim /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
systemctl daemon-reload && systemctl restart docker
Containers losing access to GPUs with error: “Failed to initialize NVML: Unknown Error”
容器无法访问 GPU,并显示错误:“无法初始化 NVML:未知错误”#
When using the NVIDIA Container Runtime Hook (i.e. the Docker --gpus flag or the NVIDIA Container Runtime in legacy mode) to inject requested GPUs and driver libraries into a container, the hook makes modifications, including setting up cgroup access, to the container without the low-level runtime (e.g. runc) being aware of these changes. The result is that updates to the container may remove access to the requested GPUs.
当使用 NVIDIA Container Runtime Hook(即 Docker --gpus 标志或旧模式下的 NVIDIA Container Runtime)将请求的 GPU 和驱动程序库注入容器时,钩子会对容器进行修改,包括设置 cgroup 访问,而低级运行时(例如 runc)不会意识到这些更改。结果是,对容器的更新可能会删除对请求的 GPU 的访问权限。
When the container loses access to the GPU, you will see the following error message from the console output:
当容器无法访问 GPU 时,您将在控制台输出中看到以下错误消息:
Failed to initialize NVML: Unknown Error
The message may differ depending on the type of application that is running in the container.
该消息可能会有所不同,具体取决于容器中运行的应用程序类型。
The container needs to be deleted once the issue occurs. When it is restarted, manually or automatically depending if you are using a container orchestration platform, it will regain access to the GPU.
出现问题后,需要删除容器。重新启动时,手动或自动,具体取决于您使用的是容器业务流程平台,它将重新获得对 GPU 的访问权限。
相关链接
- 在线安装 - GPUStack --- Online Installation - GPUStack(https://docs.gpustack.ai/latest/installation/nvidia-cuda/online-installation/#prerequisites_1)
- NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" · Issue #48 · NVIDIA/nvidia-container-toolkit(https://github.com/NVIDIA/nvidia-container-toolkit/issues/48)
- NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" · NVIDIA/nvidia-container-toolkit · Discussion #1133(https://github.com/NVIDIA/nvidia-container-toolkit/discussions/1133)
- 故障排除 — NVIDIA Container Toolkit --- Troubleshooting — NVIDIA Container Toolkit(https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/troubleshooting.html#containers-losing-access-to-gpus-with-error-failed-to-initialize-nvml-unknown-error)
发表评论