Troubleshooting
GPU Memory Limit Not Enforced​
If a container exceeds its nvidia.com/gpumem limit, check the following causes:
-
CUDA_DISABLE_CONTROL=trueis set - disables HAMi-core enforcement entirely. Remove it from production workloads. -
Docker-in-Docker (DinD) - inner containers do not inherit the
/etc/ld.so.preloadhostPath mount. HAMi enforcement does not apply inside DinD. -
Direct driver API usage - workloads calling NVML or the CUDA Driver API directly bypass
libvgpu.so. -
nvidia-container-runtimenot set as default - verify with:containerd config dump | grep default_runtime_nameThe output must show
nvidia. If not, follow the Prerequisites guide. -
If you don’t explicitly request vGPUs when using the device plugin with NVIDIA images, all GPUs on the host may be exposed to your container.
-
Currently, A100 MIG can be supported in only "none" and "mixed" modes.
-
Tasks with the "nodeName" field cannot be scheduled at the moment; please use "nodeSelector" instead.
-
Only computing tasks are currently supported; video codec processing is not supported.
-
Since v2.3.10, HAMi has changed the
device-pluginenvironment variable name fromNodeNametoNODE_NAME. If you are using an image version earlier than v2.3.10, thedevice-pluginmay fail to start.To resolve this issue, you have two options:
-
Manually edit the DaemonSet using
kubectl edit daemonsetand update the environment variable fromNodeNametoNODE_NAME. -
Upgrade the
device-pluginimage to the latest version using Helm:helm upgrade hami hami/hami -n kube-systemThis will apply the fix automatically.
-