Skip to main content
Version: latest

Troubleshooting

GPU Memory Limit Not Enforced​

If a container exceeds its nvidia.com/gpumem limit, check the following causes:

  • CUDA_DISABLE_CONTROL=true is set - disables HAMi-core enforcement entirely. Remove it from production workloads.

  • Docker-in-Docker (DinD) - inner containers do not inherit the /etc/ld.so.preload hostPath mount. HAMi enforcement does not apply inside DinD.

  • Direct driver API usage - workloads calling NVML or the CUDA Driver API directly bypass libvgpu.so.

  • nvidia-container-runtime not set as default - verify with:

    containerd config dump | grep default_runtime_name

    The output must show nvidia. If not, follow the Prerequisites guide.

  • If you don’t explicitly request vGPUs when using the device plugin with NVIDIA images, all GPUs on the host may be exposed to your container.

  • Currently, A100 MIG can be supported in only "none" and "mixed" modes.

  • Tasks with the "nodeName" field cannot be scheduled at the moment; please use "nodeSelector" instead.

  • Only computing tasks are currently supported; video codec processing is not supported.

  • Since v2.3.10, HAMi has changed the device-plugin environment variable name from NodeName to NODE_NAME. If you are using an image version earlier than v2.3.10, the device-plugin may fail to start.

    To resolve this issue, you have two options:

    • Manually edit the DaemonSet using kubectl edit daemonset and update the environment variable from NodeName to NODE_NAME.

    • Upgrade the device-plugin image to the latest version using Helm:

      helm upgrade hami hami/hami -n kube-system

      This will apply the fix automatically.

CNCFHAMi is a CNCF Sandbox project