[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?

Edit: Succinct question: Do people experience their GPUs disappearing from nvidia-smi, and are not recoverable except by system reboot

We have several servers with four RTX 2080TI cards, used for research. We are noticing that there is a tendency for one GPU to unexpectedly “die” from time to time. They disappear from the list returned by nvidia-smi.

This happens with a frequency in the range of several times a week to several times a month and requires a server restart to get back online.

The GPU load varies, but there is no clear correlation between periods (days) of high load and GPUs dying.

  • Do other groups experience this?
  • Is this expected behaviour?
  • And has anyone found a way to avoid this?

Edit: The symptom is mostly clearly noticed by the frozen GPU disappearing from the list returned by nvidia-smi. If GPU number 2 dies, the list returns values for GPUs 0, 1 and 3.

By “dying” I mean that the GPU becomes unresponsive to the system, and is not “noticed” by nvidia-smi.

The servers each have 128 GB ram, and 28 CPU cores. Some users do run CPU-intensive simulations at the same time that some users run on the GPU, which could be causing the crash.

