[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?
Edit: Succinct question: Do people experience their GPUs disappearing from
nvidia-smi, and are not recoverable except by system reboot
We have several servers with four RTX 2080TI cards, used for research. We are noticing that there is a tendency for one GPU to unexpectedly “die” from time to time. They disappear from the list returned by nvidia-smi.
This happens with a frequency in the range of several times a week to several times a month and requires a server restart to get back online.
The GPU load varies, but there is no clear correlation between periods (days) of high load and GPUs dying.
- Do other groups experience this?
- Is this expected behaviour?
- And has anyone found a way to avoid this?
Edit: The symptom is mostly clearly noticed by the frozen GPU disappearing from the list returned by
nvidia-smi. If GPU number 2 dies, the list returns values for GPUs 0, 1 and 3.
By “dying” I mean that the GPU becomes unresponsive to the system, and is not “noticed” by nvidia-smi.
The servers each have 128 GB ram, and 28 CPU cores. Some users do run CPU-intensive simulations at the same time that some users run on the GPU, which could be causing the crash.