Skip to main content


Learn About Our Meetup

5000+ Members



Join our meetup, learn, connect, share, and get to know your Toronto AI community. 



Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.



Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?

Edit: Succinct question: Do people experience their GPUs disappearing from nvidia-smi, and are not recoverable except by system reboot

We have several servers with four RTX 2080TI cards, used for research. We are noticing that there is a tendency for one GPU to unexpectedly “die” from time to time. They disappear from the list returned by nvidia-smi.

This happens with a frequency in the range of several times a week to several times a month and requires a server restart to get back online.

The GPU load varies, but there is no clear correlation between periods (days) of high load and GPUs dying.

  • Do other groups experience this?
  • Is this expected behaviour?
  • And has anyone found a way to avoid this?

Edit: The symptom is mostly clearly noticed by the frozen GPU disappearing from the list returned by nvidia-smi. If GPU number 2 dies, the list returns values for GPUs 0, 1 and 3.

By “dying” I mean that the GPU becomes unresponsive to the system, and is not “noticed” by nvidia-smi.

The servers each have 128 GB ram, and 28 CPU cores. Some users do run CPU-intensive simulations at the same time that some users run on the GPU, which could be causing the crash.

submitted by /u/wingtales
[link] [comments]