Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?

Edit: Succinct question: Do people experience their GPUs disappearing from nvidia-smi, and are not recoverable except by system reboot

We have several servers with four RTX 2080TI cards, used for research. We are noticing that there is a tendency for one GPU to unexpectedly “die” from time to time. They disappear from the list returned by nvidia-smi.

This happens with a frequency in the range of several times a week to several times a month and requires a server restart to get back online.

The GPU load varies, but there is no clear correlation between periods (days) of high load and GPUs dying.

  • Do other groups experience this?
  • Is this expected behaviour?
  • And has anyone found a way to avoid this?

Edit: The symptom is mostly clearly noticed by the frozen GPU disappearing from the list returned by nvidia-smi. If GPU number 2 dies, the list returns values for GPUs 0, 1 and 3.

By “dying” I mean that the GPU becomes unresponsive to the system, and is not “noticed” by nvidia-smi.

The servers each have 128 GB ram, and 28 CPU cores. Some users do run CPU-intensive simulations at the same time that some users run on the GPU, which could be causing the crash.

submitted by /u/wingtales
[link] [comments]