[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?

Written by torontoai on May 26, 2019. Posted in Reddit MachineLearning.

Edit: Succinct question: Do people experience their GPUs disappearing from nvidia-smi, and are not recoverable except by system reboot

We have several servers with four RTX 2080TI cards, used for research. We are noticing that there is a tendency for one GPU to unexpectedly “die” from time to time. They disappear from the list returned by nvidia-smi.

This happens with a frequency in the range of several times a week to several times a month and requires a server restart to get back online.

The GPU load varies, but there is no clear correlation between periods (days) of high load and GPUs dying.

Do other groups experience this?
Is this expected behaviour?
And has anyone found a way to avoid this?

Edit: The symptom is mostly clearly noticed by the frozen GPU disappearing from the list returned by nvidia-smi. If GPU number 2 dies, the list returns values for GPUs 0, 1 and 3.

By “dying” I mean that the GPU becomes unresponsive to the system, and is not “noticed” by nvidia-smi.

The servers each have 128 GB ram, and 28 CPU cores. Some users do run CPU-intensive simulations at the same time that some users run on the GPU, which could be causing the crash.

submitted by /u/wingtales
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[Discussion] Do you experience GPU “shutdown” on servers with multiple GPUs?