Tesla V100s have I/O pins for at most 6x 25 GB/s NVLink traces. So, systems with more than 6x GPUs cannot fully connect GPUs over NVLink. This causes I/O bottlenecks that significantly diminish returns of scaling beyond six GPUs.

This article provides an overview of their architecture that bypasses this limitation using additional high bandwidth links. Looking at the benchmarks, multi-GPU performance scales almost perfectly linearly from 1x GPU 16x GPUs.

I’m one of the engineers who worked on this project. Happy to answer any questions!

submitted by /u/mippie_moe
[link] [comments]