[P] SpeedTorch. 4x faster pinned CPU -> GPU data transfer than Pytorch pinned CPU tensors, and 110x faster GPU -> CPU transfer. Augment parameter size by hosting on CPU. Use non sparse optimizers (Adadelta, Adamax, RMSprop, Rprop, etc.) for sparse training (word2vec, node2vec, GloVe, NCF, etc.).
This is library I made for Pytorch, for fast transfer between pinned CPU tensors and GPU pytorch variables. The inspiration came from needed to train large number of embeddings, which don’t all fit on GPU ram at a desired embedding size, so I needed a faster CPU <-> GPU transfer method. This also allows using any optimizer for sparse training, since every embedding contained in the Pytorch embedding variable receives an update, previously only Pytorch’s SGD, Adagrad, and SparseAdam were suitable for such training.
In addition to augmenting parameter sizes, you can use to increase the speed of which data on your CPU is transferred to Pytorch Cuda variables.
Also, SpeedTorch’s GPU tensors are also overall faster then Pytorch cuda tensors, when taking into account both transferring two and from (overall 2.6x faster). For just transfering to a Pytorch Cuda, Pytorch is still faster, but significantly slower when transfering from a Pytorch Cuda variable.
I have personally used this to nearly double the embedding size of embeddings in two other projects, by holding half the parameters on CPU. The training speed is decent thanks to the fast CPU<->GPU exchange.
There’s a bit of a learning curve for the very first time getting started with it, so as soon as you run into any sort of friction, feel free to ask a question on the project gitter
And I’ll answer them.