[D] Implementation of “Stand-Alone Self-Attention in Vision Models”
I’m implementing https://arxiv.org/pdf/1906.05909.pdf in this repo (https://github.com/MerHS/SASA-pytorch), but current implementation consumes too much GPU memory. (currently x0.5 less params than ResNet-50, x10 more mem consumption)
I think some sort of `matmul` or `view` are causing this problem, hence I’m working on changing matmul to einsum. (also I am not sure that I implemented it correctly)
Could anyone guess how the authors optimized this network?