[D] multi-head attention regularization
There are papers related to making multi-head attention in transformer look at different parts of inputs to reduce redundancy. do you try any? if so, what works well?
[1]https://arxiv.org/abs/1910.04500
submitted by /u/taylorchu
[link] [comments]