[R] On the adequacy of untuned warmup for adaptive optimization
FAIR paper that claims to “obviate RAdam”. It writes RAdam(W) is functionally equivalent to Adam(W) with “linear warmup over 2 / (1 – beta2) training iterations”.
Legit? This makes either MSR or FAIR look pretty bad.