[D] Self-training with Noisy Student improves ImageNet classification (STNS)
A few questions about this paper:
- If you train an ensemble of SOTA architectures on Imagenet and average their results, do you beat STNS?
- Why not fine tune the teacher? Why involve the student at all? Why not have the teacher fine tune with noisy labels and get rid of the student completely?
- The noisy part to the student seems odd to me. Why would this work other than the fact of adding noise you sort of anneal the solution. Why not add noise to the gradient or do what i suggest in 2.
I see Q Le has investigated noisy gradients already. https://arxiv.org/pdf/1511.06807.pdf