[D] How the Transformers broke NLP leaderboards
I came across this interesting article about whether larger models + more data = progress in ML research.
The focus of this post is yet another problem with the leaderboards that is relatively recent. Its cause is simple: fundamentally, a model may be better than its competitors by building better representations from the available data – or it may simply use more data, and/or throw a deeper network at it. When we have a paper presenting a new model that also uses more data/compute than its competitors, credit attribution becomes hard.
The most popular NLP leaderboards are currently dominated by Transformer-based models. BERT received the best paper award at NAACL 2019 after months of holding SOTA on many leaderboards. Now the hot topic is XLNet that is said to overtake BERT on GLUE and some other benchmarks. Other Transformers include GPT-2, ERNIE, and the list is growing.
The problem we’re starting to face is that these models are HUGE. While the source code is available, in reality it is beyond the means of an average lab to reproduce these results, or to produce anything comparable. For instance, XLNet is trained on 32B tokens, and the price of using 500 TPUs for 2 days is over $250,000. Even fine-tuning this model is getting expensive.
Wait, this was supposed to happen!
On the one hand, this trend looks predictable, even inevitable: people with more resources will use more resources to get better performance. One could even argue that a huge model proves its scalability and fulfils the inherent promise of deep learning, i.e. being able to learn more complex patterns from more information. Nobody knows how much data we actually need to solve a given NLP task, but more should be better, and limiting data seems counter-productive.
On that view – well, from now on top-tier NLP research is going to be something possible only for industry. Academics will have to somehow up their game, either by getting more grants or by collaborating with high-performance computing centers. They are also welcome to switch to analysis, building something on top of the industry-provided huge models, or making datasets.
However, in terms of overall progress in NLP that might not be the best thing to do. The chief problem with the huge models is simply this:
“More data & compute = SOTA” is NOT research news.
If leaderboards are to highlight the actual progress, we need to incentivize new architectures rather than teams outspending each other. Obviously, huge pretrained models are valuable, but unless the authors show that their system consistently behaves differently from its competition with comparable data & compute, it is not clear whether they are presenting a model or a resource.
Furthermore, much of this research is not reproducible: nobody is going to spend $250,000 just to repeat XLNet training. Given the fact that its ablation study showed only 1-2% gain over BERT in 3 datasets out of 4, we don’t actually know for sure that its masking strategy is more successful than BERT’s.
At the same time, the development of leaner models is dis-incentivized, as their task is fundamentally harder and the leaderboard-oriented community only rewards the SOTA. That, in its turn, prices out of competitions academic teams, which will not result in students becoming better engineers when they graduate.