[D] Are small transformers better than small LSTMs?
Transformers are currently beating the state of the art on different NLP tasks.
Some examples are:
- Machine translation: Transformer Big + BT
- Named entity recognition: BERT large
- Natural language inference: RoBERTa
Something I noticed is that in all of the papers, the models are massive with maybe 20 layers and 100s of millions of parameters.
Of course, using larger models is a general trend in NLP but it begs the question if small transformers are any good. I recently had to train a sequence to sequence model from scratch and I was unable to get better results with a transformer than with LSTMs.
I am wondering if someone here has had similar experiences or knows of any papers on this topic.