Are small transformers better than small LSTMs?

Transformers are currently beating the state of the art on different NLP tasks.

Some examples are:

  • Machine translation: Transformer Big + BT
  • Named entity recognition: BERT large
  • Natural language inference: RoBERTa

Something I noticed is that in all of the papers, the models are massive with maybe 20 layers and 100s of millions of parameters.

Of course, using larger models is a general trend in NLP but it begs the question if small transformers are any good. I recently had to train a sequence to sequence model from scratch and I was unable to get better results with a transformer than with LSTMs.

I am wondering if someone here has had similar experiences or knows of any papers on this topic.

