We are the authors of XLNet. We conducted a fair comparison study of XLNet and BERT with large models. In this study, we ensure that almost every possible hyperparameter is the same for the training recipes of both BERT and XLNet, using the same training data.

We have the following interesting observations among others:

Trained on the same data with an almost identical training recipe, XLNet outperforms BERT by a sizable margin on all the datasets.
The gains of training on 10x more data are smaller than the gains of switching from BERT to XLNet on 8 out of 11 benchmarks.

submitted by /u/kimiyoung
[link] [comments]