[R] BERT and XLNET for Malay and Indonesian languages.
I released BERT and XLNET for Malay language, trained on around 1.2GB of data (public news, twitter, instagram, wikipedia and parliament text), and do some comparison among it. So it is really good on both social media and native context, I believe it also good for Bahasa Indonesia, in Wikipedia, we share a lot of similar context and assimilation with Indonesian text. And we know BERT released Multilanguage model, size around 714MB, which is so great but too heavy on some low cost development.
BERT-Bahasa, you can read more at here, https://github.com/huseinzol05/Malaya/tree/master/bert 2 models for BERT-Bahasa,
- Vocab size 40k, Case Sensitive, Train on 1.21GB dataset, BASE size (467MB).
- Vocab size 40k, Case Sensitive, Train on 1.21GB dataset, SMALL size (184MB).
XLNET-Bahasa, you can read more at here, https://github.com/huseinzol05/Malaya/tree/master/xlnet 1 model for XLNET-Bahasa, 1. Vocab size 32k, Case Sensitive, Train on 1.21GB dataset, BASE size (878MB).
All comparison studies inside both README pages, comparison for abstractive summarization and neural machine translation are on the way, and XLNET-Bahasa SMALL is on training.