Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[R] BERT and XLNET for Malay and Indonesian languages.

I released BERT and XLNET for Malay language, trained on around 1.2GB of data (public news, twitter, instagram, wikipedia and parliament text), and do some comparison among it. So it is really good on both social media and native context, I believe it also good for Bahasa Indonesia, in Wikipedia, we share a lot of similar context and assimilation with Indonesian text. And we know BERT released Multilanguage model, size around 714MB, which is so great but too heavy on some low cost development.

BERT-Bahasa, you can read more at here, https://github.com/huseinzol05/Malaya/tree/master/bert 2 models for BERT-Bahasa,

  1. Vocab size 40k, Case Sensitive, Train on 1.21GB dataset, BASE size (467MB).
  2. Vocab size 40k, Case Sensitive, Train on 1.21GB dataset, SMALL size (184MB).

XLNET-Bahasa, you can read more at here, https://github.com/huseinzol05/Malaya/tree/master/xlnet 1 model for XLNET-Bahasa, 1. Vocab size 32k, Case Sensitive, Train on 1.21GB dataset, BASE size (878MB).

All comparison studies inside both README pages, comparison for abstractive summarization and neural machine translation are on the way, and XLNET-Bahasa SMALL is on training.

submitted by /u/huseinzol05
[link] [comments]