[News] Megatron-LM: NVIDIA trains 8.3B GPT-2 using model and data parallelism on 512 GPUs. SOTA in language modelling and SQUAD. Details awaited.
Unlike Open-AI, they have released the complete code for data processing, training, and evaluation.
Detailed writeup: https://nv-adlr.github.io/MegatronLM
Megatron is a large, powerful transformer. This repo is for ongoing research on training large, powerful transformer language models at scale. Currently, we support model-parallel, multinode training of GPT2 and BERT in mixed precision.Our codebase is capable of efficiently training a 72-layer, 8.3 Billion Parameter GPT2 Language model with 8-way model and 64-way data parallelism across 512 GPUs. We find that bigger language models are able to surpass current GPT2-1.5B wikitext perplexities in as little as 5 epochs of training.For BERT training our repository trains BERT Large on 64 V100 GPUs in 3 days. We achieved a final language modeling perplexity of 3.15 and SQuAD F1-score of 90.7.
Their submission is not in the leaderboard of SQuAD, but this exceeds the previous best single model performance (RoBERTa 89.8).
For language modelling they get zero-shot wikitext perplexity of 17.4 (8.3B model) better than 18.3 of transformer-xl (257M). However they claim it as SOTA when GPT-2 itself has 17.48 ppl, and another model has 16.4 (https://paperswithcode.com/sota/language-modelling-on-wikitext-103)
Sadly they haven’t mentioned anything about release of the model weights.