[N] Test a Distilled GPT-2’s generative capabilities
At Hugging Face, we recently started distilling models starting with DistilBERT – a distilled version of BERT. We recently distilled the small version of GPT-2, which has the following parameters:
81,9M parameters vs 124M for GPT-2/small (66% parameters)
Weighs 336Mb vs 523Mb for GPT-2/small (64% disk size)
On CPU and GPU, the average forward pass of DistilGPT-2 is 51% that of GPT-2/small (twice as fast).
The absolute increase in perplexity on WikiText-103 is 3.5 points (15.0 -> 18.5).
We have added it to our app write with transformer, as well as our two repos transformers (along with a tutorial on how to distill transformers and example scripts!) and swift-coreml-transformers. We have successfully run it on an iPhone 7 and it is 38% faster than GPT-2 on an iPhone X with neural engine.