[D] Distilling BERT — How to achieve BERT performance using Logistic Regression
Few days ago, in this post, I asked about a way to make BERT smaller. I got some interesting results and found some relevant papers. The basic idea is, given a relatively small labelled dataset, and another much bigger unlabelled set:
- Train BERT on the labeled set
- Predict values of the unlabelled set
- Train a much smaller model using the now the labelled big set
I tried it with Logistic Regression and got some interesting results here:
submitted by /u/sudo_su_
[link] [comments]