[D] Help with fine-tuning for text classification task
Testing out fine-tuning BERT and ULMFit for text classification. I’ve followed various tutorials using FastAI and PyTorch, but haven’t yet gotten good results at all – would love to get some input and see if my approach to this problem is reasonable.
My problem is take a short snippet of text – anywhere from 10 – 200 characters and predict one from 2,510 categories that represent words (one-hot vector). I’m using CrossEntropyLoss for the problem and both BERT and ULMFit do not seem to be doing great after fine-tuning – I can’t even get BERT begin to make non-naive predictions. I believe this is due to the large number of classes, but I had thought there would be enough signal in the text for BERT to make a guess at one of the categories after seeing some examples. It ends up just predicting the most common class.
I have ~10MM samples for the data and I’ve been testing to hopefully see some basic results on ~500K examples. Is this reasonable? Should I just try throwing all of the data in there and see what happens?
My data is fairly messy as well since it’s raw text, should I be cleaning it much? Haven’t found many recommendations on this front yet and all of the SOTA metrics appear to be on clean text afaik. As far as spelling goes as well wondering if that’s an issue.
Any comments/thoughts on the approach? If it would help I can post the code (can’t post the data unfortunately), but it’s mostly copied from some tutorials with a few modifications so don’t know how much that will help.
Thanks in advance for the help.