[D] Over/Under/SMOTE sampling for EXTREMELY imbalanced data without getting data?
I am working on a case study where they gave me 3 text, 2 categorical, 1 numerical features to classify 6 classes.
However, the data is very imbalanced. Its splits like this:
Case_1: 5215/5899 = 88.4%
Case_2: 631/5899 = 10.7%
Case_3: 23/5899 = 0.39%
Case_4: 16/5899 = 0.27%
Case_5: 2/5899 = 0.03%
Case_6: 12/5899 = 0.2%
and Case_5 comes to only 1 observation after splitting data to training.
To me, it seems like over sampling minorities might result in serious overfitting. Undersampling from 5215 might result in some serious data loss. I don’t know what to do. I did do the bias to weights to log reg, but only got decent results:
normalized confusion matrix (True Positive percents):
Category_1: 96% which is 1.08 times better
Category_2: 86% which is 8.03 times better
Category_3: 100% which is 256 times better
Category_4: 80% 296 times better
Category_5: 0% since it was only 1 example in test data
Category_6: 75% which is 375 times better