[D] Over/Under/SMOTE sampling for EXTREMELY imbalanced data without getting data?

Written by torontoai on November 18, 2019. Posted in Reddit MachineLearning.

I am working on a case study where they gave me 3 text, 2 categorical, 1 numerical features to classify 6 classes.

However, the data is very imbalanced. Its splits like this:

Case_1: 5215/5899 = 88.4%

Case_2: 631/5899 = 10.7%

Case_3: 23/5899 = 0.39%

Case_4: 16/5899 = 0.27%

Case_5: 2/5899 = 0.03%

Case_6: 12/5899 = 0.2%

and Case_5 comes to only 1 observation after splitting data to training.

To me, it seems like over sampling minorities might result in serious overfitting. Undersampling from 5215 might result in some serious data loss. I don’t know what to do. I did do the bias to weights to log reg, but only got decent results:

normalized confusion matrix (True Positive percents):

Category_1: 96% which is 1.08 times better

Category_2: 86% which is 8.03 times better

Category_3: 100% which is 256 times better

Category_4: 80% 296 times better

Category_5: 0% since it was only 1 example in test data

Category_6: 75% which is 375 times better

submitted by /u/dattud
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] Over/Under/SMOTE sampling for EXTREMELY imbalanced data without getting data?