Learn About Our Meetup

4500+ Members

[D] Over/Under/SMOTE sampling for EXTREMELY imbalanced data without getting data?

I am working on a case study where they gave me 3 text, 2 categorical, 1 numerical features to classify 6 classes.

However, the data is very imbalanced. Its splits like this:

Case_1: 5215/5899 = 88.4%

Case_2: 631/5899 = 10.7%

Case_3: 23/5899 = 0.39%

Case_4: 16/5899 = 0.27%

Case_5: 2/5899 = 0.03%

Case_6: 12/5899 = 0.2%

and Case_5 comes to only 1 observation after splitting data to training.

To me, it seems like over sampling minorities might result in serious overfitting. Undersampling from 5215 might result in some serious data loss. I don’t know what to do. I did do the bias to weights to log reg, but only got decent results:

normalized confusion matrix (True Positive percents):

Category_1: 96% which is 1.08 times better

Category_2: 86% which is 8.03 times better

Category_3: 100% which is 256 times better

Category_4: 80% 296 times better

Category_5: 0% since it was only 1 example in test data

Category_6: 75% which is 375 times better

submitted by /u/dattud
[link] [comments]

Next Meetup




Plug yourself into AI and don't miss a beat


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.