Skip to main content

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

LEARN, CONNECT, SHARE

Join our meetup, learn, connect, share, and get to know your Toronto AI community. 

JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

[D] Over/Under/SMOTE sampling for EXTREMELY imbalanced data without getting data?

I am working on a case study where they gave me 3 text, 2 categorical, 1 numerical features to classify 6 classes.

However, the data is very imbalanced. Its splits like this:

Case_1: 5215/5899 = 88.4%

Case_2: 631/5899 = 10.7%

Case_3: 23/5899 = 0.39%

Case_4: 16/5899 = 0.27%

Case_5: 2/5899 = 0.03%

Case_6: 12/5899 = 0.2%

and Case_5 comes to only 1 observation after splitting data to training.

To me, it seems like over sampling minorities might result in serious overfitting. Undersampling from 5215 might result in some serious data loss. I don’t know what to do. I did do the bias to weights to log reg, but only got decent results:

normalized confusion matrix (True Positive percents):

Category_1: 96% which is 1.08 times better

Category_2: 86% which is 8.03 times better

Category_3: 100% which is 256 times better

Category_4: 80% 296 times better

Category_5: 0% since it was only 1 example in test data

Category_6: 75% which is 375 times better

submitted by /u/dattud
[link] [comments]