[D] Handling noisy labels in large datasets with slight imbalance
Hey all, I have large binary classification dataset with follow counts 0: 200k 1: 500k
Now the labels are not true labels. I know the percentage of correct of each label group. By this I mean I know that 80% of 0 are correct and 85% of 1 are correct, I don’t know which.
Now I have tried the following:- ° Random first with class weight – massively overfit and if played around with max _depth parameter to reduce overfitting however I am unable to get good results. ° Tried oversampling like SMOTE etc but they take large amount of time.
Do you have any suggestions how to deal with imbalance and noisy labels?