[D] Best approaches for semi-supervised learning and learning with low quality labels.
I am starting work on an area where data is abundant but, labelling is almost impossible. The only viable labelling approach I can employ is to use manually defined thresholds to generate labels algorithmically. I think I will have to look into techniques that try to achieve superhuman accuracy, since my model should outperform the labelling algorithm. Anyone have any idea of what to do?
I have looked into pseudo-labelling but I am not sure how useful it will be in this context. It deals with the case of small amount of well labelled data and large amount of unlabelled data. My case is large amount of badly labelled data.
EDIT: Some extra information
The number of classes, is up to me to decide. It is a regression problem where the target is discretized according to manually defined thresholds(ordinal classification). By algorithmically labelling I mean that I can apply some mathematical manipulation to the data and generate labels. These labels won’t be too accurate mainly because the math formulas don’t take into account all the features nor the correlation among features. By badly labelled I mean mistaken labels.