[D] Techniques to sample unbalanced multi-label datasets?
I have a dataset with multi-label classification outputs. These are (fortunately) binary, so a typical output is basically a binary vector like [1, 0, 0, 1, 1]. The length of this vector is fixed.
The dataset is pretty biased towards all 0’s, and since it is a multi-label output (and not like a one-hot encoding), I’m not sure what is the best way to undersample or oversample the dataset for my training epochs, since traditional stratification cannot work here.
Edit: All kinds of suggestions are welcome, be it simple beginner methods or ICLR papers 😉
submitted by /u/parekhnish
[link] [comments]