[D] why softmax+CE over sigmoid+BCE?
Most of the popular neural network language models use softmax+cross entropy loss during training, which is based on the assumption that only the target label is true, and everything else is false. But isn’t language modeling a multilabel classification task? why sigmoid+BCE isn’t used often?
submitted by /u/DeMorrr
[link] [comments]