[D] Is there any reason besides theory not to use binary cross-entropy for each class in a multi-class classification problem?
I know it’s technically wrong, but I’m interested in whether research has been done with a serious investigation into the differences in outcomes between the two.
Categorical cross-entropy is correct for multinomial problems because it corresponds directly to the log-likelihood of the multinomial distribution. Optimizing it corresponds to optimizing the Kullback-Leibler divergence between the true and predicted distributions over the classes.
But something irks me. When performing backpropagation, the gradient is only nonzero at one terminal node. This seems to me suboptimal, especially as the number of classes grows.
Consider instead computing the binary cross-entropy at each terminal node. Then there is gradient information at every terminal node and a more solid update signal. Plus, it’s not exactly a stretch to consider each class label as independent a priori.
I liken this to the difference between multinomial and one-vs-rest (OVR) logistic regression. In OVR logistic regression, the predicted label is simply the arg max of all the prediction probabilities. The same procedure can be performed with neural networks almost trivially.
For those unfamiliar, here is an example in the Scikit-Learn documentation that demonstrates graphically the difference between the two to get a better understanding. That example shows that on the synthetic dataset, OVR logistic regression performs worse. But is that often the case? I’m not sure.