[D] Per channel or per sample Loss calculation and averaging in a batch ?

Let’s say we have an N-class semantic segmentation problem. Now on each iteration (for each batch) we can calculate Dice loss in two ways: (1) calculate average loss over classes for each sample in a batch and after that get the average over batch, or (2) calculate average loss per class in a batch and then average over classes presented in a batch. Which one is better and why? Or there is no difference at all? Can it affect on how model learns to segment small or big objects? Any related articles?

