[D] are reported NLL scores in papers an average across datapoints?
Papers commonly report NLL scores, such as a value of around 3 for PixelCNN. I believe this is bits-per-dimension,
but, is it:
a) an average across all the datapoints in the test set, or
b) a sum across all datapoints, or
c) the best score on an individual datapoint?
Or maybe my question makes not sense.
Explaining furher In the case of PixelCNN, “datapoint” = image, so I believe the NLL of trained model can be evaluated by summing the logs of the conditional probabilities of each pixel (conditioned on the neighborhood in the pixelcnn scheme), plus the marginal probability for the first pixel. This gives the overall LL for a single image from the test set, but what about the other images.