[D] Do you report the best test accuracy, or the last test accuracy?
Granted on most modern datasets like CIFAR10/100/ImageNet we are cheating no matter what because we use the test set as a validation set. However it’s important to compare apple to apple. I’ve always reported last test accuracy, but I’m seeing more and more papers report the best one, which gives them a non-negligible boost.
Best: In practice if we don’t have a test set we would typically deploy the model that has the best validation accuracy. So here we report “test accuracy” but we mean “validation accuracy of the model we would deploy”.
Last: Makes it harder to overfit the validation/test set, and is arguably closer to the real generalization accuracy you would get.
Is there a consensus on best practice?