[D] Is finetuning on part of the evaluation dataset acceptable for publishing machine learning papers?
I have been trying yo reproduce the results of a SOTA paper regarding object detection. I have reimplemented their method and trained on the same dataset, based on the paper, however I was not able to achieve their results on the datasets they use for evaluation, no matter what I have tried.
Then I also studied their referenced papers and realised that many of them use a train-test split strategy for evaluating their models. This means that they use a part of the evaluation dataset for finetuning their already trained model and then evaluate it on the testing part of the same dataset. In the case of these papers, this fact was explicitly mentioned. I think that this also happened in the paper I tried to reproduce. However, they don’t mention it.
My question for discussion is, what do you think about this strategy? Is finetuning on part of the evaluation dataset a way to go? What about generalisation on totally unknown data? In my opinion it is ok if explicitly mentioned. Totally uncool in the opposite case, though.
EDIT: Just a clarification to be on the same page. What I mean by train, test and validation sets is a big dataset which is split in those three subsets.
By evaluation dataset I mean a benchmark dataset which researchers use to report their results on a specific task. So, finetuning on part of the evaluation dataset is about retraining on a part of the benchmark dataset and later report the results on the rest of it, that was not seen during finetuning.