[D] How to perform significance testing on experiments with multiple random seeds?
This is a question that isn’t really touched upon in most deep learning research, so thought I would reach out to the community for advice.
With neural networks, it is sensible practice to run the same experiment multiple times with different seeds, reporting the mean and standard deviation. This helps factor out the effects that random initialization has on the model.
However, how do you then perform significance testing on the results? For example, you have system A and system B, you run each of them with 10 different seeds over a dataset of size N, calculate a metric for each run (e.g., accuracy/F-score) and report the mean and std over different runs. You now want to determine whether system A is significantly better than system B or not.
Picking a single seed seems random and arbitrary. Using all of them makes the significance test think the test dataset is much bigger than it actually is, which also doesn’t sound right. Aligning the outputs only from the same seeds seems random as well, as all the seeds are supposed to be equivalent.
Do you have any suggestions on how to handle this? Or have you seen any papers attempting to solve this problem? Thanks!