[D] How should I statistically compare the performance of deep reinforcement models?
The environment I used is a card game with which has some randomness when the cards are drawn.
I used 10 different seeds to train the models. Thus 10 models trained for my algorithm, 10 for DQN baseline, etc.
For this study, I’m measuring performance based on the rewards from running the trained models in the environment.
- How should I test the difference between the average reward when using my algorithm vs that of the baselines?
For example, should I get the average of 10,000 episodes of the DQN model and then get the average and SD of the 100,000 episodes (from adding up the results of 10,000 episodes from 10 DQN models)? Then after completing the previous step for all the model groups (my own algorithm, other baselines, etc.), compare the rewards with Welch’s t-test?
- Also, should I used the same random seed (different from training obviously) for the environment when testing the 10 different runs or should I use a different one each time? For example, testing the 10 different trained DQN models with seed 1 for all 100,000 episodes vs using seeds 1 to 10 for 10,000 episodes each.
Some advice would be helpful. Thanks for reading this and I apologize if any part of this sounds confusing. English is not my first language.