[D] should i over-sample rare episodes with successful exploration ?
I am using DRL (mostly policy gradients) in a simulated discrete sokoban-style environment.
Alex-the-agent is rewarded for the shortest possible solution, as well as training on progressively harder/intricate maps. After a while, exploration is very difficult, and it takes millions of attempts to complete an episode with a slightly-better score. To be clear, this is not a plateauing of performance, it just takes excessively longer exploration.
Should i be “over-sampling” these increasing-rare successful score improvements ?
I use PG since it works, but I am open to trying value techniques.