[D] MCTS on raw network not trained with MCTS
In the AlphaGo Zero paper figure 6b shows the performance of a raw network which directly takes the action with the highest q-value(?) versus an MCTS approach which gets 5 seconds of thinking time. The MCTS approach has a large performance gain over the raw network approach.
Now I have trained a network with a policy and value head that uses the first approach and does not have a tree structure with accompanying data (such as times visited per node). I’m wondering if I can skip training using MCTS but just use the network to build a tree in the simulation phase and if there’s any precedent for this technique.
The problem is a deterministic RL problem with only one goal state and no other rewards. The same state can be reached twice and this often happens when I use the raw network approach. The agent then gets stuck in a loop. In a previous post, someone suggested taking the next best option once a certain state is reached more than once. This worked like a charm. But for real-world application, I would like to keep the number of actions taken as low as possible. This is why I think MCTS mightbe an improvement.