[D] Avoiding revisiting paths in MCTS, while maintaining a good policy?
I’ve been stumped on this problem for a week or so.
I realized that with MCTS, if you don’t keep track of which subtrees have been fully saturated (i.e. you’ve expanded it completely, down to every leaf), you can end up revisiting the same paths over and over again, when you really should be using your time exploring other areas of the tree.
However, at the same time, if you naively implement behavior that says “During the selection phase, ignore any fully-saturated nodes”, then the metric “number of visits” is no longer a good indicator of which child node is the best.
E.x., you have two children nodes, one leads to victory 80% of the time, the other leads to victory 40% of the time. But the one that wins 80% of the time only has 200 nodes in its entire subtree, for a maximum of 200 unique visits. The one that wins only 40% has a huge subtree, so its “visits” count becomes very large.
At the end of MCTS, you’ll see the “wins/plays” for the smaller tree will be better, but since in general MCTS will select “visits” over “wins”, you will end up selecting the wrong node.
I’ve been trying to crack this myself for some time as a matter of pride, but I’m deciding to reach out to see if the community already has a good strategy for this.
Some of my ideas include: for each node, also store a count of “all descendants” (i.e. how many nodes are in this node’s subtree), as well as “all leafs” (i.e. how many nodes in this subtree are leafs), plus “win leafs” and “lose leafs”, and then you can do
1 / all_descendants as a factor to smooth out the difference between smaller subtrees and larger ones…