[R] Multiple-action policy (RL)
In the following work, authors propose a simple trick to sample multiple actions in linear time in every step and train the model with policy gradient. It is proposed “by the way”, but to me, it is a very important contribution to the RL itself. See pages 4 and 5 of https://arxiv.org/abs/1905.12916 (Chen et al., Effective Medical Test Suggestions Using Deep Reinforcement Learning, 2019).
The main trick is, instead of outputing softmax probability distribution directly, to output sigmoid values (0-1) and sample this Bernoulli distribution. The authors then show how to make a proper probability distribution of this exponential action-space and subsequently do the policy gradient (both is very simple).