[P] DQN Loss Function Question
I am attempting to implement my own DQN without looking at the source code. I am torn between two possible approaches for evaluating the loss.
The basic approach could be to take the gradient over the selected action and to set the gradient of all other action value outputs to 0. My worry is that changes to improve the Q-value of the selected action will incidentally change the outputs for Q-values of other actions, meaning our other estimates will become less accurate. One way to solve this would be to set the labels for the non-chosen actions to be the current output of the Q-network, so that the network is incentivized not to change the output values of these actions. However, I have not seen this approach very much on the forums I’ve checked so I’m assuming it is bad for some reason.
Can anyone shed some light on which approach is better to take?