[D] RL) questions regarding log std, clipping outputs.
I’m trying to implement A2C in TF2.0 without using additional libraries like baselines. But I’m having problem with standard deviation(std) and clipping.
First of all, I made variable log_std to avoid 0s in calculations(I exponentiate it when I use log_std like tf.exp(log_std)). I’ve seen this trick in CS294-112 homework2, so I’m using it. But, when I gather trainable variables to train, I’m just using log_std, but is it okay to do this? (Since I made variable which is not exponentiated, so I think I can only update this, but not in exponentiated form) I feel like I shouldn’t since the values of derivative will be different when NN do back-propagating.
Second, I’m clipping actions with tf.clip_by_value(ac, env.action_space.low, env.action_space.high). But, I’m not sure how to clip NN’s output(NN is set to output mean of Gaussian distribution). NN should always output distributions with maximum : env.action_space.high and minimum : env.action_space.low as far as I know. But since I’m using Gaussian distribution, it is impossible to apply above constriction. Then what is usual way to clip NN’s output in case we use Gaussian distribution(or other distributions)?
Final question: do you think I should use RL libraries like tensorforce and baselines?