[D] Having trouble with Deep Q-learning on the OpenAI Gym Lunar Lander.
A few months ago I spent some time trying to learn deep reinforcement learning, and became obsessed with the OpenAI Gym Lunar Lander environment. I ended up doing KNN on memory (as in, “memory replay”), and I got some intelligent behavior out of the lander, but it was far from perfect (and yes, I know KNN is not “deep learning”, but I used what I understood). Recently I took another shot at it using deep Q-learning with neural networks, but I’m having even less success than before.
I wanted to ask about, what I believe to be, a major contributor to my difficulties.
In Deep Q-Learning you have a neural networks that will be used to approximate Q(s, a)
which is the value of action a
in state s
, or rather, the value you can expect to obtain in the long run after taking action a
in state s
. We can also say that the value of a state is V(s) = Q(s, maximizing_a)
, meaning the value of a state is the value of choosing the optimal action in that state.
As far as I understand, deep Q-learning revolves around the Q(s, a) = r + V(s')
equation, meaning the long term expected value of a state and action are simply the immediate reward, plus the expected value of the next state. In deep Q-learning you basically turn this equation into training data and train your neural network on it.
My problem is that Q
predicts action values like this: Float32[-32.5629, -32.8037, -32.6016, -32.5938]
(There are 4 possible actions at each step in the lunar lander environment.)
Q
seems to understand [correctly] that it really doesn’t matter much what I do for a single step in the lander environment. A single action barely changes the situation at all, so the expected values of all actions are very very close. I don’t think my neural network is able to make such a subtle distinction between which action is best, because any single action has a very very small effect. This is especially troublesome because, again, Q
is correct, there really isn’t much difference between the actions at a single step, so I can’t just “fix” Q
, because it’s already working, but it will never be 100% correct since it’s just a function approximator.
So what do I do? Do I throw more parameters at it? Do I train it another way? Any suggestions?
submitted by /u/Buttons840
[link] [comments]