[D] Having trouble with Deep Q-learning on the OpenAI Gym Lunar Lander.
A few months ago I spent some time trying to learn deep reinforcement learning, and became obsessed with the OpenAI Gym Lunar Lander environment. I ended up doing KNN on memory (as in, “memory replay”), and I got some intelligent behavior out of the lander, but it was far from perfect (and yes, I know KNN is not “deep learning”, but I used what I understood). Recently I took another shot at it using deep Q-learning with neural networks, but I’m having even less success than before.
I wanted to ask about, what I believe to be, a major contributor to my difficulties.
In Deep Q-Learning you have a neural networks that will be used to approximate
Q(s, a) which is the value of action
a in state
s, or rather, the value you can expect to obtain in the long run after taking action
a in state
s. We can also say that the value of a state is
V(s) = Q(s, maximizing_a), meaning the value of a state is the value of choosing the optimal action in that state.
As far as I understand, deep Q-learning revolves around the
Q(s, a) = r + V(s') equation, meaning the long term expected value of a state and action are simply the immediate reward, plus the expected value of the next state. In deep Q-learning you basically turn this equation into training data and train your neural network on it.
My problem is that
Q predicts action values like this:
Float32[-32.5629, -32.8037, -32.6016, -32.5938] (There are 4 possible actions at each step in the lunar lander environment.)
Q seems to understand [correctly] that it really doesn’t matter much what I do for a single step in the lander environment. A single action barely changes the situation at all, so the expected values of all actions are very very close. I don’t think my neural network is able to make such a subtle distinction between which action is best, because any single action has a very very small effect. This is especially troublesome because, again,
Q is correct, there really isn’t much difference between the actions at a single step, so I can’t just “fix”
Q, because it’s already working, but it will never be 100% correct since it’s just a function approximator.
So what do I do? Do I throw more parameters at it? Do I train it another way? Any suggestions?
submitted by /u/Buttons840