Learn About Our Meetup

4200+ Members

[D] Having trouble with Deep Q-learning on the OpenAI Gym Lunar Lander.

A few months ago I spent some time trying to learn deep reinforcement learning, and became obsessed with the OpenAI Gym Lunar Lander environment. I ended up doing KNN on memory (as in, “memory replay”), and I got some intelligent behavior out of the lander, but it was far from perfect (and yes, I know KNN is not “deep learning”, but I used what I understood). Recently I took another shot at it using deep Q-learning with neural networks, but I’m having even less success than before.

I wanted to ask about, what I believe to be, a major contributor to my difficulties.

In Deep Q-Learning you have a neural networks that will be used to approximate Q(s, a) which is the value of action a in state s, or rather, the value you can expect to obtain in the long run after taking action a in state s. We can also say that the value of a state is V(s) = Q(s, maximizing_a), meaning the value of a state is the value of choosing the optimal action in that state.

As far as I understand, deep Q-learning revolves around the Q(s, a) = r + V(s') equation, meaning the long term expected value of a state and action are simply the immediate reward, plus the expected value of the next state. In deep Q-learning you basically turn this equation into training data and train your neural network on it.

My problem is that Q predicts action values like this: Float32[-32.5629, -32.8037, -32.6016, -32.5938] (There are 4 possible actions at each step in the lunar lander environment.)

Q seems to understand [correctly] that it really doesn’t matter much what I do for a single step in the lander environment. A single action barely changes the situation at all, so the expected values of all actions are very very close. I don’t think my neural network is able to make such a subtle distinction between which action is best, because any single action has a very very small effect. This is especially troublesome because, again, Q is correct, there really isn’t much difference between the actions at a single step, so I can’t just “fix” Q, because it’s already working, but it will never be 100% correct since it’s just a function approximator.

So what do I do? Do I throw more parameters at it? Do I train it another way? Any suggestions?

submitted by /u/Buttons840
[link] [comments]

Next Meetup




Plug yourself into AI and don't miss a beat