Blog

Learn About Our Meetup

4500+ Members

[D] Having trouble with Deep Q-learning on the OpenAI Gym Lunar Lander.

A few months ago I spent some time trying to learn deep reinforcement learning, and became obsessed with the OpenAI Gym Lunar Lander environment. I ended up doing KNN on memory (as in, “memory replay”), and I got some intelligent behavior out of the lander, but it was far from perfect (and yes, I know KNN is not “deep learning”, but I used what I understood). Recently I took another shot at it using deep Q-learning with neural networks, but I’m having even less success than before.

I wanted to ask about, what I believe to be, a major contributor to my difficulties.

In Deep Q-Learning you have a neural networks that will be used to approximate Q(s, a) which is the value of action a in state s, or rather, the value you can expect to obtain in the long run after taking action a in state s. We can also say that the value of a state is V(s) = Q(s, maximizing_a), meaning the value of a state is the value of choosing the optimal action in that state.

As far as I understand, deep Q-learning revolves around the Q(s, a) = r + V(s') equation, meaning the long term expected value of a state and action are simply the immediate reward, plus the expected value of the next state. In deep Q-learning you basically turn this equation into training data and train your neural network on it.

My problem is that Q predicts action values like this: Float32[-32.5629, -32.8037, -32.6016, -32.5938] (There are 4 possible actions at each step in the lunar lander environment.)

Q seems to understand [correctly] that it really doesn’t matter much what I do for a single step in the lander environment. A single action barely changes the situation at all, so the expected values of all actions are very very close. I don’t think my neural network is able to make such a subtle distinction between which action is best, because any single action has a very very small effect. This is especially troublesome because, again, Q is correct, there really isn’t much difference between the actions at a single step, so I can’t just “fix” Q, because it’s already working, but it will never be 100% correct since it’s just a function approximator.

So what do I do? Do I throw more parameters at it? Do I train it another way? Any suggestions?

submitted by /u/Buttons840
[link] [comments]

Next Meetup

 

Days
:
Hours
:
Minutes
:
Seconds

 

Plug yourself into AI and don't miss a beat

 


Toronto AI is a social and collaborative hub to unite AI innovators of Toronto and surrounding areas. We explore AI technologies in digital art and music, healthcare, marketing, fintech, vr, robotics and more. Toronto AI was founded by Dave MacDonald and Patrick O'Mara.