# Blog

## 5000+ Members

### MEETUPS

LEARN, CONNECT, SHARE

### JOB POSTINGS

INDEED POSTINGS

Browse through the latest deep learning, ai, machine learning postings from Indeed for the GTA.

### CONTACT

CONNECT WITH US

Are you looking to sponsor space, be a speaker, or volunteer, feel free to give us a shout.

# [D] Having trouble with Deep Q-learning on the OpenAI Gym Lunar Lander.

A few months ago I spent some time trying to learn deep reinforcement learning, and became obsessed with the OpenAI Gym Lunar Lander environment. I ended up doing KNN on memory (as in, “memory replay”), and I got some intelligent behavior out of the lander, but it was far from perfect (and yes, I know KNN is not “deep learning”, but I used what I understood). Recently I took another shot at it using deep Q-learning with neural networks, but I’m having even less success than before.

I wanted to ask about, what I believe to be, a major contributor to my difficulties.

In Deep Q-Learning you have a neural networks that will be used to approximate `Q(s, a)` which is the value of action `a` in state `s`, or rather, the value you can expect to obtain in the long run after taking action `a` in state `s`. We can also say that the value of a state is `V(s) = Q(s, maximizing_a)`, meaning the value of a state is the value of choosing the optimal action in that state.

As far as I understand, deep Q-learning revolves around the `Q(s, a) = r + V(s')` equation, meaning the long term expected value of a state and action are simply the immediate reward, plus the expected value of the next state. In deep Q-learning you basically turn this equation into training data and train your neural network on it.

My problem is that `Q` predicts action values like this: `Float32[-32.5629, -32.8037, -32.6016, -32.5938]` (There are 4 possible actions at each step in the lunar lander environment.)

`Q` seems to understand [correctly] that it really doesn’t matter much what I do for a single step in the lander environment. A single action barely changes the situation at all, so the expected values of all actions are very very close. I don’t think my neural network is able to make such a subtle distinction between which action is best, because any single action has a very very small effect. This is especially troublesome because, again, `Q` is correct, there really isn’t much difference between the actions at a single step, so I can’t just “fix” `Q`, because it’s already working, but it will never be 100% correct since it’s just a function approximator.

So what do I do? Do I throw more parameters at it? Do I train it another way? Any suggestions?

submitted by /u/Buttons840