[D] RL Line Follower

Written by torontoai on October 14, 2019. Posted in Reddit MachineLearning.

Hi everyone,

I’m trying to train a line follower agent using Deep RL. In the simplest case, the environments look like in the attached figure. More complicated environments can be generated by varying the line thickness along the path, allowing the line to have tangency points with itself, or having other lines intersecting/touching the line of interest. The agent starts at one end and the goal is to reach the other end while staying as centered as possible and following the topologically correct path (e.g. in case of overlaps, it shouldn’t take the “wrong path”). The terminal state does not have to be signaled by the agent.When the agent is located in the middle of an intersection, the correct path to take cannot be determined unless a history of previous positions is stored (either by concatenating the latest n observations, or by using a recurrent layer such as an LSTM in the Q network/Policy network), thus effectively handling the environment as a POMDP. At each step, the observation is a retina-like representation around the current position, as in this paper: https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf

The rewards are dense (i.e. received after each timestep), and I defined them as:

-1 if the agent goes outside the line or moves in the wrong direction (e.g. moves in the opposite direction than it should, or is taking a wrong path when located inside an intersection). In this case the episode ends.
otherwise the reward is a positive between 0 and 1, depending on the distance from the center of the line (1 being the maximum, when located exactly on the center).

The actions correspond to the 8 discrete neighbors of the current position in which the agent can move, with a fixed stepsize (so the agent effectively walks on a grid).

I’ve tried using both simple DQN by concatenating the previous observations and DRQN (https://arxiv.org/abs/1507.06527), but the results are not great. I’m starting to think that Q learning is not suited for the task because the length of the line is varying from one environment to another, so the return can vary a lot, hence being hard to learn (especially because the agent does not observe the full environment). Because of this, I’ve tried reducing the discount factor, still without improvements. I cannot find a systematic reason for the failures (for example the agent failing always inside an intersection).

I’ve also tried PG and Recurrent PG but never really managed to make it work, although I am starting to think that PG is more suited for this task.

The big question: is there anything fundamentally wrong that I am doing, or a fundamental reason for which Q-learning/PG will not work for this? Any tips or tricks, suggestions?

https://i.redd.it/o62wjg7nrns31.png

submitted by /u/hemiwoyi
[link] [comments]

Blog

Learn About Our Meetup

5000+ Members

MEETUPS

JOB POSTINGS

CONTACT

[D] RL Line Follower