[D] RL Line Follower
I’m trying to train a line follower agent using Deep RL. In the simplest case, the environments look like in the attached figure. More complicated environments can be generated by varying the line thickness along the path, allowing the line to have tangency points with itself, or having other lines intersecting/touching the line of interest. The agent starts at one end and the goal is to reach the other end while staying as centered as possible and following the topologically correct path (e.g. in case of overlaps, it shouldn’t take the “wrong path”). The terminal state does not have to be signaled by the agent.When the agent is located in the middle of an intersection, the correct path to take cannot be determined unless a history of previous positions is stored (either by concatenating the latest n observations, or by using a recurrent layer such as an LSTM in the Q network/Policy network), thus effectively handling the environment as a POMDP. At each step, the observation is a retina-like representation around the current position, as in this paper: https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
The rewards are dense (i.e. received after each timestep), and I defined them as:
The actions correspond to the 8 discrete neighbors of the current position in which the agent can move, with a fixed stepsize (so the agent effectively walks on a grid).
I’ve tried using both simple DQN by concatenating the previous observations and DRQN (https://arxiv.org/abs/1507.06527), but the results are not great. I’m starting to think that Q learning is not suited for the task because the length of the line is varying from one environment to another, so the return can vary a lot, hence being hard to learn (especially because the agent does not observe the full environment). Because of this, I’ve tried reducing the discount factor, still without improvements. I cannot find a systematic reason for the failures (for example the agent failing always inside an intersection).
I’ve also tried PG and Recurrent PG but never really managed to make it work, although I am starting to think that PG is more suited for this task.
The big question: is there anything fundamentally wrong that I am doing, or a fundamental reason for which Q-learning/PG will not work for this? Any tips or tricks, suggestions?