[D] Large-scale imitation learning/apprenticeship learning for self-driving cars
Quick summary: Imitation learning for self-driving cars is confounded by the DAgger problem, but this problem is in principle soluble by scaling up training data, as AlphaStar has demonstrated. Another proposed solution is to allow the agent to sample a human demonstration whenever it makes an error it can’t recover from. Tesla appears to be trying both solutions right now with ~450,000 drivers.
Ever since DeepMind showed with AlphaStar that you can get to human-level performance on StarCraft with imitation learning alone, I’ve been obsessed with the idea of applying imitation learning to self-driving cars on a similar scale.
Waymo has experimented with imitation learning on a very small scale (just ~1,400 hours of driving). (blog post | paper) Waymo’s experiment with their imitation network, ChauffeurNet, felt like a Rorschach test. Some deep learning/autonomous vehicle people on Twitter interpreted it as showing that imitation learning doesn’t work. Others reacted the opposite way, seeing it as a promising direction for future R&D.
Large-scale imitation learning is more exciting to me because AlphaStar is such a compelling proof of concept. Alex Irpan, a reinforcement learning researcher, has a great explanation on his blog:
One of the problems with imitation learning is the way errors can compound over time. I’m not sure if there’s a formal name for this. I’ve always called it the DAgger problem, because that’s the paper that everyone cites when talking about this problem (Ross et al, AISTATS 2011).
… This problem means mistakes in imitation learning often aren’t recoverable, and the temporal nature of the problem means that the longer your episode is, the more likely it is that you enter this negative feedback loop, and the worse you’ll be if you do. …
Due to growing quadratically in T, we expect long-horizon tasks to be harder for imitation learning. A StarCraft game is long enough that I didn’t expect imitation learning to work at all. And yet, imitation learning was good enough to reach the level of a Gold player.
… If you have a very large dataset, from a wide variety of experts of varying skill levels (like, say, a corpus of StarCraft games from anyone who’s ever played the game), then it’s possible that your data already has enough variation to let your agent learn how to recover from several of the incorrect decisions it could make.
So, AlphaStar has shown us that one potential solution to the compounding errors that arise with supervised imitation learning/behavioural cloning (the DAgger problem) is to collect a massive and highly varied dataset that includes a lot of errors, and a lot of recovering from errors. Counterintuitively, humans are teaching the AI by doing it wrong!
It was recently reported in The Information that Tesla is taking a behavioural cloning approach to self-driving. Tesla has around 450,000 drivers with the latest generation of sensor hardware, which includes eight cameras covering 360 degrees around the car. Here’s what The Information said:
Tesla’s cars collect so much camera and other sensor data as they drive around, even when Autopilot isn’t turned on, that the Autopilot team can examine what traditional human driving looks like in various driving scenarios and mimic it, said the person familiar with the system. It uses this information as an additional factor to plan how a car will drive in specific situations—for example, how to steer a curve on a road or avoid an object.
Such an approach has its limits, of course: behavior cloning, as the method is sometimes called, cannot teach an automated driving system to handle dangerous scenarios that cannot be easily anticipated. That’s why some autonomous vehicle programs are wary of relying on the technique.
But Tesla’s engineers believe that by putting enough data from good human driving through a neural network, that network can learn how to directly predict the correct steering, braking and acceleration in most situations. “You don’t need anything else” to teach the system how to drive autonomously, said a person who has been involved with the team. They envision a future in which humans won’t need to write code to tell the car what to do when it encounters a particular scenario; it will know what to do on its own.
Another potential solution is to give the imitation agent access to an expert/human demonstrator when it makes an error and doesn’t know how to recover. If a vehicle drives up onto a sidewalk, and there was never any sidewalk state-action pairs in its training dataset, then you can get a human to demonstrate what to do in that situation. The problem is this is obviously very labour intensive. You need a lot of demonstrators ready to take over when an error occurs.
Strikingly, this seems to be exactly what Tesla is doing. Elon Musk recently described something that sounds like this solution to the DAgger problem:
Well, there’s a lot of things that are learnt. There are certainly edge cases where say somebody’s on Autopilot and they take over. And then, okay, that’s a trigger that goes into our system that says, okay, did they take over for convenience, or did they take over because the Autopilot wasn’t working properly.
There’s also like, let’s say we’re trying to figure out what is the optimal spline for traversing an intersection. Then, the ones where there are no interventions are the right ones. So you then say okay, when it looks like this, do the following. And then you get the optimal spine for navigating a complex intersection.
Elon later said on Twitter:
Your interventions do train the NN [neural network]
This sounds like the neural network is sampling human demonstrations when it makes an error. In theory, it could be reinforcement learning rather than imitation/apprenticeship learning. Any thoughts on whether it would make sense to use RL instead of IL here?
A totally different approach is to use a GAN and do generative adversarial imitation learning (GAIL). In one paper, GAIL did worse than behavioural cloning on short time scales (~2 seconds) but better over long time scales. There’s also inverse reinforcement learning. So, there are a bunch of different ideas to explore in this area.
So, to summarize:
AlphaStar showed behavioural cloning can solve a complex, tactical, multi-agent task with an astronomically large, continuous action space — like driving! The solution is a massive and highly varied dataset with a lot of human errors.
When behavioural cloning falls short, another potential solution is to allow the neural network to ask a human for a demonstration when it makes an error.
Tesla appears to be collecting a massive and highly varied dataset with a lot of human errors for behavioural cloning of the driving task.
Tesla also appears to allowing its NN to sample human demonstrations when the NN makes an error, unless this is actually reinforcement learning.
This is so exciting to me. The only big difference I can think of between StarCraft and driving is the obvious one: AlphaStar just plugged into the game’s API, whereas to deploy a self-driving car you need to solve computer vision. Besides that, I can’t think of anything. Can y’all?
One way in which driving is actually easier than StarCraft is the time horizon. Driving is a sequence of short time horizon tasks. For example, the time horizon for navigating an intersection is short. Once a car is through the intersection, its actions don’t depend on its past actions or previously observed states.
Before AlphaStar, imitation learning felt a lot more dubious. Now it feels like a proven solution. We might be within spitting distance of honest-to-God self-driving cars.
Either I’m way too optimistic about this, or a lot of people are missing something big. So, which is it? Am I overlooking important differences between StarCraft and driving? Is it wrong to assume the difference between ChauffeurNet and AlphaStar is just scale?