[R] Few-shot learning of talking heads
I’d like to tell you about our recent paper. We’ve tackled the problem of a few-shot generation of talking heads: given a few (or even a single) image, train a model that is able to synthesize new images of that particular person with a new pose (viewpoint and expression).
Our model was trained on a publicly available dataset of YouTube videos (VoxCeleb2, 224p) and avoided mode collapse, even though the quality of images here is quite diverse. Hense, we’re able to generalize well for new images with identities unseen during training (we can even run it for paintings and get reasonable results).
The key ingredients are adversarial meta-learning, adversarial fine-tuning and adaptive instance normalization, for more details please refer to the paper, short description of our method as well as the results are in the video below.