[R] OmniNet is all you need! ;)
Paper url: https://arxiv.org/abs/1907.07804
OmniNet is the first-ever truly universal architecture for multi-modal multi-task learning. A single OmniNet architecture can encode multiple inputs from almost any real-life domain (txt, image, video) and is capable of asynchronous multi-task learning across a wide range of tasks. The OmniNet architecture consists of multiple sub-networks called the neural peripherals, used to encode domain specific inputs as spatio-temporal representations, connected to a common central neural network called the Central Neural Processor (CNP). The CNP implements a Transformer based universal spatio-temporal encoder and a multi-task decoder. In the paper a single instance of OmniNet is jointly trained to perform the tasks of part-of-speech tagging, image captioning, visual question answering and video activity recognition. Due to the shared multi-modal representation learning architecture of the Central Neural Processor, OmniNet can also be used for zero-shot prediction for tasks it was never trained on. For example, the multi-model architecture can also be used for video captioning and video question answering even though the model was never trained on those tasks.