Learning Cross-Modal Temporal Representations from Unlabeled Videos
While people can easily recognize what activities are taking place in videos and anticipate what events may happen next, it is much more difficult for machines. Yet, increasingly, it is important for machines to understand the contents and dynamics of videos for applications, such as temporal localization, action detection and navigation for self-driving cars. In order to train neural networks to perform such tasks, it is common to use supervised training, in which the training data consists of videos that have been meticulously labeled by people on a frame-by-frame basis. Such annotations are hard to acquire at scale. Consequently, there is much interest in self-supervised learning, in which models are trained on various proxy tasks, and the supervision of those tasks naturally resides in the data itself.
In “VideoBERT: A Joint Model for Video and Language Representation Learning” (VideoBERT) and “Contrastive Bidirectional Transformer for Temporal Representation Learning” (CBT), we propose to learn temporal representations from unlabeled videos. The goal is to discover high-level semantic features that correspond to actions and events that unfold over longer time scales. To accomplish this, we exploit the key insight that human language has evolved words to describe high-level objects and events. In videos, speech tends to be temporally aligned with the visual signals, and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems, and thus provides a natural source of self-supervision. Our model is an example of cross-modal learning, as it jointly utilizes the signals from visual and audio (speech) modalities during training.
A BERT Model for Videos
The first step of representation learning is to define a proxy task that leads the model to learn temporal dynamics and cross-modal semantic correspondence from long, unlabeled videos. To this end, we generalize the Bidirectional Encoder Representations from Transformers (BERT) model. The BERT model has shown state-of-the-art performance on various natural language processing tasks, by applying the Transformer architecture to encode long sequences, and pretraining on a corpus containing a large amount of text. BERT uses the cloze test as its proxy task, in which the BERT model is forced to predict missing words from context bidirectionally, instead of just predicting the next word in a sequence.
To do this, we generalize the BERT training objective, using image frames combined with the ASR sentence output at the same locations to compose cross-modal “sentences”. The image frames are converted into visual tokens with durations of 1.5 seconds, based on visual feature similarities. They are then concatenated with the ASR word tokens. We train the VideoBERT model to fill out the missing tokens from the visual-text sentences. Our hypothesis, which our experiments support, is that by pretraining on this proxy task, the model learns to reason about longer-range temporal dynamics (visual cloze) and high-level semantics (visual-text cloze).
|Illustration of VideoBERT in the context of a video and text masked token prediction, or cloze, task. Bottom: visual and text (ASR) tokens from the same locations of videos are concatenated to form the inputs to VideoBERT. Some visual and text tokens are masked out. Middle: VideoBERT applies the Transformer architecture to jointly encode bidirectional visual-text context. Yellow and pink boxes correspond to the input and output embeddings, respectively. Top: the training objective is to recover the correct tokens for the masked locations.|
Inspecting the VideoBERT Model
We trained VideoBERT on over one million instructional videos, such as cooking, gardening and vehicle repair. Once trained, one can inspect what the VideoBERT model learns on a number of tasks to verify that the output accurately reflects the video content. For example, text-to-video prediction can be used to automatically generate a set of instructions (such as a recipe) from video, yielding video segments (tokens) that reflect what is described at each step. In addition, video-to-video prediction can be used to visualize possible future content based on an initial video token.
To verify if VideoBERT learns semantic correspondences between videos and text, we tested its “zero-shot” classification accuracy on a cooking video dataset in which neither the videos nor annotations were used during pre-training. To perform classification, the video tokens were concatenated with a template sentence “now let me show you how to [MASK] the [MASK]” and the predicted verb and noun tokens were extracted. The VideoBERT model matched the top-5 accuracy of a fully-supervised baseline, indicating that the model is able to perform competitively in this “zero-shot” setting.
Transfer Learning with Contrastive Bidirectional Transformers
While VideoBERT showed impressive results in learning how to automatically label and predict video content, we noticed that the visual tokens used by VideoBERT can lose fine-grained visual information, such as smaller objects and subtle motions. To explore this, we propose the Contrastive Bidirectional Transformers (CBT) model which removes this tokenization step, and further evaluated the quality of learned representations by transfer learning on downstream tasks. CBT applies a different loss function, the contrastive loss, in order to maximize the mutual information between the masked positions and the rest of cross-modal sentences. We evaluated the learned representations for a diverse set of tasks (e.g., action segmentation, action anticipation and video captioning) and on various video datasets. The CBT approach outperforms previous state-of-the-art by significant margins on most benchmarks. We observe that: (1) the cross-modal objective is important for transfer learning performance; (2) a bigger and more diverse pre-training set leads to better representations; (3) compared with baseline methods such as average pooling or LSTMs, the CBT model is much better at utilizing long temporal context.
|Action anticipation accuracy with the CBT approach from untrimmed videos with 200 activity classes. We compare with AvgPool and LSTM, and report performance when the observation time is 15, 30, 45 and 72 seconds.|
Conclusion & future work
Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos. We find that our models are not only useful for zero-shot action classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation. Future work includes learning low-level visual features jointly with long-term temporal representations, which enables better adaptation to the video context. Furthermore, we plan to expand the number of pre-training videos to be larger and more diverse.
The core team includes Chen Sun, Fabien Baradel, Austin Myers, Carl Vondrick, Kevin Murphy and Cordelia Schmid. We would like to thank Jack Hessel, Bo Pang, Radu Soricut, Baris Sumengen, Zhenhai Zhu, and the BERT team for sharing amazing tools that greatly facilitated our experiments. We also thank Justin Gilmer, Abhishek Kumar, Ben Poole, David Ross, and Rahul Sukthankar for helpful discussions.