[R] https://arxiv.org/abs/1811.07519 Higher-order Neural Networks for Action Recognition
I am delighted to announce that I have submitted our recent work to arXiv. Any feedback would be highly appreciated. https://arxiv.org/abs/1811.07519
In this paper, we proposed a new architecture: the higher-order operation. The term “higher order” comes from higher order functions. A higher order function is a function that takes a function as an argument, or returns a function. Similarly, the outputs of the higher-order operation are not feature maps, but a bank of filters for extracting features. Then the network use the filters to extract features.
The intuition comes from the complexity of action recognition. It is much harder to recognize an action in a video than objects in still images. An effective architecture should not only recognize the appearance of target objects associated with the action, but also understand how they relate to other objects in the scene, in both space and time.
In the figure, we have 4 categories of actions: a) pull something from left to right, b) push something from right to left, c) push something from left to right and d) pull something from right to left. Only understanding the appearance info is not enough since we have only the hand and the object “something” in all four actions. It is also insufficient with temporal information. Figure b is the reverse of figure a “pull something from left to right”, but figure b is not simply the opposite: “pull something from right to left”. It is important to understand the object-in-context pattern to classify the actions.
As scenes become more complicated and the number of objects whose relations need to be tracked increases, the complexity of the learning task faced by the architecture increases rapidly. The vanilla convolutions use fixed filters to recognize every object-in-context pattern required to recognize one category of action, potentially leading to a blow up of the number parameters required for effective recognition of the actions.
In such settings, we do not want to have a huge number of filters to cover all possible object-in-context patterns. It is best if the model can propose/derive a filter given a certain context —- different filters for different contexts. The model does not need to store all the filters, but needs to learn how to propose a filter. Then the model can capture motions in the contexts better with reasonable parameters.