[D] State of the art in video action recognition using transfer learning?
I am working on a project which requires action recognition in videos, specifically short ~10 second YouTube clips. Ideally I want to start with a pre-trained network which can be fine-tuned so as to avoid training cost.
AFAIK the SOTA is widely accepted as DeepMind’s I3D, for which pre-trained checkpoint models exist. Are there any interesting papers which challenge this approach, specifically those which can also use a pre-training approach?