[D] Is there an accepted state-of-the-art for Video Action Localisation/Region Suggestion?
I am working on a project which involves video action recognition, and am using the fantastic I3D approach.
However, I am now interested in localising the region of video which contains the action being classified (i.e. with a bounding box). For example, if my I3D network classifies a segment of video containing the action of a human “Eating an Apple”, I now want to draw a bounding box over the person eating the apple in each frame of video. Note that I am not interested in drawing a region around “Apple” or “Human”, but instead the region in which the action itself is being performed.
I am familiar with similar approaches for image classification (e.g. YOLO), but am having trouble finding work in the video domain for actions. Can anyone point me to some good papers which cover this if they exist?