[D] How do object detection algorithms and feature extractor networks work together for action detection?

I’m talking about architectures such as AlexNet, Inception and object detectors like YOLO, SSD. I’ve read a bit online and I’m really confused how they work together.
Lets say I want to detect a specific object/person from a video and put a box around them with a label describing the state of that object/person. How would that work? What would be steps taken by the object detector and feature extractor? A workflow for this would be really helpful.

