[R] Learning to Generate Human–Object Interactions – Stanford AI Lab
Tremendous progress has been made in the last few years in developing advanced virtual reality (VR) and robotics platforms. As the examples above show, these platforms now allow us to experience more immersive virtual worlds, or allow robots to perform challenging locomotion tasks like walking in snow. So, can we soon expect to have robots that can set the dinner table or do our dishes?
Unfortunately, we are not yet there.
To understand why, consider the diversity of interactions in daily human life. We spend almost all of our waking hours performing activities—simple actions like picking up a fruit or more complex ones like cooking a meal. These physical interactions, called human–object interactions, are multi-stepped and governed by physics as well as human goals, customs, and biomechanics. In order to develop more dynamic virtual worlds, and smarter robots, we need to teach machines to capture, understand, and replicate these interactions. The information we need to learn these interactions is already widely available in the form of large video collections (e.g., YouTube, Netflix, Facebook).
In this post, I will describe some first steps we have taken towards learning multi-step human–object interactions from videos. I will discuss two applications of our method: (1) generating plausible and novel human-object interaction animations suitable for VR/AR, (2) enabling robots to react smartly to user behavior and interactions.