New video-based approach to 3D motion capture makes virtual avatars more realistic than ever
With Video Inference for Body Pose and Shape Estimation (VIBE), scientists at the Max Planck Institute for Intelligent Systems have developed a neural network that makes video-based 3D motion capture more accurate, faster, and less expensive.
Tübingen. June 17, 2020 – A team of scientists at the Max Planck Institute for Intelligent Systems (MPI-IS) in Germany has developed VIBE, an algorithmic model that enables more detailed and accurate estimates of 3D human motion from video than was previously possible. They describe the model in the recently published paper, “VIBE: Video Inference for Body Pose and Shape Estimation”, which they are presenting today at the Conference on Computer Vision and Pattern Recognition (CVPR). One of the most competitive conferences in the field, CVPR 2020 kicked off on June 14 and is being held online until June 18.
“Previous frameworks do a good job of estimating 3D human pose and shape from a single image. But video-based models have not been able to mimic human motion realistically because of limited training data,” said Muhammed Kocabas, a Ph.D. student in the Perceiving Systems department at the MPI-IS and the paper’s co-author. “With VIBE, we have successfully addressed this challenge.”
VIBE is a learning-based framework that draws on AMASS, a large-scale motion capture dataset developed at MPI-IS that can be used for animation, visualization, and generating training data for deep learning. The scientists trained the VIBE algorithm on an NVIDIA GPU not only to estimate 3D human motion, but also to distinguish between real and implausible movements. Here, AMASS is used as the source of real human motion. With a single video of a human moving, the model first extracts image features using a convolutional neural network (CNN), neural networks that are often used in the field of machine learning to recognize and classify images. These features are then processed by a recurrent neural network (RNN) – a network capable of classifying temporal sequences and thus of capturing the sequential nature of human motion. The result is a smooth, realistic prediction of human pose, shape, and motion.
“What sets VIBE apart is its ability to detect a human subject’s entire range of action and motion in detail, including the way limbs and extremities move,” says Nikos Athanasiou, who is also a Ph.D. student in the Perceiving Systems Department and co-author of the paper. “From a single video, VIBE can produce realistic human motion very quickly, without any additional effort.”
VIBE could have a decisive impact on 3D animation. While high-quality virtual movement has long been a fixture of animated film and video games, producing realistic human shapes and poses generally involves a great deal of handcrafting: annotating a few seconds of video takes graphic artists and technicians several hours and requires an elaborate set-up of sensors and cameras. With VIBE, 3D motion capture can be easier, faster, and much less expensive.
“Understanding human behavior – how people move about in a scene, for example – is a fundamental task in the field of computer vision,” says Michael J. Black, Director at the Max Planck Institute for Intelligent Systems in Tübingen and head of the Perceiving Systems Department. “The VIBE model contributes to improve this understanding, and it shows promise for applications in a broad range of fields, from augmented reality to autonomous driving, robotics, and medical applications. More accurate 3D predictions of human motion will pave the way for computers to work more collaboratively with humans.”
Max Planck Institute for Intelligent Systems
Phone: +49 7071 601 1832
Mobile: +49 151 1560 4276
At the Max Planck Institute for Intelligent Systems we aim to understand the principles of Perception, Action and Learning in Intelligent Systems.
The Max Planck Institute for Intelligent Systems is located in two cities: Stuttgart and Tübingen. Research at the Stuttgart site covers small-scale robotics, self-organization, haptic perception, bio-inspired systems, medical robotics, and physical intelligence. The Tübingen site focuses on machine learning, computer vision, robotics, control, and the theory of intelligence.
The Perceiving Systems department combines computer vision, machine learning, and computer graphics to train computers to understand humans and their behavior in images and video. The team’s unique approach begins with learning compact parametric models of 3D human shape and motion. We use these to extract and analyze human behavior in the context of 3D scenes. The department has approximately 45 staff and students and additional affiliated researchers. It operates unique 4D scanning facilities that produce highly accurate and detailed 3D meshes of the body, face, hands, and feet at 60 frames per second. The department also employs wearable motion capture suits, flying robots, and camera-based systems to record human movement.
The MPI-IS is one of the 86 Max Planck Institutes and research institutions that are part of the Max Planck Society. It is Germany’s most successful research organization. Since its establishment in 1948, no fewer than 18 Nobel laureates have emerged from the ranks of its scientists, putting it on par with the best and most prestigious research institutions worldwide. All Institutes conduct basic research in the service of the general public in the natural sciences, life sciences, social sciences, and the humanities. Max Planck Institutes focus on research fields that are particularly innovative, or that are especially demanding in terms of funding or time requirements. And their research spectrum is continually evolving: new institutes are established to find answers to seminal, forward-looking scientific questions, while others are closed when, for example, their research field has been widely established at universities. This continuous renewal preserves the scope the Max Planck Society needs to react quickly to pioneering scientific developments.