The Future of Kinect
"On the Whiteboard" host Pamela Woon prepares to get a 3D model of her face, as part of a process that gathers data on voices, body gestures and facial expressions to continuously improve the Kinect experience.
By Athima Chansanchai | June 05, 2014

Zombies don’t have to be scary – especially when kids can create them in their own image. Using the Kinect for Windows v2 sensor and an app called YAKiT, children can step into the role of the undead and see it come to life using performance-based animation. Like so many who use the Kinect sensor, kids don’t need a laundry list of instructions to use it. They just step in front of it, creep like zombies and instantly, their animated figures move like them, sparking a cacophony of giggles.

While the latest version of Kinect has been available since the launch of Xbox One, the preorder of the Kinect for Windows version becomes available for all developers today. Both sensors are built on a set of shared technologies.

Companies such as Freak’n Genius, the Seattle-based company behind YAKit, have already had the chance to try the Kinect for Windows v2 sensor through its Developer Preview Program. “It’s so magical, honestly,” says Kyle Kesterson, Freak’n Genius founder. “We put people in front of it, and they light up without even having to do anything.”

But behind that magic is the culmination of years of machine learning. It’s all part of a complex 24-7 process that involves a legion of people and resources that gather data on voices, body gestures and facial expressions, then test the information and analyze it before the software makes its way to your living room.

“Bringing Kinect to Xbox One to deliver vision and hearing capability was just the start of a long journey and evolution of Natural User Interaction (NUI),” says Scott Evans, partner group program manager for Kinect. NUI breaks down the barriers between human and machine, so that interacting is as natural as talking to another person or conveying intention through non-verbal nuances.

Kinect has to work for everybody, in their natural environment. Luckily, the device is a quick study, thanks to people working every day to make it better. At any given time, more than 300 Xbox developer kits are testing up to 2 million frames of video gathered from thousands of home visits, motion-capture sessions and in-house experiments.

Machine learning: Teaching software how to behave

At Microsoft, there’s a whole group of people in the NUI group focused on taking requests from different teams and gathering information about how people move and express themselves.

“We start with designing the hardware, getting the best eyes and ears into the living room. Then we go through the process of building the software for it – the brain that takes that raw signal and takes it into an understanding of the room and the people in it,” says Evans.

When it was released as part of Xbox One, Kinect was already programmed to recognize certain movements and objects as a baseline. But in order to improve that software, first Microsoft needs to document real people using it in their natural environments, then manually compare what Kinect sees with reality (“ground truth”). That data is then fed into a system, which runs algorithms to find where its software recognition doesn’t match the ground truth – and that’s where it knows to improve.

Collecting data for Kinect means bringing volunteers to labs on the Microsoft campus, suiting up for motion capture sessions and visiting Microsoft employees’ homes – a diverse group that spans age, gender, languages and ethnicity – to record video clips of bodies in natural motion.

Kinect’s infrared camera records pre-requested movements for later processing.

Three times a day, the Interactive Research Services team visits employees who live within 25 miles of the main Redmond campus. These visits began in October 2012, and they’ve now gone to more than 1,000 homes at the request of teams working on facial recognition, color calibration, expressions, controllers, gestures, speech, audio and identity, among others.

Darrell Mitchell and Brandon Broady are the current two-man team that conducts home visits. Mitchell records the video clips taken through the Kinect sensor for Xbox One that they bring, while Broady gives instructions to participants to follow, like a fitness instructor. These actions are then recorded with infrared cameras, which can map 3D in near darkness.

Microsoft employee Seng Teung gets ready for a session in “The Holodeck,” which uses DSLR cameras to create 3D models.
Microsoft employee Seng Teung gets ready for a session in “The Holodeck,” which uses DSLR cameras to create 3D models.

Back on the Microsoft campus, in a room called “The Holodeck,” Senior Program Manager Rainer Schiller takes approximately 20 still images to begin modeling a 3D face. This helps train the Kinect to recognize different types of faces and create avatars such as those found in “Kinect Sports Rivals.”

In another building, User Research Lead Anatole Chen works with a suited up Alexander Clark to record thousands of different movements and gestures – like baseball and golf swings – using 24 four-megapixel infrared cameras. This is the basis for synthetic data that can be manipulated later to make Kinect recognize its users more accurately. This information establishes the baseline against which home visit data can be compared later.

Senior Tagger Alexander Clark suits up in a mo-cap outfit while colleague Anatole Chen analyzes his movements.
Senior Tagger Alexander Clark suits up in a mo-cap outfit while colleague Anatole Chen analyzes his movements.

The Ground Truth

All that data then goes to taggers who establish “ground truth.” It’s a tedious but necessary set of tasks that involve skeleton tracking, tagging 25 joints on the human body electronically, defined on a frame-by-frame basis. This is how movement is documented in 3D spaces and fed into machine learning. About 20 in-house taggers have to define where hand, shoulders, hands and feet are – as well as other areas on a body.

Jon Clark, Lead QA for Tagging Evolution and Production, reviews a tagged clip to make sure it meets tagging standards.
Jon Clark, Lead QA for Tagging Evolution and Production, reviews a tagged clip to make sure it meets tagging standards.

There are a lot of obstacles in their way – couches, slouchy posture, pets, crying babies, to name just a few things – and it’s their job to tell the computer where the human who’s using the device begins and ends.

More than 1 million frames of images were hand annotated before the Xbox One launch. Work like this hasn’t gone unnoticed to those developing apps that use the Kinect.

“Kinect is a great example of getting incredible sophisticated technology to create simple solutions,” says Spencer Hutchins, a co-founder and CEO of the San Diego-based Reflexion Health. Hutchins’ company has used Kinect to make Vera, an app intended to motivate patients to do their physical therapy exercises at home using the Kinect for Windows v2 sensor. When the Kinect device is hooked up to a computer, it provides companies and developers with the foundation they need to create interactive applications that respond to natural movements, gestures and voice commands. 

Hutchins adds, “The ability of the Kinect system to track and record individuals performing their exercises opens up enormous opportunities for physical therapists to understand what's happening with their patients, and enlist Vera's help in coaching their patients to perform exercises with the proper form.”

Passing the Gauntlet

Vince Ortado’s team at Microsoft processes up to 180,000 video clips an hour, running machine learning algorithms that improve Kinect’s software. More than 300 Xbox developer kits operate 24-7, divided into groups testing anything from hand gestures to identity.

It’s important to have all these millions of frames of video go through as fast as possible, as the teams working on Kinect can only act after they’ve received the results. And they’re on a schedule to act at a brisk pace with monthly software releases that give users an experience that continuously improves.

“These machines are a gauntlet. You get through the gauntlet or not. You have to pass to give developers and senior leadership team the confidence, the information that this build is a good enough build for our audience,” says Ortado.

A bay of Xbox One developer kits processes millions of frames of Kinect video every week.
A bay of Xbox One developer kits processes millions of frames of Kinect video every week.

Looking Ahead

Startups such as Freak’n Genius and Reflexion Health and games such as “Kinect Sports Rivals” show what’s already possible, and what could be on the horizon.

“It’s amazing the power that comes in such an affordable and easily distributable price point,” Hutchins says. “Motion tracking has been in medicine for decades, but it’s always proprietary, research oriented, and completely immobile. Kinect allows us to bring the power of that tracking to the real world – and layer on an immersive, engaging interface experience."

“We’re trying to create visual content that makes people say, ‘Wow, that’s cool,’ that’s high quality and polished,” Kesterson says. “We’re constantly bombarded by highly produced content all the time, so the more we and Kinect can do the hard work, it makes the quality better for the creator and the creator’s audience.”

Right now, people can experience Kinect through Xbox One: playing games, choosing movies and using Skype. Or they might be out and about and interact with a Kinect for Windows sensor as part of a retail experience, or in other spaces such as museums, hotels or corporate offices. Or they may happen upon interactive animation experiences such as those Freak’n Genius has staged, that put people on stage dancing as a company mascot. The availability of preorders on Thursday will allow even more Kinect for Windows v2 sensors to get into the hands of developers and enable a wider variety of user scenarios.

As for the teams of people who continue working to improve Kinect, Kinect’s Evans says, “It’s all about making Kinect work whether or not you have a puffy couch or a ficus in your living room that might look like a person. Being able to always get it right and understand who you are in your natural environment, in every living room with every person. That’s the investment we make in doing the machine learning. It’s to get it right for everybody.”