Virtual Stage is a background matting experiment to create high-quality videos from anywhere, improving current virtual green screening techniques with Azure Kinect depth camera.Learn about Virtual Stage
As we were preparing for the first ever online-only delivery of Microsoft Build, we set out to create a series of short videos, the Innovation Bites. As many marketers in the current situation, we found ourselves struggling with the limitations of what each speaker could create from their homes and our vision for what we wanted to create.
We got inspiration from the approaches at background matting from Adobe Research and the University of Washington (UW). We wanted to build upon these ideas with our Azure Kinect sensor to create an easy way for everyone to use green screening when they record from their homes.
We have developed a simple front-end experience for speakers to record themselves with up to 2 Azure Kinect sensors at the same time plus slides, and a backend that takes the approaches from the UW and Adobe Research and adds the Azure Kinect sensor depth capabilities to improve the precision of the background matting and automatizes the upload of the files to streamline the content production.
A Virtual Stage for Build 2020
At Build we delivered a video series using our virtual stage for the first time. This video, part of the series, explains the technical details of our approach and provides a high-level overview of the research from the UW.
Technical details for Virtual Stage
This project consists of 2 different elements:
1. Front-end app
The presenter uses the app to record themselves. It can record 3 video files from a single recording: up to two videos from Azure Kinect sensors and one separate file for the presenter’s slides with audio, to make post-production easier.
The app allows the presenter to review multiple recordings and keep the best ones before sending to the back-end process to generate the virtual green screen.
We have also included in the front-end app an easy wizard to allow the user to record the background without them, in order to solve for the foreground and alpha value of the matting process. This makes the process easier for the post-production team to ensure the best possible results in a remote collaboration scenario.
2. Back-end console process
Once the user provides the video files, we need to create the virtual green screen so that we can edit in the software of our choice. The output of this process is the video with just the presenter (the rest being a virtual green screen).
This process leverages the approach by the University of Washington team; we added improvements so we can utilize the depth recording from the Azure Kinect sensor to remove the background more precisely.
One of the key challenges we tried to solve in our implementation is that the dataset used for training (Adobe Composition-1k dataset) contained only upper body images -- and we wanted to capture our presenters in long shots. This limitation not only means that legs are not properly processed by the model due to lack of training data, but also that the neural net output is a square bounding box – perfect for just upper-body images but not ideal for a full body.
To solve the bounding box issue, we split the image in two. Yet, that still doesn’t solve for the lack of precision when it comes to recognizing the legs. This is where the capabilities of Azure Kinect have helped.
The UW approach proposes 2 steps: – first, extract the background based in supervised learning; and second, refine the output in an unsupervised way through a GAN. The first step is done by a deep network that estimates the foreground and alpha from input comprised of the original image, the background photo, and an automatically computed soft segmentation of the person in frame.
By combining this approach with the Azure Kinect API, we can replace the automatically computed soft segmentation of the person in frame with the more precise silhouette captured with our sensor. So what we input to this first model is the sensor information (both the IR and silhouette as well as the video without any processing) and the background without the speaker (captured from the front-end app by the user) -- giving us a more precise foreground and alpha estimation as an output. We also refine through an unsupervised GAN to improve the results even more.
How can you use this project for your own events/videos?
- From the speaker/content creator side, you will need 1 or 2 Azure Kinect sensors (plus applicable hardware/software requirements as listed).
- Go to GitHub where you will find the source code for both the front-end app and the server side as well as the user manual for the user app.
- We will keep updating this project to provide ARM templates to easily deploy the backend on Azure as well as improving the comments and documentation.
JFK Files takes 34,000 complex files including photos, handwriting, government documents, and more, then extracts readable information. This knowledge is organized to enable new ways to explore the information.
Snip Insights helps users find intelligent insights from a snip or screenshot. AI services convert a captured image to translated text, automatically detecting and tagging image content.
PoseTracker uses deep learning to track the position and orientation of objects. This solution will use your phone camera to measure and track the angle, orientation, and distance of an item in real time.