The need

As we were preparing for the first ever online-only delivery of Microsoft Build, we set out to create a series of short videos, the Innovation Bites. As many marketers in the current situation, we found ourselves struggling with the limitations of what each speaker could create from their homes and our vision for what we wanted to create.

The idea

We got inspiration from the approaches at background matting from Adobe Research and the University of Washington (UW). We wanted to build upon these ideas with our Azure Kinect sensor to create an easy way for everyone to use green screening when they record from their homes.

The solution

We have developed a simple front-end experience for speakers to record themselves with up to 2 Azure Kinect sensors at the same time plus slides, and a backend that takes the approaches from the UW and Adobe Research and adds the Azure Kinect sensor depth capabilities to improve the precision of the background matting and automatizes the upload of the files to streamline the content production.

Technical details for Virtual Stage

This project consists of 2 different elements:

1. Front-end app

The presenter uses the app to record themselves. It can record 3 video files from a single recording: up to two videos from Azure Kinect sensors and one separate file for the presenter’s slides with audio, to make post-production easier.

The app allows the presenter to review multiple recordings and keep the best ones before sending to the back-end process to generate the virtual green screen.

We have also included in the front-end app an easy wizard to allow the user to record the background without them, in order to solve for the foreground and alpha value of the matting process. This makes the process easier for the post-production team to ensure the best possible results in a remote collaboration scenario.

2. Back-end console process

Once the user provides the video files, we need to create the virtual green screen so that we can edit in the software of our choice. The output of this process is the video with just the presenter (the rest being a virtual green screen).

This process leverages the approach by the University of Washington team; we added improvements so we can utilize the depth recording from the Azure Kinect sensor to remove the background more precisely.

One of the key challenges we tried to solve in our implementation is that the dataset used for training (Adobe Composition-1k dataset) contained only upper body images -- and we wanted to capture our presenters in long shots. This limitation not only means that legs are not properly processed by the model due to lack of training data, but also that the neural net output is a square bounding box – perfect for just upper-body images but not ideal for a full body.

To solve the bounding box issue, we split the image in two. Yet, that still doesn’t solve for the lack of precision when it comes to recognizing the legs. This is where the capabilities of Azure Kinect have helped.

The UW approach proposes 2 steps: – first, extract the background based in supervised learning; and second, refine the output in an unsupervised way through a GAN. The first step is done by a deep network that estimates the foreground and alpha from input comprised of the original image, the background photo, and an automatically computed soft segmentation of the person in frame.

By combining this approach with the Azure Kinect API, we can replace the automatically computed soft segmentation of the person in frame with the more precise silhouette captured with our sensor. So what we input to this first model is the sensor information (both the IR and silhouette as well as the video without any processing) and the background without the speaker (captured from the front-end app by the user) -- giving us a more precise foreground and alpha estimation as an output. We also refine through an unsupervised GAN to improve the results even more.

How can you use this project for your own events/videos?

  • From the speaker/content creator side, you will need 1 or 2 Azure Kinect sensors (plus applicable hardware/software requirements as listed).
  • Go to GitHub where you will find the source code for both the front-end app and the server side as well as the user manual for the user app.
  • We will keep updating this project to provide ARM templates to easily deploy the backend on Azure as well as improving the comments and documentation.


Projects related to Virtual Stage

Browse more projects