{"id":641841,"date":"2020-03-17T08:56:42","date_gmt":"2020-03-17T15:56:42","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=641841"},"modified":"2020-03-17T09:10:52","modified_gmt":"2020-03-17T16:10:52","slug":"training-deep-control-policies-for-the-real-world","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/training-deep-control-policies-for-the-real-world\/","title":{"rendered":"Training deep control policies for the real world"},"content":{"rendered":"<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-642639 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004.jpg\" alt=\"drone image\" width=\"1400\" height=\"780\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-300x167.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-1024x571.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-768x428.jpg 768w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/p>\n<p>Humans subconsciously use perception-action loops to do just about everything, from walking down a crowded sidewalk to scoring a goal in a community soccer league. Perception-action loops\u2014using sensory input to decide on appropriate action in a continuous real time loop \u2014are at the heart of autonomous systems. Although this tech has advanced dramatically in the ability to use sensors and cameras to reason about control actions, the current generation of autonomous systems are still nowhere near human skill in making those decisions directly from visual data. Here, we share how we have built Machine Learning systems that reason out correct actions to take directly from camera images. The system is trained via simulations and learns to independently navigate challenging environments and conditions in real world, including unseen situations.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/learning-visuomotor-policies-for-aerial-navigation-using-cross-modal-representations\/\">Read the Paper<\/a>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/microsoft\/AirSim-Drone-Racing-VAE-Imitation\">Download the Code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=AxE7qGKJWaw&feature=youtu.be\"> Watch the Video<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p>We wanted to push current technology to get closer to a human\u2019s ability to interpret environmental cues, adapt to difficult conditions and operate autonomously. For example, in First Person View (FPV) drone racing, expert pilots can plan and control a quadrotor with high agility using a noisy monocular camera feed, without compromising safety. We were interested in exploring the question of what it would take to build autonomous systems that achieve similar performance levels. We trained deep neural nets on simulated data and deployed the learned models in real-world environments. Our framework explicitly separates the perception components (making sense of what you see) from the control policy (deciding what to do based on what you see). This two-stage approach helps researchers interpret and debug the deep neural models, which is hard to do with full end-to-end learning.<\/p>\n<p>The ability to efficiently solve such perception-action loops with deep neural networks can have significant impact on real-world systems. Examples include our collaboration with researchers at Carnegie Mellon University and Oregon State University, collectively named Team Explorer, on the DARPA Subterranean (SubT) Challenge. The DARPA challenge centers on assisting first responders and those who lead search and rescue missions, especially in hazardous physical environments, to more quickly identify people in need of help.<\/p>\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"PhF3cbQRp2s\" data-poster=\"https:\/\/img.youtube.com\/vi\/PhF3cbQRp2s\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Inside look at the DARPA Subterranean Urban Circuit Challenge 2020\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/PhF3cbQRp2s?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div>\n<\/div>\n<p>The video above shows the DARPA Subterranean Challenge, one of the ways Microsoft is advancing state of art in the area of autonomous systems by supporting research focused on solving real-world challenges. Learn more about <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/as\">Microsoft Autonomous systems.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/p>\n<p>Team Explorer has participated in the first two circuits of the challenge, taking second place in the February, 2020 Urban Circuit and first place in the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/helping-first-responders-achieve-more-with-autonomous-systems-and-airsim\/\">September, 2019 Tunnel Circuit<\/a>. In the Tunnel Circuit, the robots navigated underground tunnels for an hour at a time to successfully locate hidden items. In the Urban Circuit, they navigated two courses designed to represent complex urban underground infrastructure, including stairs and elevation changes. Reasoning correct control actions based on perception sensors is a critical component to success of the mission. The current methods used by Team Explorer include carefully engineered modules, such as localization, mapping and planning, which were then carefully orchestrated to carry out the mission. Here, we share how an approach of learning to map perception data to correct control actions can simplify the system further.<\/p>\n<div id=\"attachment_641844\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641844\" class=\"wp-image-641844 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-1024x524.png\" alt=\"sim photo \" width=\"1024\" height=\"524\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-1024x524.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-300x153.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-768x393.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-1536x785.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-1-2048x1047.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-641844\" class=\"wp-caption-text\"><em>Figure 1. Our framework uses simulation to learn a low-dimensional state representation using multiple data modalities. This latent vector is used to learn a control policy which directly transfers to real-world environments. We successfully deploy the system under various track shapes and weather conditions, ranging from sunny days to strong snow and wind.<\/em><\/p><\/div>\n<h3>The Task<\/h3>\n<p>In first person view (FPV) drone racing, expert pilots can plan and control a quadrotor with high agility using a noisy monocular camera feed, without compromising safety. We attempted to mimic this ability with our framework, and tested it with an autonomous drone on a racing task.<\/p>\n<p>We used a small agile quadrotor with a front facing camera, and our goal was to train a neural network policy to navigate through a previously unknown racing course. The network policy used only images from the RGB camera.<\/p>\n<p>While autonomous drone racing is an active research area, most of the previous work so far has focused on engineering a system augmented with extra sensors and software with the sole aim of speed. Instead, we aimed to create a computational fabric, inspired by the function of a human brain, to map visual information directly to correct control actions. We achieved this by first converting the high-dimensional sequence of video frames to a low-dimensional representation that summarizes the state of the world.<\/p>\n<div id=\"attachment_641847\" style=\"width: 618px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641847\" class=\"wp-image-641847 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-2.png\" alt=\"tech\" width=\"608\" height=\"456\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-2.png 608w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-2-300x225.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-2-80x60.png 80w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-2-240x180.png 240w\" sizes=\"auto, (max-width: 608px) 100vw, 608px\" \/><p id=\"caption-attachment-641847\" class=\"wp-caption-text\"><em>Figure 2: Quadrotor used for the experiments. Images from the front-facing camera are processed on the onboard computer.<\/em><\/p><\/div>\n<h3>Our Approach<\/h3>\n<p>Our approach was to learn a visuomotor policy by decomposing the problem into the tasks of (1) building useful representations of the world and (2) taking a control action based on those representations. We used <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/Microsoft\/AirSim\">AirSim<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a high-fidelity simulator, in the training phase and then deployed the learned policy in the real world without any modification. Figure 1 depicts the overall concept, showing a single perception module shared for simulated and real autonomous navigation.<\/p>\n<p>A key challenge here is the models have to be robust to the differences (e.g., illumination, texture) between simulation and the real world. To this end, we used the Cross-Modal Variational Auto Encoder (CM-VAE) framework for generating representations that closely bridge the simulation-reality gap, avoiding overfitting to the eccentricities of synthetic data.<\/p>\n<p>The first data modality considered the raw unlabeled sensor input (FPV images), while the second characterized state information directly relevant for the task at hand. In the case of drone racing, the second modality corresponds to the relative pose of the next gate defined in the drone\u2019s coordinate frame. We learned a low-dimensional latent environment representation by extending the CM-VAE framework. The framework uses an encoder-decoder pair for each data modality, while constricting all inputs and outputs to and from a single latent space (see Fig. 3b).<\/p>\n<p>The system naturally incorporated both labeled and unlabeled data modalities into the training process of the latent variable. Imitation learning was then used to train a deep control policy that mapped latent variables into velocity commands for the quadrotor (Fig. 3a).<\/p>\n<div id=\"attachment_641931\" style=\"width: 1107px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641931\" class=\"wp-image-641931 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Darpa-3.jpg\" alt=\"diagram\" width=\"1097\" height=\"423\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Darpa-3.jpg 1097w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Darpa-3-300x116.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Darpa-3-1024x395.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Darpa-3-768x296.jpg 768w\" sizes=\"auto, (max-width: 1097px) 100vw, 1097px\" \/><p id=\"caption-attachment-641931\" class=\"wp-caption-text\"><em>Figure 3. (a) Control system architecture. The input image from the drone\u2019s video is encoded into a latent representation of the environment. A control policy acts on the lower-dimensional embedding to output the desired robot control commands. (b) Cross-modal VAE architecture. Each data sample is encoded into a single latent space that can be decoded back into images, or transformed into another data modality such as the poses of gates relative to the unmanned aerial vehicle (UAV).<\/em><\/p><\/div>\n<h3>Learning to understand the world<\/h3>\n<p>The role of our\u00a0perception\u00a0module\u00a0was to\u00a0compress\u00a0the incoming input images into a low-dimensional representation. For example,\u00a0the encoder compressed\u00a0images\u00a0of size 128 X 72\u00a0in\u00a0pixels (width X height) from 27,648 original parameters (considering three color channels for RGB) down to the most essential 10\u00a0variables\u00a0that can describe\u00a0it.<\/p>\n<p>We\u00a0interpreted\u00a0the robot\u2019s understanding of the world by visualizing the\u00a0latent\u00a0space of our\u00a0cross-modal\u00a0representations\u00a0(see\u00a0Figure 4).\u00a0Despite only using 10 variables to encode\u00a0images,\u00a0the\u00a0decoded images\u00a0provided\u00a0a\u00a0rich description of\u00a0what the drone can see ahead,\u00a0including all possible\u00a0gates\u00a0sizes and locations,\u00a0and\u00a0different\u00a0background information.<\/p>\n<div id=\"attachment_644007\" style=\"width: 646px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-644007\" class=\"wp-image-644007 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Fig-4-DARPA.jpg\" alt=\"chart\" width=\"636\" height=\"428\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Fig-4-DARPA.jpg 636w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/Fig-4-DARPA-300x202.jpg 300w\" sizes=\"auto, (max-width: 636px) 100vw, 636px\" \/><p id=\"caption-attachment-644007\" class=\"wp-caption-text\">Figure 4. Visualization of imaginary images generated from our cross-modal representation. The decoded image directly captures the relative gate pose background information.<\/p><\/div>\n<p>We also showed that this\u00a0dimensionality\u00a0compression\u00a0technique\u00a0is\u00a0smooth\u00a0and\u00a0continuous.\u00a0Figure\u00a05\u00a0displays a\u00a0smooth\u00a0imaginary path between two images taken in real life.\u00a0Given the cross-modal nature of the representation, we\u00a0can see both\u00a0decoded images and gate poses for the intermediate values.<\/p>\n<div id=\"attachment_641952\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641952\" class=\"wp-image-641952 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-1024x261.png\" alt=\"diagram\" width=\"1024\" height=\"261\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-1024x261.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-300x76.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-768x196.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-1536x391.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7.png 1951w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-641952\" class=\"wp-caption-text\"><em>Figure 5: Visualization of smooth latent space interpolation between two real-world images. The ground-truth and predicted distances between camera and gate for images A and B were (2.0, 6.0) and (2.5, 5.8) meters respectively.<\/em><\/p><\/div>\n<h3>Results<\/h3>\n<p>To show the capabilities of our approach on a physical platform,\u00a0we tested\u00a0the system on a\u00a045-meter-long\u00a0S-shaped track with 8 gates, and on a\u00a040-meter-long\u00a0circular track with 8 gates, as shown in\u00a0Figure 6. Our policy\u00a0using\u00a0a\u00a0cross-modal representation significantly outperformed\u00a0end-to-end\u00a0control\u00a0policies and networks that directly encoded\u00a0the position of the next gates, without reasoning over multiple data modalities.\u00a0To show the capabilities of our approach on a physical platform, we test the system on an S-shaped track with eight gates and 45 meters of length, and on a circular track with eight gates and 40 meters of length, as shown in Figure 6. Our policy that uses a cross-modal representation significantly outperforms end-to-end policies, and networks that directly encode the position of the next gates, without reasoning over multiple data modalities.<\/p>\n<div id=\"attachment_641961\" style=\"width: 874px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641961\" class=\"wp-image-641961 size-full\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7.jpg\" alt=\"images\" width=\"864\" height=\"513\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7.jpg 864w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-300x178.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-7-768x456.jpg 768w\" sizes=\"auto, (max-width: 864px) 100vw, 864px\" \/><p id=\"caption-attachment-641961\" class=\"wp-caption-text\"><em>Figure 6: Side and top view of the test tracks: a) Circuit track, and b) S-shape track.<\/em><\/p><\/div>\n<p>The performance of standard architectures dropped significantly\u00a0when\u00a0deployed\u00a0in the\u00a0real-world\u00a0after training in simulation. Our\u00a0cross-modal\u00a0VAE,\u00a0on the other hand,\u00a0can still decode reasonable values for the gate distances\u00a0despite being trained purely on simulation.\u00a0For example,\u00a0Fig.\u00a07\u00a0displays the accumulated gate poses decoded from\u00a0direct image to pose regression and from our framework,\u00a0during\u00a0three\u00a0seconds\u00a0of a real flight test.\u00a0Direct regression\u00a0results in\u00a0noisy\u00a0estimated gate positions,\u00a0which are\u00a0farther from the gate\u2019s true location.<\/p>\n<div id=\"attachment_641964\" style=\"width: 1034px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-641964\" class=\"wp-image-641964 size-large\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-9-1024x424.jpg\" alt=\"diagram\" width=\"1024\" height=\"424\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-9-1024x424.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-9-300x124.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-9-768x318.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-9.jpg 1353w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><p id=\"caption-attachment-641964\" class=\"wp-caption-text\"><em>Fig 7. Analysis of a three-second flight segment. a) Input images and their corresponding images decoded by the CM-VAE; b) Time history of gate center poses decoded from the CM-VAE (red) and regression (blue). The regression representation has significantly higher offset and noise from the true gate pose, which explains its poor flight performance.<\/em><\/p><\/div>\n<p>We take our\u00a0perception-control framework to its limits by testing it in visual conditions never seen before during the training phase in simulation.\u00a0Fig.\u00a08\u00a0shows examples of successful test\u00a0cases\u00a0under extreme\u00a0visually-challenging\u00a0conditions:\u00a0a) indoors,\u00a0with\u00a0a blue\u00a0floor\u00a0containing\u00a0red stripes\u00a0with the same\u00a0red tone\u00a0as the gates, and Fig.\u00a08 b-c)\u00a0during\u00a0heavy\u00a0snows.\u00a0Despite the intense visual distractions from background conditions, the drone was still able to complete the courses\u00a0by employing our cross-modal perception module.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"mceTemp\"><\/div>\n<div class=\"yt-consent-placeholder\" role=\"region\" aria-label=\"Video playback requires cookie consent\" data-video-id=\"AxE7qGKJWaw\" data-poster=\"https:\/\/img.youtube.com\/vi\/AxE7qGKJWaw\/maxresdefault.jpg\"><iframe aria-hidden=\"true\" tabindex=\"-1\" title=\"Learning Visuomotor Policies for Aerial Navigation Using Cross-Modal Representations\" width=\"500\" height=\"281\" data-src=\"https:\/\/www.youtube-nocookie.com\/embed\/AxE7qGKJWaw?feature=oembed&rel=0&enablejsapi=1\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe><\/p>\n<div class=\"yt-consent-placeholder__overlay\"><button class=\"yt-consent-placeholder__play\"><svg width=\"42\" height=\"42\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" aria-hidden=\"true\" focusable=\"false\"><g fill=\"none\" fill-rule=\"evenodd\"><circle fill=\"#000\" opacity=\".556\" cx=\"21\" cy=\"21\" r=\"21\"\/><path stroke=\"#FFF\" d=\"M27.5 22l-12 8.5v-17z\"\/><\/g><\/svg><span class=\"yt-consent-placeholder__label\">Video playback requires cookie consent<\/span><\/button><\/div>\n<\/div>\n<h3><\/h3>\n<h3>Challenges and Future<\/h3>\n<p>By\u00a0separating\u00a0the perception-action loop into two\u00a0modules\u00a0and\u00a0incorporating\u00a0multiple data modalities into\u00a0the\u00a0perception\u00a0training phase, we can avoid overfitting\u00a0our networks\u00a0to\u00a0non-relevant\u00a0characteristics\u00a0of the incoming data. For example, even though the sizes of the square gates were the same in simulation and physical experiments, their width, color, and\u00a0even\u00a0intrinsic\u00a0camera\u00a0parameters are not an exact match.\u00a0The multiple streams of information that are fed into the cross-modal VAE aid in implicit regularization of the learned\u00a0model, which leads to better generalization over appearance changes.<\/p>\n<p>We believe our results\u00a0show great\u00a0potential\u00a0for\u00a0helping in\u00a0real-world applications. For example, if an autonomous search and rescue robot is better able to recognize humans in spite of differences in age, size, gender,\u00a0ethnicity and\u00a0other factors, that robot has a better chance of identifying and retrieving people in need of help.<\/p>\n<p>An unexpected result we came across during our experiments is that combining unlabeled real-world data with the labeled simulated data for training the\u00a0representation\u00a0models did not increase overall performance.\u00a0Using simulation-only data worked better.\u00a0We suspect that this drop in performance occurs\u00a0because\u00a0only simulated data was used\u00a0in the\u00a0control\u00a0learning phase\u00a0with imitation learning.\u00a0One interesting\u00a0direction\u00a0for\u00a0future work\u00a0we are investigating is the\u00a0use of\u00a0adversarial techniques for lowering the distance in latent space between similar scenes\u00a0encoded from\u00a0simulated\u00a0and real\u00a0images. This would lower the difference between\u00a0data\u00a0distributions\u00a0during\u00a0training\u00a0and testing\u00a0phases.<\/p>\n<p>We envision\u00a0extending\u00a0the\u00a0approach\u00a0of using unlabeled data for policy learning.\u00a0For example, besides images, can we combine distinct\u00a0data modalities\u00a0such as\u00a0laser measurements and even\u00a0sound for learning\u00a0representations of the\u00a0environment?\u00a0Our success with aerial\u00a0vehicles\u00a0also\u00a0suggests\u00a0the potential to\u00a0apply\u00a0this approach\u00a0to\u00a0other\u00a0real-world\u00a0robotics tasks.\u00a0For instance,\u00a0we plan to extend our approach to\u00a0robotic manipulation\u00a0which\u00a0also\u00a0requires\u00a0a similar ability to interpret inputs in real time\u00a0and\u00a0make decisions\u00a0while ensuring\u00a0safe operations.<\/p>\n<p><img decoding=\"async\" class=\"aligncenter size-full wp-image-641934\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-4.tif\" alt=\"\" \/><\/p>\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-4.tif\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-641934\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA-4.tif\" alt=\"\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Humans subconsciously use perception-action loops to do just about everything, from walking down a crowded sidewalk to scoring a goal in a community soccer league. Perception-action loops\u2014using sensory input to decide on appropriate action in a continuous real time loop \u2014are at the heart of autonomous systems. Although this tech has advanced dramatically in the [&hellip;]<\/p>\n","protected":false},"author":38838,"featured_media":642639,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[194455,1],"tags":[],"research-area":[13552,13547],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-641841","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-machine-learning","category-research-blog","msr-research-area-hardware-devices","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[607743],"related-events":[],"related-researchers":[{"type":"guest","value":"rogerio-bonatti","user_id":"641973","display_name":"Rogerio  Bonatti","author_link":"<a href=\"https:\/\/www.cs.cmu.edu\/~rbonatti\/\" aria-label=\"Visit the profile page for Rogerio  Bonatti\">Rogerio  Bonatti<\/a>","is_active":true,"last_first":"Bonatti, Rogerio ","people_section":0,"alias":"rogerio-bonatti"},{"type":"guest","value":"ratnesh-madaan","user_id":"624729","display_name":"Ratnesh Madaan","author_link":"<a href=\"https:\/\/www.linkedin.com\/in\/ratneshmadaan\/\" aria-label=\"Visit the profile page for Ratnesh Madaan\">Ratnesh Madaan<\/a>","is_active":true,"last_first":"Madaan, Ratnesh","people_section":0,"alias":"ratnesh-madaan"},{"type":"guest","value":"vibhav-vineet","user_id":"641979","display_name":"Vibhav Vineet","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/vivineet\/\" aria-label=\"Visit the profile page for Vibhav Vineet\">Vibhav Vineet<\/a>","is_active":true,"last_first":"Vineet, Vibhav","people_section":0,"alias":"vibhav-vineet"},{"type":"guest","value":"sebastian-scherer","user_id":"641982","display_name":"Sebastian  Scherer","author_link":"<a href=\"http:\/\/theairlab.org\/team\/sebastian\/\" aria-label=\"Visit the profile page for Sebastian  Scherer\">Sebastian  Scherer<\/a>","is_active":true,"last_first":"Scherer, Sebastian ","people_section":0,"alias":"sebastian-scherer"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-960x540.jpg\" class=\"img-object-cover\" alt=\"\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/03\/DARPA_FC_5.00_01_21_18.Still004-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"March 17, 2020","formattedExcerpt":"Humans subconsciously use perception-action loops to do just about everything, from walking down a crowded sidewalk to scoring a goal in a community soccer league. Perception-action loops\u2014using sensory input to decide on appropriate action in a continuous real time loop \u2014are at the heart of&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/641841","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/38838"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=641841"}],"version-history":[{"count":29,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/641841\/revisions"}],"predecessor-version":[{"id":644037,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/641841\/revisions\/644037"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/642639"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=641841"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=641841"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=641841"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=641841"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=641841"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=641841"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=641841"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=641841"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=641841"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=641841"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=641841"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}