{"id":869103,"date":"2022-08-25T09:00:00","date_gmt":"2022-08-25T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=869103"},"modified":"2023-07-19T10:04:11","modified_gmt":"2023-07-19T17:04:11","slug":"mocapact-training-humanoid-robots-to-move-like-jagger","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/mocapact-training-humanoid-robots-to-move-like-jagger\/","title":{"rendered":"MoCapAct: Training humanoid robots to \u201cMove Like Jagger\u201d"},"content":{"rendered":"\n<div class=\"annotations \" data-bi-aN=\"margin-callout\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 annotations__list--left\">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Group<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/robot-learning-group\/\" data-bi-cN=\"Robot Learning Group\" data-external-link=\"false\" data-bi-aN=\"margin-callout\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Robot Learning Group<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Hero_no_logo.gif\" alt=\"A montage of four animated figures completing humanoid actions: standing up, walking, running, and jumping.\" class=\"wp-image-869799\"\/><\/figure>\n\n\n\n<p>What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by humans. By mixing and matching a wide range of basic motor skills, from walking to jumping to balancing on one foot, people routinely dance, play soccer, carry heavy objects, and perform other complex high-level motions. If robots are ever to reach their full potential as an assistive technology, mastery of diverse bipedal motion is a requirement, not a luxury. However, even the simplest of these skills can require a fine orchestration of dozens of joints. Sophisticated engineering can rein in some of this complexity, but endowing bipedal robots with the generality to cope with our messy, weakly structured world, or a metaverse that takes after it, requires <em>learning<\/em>. Training AI agents with humanoid morphology to match human performance across the entire diversity of human motion is one of the biggest challenges of artificial physical intelligence. Due to the vagaries of experimentation on physical robots, research in this direction is currently done mostly in simulation.&nbsp;<\/p>\n\n\n\n<p>Unfortunately, it involves computationally intensive methods, effectively restricting participation to research institutions with large compute budgets. In an effort to level the playing field and make this critical research area more inclusive, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/group\/robot-learning-group\/\" target=\"_blank\" rel=\"noreferrer noopener\">Microsoft Research&#8217;s Robot Learning group<\/a> is releasing <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mocapact-a-multi-task-dataset-for-simulated-humanoid-control\/\" target=\"_blank\" rel=\"noreferrer noopener\">MoCapAct<\/a>, a large library of pre-trained humanoid control models along with enriched data for training new ones. This will enable advanced research on artificial humanoid control at a fraction of the compute resources currently required.\u00a0<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Recording of MoCap clip\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/OcVXGFH4bhw?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><figcaption class=\"wp-element-caption\">Video source: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/mocap.cs.cmu.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">Carnegie Mellon University &#8211; CMU Graphics Lab &#8211; motion capture library<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Playback of MoCap clip in simulation\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/y8orEuSXAxM?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>The reason why humanoid control research has been so computationally demanding is subtle and, at the first glance, paradoxical. The prominent avenue for <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.youtube.com\/watch?v=vppFvq2quQ0%5C\" target=\"_blank\" rel=\"noopener noreferrer\">learning locomotive skills<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is based on using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Motion_capture\" target=\"_blank\" rel=\"noopener noreferrer\">motion capture<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (MoCap) data. MoCap is an animation technique widely used in the entertainment industry for decades. It involves recording the motion of several keypoints on a human actor\u2019s body, such as their elbows, shoulders, and knees, while the actor is performing a task of interest, such as jogging. Thus, a MoCap clip can be thought of as a very concise and precise summary of an activity\u2019s video clip. Thanks to this, useful information can be extracted from MoCap clips with much less computation than from the much more high-dimensional, ambiguous training data in other major areas of machine learning, which comes in the form of videos, images, and text. On top of this, MoCap data is widely available. Repositories such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/mocap.cs.cmu.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">CMU Motion Capture Dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> contain hours of clips for just about any common motion of a human body, with visualizations of several examples shown below. Why, then, is it so hard to make physical and simulated humanoid robots mimic a person\u2019s movements?&nbsp;<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Walking\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/DJJsceshCUk?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Moving arms while stationary\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/V5AsH8PDCDs?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Running in circles\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/N4ywiANaa4I?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Cartwheel\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/huU-VBgxREY?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Salsa dance\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/uHcN_vurj7M?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>The caveat is that MoCap clips don\u2019t contain <em>all <\/em>the information necessary to imitate the demonstrated motions on a physical robot or in a simulation that models physical forces. They only show us what a motion skill <em>looks like<\/em>, not the underlying muscular movements that caused the actor\u2019s muscles to yield that motion. Even if MoCap systems recorded these signals, it wouldn\u2019t be of much help: simulated humanoids and real robots typically use motors instead of muscles, which is a dramatically different form of articulation. Nonetheless, actuation in artificial humanoids is also driven by a type of control signal. MoCap clips are a valuable aid in computing these control signals, if combined with additional learning and optimization methods that use MoCap data as guidance. The computational bottleneck that our MoCapAct release aims to remove is created exactly by these methods, collectively known as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/Reinforcement_learning\" target=\"_blank\" rel=\"noopener noreferrer\"><em>reinforcement learning<\/em><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><em> (RL)<\/em>. In simulation, where much of AI locomotion research is currently focused, RL can recover the sequence of control inputs that takes a humanoid agent through the sequence of poses from a given MoCap clip. What results is a locomotion behavior that is indistinguishable from the clip\u2019s. The availability of control policies for individual basic behaviors learned from separate MoCap clips can open the doors for fascinating locomotion research, e.g., in methods for combining these behaviors into a single \u201cmulti-skilled\u201d neural network and training higher-level locomotion capabilities by switching among them. However, with thousands of basic locomotion skills to learn, RL\u2019s expensive trial-and-error approach creates a massive barrier to entry on this research path. It is this scalability issue that our dataset release aims to address.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"400\" height=\"361\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCapAct-Dataset.png\" alt=\"A flowchart showing motion capture clips producing clip-tracking agents via reinforcement learning. The agents then generate data using the simulated humanoid. The MoCapAct dataset consists of the agents and corresponding data. \" class=\"wp-image-869175\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCapAct-Dataset.png 400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCapAct-Dataset-300x271.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCapAct-Dataset-199x180.png 199w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><figcaption class=\"wp-element-caption\">Figure 1: The MoCapAct dataset consists of policies that track individual MoCap clips and data from these agents. <\/figcaption><\/figure>\n\n\n\n<p>Our MoCapAct dataset, designed to be compatible with the highly popular <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/deepmind\/dm_control\" target=\"_blank\" rel=\"noopener noreferrer\">dm_control<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> humanoid simulation environment and the extensive <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/mocap.cs.cmu.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">CMU Motion Capture Dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, serves the research community in two ways:&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>For each of over 2500 MoCap clip snippets from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/mocap.cs.cmu.edu\/\" target=\"_blank\" rel=\"noopener noreferrer\">CMU Motion Capture Dataset<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, it provides an RL-trained \u201cexpert\u201d control policy (represented as a PyTorch model) that enables <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/deepmind\/dm_control\" target=\"_blank\" rel=\"noopener noreferrer\">dm_control<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2019s simulated humanoid to faithfully recreate the skill depicted in that clip snippet, as shown in these videos of the experts\u2019 behaviors:&nbsp;<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Walking\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/ih4y6GFe-kc?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Moving arms while stationary\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/XDBrQR3ynXA?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Running in circles\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/zW8C08rhOm4?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Cartwheel\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/HutS2k2ya6k?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Salsa dance\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/AFVgOuOasEo?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>     Training this model zoo has taken the equivalent of 50 years over many GPU-equipped <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/virtual-machines\/ncv2-series\" target=\"_blank\" rel=\"noopener noreferrer\">Azure NC6v2<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> virtual machines (excluding hyperparameter tuning and other required experiments) \u2013 a testament to the computational hurdle MoCapAct removes for other researchers.&nbsp;<\/p>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>For each of the trained skill policies above, MoCapAct supplies a set of recorded trajectories generated by executing that skill\u2019s control policy on the dm_control\u2019s humanoid agent. These trajectories can be thought of as MoCap clips of the trained experts but, in a crucial difference from the original MoCap data, they contain both low-level sensory measurements (e.g., touch measurements) and control signals for the humanoid agent. Unlike typical MoCap data, these trajectories are suitable for learning to match and improve on skill experts via direct imitation \u2013 a much more efficient class of techniques than RL.&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>We give two examples of how we used the MoCapAct dataset.&nbsp;<\/p>\n\n\n\n<p>First, we train a <em>hierarchical<\/em> policy based on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/1811.11711\" target=\"_blank\" rel=\"noopener noreferrer\">neural probabilistic motor primitive<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. To achieve this, we combine the thousands of MoCapAct\u2019s clip-specialized policies together into a single policy that is capable of executing many different skills. This agent has a high-level component that takes MoCap frames as input and outputs a <em>learned skill<\/em>. The low-level component takes the learned skill and sensory measurement from the humanoid as input and outputs the motor action.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"575\" height=\"346\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCap-frames.png\" alt=\"Two graphics of the hierarchical policy. The first graphic shows a MoCap clip of walking being fed into a high-level policy, which outputs a prediction of \u201cwalk forward.\u201d This prediction and the humanoid observation are fed into the low-level policy, which then predicts the motor actions to execute the walking motion. The second graphic is similar to the first, with the only difference being that the MoCap clip shows a \u201crun and jump\u201d motion, and the predicted skill is \u201crun and jump.\u201d \" class=\"wp-image-869178\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCap-frames.png 575w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCap-frames-300x181.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/MoCap-frames-240x144.png 240w\" sizes=\"auto, (max-width: 575px) 100vw, 575px\" \/><figcaption class=\"wp-element-caption\">Figure 2: The hierarchical policy consists of a high-level policy and low-level policy. The high-level policy maps the given MoCap frames to a learned skill. The low-level policy takes the skill and the humanoid observation and outputs an action that best realizes the skill.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>This hierarchical structure offers an appealing benefit. If we keep the low-level component, we can instead control the humanoid by inputting different skills to the low-level policy (e.g., \u201cwalk\u201d instead of the corresponding motor actions). Therefore, we can re-use the low-level policy to efficiently learn new tasks.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"488\" height=\"135\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/humanoid-observation.png\" alt=\"Graphic of a task policy feeding into a low-level policy. The task policy takes an observation from the humanoid as input, and outputs a \u201cskill.\u201d The skill and humanoid observation are fed into a low-level policy, which outputs the motor action. \" class=\"wp-image-869184\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/humanoid-observation.png 488w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/humanoid-observation-300x83.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/humanoid-observation-240x66.png 240w\" sizes=\"auto, (max-width: 488px) 100vw, 488px\" \/><figcaption class=\"wp-element-caption\">Figure 3: We can replace the high-level policy with a task policy that is trained to output skills required to achieve some new task, such as running to a target.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>In light of that, we replace the high-level policy with a task policy that is then trained to steer the low-level policy towards achieving some task. As an example, we train a task policy to have the humanoid reach a target. Notice that the humanoid uses many low-level skills, like running, turning, and side-stepping.&nbsp;<\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Go-to-Target\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/0b9aLxnZvtk?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"446\" height=\"131\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/GPT.png\" alt=\"Graphic of the GPT policy. A sequence of humanoid observations is fed into the GPT module, which outputs the motor action.\" class=\"wp-image-869289\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/GPT.png 446w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/GPT-300x88.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/GPT-240x70.png 240w\" sizes=\"auto, (max-width: 446px) 100vw, 446px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Our GPT model takes in a sequence of observations from the humanoid (called the \u201ccontext\u201d) and outputs an action that it thinks best continues the observed motion.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>Our second example centers on <em>motion completion<\/em>, which is inspired by the task of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/transformer.huggingface.co\/doc\/gpt2-large\" target=\"_blank\" rel=\"noopener noreferrer\">sentence completion<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. Here, we use the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/jalammar.github.io\/illustrated-gpt2\/\" target=\"_blank\" rel=\"noopener noreferrer\">GPT architecture<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which accepts a sequence of sensory measurements (the \u201ccontext\u201d) and outputs a motor action. We train a control policy to take one second of sensory measurements from the dataset and output the corresponding motor actions from the specialized expert. Then, before executing the policy on our humanoid, we first generate a \u201cprompt\u201d (red humanoid in the videos) by executing a specialized expert for one second. Afterwards, we let the policy control the humanoid (bronze humanoid in the videos), at each time step, where it constantly takes the previous second of sensory measurements and predicts the motor actions. We find that this policy can reliably repeat the underlying motion of the clip, which is demonstrated in the first two videos. On other MoCap clips, we find that the policy can deviate from the underlying clip in a plausible way, such as in the third video, where the humanoid transitions from side-stepping to walking backwards.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Walking forward\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/mzxoeIgNSWI?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Running in circles\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/LuP2QB8fIF8?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-embed is-type-video is-provider-youtube wp-block-embed-youtube wp-embed-aspect-4-3 wp-has-aspect-ratio\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"Side-stepping, then walking backwards\" width=\"500\" height=\"375\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/P82hccRgV-M?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<p>On top of the dataset, we also release the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/MoCapAct\" target=\"_blank\" rel=\"noopener noreferrer\">code<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> used to generate the policies and results. We hope the community can build off of our dataset and work to do incredible research in the control of humanoid robots.&nbsp;<\/p>\n\n\n\n<p>Our paper is available <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mocapact-a-multi-task-dataset-for-simulated-humanoid-control\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>. You can read more at <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/microsoft.github.io\/MoCapAct\/\" target=\"_blank\" rel=\"noopener noreferrer\">our website<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.\u00a0<\/p>\n\n\n\n<p class=\"has-text-align-center\"><em>The data used in this project was obtained from mocap.cs.cmu.edu.<\/em><br><em>The database was created with funding from NSF EIA-0196217.<\/em>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism provides unparalleled versatility in an environment designed for and by [&hellip;]<\/p>\n","protected":false},"author":37583,"featured_media":869139,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-869103","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"guest","value":"nolan-wagener","user_id":"793964","display_name":"Nolan Wagener","author_link":"<a href=\"https:\/\/scholar.google.com\/citations?user=SgGIYH0AAAAJ&hl=en\" aria-label=\"Visit the profile page for Nolan Wagener\">Nolan Wagener<\/a>","is_active":true,"last_first":"Wagener, Nolan","people_section":0,"alias":"nolan-wagener"},{"type":"user_nicename","value":"Andrey Kolobov","user_id":30910,"display_name":"Andrey Kolobov","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akolobov\/\" aria-label=\"Visit the profile page for Andrey Kolobov\">Andrey Kolobov<\/a>","is_active":false,"last_first":"Kolobov, Andrey","people_section":0,"alias":"akolobov"},{"type":"guest","value":"matthew-hausknecht-2","user_id":"388709","display_name":"Matthew Hausknecht","author_link":"<a href=\"https:\/\/mhauskn.github.io\/\" aria-label=\"Visit the profile page for Matthew Hausknecht\">Matthew Hausknecht<\/a>","is_active":true,"last_first":"Hausknecht, Matthew","people_section":0,"alias":"matthew-hausknecht-2"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-960x540.png\" class=\"img-object-cover\" alt=\"A montage of four animated figures completing humanoid actions: standing up, walking, running, and jumping.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog-1280x720.png 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/08\/1400x788_Humanoid_Image_blog.png 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"August 25, 2022","formattedExcerpt":"What would it take to get humanoid, bipedal robots to dance like Mick Jagger? Indeed, for something more mundane, what does it take to get them to simply stand still? Sit down? Walk? Move in myriads of other ways many people take for granted? Bipedalism&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/869103","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=869103"}],"version-history":[{"count":42,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/869103\/revisions"}],"predecessor-version":[{"id":956220,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/869103\/revisions\/956220"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/869139"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=869103"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=869103"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=869103"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=869103"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=869103"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=869103"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=869103"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=869103"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=869103"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=869103"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=869103"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}