{"id":922338,"date":"2023-02-27T17:11:05","date_gmt":"2023-02-28T01:11:05","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=922338"},"modified":"2023-02-27T18:03:13","modified_gmt":"2023-02-28T02:03:13","slug":"smart-a-generalized-pretraining-framework-for-control-tasks","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/smart-a-generalized-pretraining-framework-for-control-tasks\/","title":{"rendered":"SMART \u2013 A Generalized Pretraining Framework for Control Tasks"},"content":{"rendered":"\n<p class=\"has-gray-color has-text-color\" style=\"font-size:1.6em\"> We are announcing SMART, a generalized pretraining framework for a wide variety of control tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/hero_4.gif\" alt=\"The hero figure of SMART\" class=\"wp-image-923073\" width=\"915\" height=\"393\" \/><\/figure>\n\n\n\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-buttons is-content-justification-center is-content-justification-center is-layout-flex wp-container-core-buttons-is-layout-5e6b19ed wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button aligncenter has-custom-width wp-block-button__width-50 is-style-fill-download is-style-fill\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/abs\/2301.09816\" target=\"_blank\" rel=\"noreferrer noopener\">Paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button is-style-fill-github\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/github.com\/microsoft\/smart\" target=\"_blank\" rel=\"noreferrer noopener\">SMART code<\/a><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<p>Self-supervised pretraining of large neural networks (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2005.14165\">GPT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1911.05722\">MoCo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2103.00020.pdf\">CLIP<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>) has been shown to be successful in a wide range of language and vision problems. These works demonstrate that one single pretrained model can be easily finetuned to perform many downstream tasks, resulting in a simple, effective, and data-efficient paradigm. When it comes to control tasks, however, it is not clear yet whether the successes of pretraining approaches can be easily replicated. S<em>o, we ask the question: can we enable similar pretraining paradigm for efficient decision-making across various control tasks?<\/em>  <\/p>\n<\/div>\n\n\n\n<p>In &#8220;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/pdf?id=9piH3Hg8QEf\">SMART: Self-supervised Multi-task pretrAining with contRol Transformers<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>&#8220;, to be published at\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/openreview.net\/forum?id=9piH3Hg8QEf\">ICLR2023<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (as notable-top-25%), we study how to pretrain a versatile, generalizable and resilient model for a wide variety of control tasks. We demonstrate that SMART can significantly improve the learning efficiency and facilitate rapid transfer to novel tasks under different learning scenarios including Imitation Learning (IL) and Reinforcement Learning (RL). Benefiting from the proposed control-centric objective, SMART is resilient to distribution shift between pretraining and finetuning, and even works well with low-quality datasets that are randomly collected.  <\/p>\n\n\n\n<p>We now discuss the challenges and introduce our key designing concepts and technical details.<\/p>\n\n\n\n<figure class=\"wp-block-embed is-provider-youtube wp-block-embed-youtube\"><div class=\"wp-block-embed__wrapper\">\n<iframe loading=\"lazy\" title=\"SMART: SELF-SUPERVISED MULTI-TASK PRETRAINING WITH CONTROL TRANSFORMERS (ICLR'23)\" width=\"500\" height=\"281\" src=\"https:\/\/www.youtube-nocookie.com\/embed\/0-0hCZtpWUc?feature=oembed&rel=0\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share\" referrerpolicy=\"strict-origin-when-cross-origin\" allowfullscreen><\/iframe>\n<\/div><\/figure>\n\n\n\n<h3 id=\"challenges-unique-to-control-tasks\">Challenges unique to control tasks<\/h3>\n\n\n\n<p>There are research efforts that investigate application of pretrained vision models to facilitate control<br>tasks. However, there are challenges unique to sequential decision making and beyond the considerations of existing vision and language pretraining. We highlight these challenges below:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data distribution shift<\/strong>: Training data for decision making tasks is usually composed of trajectories generated under some specific behavior policies. As a result, data distributions during pretraining, downstream finetuning and deployment can be drastically different, resulting in a suboptimal performance.<\/li>\n\n\n\n<li><strong>Large discrepancy between tasks<\/strong>: In contrast to language and vision where the underlying semantic information is often shared across tasks, decision making tasks span a large variety of task-specific configurations, transition functions, rewards, and state-action spaces as well. Consequently, it is hard to obtain a generic representation for multiple decision-making tasks.<\/li>\n\n\n\n<li><strong>Long-term reward maximization<\/strong>: A good representation for downstream policy learning should capture information relevant for both immediate and long-term planning, which is usually hard in tasks with long horizons, partial observability, and continuous control.<\/li>\n\n\n\n<li><strong>Lack of supervision and high-quality data<\/strong>: Success in representation learning often depends on the availability of high-quality expert demonstrations and ground-truth rewards. However, for most sequential decision-making tasks, high-quality data and\/or supervisory signals are either non-existent or prohibitively expensive to obtain.<\/li>\n<\/ul>\n\n\n\n<h3 id=\"unlocking-generalized-pretraining-finetuning-pipeline-for-sequential-decision-making\">Unlocking generalized pretraining-finetuning pipeline for sequential decision-making<\/h3>\n\n\n\n<p>In this work, we follow ideas established in vision and language community to explicitly define our pretraining and finetuning pipeline. Specifically, during the pretraining phase we train representations with a large offline dataset collected from a set of training tasks. Then, given a specific downstream task which may or may not be contained in pretraining tasks, we attach a simple policy head on top of the pretrained representation and train it with Imitation Learning (IL) or with Reinforcement Learning (RL). The central tenet of pretraining is to learn generic representations which allow downstream task finetuning to be simple, effective, and efficient, even under low-data regimes. The pretrained model is expected to be:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Versatile<\/strong> so as to handle a wide variety of downstream control tasks and variable downstream learning methods such as IL and RL,<\/li>\n\n\n\n<li><strong>Generalizable<\/strong> to unseen tasks and domains spanning multiple rewards and agent dynamics, and<\/li>\n\n\n\n<li><strong>Resilient <\/strong> to varying-quality pretraining data without supervision.<\/li>\n<\/ul>\n\n\n\n<h3 id=\"smart-architecture-and-framework\">SMART architecture and framework<\/h3>\n\n\n\n<h4 id=\"a-unified-model-architecture-to-fit-different-learning-methods\">A unified model architecture to fit different learning methods<\/h4>\n\n\n\n<p>Inspired by the recent success of transformer models in sequential modeling, we propose a Control Transformer (CT). The input to the model is a control sequence composed of observations and actions, and the outputs of CT correspond to token embeddings representing each observation and action, respectively. The figure below depicts the CT architecture. Different from the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2106.01345\">Decision Transformer (DT)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> which directly learn reward-based policies, CT is designed to learn reward-agnostic representations, which enables it as a <strong><em>unified model<\/em><\/strong> to fit different learning methods (e.g. Imitation Learning (IL) and Reinforcement Learning (RL)) and various tasks.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"334\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-1024x334.png\" alt=\"The network architecture of SMART.\" class=\"wp-image-922389\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-1024x334.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-300x98.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-768x251.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-1536x501.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-2048x668.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/model-arch-240x78.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Architecture of Control Transformer. In the pretraining phase, we use the control-centric objective to train representation over multiple tasks; in the finetuning phase where a specific task is given, we learn a policy based on the pretrained representation (pretrained weights are shown in grey blocks). The construction of the policy head can vary for different downstream datasets or learning methods.<\/figcaption><\/figure>\n\n\n\n<h4 id=\"control-centric-pretraining-objectives-to-learn-generic-representations\">Control-centric pretraining objectives to learn generic representations <\/h4>\n\n\n\n<p>Built upon CT, we propose a control-centric pretraining objective that consists of three terms: forward dynamics prediction, inverse dynamics prediction and random masked hindsight control. The figure below illustrates each objective. These terms focus on policy-independent transition probabilities and encourage CT to capture dynamics information of both short-term and long-term temporal granularities. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"433\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-1024x433.png\" alt=\"The training objectives of SMART.\" class=\"wp-image-922653\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-1024x433.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-300x127.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-768x325.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-1536x650.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-665x280.png 665w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new-240x102.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig2-new.png 1912w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: The three terms of our proposed pretraining objective. The red shaded areas denote the attention span, while the grey regions are masked.<\/figcaption><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Forward Dynamics Prediction<\/strong>: For each observation-action pair in a control sequence, we aim to predict the next immediate latent state. This forward prediction captures the local transition information in the embedding space.<\/li>\n\n\n\n<li><strong>Inverse Dynamics Prediction<\/strong>: For each consecutive observation pair, we learn to recover the action that leads to the transition between the observation pair. <\/li>\n\n\n\n<li><strong>Random Masked Hindsight Control<\/strong>: Given a control sequence, we randomly mask part of actions and observations, and recover the masked actions based on the remaining incomplete sequence. This objective is akin to asking the question \u201cwhat actions should I take to generate such a trajectory?\u201d Therefore, we replace the causal attention mask with a non-causal one, to temporarily allow the model \u201csee the future\u201d. As a result, we encourage the model to learn controllable representations and global temporal relations, and to attend to the most essential representations for multi-step control.<\/li>\n<\/ul>\n\n\n\n<h3 id=\"experimental-results-highlights\">Experimental results highlights<\/h3>\n\n\n\n<h4 id=\"the-multi-task-dmc-benchmark\">The multi-task DMC benchmark<\/h4>\n\n\n\n<p>We evaluate SMART on the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/1801.00690.pdf\">DeepMind Control (DMC) suite<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which contains a series of continuous control tasks with RGB image observations. There are multiple <em><strong>domains<\/strong><\/em> (physical models with different state and action spaces) and multiple<strong><em> tasks<\/em><\/strong> (associated with a particular MDP) within each domain, which creates diverse scenarios for evaluating pretrained representations. Our experiments use 10 different tasks spanning over 6 domains. In pretraining, we use an offline dataset collected over 5 tasks, while the other 5 tasks (with 2 unseen domains) are held out to test the generalizability of SMART. The graphical relations of all tasks and domains involved are shown in the figure below. <\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized is-style-default\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/tasks.jpg\" alt=\"The graphical relation of tasks used for training.\" class=\"wp-image-922407\" width=\"472\" height=\"283\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/tasks.jpg 840w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/tasks-300x180.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/tasks-768x462.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/tasks-240x144.jpg 240w\" sizes=\"auto, (max-width: 472px) 100vw, 472px\" \/><\/figure>\n\n\n\n<h4 id=\"versatility\">Versatility<\/h4>\n\n\n\n<p>To evaluate the versatility of SMART, we design experiments to answer the following questions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Whether a single pretrained model can be finetuned with different downstream learning methods (i.e. <em>Return-To-Go conditioned <\/em>(<em>RTG)<\/em> and <em>Behavior Cloning<\/em> (<em>BC<\/em>));<\/li>\n\n\n\n<li>Whether the pretrained model can adapt towards various downstream tasks.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"360\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-1024x360.png\" alt=\"Line graph .\" class=\"wp-image-922398\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-1024x360.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-300x105.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-768x270.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-1536x539.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-2048x719.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig3-240x84.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Downstream learning rewards of SMART (red) compared with pretraining CT with single-task data (blue) and training from scratch (gray). Results are averaged over 3 random seeds. Scratch trains a policy with randomly initialized CT representation weights. CT-single is a variant of SMART, which pretrains CT with a single-task dataset containing trajectories from the downstream environment.<\/figcaption><\/figure>\n\n\n\n<p>In the figure above, we compare the reward curve of SMART with <em>Scratch<\/em><strong> <\/strong>and <em>CT-Single<\/em>, where models are pretrained with<em> Exploratory<\/em> dataset. It can be seen that pretrained CT from both single-task dataset (CT-single) and multi-task dataset (SMART) can achieve much better results than training from scratch. In general, under both <em>RTG<\/em> and <em>BC<\/em> finetuning, pretrained models have a warm start, a faster convergence rate, and a relatively better asymptotic performance in a variety of downstream tasks. In most cases, pretraining CT from multi-task dataset (SMART) yields better results than pretraining with only in-task data (CT-single), although it is harder to accommodate multiple different tasks with the same model capacity, which suggests that <strong><em>SMART can extract common knowledge from diverse tasks<\/em><\/strong>.<\/p>\n\n\n\n<h4 id=\"generalizability\">Generalizability<\/h4>\n\n\n\n<p>The figure shows the performance of SMART pretrained on <em>Exploratory<\/em> dataset, compared to<em> Scratch<\/em> and <em>CT-single<\/em> on 5 unseen tasks. We can see that SMART is able to generalize to unseen tasks and even unseen domains, whose distributions have a larger discrepancy as compared to the pretraining dataset. Surprisingly, SMART achieves better performance than CT-single in most tasks, even though CT-single has already seen the downstream environments. This suggests that <strong><em>good generalization ability can be obtained from learning underlying information which might be shared among multiple tasks and domains, spanning a diverse set of distributions.<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"353\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-1024x353.png\" alt=\"chart, map\" class=\"wp-image-922401\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-1024x353.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-300x104.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-768x265.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-1536x530.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-2048x707.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig4-240x83.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 4: Downstream learning rewards in unseen tasks and domains of SMART (red) compared with pretraining CT with single-task data (blue) and training from scratch (gray). Results are averaged over 3 seeds.<\/figcaption><\/figure>\n\n\n\n<p>To further investigate the generalizability of SMART, we evaluate the performance of SMART in other more challenging domains and tasks that have larger discrepancy with pretraining domains\/tasks. These additional domain-tasks are: ball-in-cup-catch, finger-turn-hard, fish-swim, swimmer-swimmer6 and swimmer-swimmer15. Note that these agents have significantly different appearance and moving patterns compared to pretraining tasks, as visualized in the figure below.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1-1024x476.png\" alt=\"Visualization of tasks.\" class=\"wp-image-922608\" width=\"372\" height=\"173\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1-1024x476.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1-300x139.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1-768x357.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1-240x111.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/novel-task-fig-63f9082d710c1.png 1053w\" sizes=\"auto, (max-width: 372px) 100vw, 372px\" \/><figcaption class=\"wp-element-caption\">Figure 5: Discrepancy between pretraining domains and selected downstream domains: (left) Walker domain. (right) Swimmer domain (6 and 15 links)<\/figcaption><\/figure>\n\n\n\n<p>The results are shown in figures below, where we can see that the pretrained model can still work in most cases, even under such a large task discrepancy. Note that here CT-Single is pretrained with data from exactly the downstream task, where SMART has never seen a sample from the downstream tasks and is pretrained on significantly different domains. Therefore, it is unsurprising that CT-Single is generally better than SMART in this setting. However, it is interesting to see that SMART is comparable with or even better than CT-Single in some tasks,<strong><em> suggesting the strong generalizability of SMART.<\/em><\/strong> On the other hand, one can imagine that it is unavoidable that the performance of a pretrained model will decrease as the discrepancy between pretraining tasks and downstream tasks increases. <strong><em>Therefore, we stress the importance of using diverse multi-task data for pretraining in practice.<\/em><\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-1024x361.png\" alt=\"Line graph.\" class=\"wp-image-922587\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-1024x361.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-300x106.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-768x271.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-1536x542.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-2048x723.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig13-240x85.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 6. Downstream learning rewards of SMART (red) in challenging tasks that have larger discrepancy with pretraining tasks, using the Exploratory pretraining dataset. Results are from 1 random seed.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"358\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-1024x358.png\" alt=\"chart\" class=\"wp-image-922584\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-1024x358.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-300x105.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-768x269.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-1536x537.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-2048x716.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig12-240x84.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 7: Downstream learning rewards of SMART (red) in challenging tasks that have larger discrepancy with pretraining tasks, using the Random pretraining dataset. Results are from 1 random seed.<\/figcaption><\/figure>\n\n\n\n<h4 id=\"resilience\">Resilience<\/h4>\n\n\n\n<p>We aggregate the results in all tasks by averaging the normalized reward (dividing raw scores by expert scores) in both <em>RTG<\/em> and <em>BC<\/em> settings. When using the Exploratory dataset for pretraining, SMART outperforms <em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/proceedings.mlr.press\/v139\/yang21h\/yang21h.pdf\">ACL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em>, and is comparable to <em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/pdf\/2106.01345.pdf\">DT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em> which has extra information of reward. When pretrained with the Random dataset, SMART is significantly better than <em>DT<\/em> and <em>ACL<\/em>, while <em>ACL<\/em> fails to outperform training from scratch. <strong><em>This result show that SMART is robust to low-quality data as compared to other baseline methods<\/em><\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"246\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-1024x246.png\" alt=\"chart, bar chart\" class=\"wp-image-922404\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-1024x246.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-300x72.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-768x184.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-1536x369.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-2048x492.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/fig5-240x58.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 8: Downstream learning rewards (normalized by expert score) of all methods using Exploratory and Random dataset. The gap between each pair of green and red bars corresponds to the resilience of each method to pretraining data quality, and our SMART shows the best resilience among all baselines.<\/figcaption><\/figure>\n\n\n\n<h4 id=\"analysis\">Analysis<\/h4>\n\n\n\n<p>In large-scale training problems, performance usually benefits from larger model capacity. We investigate if this also applies to sequential decision making tasks by varying the embedding size (width) and the number of layers (depth) in CT. The per-task comparisons are show in the figure below. From the comparison, we can see that in general, increasing the model depth leads to a better performance. However, when embedding size gets too large, the performance further drops, as a large representation space might allow for irrelevant information. In addition, the design choice of model capacity should also be considered together with the training dataset scale and diversity. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"644\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-1024x644.png\" alt=\"chart, bar chart\" class=\"wp-image-922611\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-1024x644.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-300x189.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-768x483.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-1536x965.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity-240x151.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/02\/capacity.png 1960w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 9: Comparison of varying model capacities (embedding size and layer number) in different tasks in terms of relative improvement wrt training from scratch.<\/figcaption><\/figure>\n\n\n\n<h3 id=\"towards-foundation-models-for-perception-and-control\">Towards Foundation Models for Perception and Control<\/h3>\n\n\n\n<p>We are thrilled to announce the release of SMART, a technique designed to bring foundation models for decision-making within reach of a wider audience. Our goal with SMART is to make it easy for anyone to use pretrained foundation models without requiring specialized knowledge of model architecture or pretraining approaches. By leveraging the latest advances in spatio-temporal data analysis, SMART is at the forefront of addressing the challenges of perception and control jointly. Our team is excited to see what the future holds for this powerful new technique.<\/p>\n\n\n\n<p><em>This work is being undertaken by members of the <\/em><a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/autonomous-systems\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Microsoft Autonomous Systems and Robotics Research Group<\/em><\/a> and <em><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/furong-huang.com\/students-colleagues\/\">University of Maryland<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/em>. <em>The researchers included in this project are: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ycsun2017.github.io\/\">Yanchao Sun<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.shuangma.me\/\">Shuang Ma<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/ratneshmadaan.github.io\/\">Ratnesh Madaan<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/rogeriobonatti.com\/\">Rogerio Bonatti<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/furong-huang.com\/\">Furong Huang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/akapoor\/videos\/?lang=fr_ca\">Ashish Kapoor<\/a>. <\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We are announcing SMART, a generalized pretraining framework for a wide variety of control tasks. Self-supervised pretraining of large neural networks (BERT (opens in new tab), GPT (opens in new tab), MoCo (opens in new tab), and CLIP (opens in new tab)) has been shown to be successful in a wide range of language and [&hellip;]<\/p>\n","protected":false},"author":40411,"featured_media":923152,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":867219,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-922338","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":867219,"type":"group"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/922338","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/40411"}],"version-history":[{"count":39,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/922338\/revisions"}],"predecessor-version":[{"id":923156,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/922338\/revisions\/923156"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/923152"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=922338"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=922338"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=922338"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=922338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}