{"id":1052988,"date":"2024-07-10T18:31:31","date_gmt":"2024-07-11T01:31:31","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1052988"},"modified":"2025-03-19T02:21:03","modified_gmt":"2025-03-19T09:21:03","slug":"ar-videos","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ar-videos\/","title":{"rendered":"Autoregressive Video Models"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-racing-green card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 w-lg-col-5\">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading has-text-align-left\" id=\"autoregressive-video-models\">Autoregressive Video Models<\/h1>\n\n\n\n<p>Driving large video models with next token prediction<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"video-in-context-learning-autoregressive-transformers-are-zero-shot-video-imitators\">Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators<\/h2>\n\n\n\n<div class=\"wp-block-buttons is-content-justification-left is-content-justification-left is-layout-flex wp-container-core-buttons-is-layout-16165478 wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/abs\/2407.07356\">Paper<\/a><\/div>\n<\/div>\n\n\n\n<p>In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences, each semantically guided by the prompted video demonstrations. To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets. We thoroughly analyze the effect of different datasets and represent frames as discrete tokens, and then model them by next token predictions.<\/p>\n\n\n\n<p>As a result, the obtained vision Transformer model is able to generate a subsequent video sequence of a query video clip, which is semantically aligned with the demonstration video. Demonstration videos are highly versatile and capable of conveying a wide range of information, such as examples for various tasks including moving or grabbing objects, or movements of the camera in an ego-centric video. This allows video in-context learning to address multiple downstream tasks, such as embodied planning and simulating, by letting a query robot imitate the actions demonstrated by other robots, as shown below.  <strong><em>Video in-context learning serves as a new and crucial interface for models to interact with the real world<\/em><\/strong>, as videos are good at describing low-level details (where language may fall short) and temporal dynamics (where images are insufficient).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"592\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1024x592.png\" alt=\"Model Architecture\" class=\"wp-image-1052997\" style=\"width:908px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1024x592.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-300x174.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-768x444.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-1536x888.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-2048x1185.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/figure2-240x139.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">The pipeline of Vid-ICL. Left: Training of Vid-ICL. The data used for training are continuous video clips and the Transformer is trained by next token prediction. Right: In-context Inference of Vid-ICL. The model is conditioned on demonstration videos and generates future frames.<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"generated-samples-on-something-something-v2\">Generated Samples on Something-Something v2<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DeLVM<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, ft)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-gallery has-nested-images columns-6 is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053009\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-66839f45d9812.gif\" alt=\"gif placeholder\" class=\"wp-image-1053009\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053015\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-66839f9fc2758.gif\" alt=\"gif placeholder\" class=\"wp-image-1053015\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053018\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11.gif\" alt=\"gif placeholder\" class=\"wp-image-1053018\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053021\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683a03123635.gif\" alt=\"gif placeholder\" class=\"wp-image-1053021\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053024\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683a043bc758.gif\" alt=\"gif placeholder\" class=\"wp-image-1053024\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053027\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683a056ae654.gif\" alt=\"gif placeholder\" class=\"wp-image-1053027\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053033\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053033\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053036\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/1.gif\" alt=\"gif placeholder\" class=\"wp-image-1053036\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053039\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15.gif\" alt=\"gif placeholder\" class=\"wp-image-1053039\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053042\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a2f531380.gif\" alt=\"gif placeholder\" class=\"wp-image-1053042\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053045\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a3155d78d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053045\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053048\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/5-6683a321daf6d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053048\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053054\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053054\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053057\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053057\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053060\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/17.gif\" alt=\"gif placeholder\" class=\"wp-image-1053060\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053063\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/17-6683a3904eae6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053063\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053069\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/17-6683a3e431613.gif\" alt=\"gif placeholder\" class=\"wp-image-1053069\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053072\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/6-6683a3ef3af43.gif\" alt=\"gif placeholder\" class=\"wp-image-1053072\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053075\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/7.gif\" alt=\"gif placeholder\" class=\"wp-image-1053075\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053078\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/3.gif\" alt=\"gif placeholder\" class=\"wp-image-1053078\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053081\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/19.gif\" alt=\"gif placeholder\" class=\"wp-image-1053081\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053084\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/19-6683a433d902c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053084\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053087\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/19-6683a440dabd5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053087\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053090\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/7-6683a44d84e2a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053090\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053093\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a467c30cb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053093\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053096\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/12.gif\" alt=\"gif placeholder\" class=\"wp-image-1053096\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053099\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/110.gif\" alt=\"gif placeholder\" class=\"wp-image-1053099\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053102\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/110-6683a491efa62.gif\" alt=\"gif placeholder\" class=\"wp-image-1053102\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053105\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/110-6683a4a0e32aa.gif\" alt=\"gif placeholder\" class=\"wp-image-1053105\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053120\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683a54d15e55.gif\" alt=\"gif placeholder\" class=\"wp-image-1053120\" \/><\/figure>\n<\/figure>\n\n\n\n<p>Qualitative results on the generated samples by Vid-ICL on Something-Something v2 dataset, a dataset of real world human activities. . The first column shows the demonstration video. The second column shows the generated samples by DeLVM. The third to sixth columns show the generated samples by Vid-ICL with different pretraining and finetuning strategies. It shows that Vid-ICL generates more diverse and plausible samples than DeLVM, and the quality of the generated samples is improved by scaling model sizes and in-domain fine-tuning.<\/p>\n<\/div>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"generated-samples-on-robotics-transformer-dataset\">Generated Samples on Robotics Transformer Dataset<\/h4>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, pt)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (300M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (700M, ft)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Vid-ICL (1.1B, ft)<\/p>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-6 is-cropped wp-block-gallery-2 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053135\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a808d3bc0.gif\" alt=\"gif placeholder\" class=\"wp-image-1053135\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053138\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a8180f7db.gif\" alt=\"gif placeholder\" class=\"wp-image-1053138\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053141\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a825241a2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053141\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053144\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a83259a44.gif\" alt=\"gif placeholder\" class=\"wp-image-1053144\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053147\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a83db69b1.gif\" alt=\"gif placeholder\" class=\"wp-image-1053147\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053150\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683a84d1869a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053150\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053153\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8.gif\" alt=\"gif placeholder\" class=\"wp-image-1053153\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053156\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a89c68359.gif\" alt=\"gif placeholder\" class=\"wp-image-1053156\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053159\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8a6d6e8d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053159\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053162\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8b1989ee.gif\" alt=\"gif placeholder\" class=\"wp-image-1053162\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053165\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8bc41ee6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053165\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053168\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683a8c7620c6.gif\" alt=\"gif placeholder\" class=\"wp-image-1053168\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053171\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24.gif\" alt=\"gif placeholder\" class=\"wp-image-1053171\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053174\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8e7e2880.gif\" alt=\"gif placeholder\" class=\"wp-image-1053174\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053177\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8f0d2429.gif\" alt=\"gif placeholder\" class=\"wp-image-1053177\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a8fcd9a37.gif\" alt=\"gif placeholder\" class=\"wp-image-1053180\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053183\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a907f3c8c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053183\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053186\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/24-6683a9143829a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053186\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053189\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62.gif\" alt=\"gif placeholder\" class=\"wp-image-1053189\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053192\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a93226921.gif\" alt=\"gif placeholder\" class=\"wp-image-1053192\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053195\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a93ba233b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053195\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053198\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a946879bf.gif\" alt=\"gif placeholder\" class=\"wp-image-1053198\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053201\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a9504596c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053201\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053204\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/62-6683a95a4f32a.gif\" alt=\"gif placeholder\" class=\"wp-image-1053204\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053207\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51.gif\" alt=\"gif placeholder\" class=\"wp-image-1053207\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053210\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9819d178.gif\" alt=\"gif placeholder\" class=\"wp-image-1053210\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053213\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a98d8882b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053213\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053216\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9971c096.gif\" alt=\"gif placeholder\" class=\"wp-image-1053216\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053219\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9a069983.gif\" alt=\"gif placeholder\" class=\"wp-image-1053219\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053222\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/51-6683a9a9ba409.gif\" alt=\"gif placeholder\" class=\"wp-image-1053222\" \/><\/figure>\n<\/figure>\n\n\n\n<p>Qualitative results on the generated samples by Vid-ICL on Robotics Transformer(RT-1) dataset, a dataset of robotic manipulation tasks. Vid-ICL offers more concise control over various robotic tasks with the demonstration video.<\/p>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"contrastive-demonstration\">Contrastive Demonstration<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Contrastive Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Contrastive Generation<\/p>\n<\/div>\n<\/div>\n\n\n\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-4 is-cropped wp-block-gallery-3 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053234\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abe085d38.gif\" alt=\"gif placeholder\" class=\"wp-image-1053234\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053237\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abec3dd85.gif\" alt=\"gif placeholder\" class=\"wp-image-1053237\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053240\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683abfbb2bd7.gif\" alt=\"gif placeholder\" class=\"wp-image-1053240\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053243\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/1-6683ac07891a0.gif\" alt=\"gif placeholder\" class=\"wp-image-1053243\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053246\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac1b630fd.gif\" alt=\"gif placeholder\" class=\"wp-image-1053246\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053249\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac267f930.gif\" alt=\"gif placeholder\" class=\"wp-image-1053249\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053252\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac308fafb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053252\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053255\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/15-6683ac3d265b2.gif\" alt=\"gif placeholder\" class=\"wp-image-1053255\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053258\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac5209723.gif\" alt=\"gif placeholder\" class=\"wp-image-1053258\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053261\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac5f2bbee.gif\" alt=\"gif placeholder\" class=\"wp-image-1053261\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053264\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac6aa2947.gif\" alt=\"gif placeholder\" class=\"wp-image-1053264\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053267\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/2-6683ac761ae3b.gif\" alt=\"gif placeholder\" class=\"wp-image-1053267\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053270\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac832df72.gif\" alt=\"gif placeholder\" class=\"wp-image-1053270\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053273\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac90605e5.gif\" alt=\"gif placeholder\" class=\"wp-image-1053273\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053276\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683ac9aef5fe.gif\" alt=\"gif placeholder\" class=\"wp-image-1053276\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053279\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/4-6683aca67f7cf.gif\" alt=\"gif placeholder\" class=\"wp-image-1053279\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053282\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acb98695e.gif\" alt=\"gif placeholder\" class=\"wp-image-1053282\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053285\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acc4ea7fd.gif\" alt=\"gif placeholder\" class=\"wp-image-1053285\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053288\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acd0ebc70.gif\" alt=\"gif placeholder\" class=\"wp-image-1053288\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053291\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/11-6683acde3c986.gif\" alt=\"gif placeholder\" class=\"wp-image-1053291\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053294\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/13.gif\" alt=\"gif placeholder\" class=\"wp-image-1053294\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053297\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad0093b3c.gif\" alt=\"gif placeholder\" class=\"wp-image-1053297\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053300\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad1314c06.gif\" alt=\"gif placeholder\" class=\"wp-image-1053300\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053303\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/13-6683ad2d660bb.gif\" alt=\"gif placeholder\" class=\"wp-image-1053303\" \/><\/figure>\n<\/figure>\n\n\n\n<p>In this section, we give shared prompts to Vid-ICL in the second and fourth columns of each row, and a contrastive prompt in the first and third columns. Results shows an important feature of Vid-ICL that when conditioned on demonstration with contrastive semantics, Vid-ICL generates contrary samples given the same demonstration, which demonstrates Vid-ICL&#8217;s ability to&nbsp;<strong>precisely understand the demonstration dynamics and generate diverse future sequences.<\/strong><\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"text-conditioned-generation\">Text Conditioned Generation<\/h4>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Text<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Demonstration<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation w\/o Text<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Generation w\/ Text<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Push [something] from left to right<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-4 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053321\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/8-6683b22db315f.gif\" alt=\"gif placeholder\" class=\"wp-image-1053321\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053324\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/20.gif\" alt=\"gif placeholder\" class=\"wp-image-1053324\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053327\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/20-6683b2471482d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053327\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Turning the camera left while filming [something]<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-5 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053342\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b32f3b405.gif\" alt=\"gif placeholder\" class=\"wp-image-1053342\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053345\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b33bed1be.gif\" alt=\"gif placeholder\" class=\"wp-image-1053345\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053348\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/0-6683b3451d71f.gif\" alt=\"gif placeholder\" class=\"wp-image-1053348\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:380px\">\n<p class=\"has-text-align-center\">Moving [something] closer to [something]<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-gallery aligncenter has-nested-images columns-default is-cropped wp-block-gallery-6 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053351\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/76.gif\" alt=\"gif placeholder\" class=\"wp-image-1053351\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053354\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/76-6683b381b4f29.gif\" alt=\"gif placeholder\" class=\"wp-image-1053354\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"256\" height=\"256\" data-id=\"1053357\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/76-6683b389a3f7d.gif\" alt=\"gif placeholder\" class=\"wp-image-1053357\" \/><\/figure>\n<\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>We show that text can be used as an additional in-context condition signals to augment the generation of Vid-ICL. After aligning Vid-ICL to the text, the generated samples are more consistent with the text instructions and demonstration videos.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"on-reinforcement-learning-tasks\">On Reinforcement Learning Tasks<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"335\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1024x335.png\" alt=\"gif placeholder\" class=\"wp-image-1053369\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1024x335.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-300x98.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-768x251.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-1536x503.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-2048x670.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/robodesk-6683b44fc2704-240x79.png 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We demonstrate that Vid-ICL can also function as a simulator in reinforcement learning tasks by evaluating it on the RoboDesk, a test benchmark containing several reinforcement learning tasks. Vid-ICL generates future frames with videos that accomplishing the same task as the demonstration, where the generated frames inversely reflect the corresponding actions that correctly interact with the environment. Evaluating on the <em><strong>Push_red<\/strong><\/em> task, we find that Vid-ICL provides more precise control over the environment interaction.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"ethics-statement\">Ethics Statement<\/h4>\n\n\n\n<p>Vid-ICL is exclusively a research initiative with no current plans for product integration or public access. We are committed to adhering to Microsoft AI principles during the ongoing development of our models. The datasets utilized in this study are publicly available and have been thoroughly reviewed to ensure they do not include personally identifiable information or offensive content. Nonetheless, as these datasets are sourced from the Internet, there may still be inherent biases. To address this, we have implemented a rigorous filtering process on the training data to minimize the potential for the model to generate inappropriate content.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<h4 class=\"wp-block-heading\" id=\"citation\">Citation<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>@article{zhang2024video,\n  title={Video In-Context Learning},\n  author={Zhang, Wentao and Guo, Junliang and He, Tianyu and Zhao, Li and Xu, Linli and Bian, Jiang},\n  journal={arXiv preprint arXiv:2407.07356},\n  year={2024}\n}<\/code><\/pre>\n\n\n\n\n\n<h2 class=\"wp-block-heading has-text-align-center\" id=\"fast-autoregressive-video-generation-with-diagonal-decoding\">Fast Autoregressive Video Generation with Diagonal Decoding<\/h2>\n\n\n\n<div class=\"wp-block-buttons is-layout-flex wp-block-buttons-is-layout-flex\">\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\" href=\"https:\/\/arxiv.org\/pdf\/2503.14070\">Paper<\/a><\/div>\n\n\n\n<div class=\"wp-block-button\"><a data-bi-type=\"button\" class=\"wp-block-button__link wp-element-button\">Code (Coming Soon)<\/a><\/div>\n<\/div>\n\n\n\n<p>Autoregressive Transformer models have demonstrated impressive performance in video generation, but their sequential token-by-token decoding process poses a major bottleneck, particularly for long videos represented by tens of thousands of tokens. Here, we propose <strong>Diagonal Decoding (DiagD)<\/strong>, a training-free inference acceleration algorithm for autoregressively pre-trained models that exploits spatial and temporal correlations in videos. Our method generates tokens along diagonal paths in the spatial-temporal token grid, enabling parallel decoding within each frame as well as partially overlapping across consecutive frames. The proposed algorithm is versatile and adaptive to various generative models and tasks, while providing flexible control over the trade-off between inference speed and visual quality. Experiments on multiple autoregressive video generation models and datasets demonstrate that DiagD achieves up to <strong>10x speedup compared to naive sequential decoding<\/strong>, while maintaining comparable visual fidelity.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"865\" height=\"502\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1.png\" alt=\"Comparisons between naive Next-Token Prediction (NTP) and Diagonal Decoding (DiagD) on Cosmos autoregressive models. \" class=\"wp-image-1134475\" style=\"width:596px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1.png 865w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-300x174.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-768x446.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-480x280.png 480w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/front-1-240x139.png 240w\" sizes=\"auto, (max-width: 865px) 100vw, 865px\" \/><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"method\">Method<\/h4>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"472\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1024x472.png\" alt=\"Illustration of Diagonal Decoding\" class=\"wp-image-1134476\" style=\"width:779px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1024x472.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-300x138.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-768x354.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-1536x708.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method-240x111.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/method.png 1629w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The motivation of our method arises from intuitive observations on consecutive frames in a video, which can be summarized in two key insights. As shown above, the first insight is that patches exhibit stronger correlations with their spatial neighbors than with sequential ones. Secondly, due to the temporal redundancy of videos, patches from consecutive frames that occupy similar relative positions are highly likely to be similar to each other. <\/p>\n\n\n\n<p>As a result, we propose Diagonal Decoding, an iterative algorithm that generates tokens along diagonal paths in the spatial-temporal token grid. Spatially, within each frame, tokens along the same diagonal are generated in parallel, leveraging the strong local dependencies between neighboring patches. And temporally, by stacking frames together, our method generates the top-left tokens of the next frame before completing the current frame, as these tokens are less likely to depend on the bottom-right tokens that have not yet been generated.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-cosmos-12b-autoregressive-model\">Case: Cosmos 12B Autoregressive Model<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=1)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=2)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415.gif\" alt=\"NTP Case 1\" class=\"wp-image-1134477\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415-1.gif\" alt=\"Diagd k=1 Case 1\" class=\"wp-image-1134478\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/buYzC9In_lM_415-2.gif\" alt=\"Diagd k=2 Case 1\" class=\"wp-image-1134479\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-4.gif\" alt=\"NTP Case 2\" class=\"wp-image-1134485\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-3.gif\" alt=\"Diagd k=1 Case 2\" class=\"wp-image-1134484\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/TFYdFphVyZw_1341-2.gif\" alt=\"Diagd k=2 Case 2\" class=\"wp-image-1134483\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745.gif\" alt=\"NTP Case 3\" class=\"wp-image-1134486\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745-2.gif\" alt=\"Diagd k=1 Case 3\" class=\"wp-image-1134488\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"500\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/twV1rn5uaI_5745-1.gif\" alt=\"Diagd k=2 Case 3\" class=\"wp-image-1134487\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-wham-1-6b-model\">Case: WHAM 1.6B Model<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=1)<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD (k=2)<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD3991.gif\" alt=\"NTP Case 1\" class=\"wp-image-1134493\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD39991B0AE4A7AB3474ED102EEDCC5779D665B4619_250Movie.pb_step-1750_tick-14162_video-ezgif.com-video-to-gif-converter.gif\" alt=\"Diagd k=1 Case 1\" class=\"wp-image-1134489\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBD39991B0AE4A7AB3474ED102EEDCC5779D665B4619_250Movie.pb_step-1750_tick-14162_video-ezgif.com-video-to-gif-converter-1.gif\" alt=\"Diagd k=2 Case 1\" class=\"wp-image-1134491\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D21.gif\" alt=\"NTP Case 2\" class=\"wp-image-1134494\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D26593B2E7DF33F742A95AC06D4CA3AF7A7AFC3B_71Movie.pb_step-16_tick-4591_video-ezgif.com-video-to-gif-converter.gif\" alt=\"Diagd k=1 Case 2\" class=\"wp-image-1134490\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"300\" height=\"180\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/FBE4D26593B2E7DF33F742A95AC06D4CA3AF7A7AFC3B_71Movie.pb_step-16_tick-4591_video-ezgif.com-video-to-gif-converter-1.gif\" alt=\"Diagd k=2 Case 2\" class=\"wp-image-1134492\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"case-autoregressive-model-on-minecraft\">Case: Autoregressive Model on Minecraft<\/h4>\n\n\n\n<div class=\"wp-block-columns are-vertically-aligned-center is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">Next Token Prediction<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD w\/o Finetune<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-vertically-aligned-center is-layout-flow wp-block-column-is-layout-flow\">\n<p class=\"has-text-align-center\">DiagD w\/ Finetune<\/p>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter-3.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134501\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134495\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_58-ezgif.com-video-to-gif-converter-2.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134499\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-not-stacked-on-mobile is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter-3.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134502\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134496\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"224\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/07\/clip_198-ezgif.com-video-to-gif-converter-2.gif\" alt=\"a screenshot of a video game\" class=\"wp-image-1134500\" style=\"width:501px;height:auto\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h4 class=\"wp-block-heading has-text-align-center\" id=\"bibtex\">BibTex<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>@article{ye2025fast,\n      title={Fast Autoregressive Video Generation with Diagonal Decoding}, \n      author={Ye, Yang and Guo, Junliang and Wu, Haoyu and He, Tianyu and Pearce, Tim and Rashid, Tabish and Hofmann, Katja and Bian, Jiang},\n      journal={arXiv preprint arXiv:2503.14070},\n      year={2025},\n}<\/code><\/pre>\n\n\n","protected":false},"excerpt":{"rendered":"<p>Driving large video models with next token prediction In-context learning for vision data has been underexplored compared with that in natural language. Previous works studied image in-context learning, urging models to generate a single image guided by demonstrations. In this project, we propose and study video in-context learning, where the model starts from an existing [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1052988","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[1056966],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Li Zhao","user_id":36152,"people_section":"Related people","alias":"lizo"}],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1052988","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":50,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1052988\/revisions"}],"predecessor-version":[{"id":1140261,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1052988\/revisions\/1140261"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1052988"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1052988"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1052988"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1052988"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1052988"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}