{"id":1132587,"date":"2025-02-26T21:49:41","date_gmt":"2025-02-27T05:49:41","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1132587"},"modified":"2025-02-26T21:49:43","modified_gmt":"2025-02-27T05:49:43","slug":"efficiently-generating-long-high-quality-and-dynamic-videos-using-text-prompts","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/efficiently-generating-long-high-quality-and-dynamic-videos-using-text-prompts\/","title":{"rendered":"Efficiently generating long, high-quality, and dynamic videos using text prompts"},"content":{"rendered":"\n<p>The rapid development of AI has steadily advanced the field of text-to-video (T2V) generation, offering a rich and convenient video content creation experience and unlocking new possibilities in entertainment, education, and multimedia communication. Traditional T2V methods, however, are limited due to a lack of data and computational resources, making it difficult to generate long videos (longer than 30 seconds) that contain dynamic content and temporal consistency. Achieving coherence and preserving the dynamics when generating long videos while also improving efficiency has become a key focus in this field.<\/p>\n\n\n\n<p>To address this, a research team at Microsoft Research Asia has developed the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/arlon-boosting-diffusion-transformers-with-autoregressive-models-for-long-video-generation\/\">ARLON framework<\/a>, which combines autoregressive (AR) models with diffusion transformer (DiT) technology. By using vector quantized variational autoencoder (VQ-VAE) technology, ARLON effectively compresses and quantizes high-dimensional input features in T2V tasks, reducing learning complexity without compromising information density. With text prompts, ARLON synthesizes high-quality videos that retain both rich dynamics and temporal coherence.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"875\" height=\"229\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/image-67bff5607b89f.png\" alt=\"diagram\" class=\"wp-image-1132644\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/image-67bff5607b89f.png 875w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/image-67bff5607b89f-300x79.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/image-67bff5607b89f-768x201.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/image-67bff5607b89f-240x63.png 240w\" sizes=\"auto, (max-width: 875px) 100vw, 875px\" \/><figcaption class=\"wp-element-caption\">Figure 1. ARLON&#8217;s inference framework<\/figcaption><\/figure>\n\n\n\n<p>They optimized the ARLON framework by introducing an adaptive semantic injection module and an uncertainty sampling strategy, enhancing the model&#8217;s robustness to noise and improving the efficiency of video generation. The adaptive semantic injection module uses a gated adaptive normalization mechanism to inject coarse semantic information into the video generation process. Meanwhile, an uncertainty sampling strategy simulates errors in AR predictions by sampling noise from the distribution of the original coarse latent features, improving the model&#8217;s adaptability to different input conditions.<\/p>\n\n\n\n<p>Evaluation demonstrates that ARLON can significantly outperform earlier video generation models in robustness, naturalness, and dynamic consistency. Even when handling highly complex or repetitive scenes, it can consistently synthesize high-quality videos. Using the VBench video generation benchmark, ARLON surpassed existing baseline models and achieved groundbreaking progress across multiple evaluation metrics. The success of the ARLON framework demonstrates the potential of combining the strengths of different models to solve complex problems and offers new directions for advancing video generation technology.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"how-arlon-enhances-the-efficiency-and-quality-of-long-video-generation\">How ARLON enhances the efficiency and quality of long video generation<\/h3>\n\n\n\n<p>The ARLON framework is composed of three primary components: latent VQ-VAE compression, AR modeling, and semantic-aware condition generation. Given a text prompt, the AR model predicts coarse visual latent tokens, constructed from a 3D VAE encoder followed by a latent VQ-VAE encoder. These predicted visual latent tokens encapsulate both coarse spatial information and consistent semantic information. Based on these tokens, a latent VQ-VAE decoder generates continuous latent features, which serve as semantic conditions to guide the DiT model with a semantic injection module.<\/p>\n\n\n\n<p>These components are described in detail below:<\/p>\n\n\n\n<p><strong>Latent VQ-VAE compression<\/strong> is a crucial step maps high-dimensional input features into a compact and discrete latent space. The process is achieved through the following expression:<a><\/a><\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"400\" height=\"57\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-2.png\" alt=\"formular\" class=\"wp-image-1121529\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-2.png 400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-2-300x43.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-2-240x34.png 240w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/figure>\n\n\n\n<p>where X\u2208R^(T\u00d7H\u00d7W\u00d7C) represents the input features, E_&#8221;latent&#8221; is the encoder composed of 3D convolutional neural network blocks and residual attention blocks, and V\u2208R^(T\/r\u00d7H\/o\u00d7W\/o\u00d7h) is the encoded latent embedding. Each embedding vector v\u2208R^h is quantized to the nearest entry c\u2208R^m in the codebook C\u2208R^(K\u00d7m), forming the discrete latent embedding (Q):<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"347\" height=\"84\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-3-678e1d78162ed.png\" alt=\"formular\" class=\"wp-image-1122666\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-3-678e1d78162ed.png 347w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-3-678e1d78162ed-300x73.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-3-678e1d78162ed-240x58.png 240w\" sizes=\"auto, (max-width: 347px) 100vw, 347px\" \/><\/figure>\n\n\n\n<p>The decoding process involves retrieving the corresponding entries (c) from the codebook (C) given the indices of the video tokens and then using the latent VQ-VAE decoder to reconstruct the video embeddings (F):<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"588\" height=\"67\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-4.png\" alt=\"formular\" class=\"wp-image-1121535\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-4.png 588w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-4-300x34.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-4-240x27.png 240w\" sizes=\"auto, (max-width: 588px) 100vw, 588px\" \/><\/figure>\n\n\n\n<p><strong>AR modeling<\/strong> uses a causal transformer decoder as a language model, combining the text condition Y and the indices of visual tokens Q as the input to the model to generate video content in an AR manner. This process can be described by the following probabilistic model:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"624\" height=\"126\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-5.png\" alt=\"formular\" class=\"wp-image-1121538\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-5.png 624w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-5-300x61.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-5-240x48.png 240w\" sizes=\"auto, (max-width: 624px) 100vw, 624px\" \/><\/figure>\n\n\n\n<p>where Q_&#8221;AR&#8221; =[q_1,q_2,\u2026,q_N ] is the sequence of visual token indices, and N is the sequence length. \u0398<em>&#8220;AR&#8221; represents the model parameters. The objective of the model is to maximize the probability of the visual token index sequence Q<\/em>&#8220;AR&#8221; given the text condition Y.<\/p>\n\n\n\n<p>In the<strong> semantic-aware condition generation <\/strong>phase, the ARLON framework utilizes a video VAE and a latent VQ-VAE to compress the video into a coarse latent space. It uses the tokens predicted by the AR model as semantic conditions for training the diffusion model. This process can be represented by:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"541\" height=\"81\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-6.png\" alt=\"formular\" class=\"wp-image-1121541\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-6.png 541w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-6-300x45.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-6-240x36.png 240w\" sizes=\"auto, (max-width: 541px) 100vw, 541px\" \/><\/figure>\n\n\n\n<p>where x is the input video, E_&#8221;video&#8221; is the video encoder, E_&#8221;latent&#8221; is the latent VQ-VAE encoder, D_&#8221;latent&#8221; is the latent VQ-VAE decoder, and F is the reconstructed latent feature used as the semantic condition.<\/p>\n\n\n\n<p>Semantic injection involves injecting coarse semantic information into the video generation process to guide the diffusion process. This involves the following steps:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"896\" height=\"118\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-7.png\" alt=\"formular\" class=\"wp-image-1121544\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-7.png 896w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-7-300x40.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-7-768x101.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-7-240x32.png 240w\" sizes=\"auto, (max-width: 896px) 100vw, 896px\" \/><\/figure>\n\n\n\n<p>where X_i is the input latent variable, F \u0302_i is the condition latent variable processed by uncertainty sampling, \u03b1_i,\u03b2_i,\u03b3_iare the scale, shift, and gating parameters generated by the multi-layer perceptron (MLP ) network, and the &#8220;Fusion&#8221; function injects the condition information into the original latent variable.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"904\" height=\"533\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-8.png\" alt=\"ARLON \u7684\u6574\u4f53\u6846\u67b6\" class=\"wp-image-1121547\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-8.png 904w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-8-300x177.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-8-768x453.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-8-240x142.png 240w\" sizes=\"auto, (max-width: 904px) 100vw, 904px\" \/><figcaption class=\"wp-element-caption\">Figure 2.\u00a0Overview of ARLON\u2019s overall framework<\/figcaption><\/figure>\n\n\n\n<p>To mitigate the inevitable noise introduced during AR inference, the team adopted the following two strategies during the training phase:<\/p>\n\n\n\n<p><strong>Coarse visual latent tokens<\/strong>: Two different compression ratios of latent VQ-VAE for training and inference enhance the diffusion process&#8217;s tolerance to noisy AR predictions.<\/p>\n\n\n\n<p><strong>Uncertainty sampling<\/strong>: To simulate the variance of AR predictions, an uncertainty sampling module was introduced. This generates noise from the distribution of the original coarse latent features F_i rather than strictly relying on the original coarse latent features:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"60\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-9.png\" alt=\"formular\" class=\"wp-image-1121574\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-9.png 785w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-9-300x23.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-9-768x59.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-9-240x18.png 240w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure>\n\n\n\n<p>where \u03bc_i and \u03c3_i are the mean and standard deviation of the noises, respectively, and F \u203e_i=(F_i-\u03bc_i)\/\u03c3_i is the normalized feature. \u03c3 \u0302_i and \u03bc \u0302_i are noise vectors sampled from the target feature mean and variance distribution.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"evaluation-results\">Evaluation results<\/h2>\n\n\n\n<p>The team assessed ARLON against other open-source text-to-long-video generation models using VBench metrics, such as dynamic degree, aesthetic quality, imaging quality, subject consistency, overall consistency, background consistency, and motion smoothness. ARLON achieved state-of-the-art performance in long video generation, with significant improvements in both inference efficiency and generation quality. The results, shown in Figure 3, demonstrate that ARLON excels across multiple evaluation metrics, particularly in dynamic degree and aesthetic quality.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"221\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10-1024x221.png\" alt=\"ARLON \u4e0e\u5176\u4ed6\u957f\u89c6\u9891\u751f\u6210\u65b9\u6cd5\u5728\u6027\u80fd\u4e0a\u7684\u6bd4\u8f83\u7ed3\u679c\" class=\"wp-image-1121550\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10-1024x221.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10-300x65.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10-768x166.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10-240x52.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-10.png 1320w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Comparison of ARLON\u2019s performance against other long video generation methods.<\/figcaption><\/figure>\n\n\n\n<p>Qualitative results further highlight ARLON\u2019s ability to maintain both dynamism and consistency in generated videos. Unlike models that generate static or nearly motionless videos, ARLON achieves a better balance among dynamic motion, temporal consistency, and natural smoothness. Its videos retain a high level of subject consistency while exhibiting fluid and lifelike motion.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"550\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-1024x550.png\" alt=\"ARLON \u4e0e\u5176\u4ed6\u957f\u89c6\u9891\u65b9\u6cd5\u7684\u5b9a\u6027\u6bd4\u8f83\" class=\"wp-image-1121553\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-1024x550.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-300x161.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-768x413.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11-240x129.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-11.png 1323w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 4. Qualitative comparison between ARLON and other long video generation methods.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"512\" height=\"512\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-12.gif\" alt=\"ARLON \u6839\u636e\u6587\u6848 \u201cMisty mountains at sunrise, with the sun casting a warm glow\u2026\u2026 \u201d \u751f\u6210\u7684\u77ed\u89c6\u9891\uff082s\uff09\" class=\"wp-image-1121556\"\/><figcaption class=\"wp-element-caption\">Figure 5. A short, two-second video generated by ARLON based on the description, &#8220;Misty mountains at sunrise, with the sun casting a warm glow.&#8221;<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"161\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-13.gif\" alt=\"\u591a\u4e2a\u6a21\u578b\u6839\u636e\u6587\u6848 \u201cIn a mesmerizing underwater world, schools of tropical fish, including angelfish, clownfish, and tangs, dart gracefully through the water\u2026\u2026\u201d \u751f\u6210\u7684\u89c6\u9891\u5bf9\u6bd4\" class=\"wp-image-1121562\"\/><figcaption class=\"wp-element-caption\">Figure 6.\u00a0Comparison of long videos of thirty seconds generated by multiple models based on the description, &#8220;In a mesmerizing underwater world, schools of tropical fish, including angelfish, clownfish, and tangs, dart\u00a0gracefully through the water.&#8221;<\/figcaption><\/figure>\n\n\n\n<p>ARLON significantly accelerates the DiT model\u2019s denoising process by using AR-predicted latent features as an effective initialization. While the baseline model requires 30 steps for denoising, ARLON achieves similar performance in just 5 to 10 steps.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"589\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-1024x589.jpg\" alt=\"Comparison of Video Quality at Different Denoising Steps\" class=\"wp-image-1132659\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-1024x589.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-300x172.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-768x441.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-1536x883.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-2048x1177.jpg 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/Picture1-240x138.jpg 240w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 7. Comparison of video quality at different denoising steps.<\/figcaption><\/figure>\n\n\n\n<p>Additionally, ARLON supports long video generation through progressive text prompts, enabling the model to generate videos based on a series of gradually changing text prompts while preserving the coherence of the video content during prompt transitions.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"546\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-1024x546.png\" alt=\"ARLON \u4e0e\u5176\u4ed6\u6a21\u578b\u5728\u6e10\u8fdb\u5f0f\u6587\u672c\u957f\u89c6\u9891\u751f\u6210\u65b9\u9762\u7684\u5bf9\u6bd4\" class=\"wp-image-1121565\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-1024x546.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-300x160.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-768x409.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-710x380.png 710w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15-240x128.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-15.png 1319w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 8.\u00a0Comparison of ARLON with other models in progressive, text-based, long video generation.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img decoding=\"async\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/01\/arlon-16.gif\" alt=\"ARLON \u6839\u636e\u6e10\u8fdb\u5f0f\u591a\u6587\u672c\u6587\u6848 \u201cA majestic dormant volcano rises in the center \u2026.\u201d\u2014> \u201cAn erupting volcano dominates the scene \u2026\u2026\u201d \u751f\u6210\u7684\u89c6\u9891\" class=\"wp-image-1121568\"\/><figcaption class=\"wp-element-caption\">Figure 9.\u00a0Video generated by ARLON based on progressive multi-text prompts, &#8220;A majestic dormant volcano rises in the center,&#8221; and &#8220;An erupting volcano dominates the scene.&#8221;<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><em>Note: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/aka.ms\/arlon\">ARLON<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a research project. While it can synthesize long videos with dynamic scenes, their realism and naturalness depend on factors such as the length, quality, and context of the video prompts. The model carries potential risks of misuse, including forging video content or impersonating specific scenes. In video generation research, applying the model to new, real-world scenarios requires agreements with relevant stakeholders for the use of video content and the integration of synthetic video detection models. If you suspect that ARLON is being misused, used illegally, or infringing on your rights or the rights of others, report it through the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/msrc.microsoft.com\/report\/\">Microsoft abuse reporting portal<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/em><\/p>\n\n\n\n<p><em>The rapid development of AI has made trustworthy AI systems an urgent issue. Microsoft has taken proactive measures to anticipate and mitigate risks associated with AI technologies and is committed to promoting the development of AI in accordance with human-centered ethical principles. In 2018, Microsoft introduced six <a href=\"https:\/\/www.microsoft.com\/en-us\/ai\/responsible-ai\/\">Responsible AI Principles<\/a>: fairness, inclusiveness, reliability and safety, transparency, privacy and security, and accountability. These principles were later formalized through the Responsible AI Standards, supported by a governance framework to ensure that Microsoft teams integrate them into their daily workflows. Microsoft is continuing to collaborate with researchers and academic institutions worldwide to advance responsible AI practices and technologies.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The rapid development of AI has steadily advanced the field of text-to-video (T2V) generation, offering a rich and convenient video content creation experience and unlocking new possibilities in entertainment, education, and multimedia communication. Traditional T2V methods, however, are limited due to a lack of data and computational resources, making it difficult to generate long videos [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1121580,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1132587","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1132587","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1132587\/revisions"}],"predecessor-version":[{"id":1132686,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1132587\/revisions\/1132686"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1121580"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1132587"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1132587"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1132587"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1132587"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}