{"id":1155826,"date":"2025-11-16T20:21:04","date_gmt":"2025-11-17T04:21:04","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=1155826"},"modified":"2025-11-16T20:21:06","modified_gmt":"2025-11-17T04:21:06","slug":"ui-evol-compute-use-agents-act-on-knowledge","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/ui-evol-compute-use-agents-act-on-knowledge\/","title":{"rendered":"UI-Evol: Compute-use Agents Act on Knowledge"},"content":{"rendered":"\n<p>Computer-use agents are AI systems that autonomously navigate and interact with software applications through graphical user interfaces (GUIs), and they are emerging as a new capability in artificial intelligence. By navigating and manipulating the same visual interfaces that people use, they can perform complex tasks on behalf of users, from filling out forms to managing workflows.<\/p>\n\n\n\n<p>Yet despite their promise, these agents perform poorly in practice. They typically draw on external knowledge\u2014information retrieved from the web that describes how to navigate the interfaces in question\u2014and use it to interpret what\u2019s on the screen and adapt to different environments. However, these agents often fail to translate this knowledge into successful action\u2014a problem researchers call the \u201cknowledge\u2013action gap.\u201d<\/p>\n\n\n\n<p>A recent study shows that even when the instructions are 90% correct, agents perform tasks successfully only 41% of the time. This disconnect between having the needed information and effectively applying it, illustrated at the top of Figure 1, can lead to a frustrating user experience.<\/p>\n\n\n\n<p>To address this, researchers at Microsoft Research Asia developed <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/ui-evol-automatic-knowledge-evolving-for-computer-use-agents\/\">UI-Evol<\/a>, a ready-to-use component that integrates into an agent\u2019s workflow and relies on the actual user interface for guidance. UI-Evol continuously updates its interface knowledge, helping make agents more accurate and reliable when completing tasks, as shown in the bottom of Figure 1.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"452\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-1024x452.png\" alt=\"graphical user interface, text, application\" class=\"wp-image-1143912\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-1024x452.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-300x133.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-768x339.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-1536x679.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1-240x106.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-1.png 1874w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: The top shows how correct external knowledge still fails to work in real-world settings. The bottom shows how UI-Evol narrows this gap by aligning knowledge with the software environment, enabling more reliable performance.<\/figcaption><\/figure>\n\n\n\n<p>This work has been recognized by the research community, with the team\u2019s findings accepted at the ICML 2025 <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/icml.cc\/virtual\/2025\/workshop\/39960\">Workshop on Computer Use Agents<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"how-ui-evol-works\">How UI-Evol works<\/h2>\n\n\n\n<p>UI-Evol addresses the knowledge-action gap through a two-stage process. The first stage, called retrace, records the exact steps an agent takes to finish a task. In this way, the system captures the specific clicks, keystrokes, and other actions that led to the result.<\/p>\n\n\n\n<p>The second stage, critique, reviews those actions against instructions drawn from outside the application. If it finds mismatches, it adjusts the knowledge so that the steps reflect what actually works in practice. Together, these two stages turn external instructions into tested, reliable guidance for agents. This process is illustrated in Figure 2.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"544\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-1024x544.png\" alt=\"graphical user interface, diagram\" class=\"wp-image-1143913\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-1024x544.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-300x159.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-768x408.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-1536x817.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2-240x128.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-2.png 1710w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: UI-Evol\u2019s two stages refine outside instructions with the agent\u2019s real actions, producing guidance that works in practice.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"assessing-ui-evol-s-effect-on-performance-reliability\">Assessing UI-Evol\u2019s effect on performance, reliability<\/h2>\n\n\n\n<p>The research team tested UI-Evol on Agent S2, a state-of-the-art computer-use agent. They used the OSWorld benchmark, designed to evaluate multimodal agents on open-ended computer tasks involving real software and workflows. They found that UI-Evol not only improved performance but also made the agent\u2019s behavior more dependable.<\/p>\n\n\n\n<p>Computer-use agents have long shown what researchers call \u201chigh behavioral standard deviation.\u201d In plain terms, the same agent, given the same task, may act differently each time it tries to carry it out. This unpredictability has not been a central focus of earlier work, yet it is precisely what limits agents\u2019 usefulness in real-world applications.<br>With UI-Evol, that pattern shifted. Experiments with agents based on leading LLMs, like GPT-4o and OpenAI-o3, showed not only higher success rates (Table 1) but also greater consistency with UI-Evol.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"349\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-1024x349.png\" alt=\"table\" class=\"wp-image-1143914\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-1024x349.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-300x102.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-768x262.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-1536x523.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3-240x82.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/ui-evol-3.png 1629w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Table 1: Experiment results on OSWorld. \u201cSR\u201d denotes success rate. It shows that computer-use agents often behaved unpredictably. With UI-Evol, performance improved, and their behavior became more consistent.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"what-this-means-for-practical-ai\">What this means for practical AI<\/h2>\n\n\n\n<p>The introduction of UI-Evol tackles a problem that has long challenged computer use agents since their inception: the gap between what they know and what they can reliably do. As these agents move from research labs to real-world settings such as office automation, virtual assistants, and robotic process automation on software, consistency matters as much as capability.<\/p>\n\n\n\n<p>UI-Evol&#8217;s approach\u2014learning from actual agent behavior rather than relying on external knowledge alone\u2014offers a path forward. It&#8217;s not only about making agents smarter; it&#8217;s about making them dependable enough to trust with real work.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Computer-use agents are AI systems that autonomously navigate and interact with software applications through graphical user interfaces (GUIs), and they are emerging as a new capability in artificial intelligence. By navigating and manipulating the same visual interfaces that people use, they can perform complex tasks on behalf of users, from filling out forms to managing [&hellip;]<\/p>\n","protected":false},"author":34512,"featured_media":1143915,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-1155826","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1155826","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1155826\/revisions"}],"predecessor-version":[{"id":1155829,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1155826\/revisions\/1155829"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1143915"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1155826"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1155826"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1155826"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1155826"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}