{"id":1005141,"date":"2024-02-07T07:16:41","date_gmt":"2024-02-07T15:16:41","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=1005141"},"modified":"2024-05-21T17:22:48","modified_gmt":"2024-05-22T00:22:48","slug":"interactive-agent-foundation-model","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/interactive-agent-foundation-model\/","title":{"rendered":"An Interactive Agent Foundation Model"},"content":{"rendered":"<p><span dir=\"ltr\" role=\"presentation\">The development of artificial intelligence systems <\/span><span dir=\"ltr\" role=\"presentation\">is transitioning from creating static, task-specific<\/span><br role=\"presentation\" \/><span dir=\"ltr\" role=\"presentation\">models to dynamic, agent-based systems capa<\/span><span dir=\"ltr\" role=\"presentation\">ble of performing well in a wide range of ap<\/span><span dir=\"ltr\" role=\"presentation\">plications.<\/span> <span dir=\"ltr\" role=\"presentation\">We propose an<\/span> <strong><span dir=\"ltr\" role=\"presentation\">Agent <\/span><span dir=\"ltr\" role=\"presentation\">Foundation Model <\/span><\/strong><span dir=\"ltr\" role=\"presentation\">that uses a novel multi-task <\/span><span dir=\"ltr\" role=\"presentation\">agent training paradigm for training AI agents <\/span><span dir=\"ltr\" role=\"presentation\">across a wide range of domains, datasets, and <\/span><span dir=\"ltr\" role=\"presentation\">tasks. Our training paradigm unifies diverse pre<\/span><span dir=\"ltr\" role=\"presentation\">training strategies, including visual masked auto-<\/span><span dir=\"ltr\" role=\"presentation\">encoders, language modeling, and next-action <\/span><span dir=\"ltr\" role=\"presentation\">prediction, enabling a versatile and adaptable AI <\/span><span dir=\"ltr\" role=\"presentation\">framework. We demonstrate the performance of <\/span><span dir=\"ltr\" role=\"presentation\">our framework across three separate domains\u2014 <\/span><span dir=\"ltr\" role=\"presentation\">Robotics, Gaming AI, and Healthcare. Our model <\/span><span dir=\"ltr\" role=\"presentation\">demonstrates its ability to generate meaningful <\/span><span dir=\"ltr\" role=\"presentation\">and contextually relevant outputs in each area. <\/span><span dir=\"ltr\" role=\"presentation\">The strength of our approach lies in its general<\/span><span dir=\"ltr\" role=\"presentation\">ity, leveraging a variety of data sources such as <\/span><span dir=\"ltr\" role=\"presentation\">robotics sequences, gameplay data, large-scale <\/span><span dir=\"ltr\" role=\"presentation\">video datasets, and textual information for effec<\/span><span dir=\"ltr\" role=\"presentation\">tive multimodal and multi-task learning. Our ap<\/span><span dir=\"ltr\" role=\"presentation\">proach provides a promising avenue for develop<\/span><span dir=\"ltr\" role=\"presentation\">ing generalist, action-taking, multimodal systems.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The development of artificial intelligence systems is transitioning from creating static, task-specificmodels to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"arXiv","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2024-2-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13545,13554,13553],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[268140,246658,268332,268266,247039,249835],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1005141","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-research-area-human-computer-interaction","msr-research-area-medical-health-genomics","msr-locale-en_us","msr-field-of-study-agent-ai","msr-field-of-study-deep-learning","msr-field-of-study-embodied-ai","msr-field-of-study-gaming","msr-field-of-study-health-care","msr-field-of-study-robotics"],"msr_publishername":"arXiv","msr_edition":"","msr_affiliation":"","msr_published_date":"2024-2-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Agent-fondation-model.pdf","id":"1005228","title":"agent-fondation-model","label_id":"243109","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/pdf\/2402.05929.pdf","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Agent-fondation-model.pdf","id":"1005228","title":"agent-fondation-model","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/arxiv.org\/pdf\/2402.05929.pdf","label_id":"243118","label":0}],"msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":1005228,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/02\/Agent-fondation-model.pdf"}],"msr-author-ordering":[{"type":"text","value":"Zane Durante","user_id":0,"rest_url":false},{"type":"text","value":"Bidipta Sarkar","user_id":0,"rest_url":false},{"type":"text","value":"Ran Gong","user_id":0,"rest_url":false},{"type":"text","value":"Rohan Taori","user_id":0,"rest_url":false},{"type":"guest","value":"yusuke-noda","user_id":969939,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=yusuke-noda"},{"type":"text","value":"Paul Tang","user_id":0,"rest_url":false},{"type":"text","value":"Ehsan Adeli","user_id":0,"rest_url":false},{"type":"text","value":"Shrinidhi Kowshika Lakshmikanth","user_id":0,"rest_url":false},{"type":"text","value":"Kevin Schulman","user_id":0,"rest_url":false},{"type":"text","value":"Arnold Milstein","user_id":0,"rest_url":false},{"type":"guest","value":"demetri-terzopoulos-2","user_id":981291,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=demetri-terzopoulos-2"},{"type":"user_nicename","value":"Ade Famoti","user_id":43005,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ade Famoti"},{"type":"text","value":"Noboru Kuno","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ashley Llorens","user_id":39964,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ashley Llorens"},{"type":"guest","value":"hoi-vo-3","user_id":981312,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=hoi-vo-3"},{"type":"user_nicename","value":"Katsushi Ikeuchi","user_id":32500,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Katsushi Ikeuchi"},{"type":"guest","value":"fei-fei-li","user_id":969957,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=fei-fei-li"},{"type":"user_nicename","value":"Jianfeng Gao","user_id":32246,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Jianfeng Gao"},{"type":"user_nicename","value":"Naoki Wake","user_id":39916,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Naoki Wake"},{"type":"user_nicename","value":"Qiuyuan Huang","user_id":36356,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Qiuyuan Huang"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144931,668253],"msr_project":[788159],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":788159,"post_title":"Agent AI","post_name":"agent-ai","post_type":"msr-project","post_date":"2023-09-25 21:53:00","post_modified":"2024-02-28 07:03:22","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/agent-ai\/","post_excerpt":"Agent-based multimodal AI systems are becoming a ubiquitous presence in our everyday lives. A promising direction for making these systems more interactive is to embody them as agents within specific environments. The grounding of large foundation models to act as agents within specific environments can provide a way of incorporating visual and contextual information into an embodied system. For example, a system that can perceive user actions, human behavior, environment objects, audio expressions, and the&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/788159"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141\/revisions"}],"predecessor-version":[{"id":1038630,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1005141\/revisions\/1038630"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1005141"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1005141"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1005141"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1005141"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1005141"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1005141"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1005141"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1005141"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1005141"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1005141"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1005141"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1005141"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1005141"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}