{"id":973743,"date":"2023-10-06T11:19:49","date_gmt":"2023-10-06T18:19:49","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=973743"},"modified":"2024-06-10T09:40:18","modified_gmt":"2024-06-10T16:40:18","slug":"llava-large-language-and-vision-assistant","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/llava-large-language-and-vision-assistant\/","title":{"rendered":"LLaVA: Large Language and Vision Assistant"},"content":{"rendered":"\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\">\n<div class=\"wp-block-group is-layout-constrained wp-block-group-is-layout-constrained\"><section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background bg-gray-200 has-background- card-background--full-bleed\">\n\t\t\t\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 \">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h3 class=\"wp-block-heading\" id=\"building-next-gen-multimodal-foundation-models-for-general-purpose-assistants\">Building Next-Gen Multimodal Foundation Models for General-Purpose Assistants<\/h3>\n\n\n\n<p>LLaVA is an open-source project, collaborating with research community to advance the state-of-the-art in AI. LLaVA represents the first end-to-end trained large multimodal model (LMM) that achieves impressive chat capabilities mimicking spirits of the multimodal GPT-4. The LLaVA family continues growing to support more modalities, capabilities, applications and beyond.<\/p>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n<\/div>\n<\/div>\n\n\n\n\n\n<p>LLaVA represents a cost-efficient approach to building general-purpose multimodal assistant. It is a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<div class=\"wp-block-group is-horizontal is-content-justification-center is-nowrap is-layout-flex wp-container-core-group-is-layout-c1dfda1e wp-block-group-is-layout-flex\">\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\">\n<figure class=\"wp-block-image size-full is-resized is-style-spectrum\"><img loading=\"lazy\" decoding=\"async\" width=\"697\" height=\"252\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/10\/mfm_evolution.jpeg\" alt=\"evolution\" class=\"wp-image-973812\" style=\"width:669px;height:undefinedpx\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/10\/mfm_evolution.jpeg 697w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/10\/mfm_evolution-300x108.jpeg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/10\/mfm_evolution-240x87.jpeg 240w\" sizes=\"auto, (max-width: 697px) 100vw, 697px\" \/><\/figure>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"recent-development\">Recent development<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LLaVA<\/strong>: The first open-source project to GPT-V alternative. [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/llava-vl.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">Project<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2304.08485\" target=\"_blank\" rel=\"noopener noreferrer\">Paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/haotian-liu\/LLaVA\" target=\"_blank\" rel=\"noopener noreferrer\">Github<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/llava.hliu.cc\/\" target=\"_blank\" rel=\"noopener noreferrer\">Demo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/datasets\/liuhaotian\/LLaVA-Instruct-150K\" target=\"_blank\" rel=\"noopener noreferrer\">Data<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/liuhaotian\/LLaVA-13b-delta-v0\" target=\"_blank\" rel=\"noopener noreferrer\">Model<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2309.09958\" target=\"_blank\" rel=\"noopener noreferrer\">Scaling Note<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]<\/li>\n\n\n\n<li><strong>LLaVA-Med<\/strong>: The first multimodal assistant in the healthcare domain [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/llava-med\">Github<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/2306.00890\">Paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]<\/li>\n\n\n\n<li><strong>LLaVA-Interactive<\/strong>: An all-in-one demo to demonstrate the visual interaction\/generation capabilities beyond language interaction alone, supported by <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/haotian-liu\/LLaVA\">LLaVA<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/UX-Decoder\/Segment-Everything-Everywhere-All-At-Once\">SEEM<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/gligen\/GLIGEN\">GLIGEN<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/li>\n\n\n\n<li><strong>Multimodal Foundation Models<\/strong>: A 118-page survey on the evolution, trends and our position of multimodal foundation models:&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2309.10020\" target=\"_blank\" rel=\"noopener noreferrer\">&#8220;Multimodal Foundation Models: From Specialists to General-Purpose Assistants&#8221;<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This is based on <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/vlp-tutorial.github.io\/2023\" target=\"_blank\" rel=\"noopener noreferrer\">CVPR 2023 Tutorial<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2306.14895\" target=\"_blank\" rel=\"noopener noreferrer\">Note on Large Multimodal Models<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/datarelease.blob.core.windows.net\/tutorial\/vision_foundation_models_2023\/slides\/Chunyuan_cvpr2023_tutorial_lmm.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">Slides<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/youtu.be\/mkI7EPD1vp8\" target=\"_blank\" rel=\"noopener noreferrer\">YouTube<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.bilibili.com\/video\/BV1Ng4y1T7v3\/\" target=\"_blank\" rel=\"noopener noreferrer\">Bilibili<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]<\/li>\n\n\n\n<li><strong>Instruction Tuning with GPT-4<\/strong>: the &#8220;first attempt&#8221; to use GPT-4 data for LLM self-instruct tuning. [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/instruction-tuning-with-gpt-4.github.io\/\">Project<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2304.03277\" target=\"_blank\" rel=\"noopener noreferrer\">Paper<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/Instruction-Tuning-with-GPT-4\/GPT-4-LLM\" target=\"_blank\" rel=\"noopener noreferrer\">Github<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>] [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.linkedin.com\/feed\/update\/urn:li:activity:7049992414995902464\/\" target=\"_blank\" rel=\"noopener noreferrer\">My Learnings<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>]<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:100%\"><\/div>\n<\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>LLaVA is an open-source project, collaborating with research community to advance the state-of-the-art in AI. LLaVA represents the first end-to-end trained large multimodal model (LMM) that achieves impressive chat capabilities mimicking spirits of the multimodal GPT-4. The LLaVA family continues growing to support more modalities, capabilities, applications and beyond. LLaVA represents a cost-efficient approach to [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556,13562,13545],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-973743","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-computer-vision","msr-research-area-human-language-technologies","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[952167,973821,973851,973965],"related-downloads":[],"related-videos":[1003980],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Hao Cheng","user_id":39922,"people_section":"Related people","alias":"chehao"},{"type":"user_nicename","display_name":"Michel Galley","user_id":32887,"people_section":"Related people","alias":"mgalley"},{"type":"user_nicename","display_name":"Jianfeng Gao","user_id":32246,"people_section":"Related people","alias":"jfgao"},{"type":"guest","display_name":"Yong Jae Lee","user_id":973803,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Lars Liden","user_id":32612,"people_section":"Related people","alias":"laliden"},{"type":"guest","display_name":"Haotian Liu","user_id":973785,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Xiaodong Liu","user_id":34877,"people_section":"Related people","alias":"xiaodl"},{"type":"guest","display_name":"Yadong Lu","user_id":973830,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Tristan Naumann","user_id":37929,"people_section":"Related people","alias":"tristan"},{"type":"user_nicename","display_name":"Hoifung Poon","user_id":32016,"people_section":"Related people","alias":"hoifung"},{"type":"guest","display_name":"Yelong Shen","user_id":973833,"people_section":"Related people","alias":""},{"type":"user_nicename","display_name":"Swadheen Shukla","user_id":38248,"people_section":"Related people","alias":"swads"},{"type":"user_nicename","display_name":"Andrea Tupini","user_id":40339,"people_section":"Related people","alias":"andreatupini"},{"type":"user_nicename","display_name":"Naoto Usuyama","user_id":38670,"people_section":"Related people","alias":"naotous"},{"type":"user_nicename","display_name":"Yu Wang","user_id":40783,"people_section":"Related people","alias":"yuwan"},{"type":"user_nicename","display_name":"Cliff Wong","user_id":38508,"people_section":"Related people","alias":"clwon"},{"type":"user_nicename","display_name":"Sheng Zhang","user_id":39087,"people_section":"Related people","alias":"shezhan"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/973743","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":13,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/973743\/revisions"}],"predecessor-version":[{"id":1045062,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/973743\/revisions\/1045062"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=973743"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=973743"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=973743"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=973743"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=973743"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}