{"id":860352,"date":"2022-07-08T09:00:01","date_gmt":"2022-07-08T16:00:01","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/"},"modified":"2024-03-25T04:15:38","modified_gmt":"2024-03-25T11:15:38","slug":"ink-intensive-neural-knowledge-aligned-image-text-retrieval","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/ink-intensive-neural-knowledge-aligned-image-text-retrieval\/","title":{"rendered":"INK: Intensive Neural Knowledge"},"content":{"rendered":"<p><span dir=\"ltr\" role=\"presentation\">Knowledge-based vision language systems are <\/span><span dir=\"ltr\" role=\"presentation\">increasingly ubiquitous in our everyday lives. <\/span><span dir=\"ltr\" role=\"presentation\">However, despite the introduction of numerous <\/span><span dir=\"ltr\" role=\"presentation\">benchmarks, the community has siloed mod<\/span><span dir=\"ltr\" role=\"presentation\">els of different types of knowledge rather than <\/span><span dir=\"ltr\" role=\"presentation\">building general knowledge-intensive models <\/span><span dir=\"ltr\" role=\"presentation\">that encompass both commonsense and fac<\/span><span dir=\"ltr\" role=\"presentation\">toid knowledge. We introduce<\/span> <span dir=\"ltr\" role=\"presentation\">INK<\/span> <span dir=\"ltr\" role=\"presentation\">\u2013<\/span> <span dir=\"ltr\" role=\"presentation\">I<\/span><span dir=\"ltr\" role=\"presentation\">ntensive <\/span><span dir=\"ltr\" role=\"presentation\">N<\/span><span dir=\"ltr\" role=\"presentation\">eural<\/span> <span dir=\"ltr\" role=\"presentation\">K<\/span><span dir=\"ltr\" role=\"presentation\">nowledge<\/span> <span dir=\"ltr\" role=\"presentation\">\u2013 a new task that involves <\/span><span dir=\"ltr\" role=\"presentation\">extracting the necessary<\/span> <span dir=\"ltr\" role=\"presentation\">knowledge<\/span> <span dir=\"ltr\" role=\"presentation\">to accu<\/span><span dir=\"ltr\" role=\"presentation\">rately perform image and text retrieval<\/span><span dir=\"ltr\" role=\"presentation\">.<\/span> <span dir=\"ltr\" role=\"presentation\">In <\/span><span dir=\"ltr\" role=\"presentation\">particular,<\/span> <span dir=\"ltr\" role=\"presentation\">INK<\/span> <span dir=\"ltr\" role=\"presentation\">leverages existing resources <\/span><span dir=\"ltr\" role=\"presentation\">to require understanding of factoid, object-<\/span><span dir=\"ltr\" role=\"presentation\">commonsense, or social-consciousness knowl<\/span><span dir=\"ltr\" role=\"presentation\">edge to successfully perform retrieval. Finally, <\/span><span dir=\"ltr\" role=\"presentation\">we provide a set of competitive baseline models <\/span><span dir=\"ltr\" role=\"presentation\">whose weak performance motivates the need to <\/span><span dir=\"ltr\" role=\"presentation\">develop new knowledge understanding models <\/span><span dir=\"ltr\" role=\"presentation\">and systems.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Knowledge-based vision language systems are increasingly ubiquitous in our everyday lives. However, despite the introduction of numerous benchmarks, the community has siloed models of different types of knowledge rather than building general knowledge-intensive models that encompass both commonsense and factoid knowledge. We introduce INK \u2013 Intensive Neural Knowledge \u2013 a new task that involves extracting [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"MSR-TR-2022-38","msr_organization":"Microsoft MSR-TR-2022-38","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2022-7-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13545],"msr-publication-type":[193718],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-860352","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-human-language-technologies","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2022-7-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"MSR-TR-2022-38","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"Microsoft MSR-TR-2022-38","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/07\/INK.pdf","id":"893961","title":"ink-2","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":893961,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2022\/10\/INK.pdf"}],"msr-author-ordering":[{"type":"text","value":"James Park","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Qiuyuan Huang","user_id":36356,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Qiuyuan Huang"},{"type":"guest","value":"yonatan-bisk","user_id":788162,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=yonatan-bisk"},{"type":"text","value":"Jianwei Yang","user_id":0,"rest_url":false},{"type":"guest","value":"subhojit-som","user_id":795560,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=subhojit-som"},{"type":"guest","value":"ali-farhadi","user_id":785470,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=ali-farhadi"},{"type":"guest","value":"yejin-choi","user_id":474327,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=yejin-choi"},{"type":"text","value":"Jianfeng Gao","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[144931],"msr_project":[788159],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"techreport","related_content":{"projects":[{"ID":788159,"post_title":"Agent AI","post_name":"agent-ai","post_type":"msr-project","post_date":"2023-09-25 21:53:00","post_modified":"2024-02-28 07:03:22","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/agent-ai\/","post_excerpt":"Agent-based multimodal AI systems are becoming a ubiquitous presence in our everyday lives. A promising direction for making these systems more interactive is to embody them as agents within specific environments. The grounding of large foundation models to act as agents within specific environments can provide a way of incorporating visual and contextual information into an embodied system. For example, a system that can perceive user actions, human behavior, environment objects, audio expressions, and the&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/788159"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/860352","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":8,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/860352\/revisions"}],"predecessor-version":[{"id":1015590,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/860352\/revisions\/1015590"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=860352"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=860352"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=860352"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=860352"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=860352"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=860352"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=860352"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=860352"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=860352"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=860352"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=860352"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=860352"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=860352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}