{"id":171092,"date":"2013-02-09T02:53:21","date_gmt":"2013-02-09T02:53:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/structured-data-search\/"},"modified":"2019-08-19T18:23:22","modified_gmt":"2019-08-20T01:23:22","slug":"structured-data-search","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/structured-data-search\/","title":{"rendered":"Web Data Extraction and Search"},"content":{"rendered":"<p>The goal of this project is to extract structured data on the web (like html tables, lists, spreadsheets etc.) and make it accessible\/searchable on\u00a0Bing and Office 365.<\/p>\n<p>Some of the technical challenges:<\/p>\n<ul>\n<li><strong>Table classification and understanding<\/strong>: The vast majority of html tables are used for formatting\/layout purposes; they do not any contain useful content . How do we automatically filter out such tables? Furthermore, there are various types of tables like relational tables (each row corresponds to a different entity and each column corresponds to a different attribute) and attribute-value tables (each row corresponds to a different attribute, e.g., tables on dpreview.com). How do we automatically distinguish these tables from each other?\u00a0For relational tables, there is typically a column (or a set of columns) that contain the subject entities. How do we identify this column(s)? For attribute-value tables, how do we identify the subject entity?<\/li>\n<li><strong>Query classification<\/strong>: For the Bing table answer feature, we want to show a table only if the intent of the query is a table (or part of a table), not simply because a table with great match is available. How to we identify such queries?<\/li>\n<li><strong>Table matching and ranking<\/strong>: For Bing table answer, how do we identify\u00a0the best table or part of table (if one exists) for a query with table intent? In Excel table search, how do we rank the tables in response to a keyword search query?<\/li>\n<li><strong>New modes of search<\/strong>: Keyword search may not be the only way to search for structured information. In a spreadsheet setting, other modes of search are possible like entity augmentation and attribute discovery proposed in the InfoGather\/InfoGather+ papers.<\/li>\n<\/ul>\n<h2>Impact<\/h2>\n<p>Our web data research had tremendous impact of several Microsoft products and services over the years:<\/p>\n<ul>\n<li>We worked closely with SQL Server and Excel groups and integrated this technology into Excel in 2013 (as the &#8220;Public Data Search&#8221; feature in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/office.microsoft.com\/en-us\/excel\/download-data-explorer-for-excel-FX104018616.aspx\">Power Query<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>). It is\u00a0available as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/office.microsoft.com\/en-us\/excel\/download-data-explorer-for-excel-FX104018616.aspx\">an Excel Add-In<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0as of February 2013. We expect this feature to be part of core Excel in the future.<\/li>\n<li>We worked closely with Bing&#8217;s whole page relevance team to ship algorithmcally generated table captions in 2015.\u00a0Try a search like <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=highest%20mountains%20in%20usa&qs=n&form=QBRE&sp=-1&pq=highest%20mountains%20in%20usa&sc=8-24&sk=&cvid=1AD10D5FB4564723B0797C6587F95435\">highest mountains in usa<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.bing.com\/search?q=list%20of%20futurama%20characters&qs=n&form=QBRE&pq=list%20of%20futurama%20characters&sc=1-27&sp=-1&sk=&cvid=6b0f218a24c84b26a9f331e15f7afadd\">list of futurama characters<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=airports%20in%20florida&qs=n&form=QBRE&sp=-1&pq=airports%20in%20florida&sc=8-18&sk=&cvid=DC2E03FFD529486981FB8E8BBAE82F44\">airports in florida<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> or <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=breaking+bad+episodes&qs=n&form=QBLH&sp=-1&pq=breaking+bad+episodes&sc=8-21&sk=&cvid=9B7942BA67AC4830A01C6D6AD564B6C5\">breaking bad episodes<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> on Bing and see table captions! Table caption is\u00a0shown as part of the snippet of the top (sometimes second or third) algo result and typically complements the information shown in the vertical answer (e.g., carousel) shown on top.<\/li>\n<li>We worked closely with Bing&#8217;s question-answering team to ship algorithmically generated table answers ships in 2016. Bing now shows a table as an answer to a query with list or superlative intent. Try a search like\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=drugs%20for%20high%20cholesterol&qs=n&form=QBRE&sp=-1&pq=drugs%20for%20high%20cholesterol&sc=8-26&sk=&cvid=EB765FCCA8E042F59345CA7F810FCDB6\">drugs for high cholesterol<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=largest%20cities%20in%20the%20world&qs=n&form=QBRE&sp=-1&pq=largest%20cities%20in%20the%20world&sc=8-26&sk=&cvid=1EDD976EE4FC4945ABAA62EF9AD0D529\">largest cities in the world<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=longest%20life%20expectancy%20countries&qs=n&form=QBRE&pq=longest%20life%20expectancy%20countries&sc=2-33&sp=-1&sk=&cvid=D0990FC5B3B0464EB5FB72B4B866771D&toHttps=1&redig=70881B6524B7405CAAA8DB56C31DDA5B\">longest life expectancy countries,<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=top%20computer%20science%20schools&qs=n&form=QBRE&sp=-1&pq=top%20computer%20science%20schools&sc=8-27&sk=&cvid=02396CF2E57648B488B6BF31DEBE5445\">top computer science schools<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=richest+county+in+usa&FORM=EDGNNC\">richest county in usa<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=renaissance%20painters%20from%20italy&qs=n&form=QBRE&pq=renaissance%20painters%20from%20italy&sc=1-31&sp=-1&sk=&cvid=A9CD27CC64C849299EAEBDD6F5966399&toHttps=1&redig=B3239F972C0648BEA6DE6844939DF500\">renaissance painters from italy<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0or\u00a0<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.bing.com\/search?q=mlb%20stadiums&qs=n&form=QBRE&sp=-1&pq=mlb%20stadiums&sc=8-12&sk=&cvid=6C355041364342FD91A3F62BE9265AAC\">mlb stadiums<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00a0on Bing and see table answers! We are working on a V2 to dramatically increase the coverage!<\/li>\n<\/ul>\n<p>Past interns: Mohamed Yakout, Chi Wang, Meihui Zhang, Mohan Yang<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The goal of this project is to extract structured data on the web (like html tables, lists, spreadsheets etc.) and make it accessible\/searchable on\u00a0Bing and Office 365. Some of the technical challenges: Table classification and understanding: The vast majority of html tables are used for formatting\/layout purposes; they do not any contain useful content . [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13563,13555],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-171092","msr-project","type-msr-project","status-publish","hentry","msr-research-area-data-platform-analytics","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2013-02-09","related-publications":[162425,164287,167035,357899],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Surajit Chaudhuri","user_id":33764,"people_section":"Group 1","alias":"surajitc"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171092","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171092\/revisions"}],"predecessor-version":[{"id":392360,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/171092\/revisions\/392360"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=171092"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=171092"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=171092"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=171092"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=171092"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}