Web Data Extraction and Search

Established: February 9, 2013

The goal of this project is to extract structured data on the web (like html tables, lists, spreadsheets etc.) and make it accessible/searchable on Bing and Office 365.

Some of the technical challenges:

Table classification and understanding: The vast majority of html tables are used for formatting/layout purposes; they do not any contain useful content . How do we automatically filter out such tables? Furthermore, there are various types of tables like relational tables (each row corresponds to a different entity and each column corresponds to a different attribute) and attribute-value tables (each row corresponds to a different attribute, e.g., tables on dpreview.com). How do we automatically distinguish these tables from each other? For relational tables, there is typically a column (or a set of columns) that contain the subject entities. How do we identify this column(s)? For attribute-value tables, how do we identify the subject entity?
Query classification: For the Bing table answer feature, we want to show a table only if the intent of the query is a table (or part of a table), not simply because a table with great match is available. How to we identify such queries?
Table matching and ranking: For Bing table answer, how do we identify the best table or part of table (if one exists) for a query with table intent? In Excel table search, how do we rank the tables in response to a keyword search query?
New modes of search: Keyword search may not be the only way to search for structured information. In a spreadsheet setting, other modes of search are possible like entity augmentation and attribute discovery proposed in the InfoGather/InfoGather+ papers.

Impact

Our web data research had tremendous impact of several Microsoft products and services over the years:

We worked closely with SQL Server and Excel groups and integrated this technology into Excel in 2013 (as the “Public Data Search” feature in Power Query (opens in new tab)). It is available as an Excel Add-In (opens in new tab) as of February 2013. We expect this feature to be part of core Excel in the future.
We worked closely with Bing’s whole page relevance team to ship algorithmcally generated table captions in 2015. Try a search like highest mountains in usa (opens in new tab), list of futurama characters (opens in new tab), airports in florida (opens in new tab) or breaking bad episodes (opens in new tab) on Bing and see table captions! Table caption is shown as part of the snippet of the top (sometimes second or third) algo result and typically complements the information shown in the vertical answer (e.g., carousel) shown on top.
We worked closely with Bing’s question-answering team to ship algorithmically generated table answers ships in 2016. Bing now shows a table as an answer to a query with list or superlative intent. Try a search like drugs for high cholesterol (opens in new tab), largest cities in the world (opens in new tab), longest life expectancy countries, (opens in new tab) top computer science schools (opens in new tab), richest county in usa (opens in new tab), renaissance painters from italy (opens in new tab) or mlb stadiums (opens in new tab) on Bing and see table answers! We are working on a V2 to dramatically increase the coverage!

Past interns: Mohamed Yakout, Chi Wang, Meihui Zhang, Mohan Yang

People

Surajit Chaudhuri

Technical Fellow, Data Platforms and Analytics

Learn more