Web Data Extraction and Search

Established: February 9, 2013

The goal of this project is to extract structured data on the web (like html tables, lists, spreadsheets etc.) and make it accessible/searchable on Bing and Office 365.

Some of the technical challenges:

  • Table classification and understanding: The vast majority of html tables are used for formatting/layout purposes; they do not any contain useful content . How do we automatically filter out such tables? Furthermore, there are various types of tables like relational tables (each row corresponds to a different entity and each column corresponds to a different attribute) and attribute-value tables (each row corresponds to a different attribute, e.g., tables on dpreview.com). How do we automatically distinguish these tables from each other? For relational tables, there is typically a column (or a set of columns) that contain the subject entities. How do we identify this column(s)? For attribute-value tables, how do we identify the subject entity?
  • Query classification: For the Bing table answer feature, we want to show a table only if the intent of the query is a table (or part of a table), not simply because a table with great match is available. How to we identify such queries?
  • Table matching and ranking: For Bing table answer, how do we identify the best table or part of table (if one exists) for a query with table intent? In Excel table search, how do we rank the tables in response to a keyword search query?
  • New modes of search: Keyword search may not be the only way to search for structured information. In a spreadsheet setting, other modes of search are possible like entity augmentation and attribute discovery proposed in the InfoGather/InfoGather+ papers.

Impact

Our web data research had tremendous impact of several Microsoft products and services over the years:

Past interns: Mohamed Yakout, Chi Wang, Meihui Zhang, Mohan Yang

People

Publications