In this digital age, people spend a significant portion of their lives online and this has led to an explosion of personal data from users and their activities. Typically, this data is private and no-body else, except the user, is allowed to look at it. This poses interesting and complex challenges from scalable information extraction point of view: extracting information under privacy aware constraints where there is little data to learn from but need highly accurate models to run on large amount of data across different users. Anonymization of data is typically used to convert private data into pub-licly accessible data. But this may not always be feasible and may require complex differential privacy guarantees in order to be safe from any potential negative consequences. Other techniques involve building models on a small amount of seen (eyes-on) data and a large amount of unseen (eyes-off) data. In this tutorial, we use emails as representative private data to explain the concepts of scalable IE under privacy-aware constraints.
Around 270 billion emails are sent and received per day and more than 60% of them are business to consumer (B2C) emails. At Microsoft, we have developed information extraction systems to extract relevant information from these emails for a large number of scenarios (e.g., flights, hotels, appointments, etc.), for thousands of sender domains (e.g., Amazon, Hilton, British Airways, etc.) and templates (HTML DOM structures)—to power a number of AI applications (e.g., flight reminders, package tracking). As explained above, here are the challenges that we need to overcome to develop information extraction systems for emails:
Privacy: For legal and trust reasons, email and its derivatives should be accessible only to the person who it is intended to. Thus, we can’t directly apply the web IE techniques used to extract information from webpages.
Efficiency: As we need to process billions of emails every day—different for different users—extraction models need to be very efficient.
Scalability: There are a large number of variations in the way information is presented in the emails. For example, a flight itinerary is represented in different ways by different providers.
Multi-lingual: Users are located across geographies, and hence, the information extraction systems need to work across multiple languages.
To extract information from B2C emails, one needs to classify the emails, cluster them into possible templates, build models to extract information from them, and monitor the models to maintain a high precision and recall. How are the IE techniques for private eyes-off data different compared to that for eyes-on HTML data? How to get labeled data in a privacy preserving manner? What are the different techniques for generating semi-labeled data and learning from them? How to build scalable extraction models across a number of sender domains using different ways to represent the information? How to monitor these models with minimum human intervention? In this tutorial we address all these questions from various research to production perspectives.