Large-Scale Information Extraction under Privacy-Aware Constraints

  1. Difference between Information Extraction from web pages and that from emails
  2. Handling privacy issues  
    1. Anonymization of emails while keeping it useful for model building 
    2. Templatization of emails
  3. Semi-supervised techniques to generate labeled-data  
    1. Concepts of semi-supervision, using Structural similarity
    2. Active learning
    3. Transfer learning
    4. Knowledge distillation, Teacher-student architecture
    5. Weak labeling, data programming
  4. Text classification across languages with limited data
    1. Generating labeled data for English 
    2. Using English data to create classifiers for other languages (Spanish, Portuguese, etc.)
  5. Handling Scalability issues in model building  
    1. Adapting web IE techniques by writing wrappers– their limitations, conjunctive and disjunctive template (DOM) based models, etc.
    2. Scalability issues due to number of scenarios, sender domains, and templates
    3. Rule induction: Programming by examples
    4. Machine learning approaches: LR, CRF, LSTM, etc.
    5. Combining rule induction and machine learning to get high precision and recall with high coverage 
      1. Ensemble approach: Automated clustering, ML models to identify individual fields, generating xpath based rules for each cluster.
      2. Iterative approach: Use ML to create models which work well for seen templates and, to some extent, for unseen templates; feed the data with probabilistic labels to rule-induction module; semi-supervised approach to remove discrepancies between ML and rule induction outputs; iterating multiple times to improve the performance and coverage. 
  6. Efficient monitoring to maintain high precision and recall 
    1. Sampling to identify precision and recall gaps
    2. Anomaly detection algorithms