Text classification across languages with limited data
Generating labeled data for English
Using English data to create classifiers for other languages (Spanish, Portuguese, etc.)
Handling Scalability issues in model building
Adapting web IE techniques by writing wrappers– their limitations, conjunctive and disjunctive template (DOM) based models, etc.
Scalability issues due to number of scenarios, sender domains, and templates
Rule induction: Programming by examples
Machine learning approaches: LR, CRF, LSTM, etc.
Combining rule induction and machine learning to get high precision and recall with high coverage
Ensemble approach: Automated clustering, ML models to identify individual fields, generating xpath based rules for each cluster.
Iterative approach: Use ML to create models which work well for seen templates and, to some extent, for unseen templates; feed the data with probabilistic labels to rule-induction module; semi-supervised approach to remove discrepancies between ML and rule induction outputs; iterating multiple times to improve the performance and coverage.
Efficient monitoring to maintain high precision and recall