Information Extraction Crossing Language, Robustness and Domain Barriers


November 2, 2012


Imed Zitouni


Microsoft - Bing


Modern communication technologies have made massive amounts of real-time news information in several languages readily available. This led to the need to develop news-monitoring system that allows users to monitor multilingual news media in near real-time and search over stored content. One example of such a system is Translingual Automatic Language Exploration System, codenamed TALES. In this talk I will briefly describe the architecture of TALES and focus on its information extraction component. Information extraction is a crucial step toward understanding a text, as it identifies the important conceptual objects and relations between them in a discourse. I will address the portability of the used approach to different languages and show a method of propagating information into low resource languages from richer ones. Compared to other approaches that focuses on clean-text, I will also show the robustness of our technique to less-well-formed input. For example, information extraction in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case yields many false alarms, which raise the research problem of making information extraction robust to such noisy input text. If time permit, I will also discuss the application and adaptation of these techniques to health-care domain.


Imed Zitouni

Imed Zitouni joined Microsoft Bing recently. Imed was a research member of the IBM Multilingual NLP group since 2004. Before joining IBM, he was a scientist at a startup company DIALOCA in 98-99. He then joined Bell-Laboratories between 99 and 04 as a research staff member. He received his M.Sc. and Ph.D. with the highest-honors from the University-of-Nancy1 France in 1996 and 2000, respectively. In 1995, he obtained a MEng degree in computer science from ENSI, a prestigious national computer institute in Tunisia. His research interests include natural language processing, information retrieval, machine translation, spoken-dialogue-systems, speech-recognition and machine learning. He is a senior member of IEEE, member of the IEEE Speech and Language Processing Technical Committee (99-11), the Information Officer of the ACL SIG on Semitic-Languages, and a member of ISCA and ACL. He served as team-lead in several NLP projects at IBM and served as chair and reviewing-committee-member of several conferences and journals. He has also authored/co-authored more than 75 papers in international conferences and journals. Imed’s recent book is “Multilingual Natural Language Processing Application: from Theory to Practice”, by Prentice Hall.