The successes and challenges of making low-data languages available in online automatic translation portals and software


August 17, 2010


Jeff Allen


SAP and Advisory Boards to Multilingual Magazine and Linguist List


The majority of development work and deployment of machine translation (MT) technologies over the past several decades have been for international languages. Only a few projects for low-data/low-density/low resource/sparse-data/less-prevalent/lesser-commonly taught/minority languages have led to successful prototypes and products. There are a certain number of technical, logistical, social, educational and other factors which influence and impact the potential success of implementing systems for such languages. This talk will cover many of the lessons learned from previous projects, and some of the pitfalls to avoid. It will also demonstrate how the recent efforts for making Haitian Creole available for Haiti Disaster Relief had a certain level of success in record time because of the ability to build upon previous work. Yet, there were also obstacles with have been problematic and remain a concern for this language and for other less-prevalent languages. Lastly, the discussion will mention some ways to enable proactive, forward thinking projects, using some bootstrapping methods, to reduce the risk of situations which can result from working in a primarily reactive mode.

This will be an interactive dialogue with the audience, allowing for questions throughout the session, and an additional question/answer time.


Jeff Allen

Jeff Allen has 2 decades of experience in the translation services and translation tools industry. With a Bachelor’s degree in French linguistics and master’s and doctoral degrees in Creole linguistics, he started out his career as a professional translator and language/linguistics professor at several universities and SIL. He then went on to hold roles in several key machine translation (MT) projects and tool vendors, including the Caterpillar Inc MT implementation, the Center for Machine Translation/Language Technologies Institute at Carnegie Mellon University (CMU), Softissimo/Reverso, the European/Evaluation Language resources Distribution Agency (ELDA), SYSTRAN and Software/TransPerfect Translations. He is currently engineering R&D tools/process expert at SAP. He has worked on several different types (RBMT, KBMT, EBMT, SBMT, MEMT) and brands of MT systems, and is known for his work in controlled language writing for translation, MT dictionary optimization, MT postediting, website and software localization, translation management systems (GMS/TMS), terminology management, and speech technologies. A long-time campaigner for MT technologies, including for minority languages with a few dozen articles on such languages, at the outset of the earthquake in Haiti in Jan 2010, he quickly spearheaded and led the Haitian Creole language data recovery actions of his decade-old CMU DIPLOMAT project which led to the immediate release of Haitian Creole within several online MT portals and applications, including the Microsoft Bing Translator and MSN Windows Messenger Tbot. Over the past 6 months, he has continued to focus on translation technologies opportunities for Haitian Creole and obtaining permission to use additional Haitian Creole language resource data for natural language processing needs.