Resource-Based NLP (by Hitoshi Isahara) and Construction of Large Japanese-Chinese Aligned Parallel Treebank Corpus (by Kiyotaka Uchimoto) – National Institute of Information and Communications Technology (NICT)


December 7, 2007


Hitoshi Isahara and Kiyotaka Uchimoto


National Institute of Information and Communications Technology (NICT)


Resource-Based NLP (by Hitoshi Isahara)
Abstract: At NICT, we develop widely-applicable and high-performance NLP technologies and linguistic resources, which we make available to public to facilitate the development of multilingual information processing technologies. As for linguistic resources, we have already published several large resources and aim to be one of the world-biggest NLP resource centers. We are working on the development of linguistic resources (corpora, dictionaries and other tools) as a basis of resource-based NLP. These resources include the NICT Multilingual Corpus, the NICT Japanese Learner English (JLE) corpus, Japanese-English News Article Alignment Data, 2-million parallel sentences between Japanese and English patent documents, the Corpus of Spontaneous Japanese (CSJ), and the EDR Electronic Dictionary. We also work on R&D into NLP using language resources. This work involves the development of fundamental NLP technologies to be utilized in machine translation systems, and the development of technologies to support the creation of large-scale corpora. The tools which has been developed and published include parallel text alignment software and NLP software using the maximum entropy method.We are also working on the automatic acquisition of lexical knowledge from large-scale corpora to apply the results to support system for web retrieval by showing related word list to users.

Construction of Large Japanese-Chinese Aligned Parallel Treebank Corpus (by Kiyotaka Uchimoto)
Abstract: We have launched a five-year project to develop a practical Japanese-Chinese translation system in 2006. We are taking a corpus-based approach for machine translation such as EBMT and SMT. Since it requires a large parallel corpus, we are compiling a parallel corpus in this project. We combine manual translation and automatic gathering for the compilation. For manual translation, we selected Japanese texts to be translated and started to translate them manually into the target Chinese language. We will finish translating Japanese texts (mainly scientific papers) into Chinese to create a large parallel corpus (over 1 million sentences) in 5 years. For automatic gathering, we gather parallel corpora from the web and also extract parallel text from non-parallel corpora by using NLP techniques. Furthermore, we will annotate morphological, syntactic and alignment information to the parallel corpus in a semi-automatic way. This talk will describe the plan and the progress of the construction of the parallel corpus in our project.


Hitoshi Isahara and Kiyotaka Uchimoto

Hitoshi Isahara is the Leader of the Computational Linguistics Group and the Director of the Thai Computational Linguistics Laboratory (TCL) at the National Institute of Information and Communications Technology (NICT), Japan. He is also professor at Kobe University Graduate School of Science and Technology, Japan, and part-time lecturer at Kyoto University Graduate School of Human and Environmental Studies and Doshisha University. Hitoshi is also the president-elect of the International Association for Machine Translation (IAMT), and the president of the Asia-Pacific Association for Machine Translation (AAMT).He received the B.E., M.E., and Ph.D. degrees in electrical engineering from Kyoto University, Kyoto, Japan, in 1978, 1980, and 1995, respectively.His research interests include natural language processing and lexical semantics.

Kiyotaka Uchimoto, Ph.D., is a Senior Researcher of the National Institute of Information and Communications Technology, Japan.He received the B.E. and M.E. in Electrical Engineering, and the Ph.D.