Resource-Based NLP (by Hitoshi Isahara)
Abstract: At NICT, we develop widely-applicable and high-performance NLP technologies and linguistic resources, which we make available to public to facilitate the development of multilingual information processing technologies. As for linguistic resources, we have already published several large resources and aim to be one of the world-biggest NLP resource centers. We are working on the development of linguistic resources (corpora, dictionaries and other tools) as a basis of resource-based NLP. These resources include the NICT Multilingual Corpus, the NICT Japanese Learner English (JLE) corpus, Japanese-English News Article Alignment Data, 2-million parallel sentences between Japanese and English patent documents, the Corpus of Spontaneous Japanese (CSJ), and the EDR Electronic Dictionary. We also work on R&D into NLP using language resources. This work involves the development of fundamental NLP technologies to be utilized in machine translation systems, and the development of technologies to support the creation of large-scale corpora. The tools which has been developed and published include parallel text alignment software and NLP software using the maximum entropy method.We are also working on the automatic acquisition of lexical knowledge from large-scale corpora to apply the results to support system for web retrieval by showing related word list to users.
Construction of Large Japanese-Chinese Aligned Parallel Treebank Corpus (by Kiyotaka Uchimoto)
Abstract: We have launched a five-year project to develop a practical Japanese-Chinese translation system in 2006. We are taking a corpus-based approach for machine translation such as EBMT and SMT. Since it requires a large parallel corpus, we are compiling a parallel corpus in this project. We combine manual translation and automatic gathering for the compilation. For manual translation, we selected Japanese texts to be translated and started to translate them manually into the target Chinese language. We will finish translating Japanese texts (mainly scientific papers) into Chinese to create a large parallel corpus (over 1 million sentences) in 5 years. For automatic gathering, we gather parallel corpora from the web and also extract parallel text from non-parallel corpora by using NLP techniques. Furthermore, we will annotate morphological, syntactic and alignment information to the parallel corpus in a semi-automatic way. This talk will describe the plan and the progress of the construction of the parallel corpus in our project.