Developing Annotated Korean Learner Corpus and Automatic Analysis of Learner Language

Date

June 25, 2009

Speaker

Sun-Hee Lee, Seok Bae Jang, and Markus Dickinson

Affiliation

Wellesley College, Brigham Young University and Indiana University

Overview

In this talk, we present our ongoing research on developing an annotated Korean learner corpus and the automatic analysis of learner errors. Our long-term goal is to develop an intelligent computer-assisted language learning (ICALL) system. Annotated learner corpora are known to provide valuable information about language learners in different teaching environments and about learning, language acquisition, and language variation. There is also a practical, technological need for annotated learner corpora. Annotated learner corpora can provide more training and evaluation data for processing learner language on a wider range of languages, such as Korean.
Taking spelling and particle errors as our starting points, we pursue the goal of providing a taxonomy and annotation scheme for different kinds of Korean learner errors. This is useful for automatic error diagnosis and useful for feedback instruction in a Korean ICALL system. As with any language being learned, learners of Korean frequently make a variety of different kinds of errors, some of which, such as spelling errors, are computationally problematic and some of which persist through years of study. For instance, the Korean language uses postpositional particles to indicate grammatical functions (cf. case-marking) and uses which roughly correspond to English prepositions. As with English prepositions (cf. Tetreault and Chodorow 2008; de Felice and Pulman 2008), learners of Korean persistently misuse particles throughout different stages of the learning process (Lee et al. 2009; Ko et al 2004). Furthermore, processing Korean particles, especially for learner language, proves to be quite challenging since there are a variety of syntactic, semantic, and pragmatic factors which go into the selection of a correct particle (see, e.g., Dickinson and Lee 2009). In a different vein, spelling errors are reflective of different aspects of linguistic knowledge (see Rimrott & Heift 2008, and references therein). Korean learners make frequent mismatches between a syllable and a character, due to the interaction of complex linguistic processes they are learning, including phonological rules and morphological boundary detection. Because dealing with misspellings is a crucial first step for most systems, and because providing a corpus which contains spelling corrections allows NLP practitioners to focus on other aspects of research, we focus on spelling error annotation in our corpus development.
We will show various issues related to processing particle errors and spelling errors in an agglutinative language like Korean. Linguistic awareness plays a crucial role not only in language learning but also error processing. By building up annotated learner corpus, we provide a platform for research into different realms of second language acquisition, linguistics, and the automatic analysis of learner language.

Speakers

Sun-Hee Lee, Seok Bae Jang, and Markus Dickinson

Sun-Hee Lee is an assistant professor in the Department of East Asian Languages and Literatures at Wellesley College. Her areas of expertise include Korean syntax, semantics, corpus linguistics, and Korean language teaching. She has been working on lexical analysis of long-distance dependency, pronoun/anaphor resolution, and corpus annotation. Her current research interests include learner corpus annotation and the automatic analysis of learner language for developing an intelligent computer-assisted language learning (ICALL) system of Korean.

Seok Bae Jang is a visiting lecturer of Korean in the Department of Asian and Near Easter languages at Brigham Young University.He is interested in corpus annotation and linguistic processing, especially Korean language processing, temporal and event processing, summarization evaluation, and the automatic analysis of learner language for the intelligent computer-assisted language learning applications.

Markus Dickinson is an assistant professor in the Department of Linguistics at Indiana University, specializing in computational linguistics. His research interests include exploring the intersection of corpus annotation and linguistic processing, especially the detection of annotation errors; and research into the automatic analysis of learner language, especially for intelligent computer-assisted language learning applications.