In this talk, we present our ongoing research on developing an annotated Korean learner corpus and the automatic analysis of learner errors. Our long-term goal is to develop an intelligent computer-assisted language learning (ICALL) system. Annotated learner corpora are known to provide valuable information about language learners in different teaching environments and about learning, language acquisition, and language variation. There is also a practical, technological need for annotated learner corpora. Annotated learner corpora can provide more training and evaluation data for processing learner language on a wider range of languages, such as Korean.
Taking spelling and particle errors as our starting points, we pursue the goal of providing a taxonomy and annotation scheme for different kinds of Korean learner errors. This is useful for automatic error diagnosis and useful for feedback instruction in a Korean ICALL system. As with any language being learned, learners of Korean frequently make a variety of different kinds of errors, some of which, such as spelling errors, are computationally problematic and some of which persist through years of study. For instance, the Korean language uses postpositional particles to indicate grammatical functions (cf. case-marking) and uses which roughly correspond to English prepositions. As with English prepositions (cf. Tetreault and Chodorow 2008; de Felice and Pulman 2008), learners of Korean persistently misuse particles throughout different stages of the learning process (Lee et al. 2009; Ko et al 2004). Furthermore, processing Korean particles, especially for learner language, proves to be quite challenging since there are a variety of syntactic, semantic, and pragmatic factors which go into the selection of a correct particle (see, e.g., Dickinson and Lee 2009). In a different vein, spelling errors are reflective of different aspects of linguistic knowledge (see Rimrott & Heift 2008, and references therein). Korean learners make frequent mismatches between a syllable and a character, due to the interaction of complex linguistic processes they are learning, including phonological rules and morphological boundary detection. Because dealing with misspellings is a crucial first step for most systems, and because providing a corpus which contains spelling corrections allows NLP practitioners to focus on other aspects of research, we focus on spelling error annotation in our corpus development.
We will show various issues related to processing particle errors and spelling errors in an agglutinative language like Korean. Linguistic awareness plays a crucial role not only in language learning but also error processing. By building up annotated learner corpus, we provide a platform for research into different realms of second language acquisition, linguistics, and the automatic analysis of learner language.