Unsupervised Morphological Segmentation with Log-Linear Models
- Kristina Toutanova ,
- Colin Cherry
Proceedings of NAACL-HLT |
Published by Association for Computational Linguistics
Best paper award winner
Morphological segmentation breaks words
into morphemes (the basic semantic units). It
is a key component for natural language processing
systems. Unsupervised morphological
segmentation is attractive, because in every
language there are virtually unlimited supplies
of text, but very few labeled resources.
However, most existing model-based systems
for unsupervised morphological segmentation
use directed generative models, making it difficult
to leverage arbitrary overlapping features
that are potentially helpful to learning.
In this paper, we present the first log-linear
model for unsupervised morphological segmentation.
Our model uses overlapping features
such as morphemes and their contexts,
and incorporates exponential priors inspired
by the minimum description length (MDL)
principle. We present efficient algorithms
for learning and inference by combining contrastive
estimation with sampling. Our system,
based on monolingual features only, outperforms
a state-of-the-art system by a large
margin, even when the latter uses bilingual information
such as phrasal alignment and phonetic
correspondence. On the Arabic Penn
Treebank, our system reduces F1 error by 11%
compared to Morfessor.