Sentence Mover’s Similarity:  Automatic Evaluation for  Multi-Sentence Texts

Elizabeth Clark; Asli Celikyilmaz; Noah Smith

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts

Elizabeth Clark ,
Asli Celikyilmaz ,
Noah Smith

ACL 2019 | July 2019

Download BibTex

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word match-ing, an inflexible approach for measuring se-mantic similarity. We introduce methods based on sentence mover’s similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences)and human-authored essays (average length of 7.5). We also show that sentence mover’s similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.