Inferring Information Status for Reference Generation in Open Domains


March 18, 2005


Advaith Siddharthan


Columbia University


Multi-document summarization involves heavy information compression. Important events need to be summarized, while at the same time, the protagonists need to be described in sufficient detail that the reader can relate them to the story. This talk formalizes the trade-off between describing entities and events within the space constraints of a short summary. I present experiments on automatically acquiring the information status of entities and using this to direct the reference generation process for multi-document summarization.

Information Status broadly consists of three notions- whether an entity is hearer-old or hearer-new, whether a reference to it is discourse-old or discourse-new and whether the entity is a major or minor character for that text. The discourse old/new characterization is well understood, and results in longer, descriptive initial references and shorter subsequent references. I show how, based on features extracted from the input documents to a typical summarization engine, it can be determined if entities are assumed to be hearer-old or hearer-new, and also whether a character is central for the summary or not (major/minor). This information is used to decide whether to refer to a character by name or generically, and also to decide the level of detail required in the initial reference. The learned information status successfully models human decisions on reference generation, generating short references (eg. “President Bush”) for entities the reader can be expected to be familiar with, and more descriptive references (eg. “Kurdish Rebel Abdullah Ocalan, who has been sought for years by Turkey”) for less familiar entities.

(This talk descibes joint work with Ani Nenkova and Kathleen McKeown)


Advaith Siddharthan

Currently Postdoc at Columbia University. BA, MA, PhD Computer Science from University of Cambridge BSc Physics from University of Delhi. My research to date has focused on the automatic rewriting of text, particularly news reports, to make information more accessible to a wide audience. My doctoral research focused on automatically reducing the grammatical complexity of English news reports by splitting long multi-clause sentences into multiple shorter ones. My post-doctoral research has been in the field of multi-document news summarisation. My particular interests relate to issues of coherence in computer generated text, particularly the use of referring expressions.