Multi-document summarization involves heavy information compression. Important events need to be summarized, while at the same time, the protagonists need to be described in sufficient detail that the reader can relate them to the story. This talk formalizes the trade-off between describing entities and events within the space constraints of a short summary. I present experiments on automatically acquiring the information status of entities and using this to direct the reference generation process for multi-document summarization.
Information Status broadly consists of three notions- whether an entity is hearer-old or hearer-new, whether a reference to it is discourse-old or discourse-new and whether the entity is a major or minor character for that text. The discourse old/new characterization is well understood, and results in longer, descriptive initial references and shorter subsequent references. I show how, based on features extracted from the input documents to a typical summarization engine, it can be determined if entities are assumed to be hearer-old or hearer-new, and also whether a character is central for the summary or not (major/minor). This information is used to decide whether to refer to a character by name or generically, and also to decide the level of detail required in the initial reference. The learned information status successfully models human decisions on reference generation, generating short references (eg. “President Bush”) for entities the reader can be expected to be familiar with, and more descriptive references (eg. “Kurdish Rebel Abdullah Ocalan, who has been sought for years by Turkey”) for less familiar entities.
(This talk descibes joint work with Ani Nenkova and Kathleen McKeown)