Composing Text and Structured Databases
- Rakesh Agrawal ,
- Ariel Fuxman ,
- Anitha Kannan ,
- Qi Lu ,
- John Shafer
MSR-TR-2012-22 |
We postulate a universe of objects in which each object is described by a set of characteristics. Different objects can have the same or different characteristics, but they differ in at least one characteristic. Thus, every object can be conceptually thought to have an implicit unique identity, though the object identity is not manifest. We have two data sources. The first contains text documents written in a natural language, each discussing some aspects of a small number of objects. There is no marking or structure in the documents that explicitly identifies the objects that the document is about; each text is simply a sequence of words. The second source is a collection of structured records, each record containing representative but partial characteristics of some object.
Under this setting, we present a framework for composing texts about the objects and structured information about them. The framework is centered around the concept of “trait”: a set of characteristics that can serve as the proxy for the identity of an object. Traits might sound similar to database keys, but traits are instance-based rather than schema-based. We present techniques for computing traits, mapping structured records and text documents to traits, and thus joining information about the same object from two repositories. Our extensive experiments using synthetic data demonstrate the effectiveness of our approach under a wide range of operating parameters. Experiments using empirical data validate the results of the synthetic experiments.