Conceptualization

Established: July 5, 2010

The Conceptualization model aims to map text format entities into semantic concept categories with some probabilities, which may depend on the context texts of the entities. As an example, “Microsoft” could be automatically mapped to “Software Company” and “Fortune 500 company” etc. with some probabilities. It provides computers the common sense computing capability and make machines “aware” of the mental world of human beings, through which way machines can better understand human communication in text. In detail, conceptualization maps instances or short texts into a large auto learned concept space, which is a vector space, with human-level concept reasoning. It can be treated as both human understandable and machine understandable text embedding. Thus it provides us the capability of text concept tagging, short text semantic similarity computation etc. for text understanding. It can benefit various text processing applications including search engines, automatic question-answering, online advertising, recommendation systems and artificial intelligence system. For more information, please refer to our Microsoft Concept Graph release page and our ACL 2016 tutorial “Understanding Short Texts“.

1.Single instance conceptualization

Single instance conceptualization can return a ranked list of automatically learned concept/category names for any input entity mention/instance. Each concept has a probability to denote the possibility of the input entity belonging to this concept. As a result, the input entity is represented as a numerical vector, which shows its distribution over the concept vector space.

For human beings, given a single instance, this concept distribution often forms automatically and subconsciously. More importantly, those categories at the appropriate level rank higher. Psychologists and linguists call it as Basic-level Categorization (BLC).

As an example, consider the term Microsoft, which can be categorized into a large number of concepts, ranging from extremely general to extremely specific, such as company, software company, and largest OS vendor. If we go through company, we may find objects such as McDonald’s and BMW, which have not much similarity to Microsoft. If we go through largest OS vendor, we may not be able to find any reasonable object other than Microsoft. On the other hand, if we go through software company, we may find Oracle, Adobe, IBM, which are a lot more similar to Microsoft. Thus, software company is a more appropriate basic-level concept for Microsoft, or in other words, properties associated with software company are more readily applied to Microsoft, which is also the reason why through software company we can find many objects that are similar to Microsoft.

In this release, we will provide the concept distribution of input text with basic-level conceptualization. Besides, some common measures for conceptualization including MI, PMI, PMIk, and Typicality will be provided simultaneously.

A snapshot of the demo:

Given a single instance “python”, the demo returns concept distributions with different measures (including BLC measure):

You can simply integrate this single instance conceptualization service into your own applications.

2.Single instance conceptualization with context

Given “apple” and “pie”, our API maps “apple” to fruit related senses.
Given “apple” and “ipad”, our API maps “apple” to company related seneses.

3.Short text conceptualization

Given a short text “the engineer is eating the apple”, will do the segmentation, concept mapping, and sense disambiguation.

References

  1. Zhongyuan Wang and Haixun Wang, Understanding Short Texts, in the Association for Computational Linguistics (ACL) (Tutorial), August 2016.
  2. Zhongyuan Wang, Haixun Wang, Ji-Rong Wen, and Yanghua Xiao, An Inference Approach to Basic Level of Categorization, in ACM International Conference on Information and Knowledge Management (CIKM), ACM –Association for Computing Machinery, October 2015.
  3. Zhongyuan Wang, Kejun Zhao, Haixun Wang, Xiaofeng Meng, and Ji-Rong Wen, Query Understanding through Knowledge-Based Conceptualization, in IJCAI, July 2015.
  4. Wen Hua, Zhongyuan Wang, Haixun Wang, Kai Zheng, and Xiaofang Zhou, Short Text Understanding Through Lexical-Semantic Analysis, in International Conference on Data Engineering (ICDE), April 2015. (Best Paper Award)
  5. Zhongyuan Wang, Haixun Wang, and Zhirui Hu, Head, Modifier, and Constraint Detection in Short Texts, in International Conference on Data Engineering (ICDE), 2014.
  6. Yangqiu Song, Haixun Wang, Zhongyuan Wang, Hongsong Li, and Weizhu Chen, Short Text Conceptualization using a Probabilistic Knowledgebase, in IJCAI, 2011

Contacts


Zhongyuan Wang

Dawei Zhang

Jun Yan

Wei-Ying Ma

Group

  Data Mining and Enterprise Intelligence Group, MSRA