{"id":170584,"date":"2010-10-29T03:13:04","date_gmt":"2010-10-29T03:13:04","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/project\/probase\/"},"modified":"2017-06-05T10:40:21","modified_gmt":"2017-06-05T17:40:21","slug":"probase","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/probase\/","title":{"rendered":"Probase"},"content":{"rendered":"<p><img decoding=\"async\" id=\"Imagec97b486e-e8f8-438b-adde-6bf9a444a7c5\" class=\"aligncenter\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-probaselogo.gif\" alt=\"\" \/><span id=\"ImageCaptionc97b486e-e8f8-438b-adde-6bf9a444a7c5\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/p>\n<p>The goal of<strong> Probase<\/strong> is to make machines \u201caware\u201d of the mental world of human beings, so that\u00a0machines can better understand human communication. We do this by\u00a0giving certain <b><i>general knowledge<\/i><\/b> or certain <b><i>common sense<\/i><\/b> to machines.<\/p>\n\t<div data-wp-context='{\"items\":[]}' data-wp-interactive=\"msr\/accordion\">\n\t\t\t\t\t<div class=\"clearfix\">\n\t\t\t\t<div\n\t\t\t\t\tclass=\"btn-group align-items-center mb-g float-sm-right\"\n\t\t\t\t\tdata-bi-aN=\"accordion-collapse-controls\"\n\t\t\t\t>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Expand all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllExpanded\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onExpandAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tExpand all\t\t\t\t\t<\/button>\n\t\t\t\t\t<span aria-hidden=\"true\"> | <\/span>\n\t\t\t\t\t<button\n\t\t\t\t\t\tclass=\"btn btn-link m-0\"\n\t\t\t\t\t\tdata-bi-cN=\"Collapse all\"\n\t\t\t\t\t\tdata-wp-bind--aria-controls=\"state.ariaControls\"\n\t\t\t\t\t\tdata-wp-bind--aria-expanded=\"state.ariaExpanded\"\n\t\t\t\t\t\tdata-wp-bind--disabled=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-class--inactive=\"state.isAllCollapsed\"\n\t\t\t\t\t\tdata-wp-on--click=\"actions.onCollapseAll\"\n\t\t\t\t\t\ttype=\"button\"\n\t\t\t\t\t>\n\t\t\t\t\t\tCollapse all\t\t\t\t\t<\/button>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t\t\t<ul class=\"msr-accordion\">\n\t\t\t\t\t\t\th2>News<\/h2><p><span style=\"color: #ff6600;\">[Sept.<strong> 2016<\/strong>] Please visit our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/concept.research.microsoft.com\/\">Microsoft Concept Graph <span class=\"sr-only\"> (opens in new tab)<\/span><\/a>release for up-to-date information of this project!<\/span><\/p><h2>Applications<\/h2><p class=\"page-content\" style=\"text-align: center;\">The goal of Probase to enable machines to better understand human communication. For example, in natural language processing and speech analysis, knowledgebases can help reduce the ambiguities in language. As Probase has a knowledgebase as large as the concept space (of wordly facts) in a human mind, it has unique advantages in these applications.Besides, with the probabilistic knowledge provided by Probase, we build several interesting applications, such as topic search, web table search and document\u00a0understanding,\u00a0shown in Figure 3.<span id=\"186f2c32-3da1-49e5-8ab6-562eeb443c3a\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image186f2c32-3da1-49e5-8ab6-562eeb443c3a\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-probase_overview.png\" alt=\"\" \/><\/span><\/p><p class=\"page-content\" style=\"text-align: center;\"><span id=\"186f2c32-3da1-49e5-8ab6-562eeb443c3a\" class=\"ImageBlock fn\"><span id=\"ImageCaption186f2c32-3da1-49e5-8ab6-562eeb443c3a\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><b>Figure 3: Overview of\u00a0Probase and Its Applications<\/b><\/p><h2>Project Contact<\/h2><ul><li><strong><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/wangzhongyuan.com\/en\/\">Zhongyuan Wang<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/strong> (wzhy\u00a0(AT) outlook. com)<\/li><li>Haixun Wang<\/li><li>Wei-Ying Ma<\/li><\/ul>\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-2\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-2\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-1\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tData Statistics\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-1\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-2\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<p>This page\u00a0the data statistics of\u00a0the taxonomy we build. In addition, proofs of Theorem 1 and 2 in Section 3.6 can be found <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/haixun.olidu.com\/probase\/files\/proofs.pdf\">here<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which are related to the taxonomy construction framework.<\/p>\n<h2>General Information<\/h2>\n<p align=\"center\"><b>Table 1: Scale comparison of several open domain taxonomies<\/b><\/p>\n<table class=\" borderColumns borderRows tableBorder\" width=\"400\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td><b><i>name<\/i><\/b><\/td>\n<td><b><i># of concepts<\/i><\/b><\/td>\n<td><b><i># of isA pairs<\/i><\/b><\/td>\n<\/tr>\n<tr>\n<td>Freebase<\/td>\n<td>1,450<\/td>\n<td>24,483,434<\/td>\n<\/tr>\n<tr>\n<td>WordNet<\/td>\n<td>25,229<\/td>\n<td>283,070<\/td>\n<\/tr>\n<tr>\n<td>WikiTaxonomy<\/td>\n<td>111,654<\/td>\n<td>105,418<\/td>\n<\/tr>\n<tr>\n<td>YAGO<\/td>\n<td>352,297<\/td>\n<td>8,277,227<\/td>\n<\/tr>\n<tr>\n<td>DBPedia<\/td>\n<td>259<\/td>\n<td>1,900,000<\/td>\n<\/tr>\n<tr>\n<td>ResearchCyc<\/td>\n<td>\u2248 120,000<\/td>\n<td>< 5,000,000<\/td>\n<\/tr>\n<tr>\n<td>KnowItAll<\/td>\n<td>N\/A<\/td>\n<td>< 54,753<\/td>\n<\/tr>\n<tr>\n<td>TextRunner<\/td>\n<td>N\/A<\/td>\n<td>< 11,000,000<\/td>\n<\/tr>\n<tr>\n<td>OMCS<\/td>\n<td>173,398<\/td>\n<td>1,030,619<\/td>\n<\/tr>\n<tr>\n<td>NELL<\/td>\n<td>123<\/td>\n<td>< 242,453<\/td>\n<\/tr>\n<tr>\n<td><b>Probase<\/b><\/td>\n<td><b>2,653,872<\/b><\/td>\n<td><b>20,757,545<\/b><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We extract <b>326,110,911<\/b> sentences from a corpus containing <b>1,679,189,480<\/b> web pages, after sentence deduplication. To the best of our knowledge, the scale of our corpus is one order of magnitude larger than the previously known largest corpus. We then extract <b>143,328,997<\/b> <i>isA<\/i> pairs from the sentences, with <b>9,171,015<\/b> distinct super-concept labels and <b>11,256,733 <\/b>distinct sub-concept labels.<\/p>\n<p>The inferred taxonomy contains <b>2,653,872<\/b> distinct concepts (down from 9.17 million after the extraction phase), <b>16,218,369<\/b> distinct concept-instance pairs, and <b>4,539,176<\/b> distinct concept-subconcept pairs (<b>20,757,545<\/b> pairs in total). The number of concept labels decreases since we have changed all labels to lowercases and flatten the concepts with only one instance (and refer to them as instances).<\/p>\n<p>As comparison, <b>Table 1<\/b> shows statistics of several well-known open-domain taxonomies in comparison with Probase. For <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/wordnet.princeton.edu\/\">WordNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we only count the sub-taxonomy related to <b>nouns<\/b> in WordNet, and we have converted synsets in WordNet to their lexical form. For <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.freebase.com\/\">Freebase<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the statistics are obtained from a version downloaded in early Match, 2010. More than 3,000 topics in this data source are incorrect and cannot be found on the official Freebase website, and are therefore ignored in our analysis. For <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/research.cyc.com\/\">ResearchCyc<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the number of <i>isA<\/i> pairs shown is in fact the number of <b><i>all<\/i><\/b> the relationships, since the exact numbers are not reported. For <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.mpi-inf.mpg.de\/yago-naga\/yago\/downloads.html\">YAGO<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the statistics are obtained from its latest version (Dec. 2010), and the number of <i>isA<\/i> pairs is inferred by summing up the number of <b>SubConceptOf<\/b> and <b>Type<\/b> relations reported.<\/p>\n<p>For completeness, in <b>Table 1<\/b>, we have also included statistics for KnowItAll, TextRunner, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/openmind.media.mit.edu\/\">OMCS<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/rtw.ml.cmu.edu\/rtw\/\">NELL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. However, these frameworks are <b><i>not<\/i><\/b> intended to build a taxonomy as we desired, but to <i>extract<\/i> general <b><i>facts<\/i><\/b> that may indicate various relationships between concepts or entities. Therefore, it is usually hard to tell concepts from entities, and also hard to tell how many <i>isA<\/i> pairs are among all the pairs, if not reported.<\/p>\n<h2>Concept Space<\/h2>\n<p>Given that Probase has many more concepts than any other taxonomies, a reasonable question to ask is how many of these concepts are <i>relevant<\/i>. This question is akin to the <i>precision<\/i> measure in <i>information retrieval<\/i> (IR). Here for the purpose of comparison, we define a concept to be relevant, if it appears <i>at least once<\/i> in web queries. We analyzed <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.bing.com\/\">Bing<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u00b4s query log from a two-year period, sorted the queries in decreasing order of their <i>frequency<\/i> (i.e., the number of times they are issued through Bing), and computed the number of relevant concepts in Probase and 4 other general-purposed open-domain taxonomies <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/wordnet.princeton.edu\/\">WordNet<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.h-its.org\/english\/research\/nlp\/download\/wikitaxonomy.php\">WikiTaxonomy<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.mpi-inf.mpg.de\/yago-naga\/yago\/\">YAGO<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/www.freebase.com\/\">Freebase<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, with respect to the top <b>50 million<\/b> queries. <b>Figure 1<\/b> shows the result.<\/p>\n<p align=\"center\"><b><span id=\"c9e78fed-24f6-499d-960b-83c71a812f8e\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagec9e78fed-24f6-499d-960b-83c71a812f8e\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-conceptprecision.png\" alt=\"\" \/><span id=\"ImageCaptionc9e78fed-24f6-499d-960b-83c71a812f8e\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 1: Number of relevant concepts in taxonomies<\/b><\/p>\n<p>In total, <b>664,775<\/b> concepts are considered relevant in Probase, compared to <b>70,656<\/b> in YAGO. This reflects the well-known long-tail phenomena of user queries. While a small number of basic concepts (e.g., <i>company<\/i>, <i>city<\/i>, <i>country<\/i>) representing common sense knowledge appear very frequently in user queries, Web users do mention other <i>less well-known<\/i> concepts. Probase does a better job at capturing these concepts in the long tail and hence has a better chance of understanding these user queries.<\/p>\n<p>We next measure the <i>taxonomy coverage<\/i> of queries by Probase, which is akin to the <i>recall<\/i> measure in IR. A query is said to be <i>covered<\/i> by a taxonomy if the query contains <i>at least one concept or instance<\/i> within the taxonomy. <b>Figure 2<\/b> compares the coverage of queries by Probase taxonomy against the other four aforementioned taxonomies. Probase outperforms all the other taxonomies on the coverage of top <b>10 million<\/b> to top <b>50 million<\/b> queries. In all, Probase covers <b>40,517,506<\/b> (or, <b>81.04%<\/b>) of the top 50 million queries.<\/p>\n<p align=\"center\"><b><span id=\"476cbd0d-4600-4fb4-9696-df7f6a30dd55\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image476cbd0d-4600-4fb4-9696-df7f6a30dd55\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-taxonomycoverage.png\" alt=\"\" \/><span id=\"ImageCaption476cbd0d-4600-4fb4-9696-df7f6a30dd55\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 2: Taxonomy coverage of the top 50 million queries<\/b><\/p>\n<p align=\"center\"><b><span id=\"2044b559-46bb-4753-b42f-b1478164ff2e\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image2044b559-46bb-4753-b42f-b1478164ff2e\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-conceptcoverage.png\" alt=\"\" \/><span id=\"ImageCaption2044b559-46bb-4753-b42f-b1478164ff2e\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 3: Concept coverage of the top 50 million queries<\/b><\/p>\n<p>We further measure <i>concept coverage<\/i>, which is the number of queries containing <i>at least one concept<\/i> in the taxonomy. <b>Figure 3<\/b> compares the concept coverage by Probase against the other four taxonomies. Again, Probase outperforms all the others. Note that, although Freebase presents comparable taxonomy coverage with Probase in <b>Figure 2<\/b>, its concept coverage is much smaller.<\/p>\n<h2>isA Relationship Space<\/h2>\n<p>There are two kinds of <i>isA<\/i> relationships in Probase: the <i>concept-subconcept<\/i> relationship which are the edges connecting internal nodes in the hierarchy, and the <i>concept-instance<\/i> relationship which are the edges connecting a leaf node.<\/p>\n<p><b>Table 2<\/b> compares the concept-subconcept relationship space of Probase with the other taxonomies. The <i>level<\/i> of a concept is defined to be one plus the length of the <i>longest<\/i> path from it to a <i>leaf<\/i> concept (i.e., concept without any subconcepts\/children). All leaf concepts thus receive a level of 1. <b>Table 1<\/b> shows that even with <b><i>an order of magnitude larger<\/i><\/b> number of concepts, Probase still has a <i>similar<\/i> hierachical complexity to the other taxonomies. The exception is Freebase which exhibits trivial values on these measured metrics because it has no isA relationship among its concepts at all.<\/p>\n<p align=\"center\"><b>Table 2: The concept-subconcept relationship space<\/b><\/p>\n<table class=\" borderColumns borderRows tableBorder\" width=\"500\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td><\/td>\n<td><b><i>\u00a0\u00a0 # of isA pairs<\/i><\/b><\/td>\n<td><b><i>Avg # of children<\/i><\/b><\/td>\n<td><b><i>Avg # of parents<\/i><\/b><\/td>\n<td><b><i>Avg level<\/i><\/b><\/td>\n<\/tr>\n<tr>\n<td>WordNet<\/td>\n<td>\u00a0\u00a0 283,070<\/td>\n<td>11.0<\/td>\n<td>2.4<\/td>\n<td>1.265<\/td>\n<\/tr>\n<tr>\n<td>WikiTaxonomy<\/td>\n<td>\u00a0\u00a0 90,739<\/td>\n<td>3.7<\/td>\n<td>1.4<\/td>\n<td>1.483<\/td>\n<\/tr>\n<tr>\n<td>YAGO<\/td>\n<td>\u00a0\u00a0 366,450<\/td>\n<td>23.8<\/td>\n<td>1.04<\/td>\n<td>1.063<\/td>\n<\/tr>\n<tr>\n<td>Freebase<\/td>\n<td>\u00a0\u00a0 0<\/td>\n<td>0<\/td>\n<td>0<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td>Probase<\/td>\n<td>\u00a0\u00a0 4,539,176<\/td>\n<td>7.53<\/td>\n<td>2.33<\/td>\n<td>1.086<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p align=\"center\"><b><span id=\"456f1bfb-cf3b-4f3d-b1bb-1da6d71c0877\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image456f1bfb-cf3b-4f3d-b1bb-1da6d71c0877\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-conceptsizedistribution.png\" alt=\"\" \/><span id=\"ImageCaption456f1bfb-cf3b-4f3d-b1bb-1da6d71c0877\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 4: Concept size distributions in Probase and Freebase<\/b><\/p>\n<p>We also compare Probase and Freebase on the concept-instance relationships. We choose Freebase since it is the only existing taxonomy with comparable scale on instance space (<b>24,483,434<\/b> concept-instance pairs, see <b>Table 1<\/b>). We define <i>concept size<\/i> to be the number of instances directly under a concept node. <b>Figure 4<\/b> (<i>logarithmic scale on the Y-axis<\/i>) compares distributions of concept sizes in Probase and Freebase. While Freebase focuses on a few very popular concepts like <i>track<\/i> and <i>book<\/i> which include over two million instances, Probase has many more medium to small size concepts. In fact, the top 10 concepts in Freebase contain <b>17,174,891<\/b> concept-instance pairs, or <b>70%<\/b> of all the pairs it has. In contrast, the top 10 concepts in Probase only contains <b>727,136<\/b> pairs, or <b>4.5%<\/b> of its total. Therefore, Probase provides a much <i>broader<\/i> coverage on <i>diverse<\/i> topics, while Freebase is more informative on specific topics. On the other hand, the instances of large concepts in Freebase like <i>book<\/i> are mostly from specific websites like Amazon, which could be easily merged into Probase using the integration framework we proposed.<\/p>\n<p align=\"center\"><b><span id=\"f8970a4f-b1a8-45f1-a6f6-3be8c2c20850\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagef8970a4f-b1a8-45f1-a6f6-3be8c2c20850\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-comparetofreebase.png\" alt=\"\" \/><span id=\"ImageCaptionf8970a4f-b1a8-45f1-a6f6-3be8c2c20850\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 5: Relative size of sample concepts to Freebase<\/b><\/p>\n<p>Moreover, due to the transitivity nature of the <i>isA<\/i> relationship, we can assume that all instances in a subconcept conceptually also belong to its superconcept. Freebase, however, lacks such concept-subconcept information and the pairs it contains are merely concept-instance pairs. So if we propagate all instances up through the taxonomy, the number of instances in each concept becomes much larger. If we take all the distinct classes in Probase which also exist in Freebase and divide them into 7 groups by their Freebase sizes, or the number of instances in each Freebase class, <b>Figure 5<\/b> depicts the relative sizes of Probase classes in these 7 groups, which indicates that Probase clearly contains more instances in medium to smaller classes, but slightly less instances for very large and popular classes.<\/p>\n<p>To estimate the correctness of the <i>isA<\/i> pairs within Probase, we create a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/haixun.olidu.com\/probase\/benchmark.htm\">benchmark<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> dataset containing <b>40<\/b> concepts in various domains. The concept size varies from <b>21<\/b> instances (for <b><i>aircraft model<\/i><\/b>) to <b>85,391<\/b> (for <b><i>company<\/i><\/b>), with a median of <b>917<\/b>. Benchmarks with similar number of concepts and domain coverage have also been reported in previous information extraction research. For each concept, we randomly pick up to 50 instances\/subconcepts and ask human judge to evaluate their correctness and hence also the precision of the extraction algorithm. <b>Figure 6<\/b> shows the result. The average precision of all pairs in benchmark is <b>92.8%<\/b>, which outperforms precision frameworks like KnowItAll (<b>64%<\/b> on average), NELL (<b>74%<\/b>) and TextRunner (<b>80%<\/b> on average). It is not fair to directly compare our results with Wikipedia-based frameworks like WikiTaxonomy (<b>86%<\/b> precision) and YAGO (<b>95%<\/b> precision), whose data sources are much cleaner. Nevertheless, only YAGO has a better overall precision than Probase.<\/p>\n<p align=\"center\"><b><span id=\"be18f997-4a6e-447b-88ed-4b1f22a86187\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagebe18f997-4a6e-447b-88ed-4b1f22a86187\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-precision.png\" alt=\"\" \/><span id=\"ImageCaptionbe18f997-4a6e-447b-88ed-4b1f22a86187\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 6: Precision of extracted pairs<\/b><\/p>\n<p align=\"center\"><b>Table 3: Precisions of Probase and KnowItAll<\/b><\/p>\n<table class=\" borderColumns borderRows tableBorder\" width=\"300\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td><b><i>concept<\/i><\/b><\/td>\n<td><b><i>Probase<\/i><\/b><\/td>\n<td><b><i>KnowItAll<\/i><\/b><\/td>\n<\/tr>\n<tr>\n<td>actor<\/td>\n<td>0.96<\/td>\n<td>0.69<\/td>\n<\/tr>\n<tr>\n<td>city<\/td>\n<td>0.84<\/td>\n<td>0.71<\/td>\n<\/tr>\n<tr>\n<td>film<\/td>\n<td>0.88<\/td>\n<td>0.49<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>As a detailed case study, since KnowItAll also used Hearst&#8217;s patterns to extract <i>isA<\/i> relationships, we compare our precision with that of KnowItAll on <i>actor<\/i>, <i>city<\/i> and <i>film<\/i>, three concepts that are common to both systems. <b>Table 3<\/b> shows that Probase has notable advantage in <i>isA<\/i> extraction precision over KnowItAll.<\/p>\n<h2>Taxonomy Scoring<\/h2>\n<h4>Consensus Score<\/h4>\n<p align=\"center\"><b><span id=\"a3f33455-c444-45cb-b2ca-efa8c742ef34\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagea3f33455-c444-45cb-b2ca-efa8c742ef34\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-consensus.png\" alt=\"\" \/><span id=\"ImageCaptiona3f33455-c444-45cb-b2ca-efa8c742ef34\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 7: Consensus score vs. the actual percentage of true claims<\/b><\/p>\n<p>We experimented with the simple model in Equation (1) for computing <b><i>consensus<\/i><\/b> of a claim using the benchmark concepts. We expect that the consensus scores to be approximately equal to the actual percentage of true claims as the number of evidences grows. This is verified in <b>Figure 7<\/b>. The average consensus scores matches the actual percentage of true claims (checked by human judges) quite well, except when there is only one evidence. <b>Figure 7<\/b> has an uneven scale on the x-axis because the frequency distribution of claims in Probase has a long tail.<\/p>\n<h4>Typicality Score<\/h4>\n<p align=\"center\"><b><span id=\"cd47e771-5ec3-4c30-9d96-b4637811e59c\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imagecd47e771-5ec3-4c30-9d96-b4637811e59c\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-typicality.png\" alt=\"\" \/><span id=\"ImageCaptioncd47e771-5ec3-4c30-9d96-b4637811e59c\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 8: Typicality evaluation results<\/b><\/p>\n<p>Please click the following link to view representative instances\/child concepts for classes within the benchmark.<\/p>\n<ol>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/haixun.olidu.com\/probase\/instanceImportance.htm\">Top 50 instances for selected concepts in the benchmark, ordered by decreasing typicality scores.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li><a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/haixun.olidu.com\/probase\/childImportance.htm\">Top 20 subconcepts for selected concepts within the benchmark, ordered by decreasing typicality scores.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<\/ol>\n<p align=\"left\">The typicality score a subconcept <b><i>x<\/i><\/b> of concept <b><i>c<\/i><\/b> receives is computed with the formula:<\/p>\n<p align=\"center\"><span id=\"ec3990a8-2dc8-4b52-873e-3bc2c39467fe\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imageec3990a8-2dc8-4b52-873e-3bc2c39467fe\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-formula.jpg\" alt=\"\" \/><span id=\"ImageCaptionec3990a8-2dc8-4b52-873e-3bc2c39467fe\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p align=\"left\">Since subconcepts are actually groups of instances of their parent concept, intuitively, if a subconcept contains more representative instances of the parent concept, then its typicality score should be boosted.<\/p>\n<p>We further conduct a user study for this relatively subjective measure. First, we pick 10 concepts, and the top 50 instances for each concept, according to their typicality scores. Then, we invite 4 users to manually score the typicality of the instances (with order shuffled) in their respective concepts, as 3 (very representative), 2 (correct but not very representative), 1 (unknown), and 0 (incorrect).<\/p>\n<p align=\"center\"><b><span id=\"3e32d0dc-4614-4860-a085-ef36ab1c5b28\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image3e32d0dc-4614-4860-a085-ef36ab1c5b28\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-similarity.png\" alt=\"\" \/><span id=\"ImageCaption3e32d0dc-4614-4860-a085-ef36ab1c5b28\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/b><\/p>\n<p align=\"center\"><b>Figure 9: Similarity evaluation results<\/b><\/p>\n<p>We divide the 50 instances of each concept into 5 groups by their typicality score ranks (<i>i.e.<\/i> top 10 instances from each concept go to Group 1, second 10 instances from each concept go to Group 2, and so on), and then compute the average judge scores assigned to instances within each group. <b>Figure 8<\/b> shows that the typicality of the instances in their classes, as perceived by human judges, decreases with computed typicality scores, which means our definition of the typicality score is sensible.<\/p>\n<h4>Similarity Score<\/h4>\n<p>For each concept (which we call host concept) in the benchmark, we pick the five most similar concepts and rank them according to the similarity scores. We thereby form 5 groups of similar concepts in decreasing order of the scores. We then ask the judges to manually rate their proximity to the host concepts on a normalized scale from 0 (least similar) to 1 (most similar). We average these judge scores in each of the 5 groups and present the results in <b>Figure 9<\/b>, which indicates that the similarity scores match the judges&#8217; perceptions.<\/p>\n<h2>ProBase Browser<\/h2>\n<p>We have developed a browser that could be used to inspect the taxonomy we built. This page gives some snapshots of it. Example paths within the taxonomy include:<\/p>\n<ul>\n<li>animals \u2192 livestock \u2192 sheep &#8230;<\/li>\n<li>animals \u2192 reptiles \u2192 snakes &#8230;<\/li>\n<li>plants \u2192 trees \u2192 maples &#8230;<\/li>\n<li>artists \u2192 musicians \u2192 guitarists &#8230;<\/li>\n<li>places \u2192 restaurants \u2192 fast foods &#8230;<\/li>\n<li>foods \u2192 vegetables \u2192 cabbages &#8230;<\/li>\n<li>machines \u2192 computers \u2192 laptops &#8230;<\/li>\n<li>novels \u2192 mysteries \u2192 ghosts &#8230;<\/li>\n<li>sports \u2192 footballs &#8230;<\/li>\n<li>celebrities \u2192 athletes \u2192 gymnasts &#8230;<\/li>\n<\/ul>\n<h2>Snapshots<\/h2>\n<p align=\"left\"><strong>Concept search:<\/strong><\/p>\n<p align=\"left\"><span id=\"d256f951-afc8-4102-9497-6f5edc454c84\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Imaged256f951-afc8-4102-9497-6f5edc454c84\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-viewer_concept.png\" alt=\"\" \/><span id=\"ImageCaptiond256f951-afc8-4102-9497-6f5edc454c84\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p align=\"left\"><strong>Instance search:<\/strong><\/p>\n<p align=\"left\"><span id=\"66d6e820-ccb1-45ef-a1b5-328152fa1405\" class=\"ImageBlock fn\"><img decoding=\"async\" id=\"Image66d6e820-ccb1-45ef-a1b5-328152fa1405\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-viewer_instance.png\" alt=\"\" \/><span id=\"ImageCaption66d6e820-ccb1-45ef-a1b5-328152fa1405\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t<li class=\"m-0\" data-wp-context='{\"id\":\"accordion-content-4\"}' data-wp-init=\"callbacks.init\">\n\t\t<div class=\"accordion-header\">\n\t\t\t<button\n\t\t\t\taria-controls=\"accordion-content-4\"\n\t\t\t\tclass=\"btn btn-collapse\"\n\t\t\t\tdata-wp-bind--aria-expanded=\"state.isExpanded\"\n\t\t\t\tdata-wp-on--click=\"actions.onClick\"\n\t\t\t\tid=\"accordion-button-3\"\n\t\t\t\ttype=\"button\"\n\t\t\t>\n\t\t\t\tApplications\t\t\t<\/button>\n\t\t<\/div>\n\t\t<div\n\t\t\taria-labelledby=\"accordion-button-3\"\n\t\t\tclass=\"msr-accordion__content\"\n\t\t\tdata-wp-bind--inert=\"!state.isExpanded\"\n\t\t\tdata-wp-run=\"callbacks.run\"\n\t\t\tid=\"accordion-content-4\"\n\t\t>\n\t\t\t<div class=\"msr-accordion__body\">\n\t\t\t\t<h3>ProBase Browser<\/h3>\n<p><span id=\"99e25e2e-d549-4e19-8b84-9a4bc0efb4bb\" class=\"ImageBlock fl\"><img decoding=\"async\" id=\"Image99e25e2e-d549-4e19-8b84-9a4bc0efb4bb\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-viewer_small.png\" alt=\"\" \/><span id=\"ImageCaption99e25e2e-d549-4e19-8b84-9a4bc0efb4bb\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p>Probase Viewer shows the backbone of our taxonomy, such as class-subclass, class-instance and class-attribute. It also shows the results of incorporating external knowledge bases, such as freebase.<\/p>\n<h3>Topic Search<\/h3>\n<p><span id=\"afb023ff-01d6-449d-b2a9-1073611c76bd\" class=\"ImageBlock fl\"><img decoding=\"async\" id=\"Imageafb023ff-01d6-449d-b2a9-1073611c76bd\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-topic_search_small.png\" alt=\"\" \/><span id=\"ImageCaptionafb023ff-01d6-449d-b2a9-1073611c76bd\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p>We present a framework that improves web search experiences through the use of\u00a0Probase. The framework classifies web queries into different patterns according to the concepts and entities in addition to keywords contained in these queries.<\/p>\n<h3><\/h3>\n<h3>Web Tables<\/h3>\n<p><span id=\"5293e03d-e143-4210-ba0c-ef0d61655223\" class=\"ImageBlock fl\"><img decoding=\"async\" id=\"Image5293e03d-e143-4210-ba0c-ef0d61655223\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-table_search_small.png\" alt=\"\" \/><span id=\"ImageCaption5293e03d-e143-4210-ba0c-ef0d61655223\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p>We use\u00a0Probase to help us interpret and understand tables, which\u00a0unlocks the wealth of information hidden in web table. Then we build a semantic search engine over tables to demonstrate how much valuable information can be found in Web tables, and how structured data can empower information retrieval on the Web.<\/p>\n<h3><\/h3>\n<h3>Conceptualization<\/h3>\n<p><span id=\"ca13fbbe-a906-43b6-ac4d-aa0081d4bffe\" class=\"ImageBlock fl\"><img decoding=\"async\" id=\"Imageca13fbbe-a906-43b6-ac4d-aa0081d4bffe\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-abstraction_small.png\" alt=\"\" \/><span id=\"ImageCaptionca13fbbe-a906-43b6-ac4d-aa0081d4bffe\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p>Probase enables machines to conceptualize from a set of words or short text\u00a0by performing Bayesian analysis based on the typicality and other probabilities. We develop several applications to show with Probase machines can \u201cthinking\u201d like human beings.<\/p>\n<h3><\/h3>\n<h3>Probase Simple Web Interface<\/h3>\n<p><span id=\"a5989aaa-341c-47de-9c19-a167e0b1a586\" class=\"ImageBlock fl\"><img decoding=\"async\" id=\"Imagea5989aaa-341c-47de-9c19-a167e0b1a586\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/02\/probase-interface_small.png\" alt=\"\" \/><span id=\"ImageCaptiona5989aaa-341c-47de-9c19-a167e0b1a586\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n<p>For users who want to try our release data, this interface integrates basic data access functions and conceptualization functions based on Probase release\u00a0package. Users can easily query our knowledge base and get our results without any programming.<span id=\"c97b486e-e8f8-438b-adde-6bf9a444a7c5\" class=\"ImageBlock fn\"><span id=\"ImageCaptionc97b486e-e8f8-438b-adde-6bf9a444a7c5\" class=\"ImageCaptionCoreCss ImageCaption\"><\/span><\/span><\/p>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/li>\n\t\t\t\t\t\t<\/ul>\n\t<\/div>\n\t\n","protected":false},"excerpt":{"rendered":"<p>The goal of Probase is to make machines \u201caware\u201d of the mental world of human beings, so that\u00a0machines can better understand human communication. We do this by\u00a0giving certain general knowledge or certain common sense to machines.<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13555],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-170584","msr-project","type-msr-project","status-publish","hentry","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2010-10-29","related-publications":[164660,295037,238196,168676,168675,168351,167617,167616,165731,165269,161544,164576,164575,164524,163960,162965,162634,162337,162336,161604],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"related-researchers":[],"msr_research_lab":[199560],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170584\/revisions"}],"predecessor-version":[{"id":388625,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/170584\/revisions\/388625"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=170584"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=170584"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=170584"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=170584"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=170584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}