{"id":564801,"date":"2019-02-08T08:00:53","date_gmt":"2019-02-08T16:00:53","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&#038;p=564801"},"modified":"2019-03-14T20:45:48","modified_gmt":"2019-03-15T03:45:48","slug":"cost-of-tracking-research-trends-and-impacts-with-microsoft-academic-graph","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/cost-of-tracking-research-trends-and-impacts-with-microsoft-academic-graph\/","title":{"rendered":"Cost of tracking research trends and impacts with Microsoft Academic Graph"},"content":{"rendered":"<p>Microsoft Academic Graph (MAG) explores how we can harness the power of modern machine cognition capabilities to capture the accumulated knowledge as recorded in the academic publications and patent filings of the last two hundred years. Since MAG launched in late 2015, <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/academic.microsoft.com\/#\/search?iq=And(Ty%3D'0'%2CRId%3D1932742904)&q=papers%20citing%20an%20overview%20of%20microsoft%20academic%20service%20mas%20and%20applications&filters=&from=0&sort=0\" target=\"_blank\" rel=\"noopener noreferrer\">more than 170 research projects and systems<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> have been able to use this rich knowledge base to try out new ideas and release new products. There are two major ways of getting a copy of MAG. First, MAG is a major component of the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/www.openacademic.ai\/oag\/\" target=\"_blank\" rel=\"noopener noreferrer\">Open Academic Graph<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> developed by Tsinghua University, where advanced entity conflation algorithms have been applied to merge the knowledge in MAG and Tsinghua\u2019s <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/aminer.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">AMiner<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> graph. This latest version of the Open Academic Graph uses a snapshot of MAG taken in late November 2018, and we\u2019re thrilled to make it available as a free download.<\/p>\n<p>Research never stops, however. More than 1 million publications are being added to MAG each month (more on this later). If your project needs to access the most current data and cannot wait for the annual updates of Open Academic Graph, we are also making a more frequently updated MAG available through Azure cloud replication, as described in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/academic-services\/graph\/\" target=\"_blank\" rel=\"noopener noreferrer\">MAG online document<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. This online version enables users to conduct their research directly in the cloud and leverage the big data analytics Azure offers. We have included a link to the Azure price calculator in the document so that you can estimate the storage and networking costs that a project will incur on Azure.<\/p>\n<p>While the Azure prices on storage, computation, and read\/write operations are clearly listed, sometimes it\u2019s not a trivial task to estimate charges. How much cloud resource a typical analytical project will consume, and what degree of parallelism that you specify to execute the task impact costs. We hope the following two examples can give you a better reference point to estimate the Azure charges you may encounter.<\/p>\n<p>First, let\u2019s consider a very rudimentary \u201cHello World\u201d type of task that analyzes the growth in scholarly communication activities for the years that MAG has data for. The task is rather straightforward: we just have to go over all of the publications and tabulate them based on their year of publication. Using the built-in Azure Data Lake Analytics language called <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/data-lake-analytics\/data-lake-analytics-u-sql-get-started\" target=\"_blank\" rel=\"noopener noreferrer\">U-SQL<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, the core of the analytic script amounts to two statements:<\/p>\n<pre>@paperCitationCounts =\r\n     SELECT\r\n          Year,\r\n          COUNT(*) AS PaperCount,\r\n          SUM(EstimatedCitation) AS CitationCount\r\n     FROM @magPapers\r\n     GROUP BY Year;\r\n\r\nOUTPUT @paperCitationCounts\r\n     TO @outStream\r\n     ORDER BY Year ASC\r\n     USING Outputters.Csv(quoting : false);\r\n<\/pre>\n<p>The first statement uses a variable <code>@magPapers<\/code> to refer to the Papers.Txt data stream in MAG that contains all the information for each publication. If you have first run the <code>CreateFunction.usql<\/code> script that comes with your MAG distribution on each snapshot, you can easily specify the stream by using the single statement:<\/p>\n<pre>@magPapers = Papers(@uriPrefix);<\/pre>\n<p>with <code>@uriPrefix<\/code> being &#8220;wasb:\/\/&#8221; + <code>@dataVersion<\/code> + &#8220;<code>@<\/code>&#8221; + <code>@blobAccount<\/code> + &#8220;\/&#8221;, where <code>@dataVersion<\/code> is the name of the MAG snapshot you\u2019d like to use (for example, the November 29, 2018 snapshot of MAG, also used in the Open Academic Graph, is &#8220;mag-2018-11-29&#8221;), and <code>@blobAccount<\/code> is the Azure Blob Storage account to which you have asked us to send the MAG data. Using the default five degrees of parallelism (or in Azure\u2019s parlance, using 5 AUs), the script takes 4 minutes and 17 seconds to finish, for a cost of U.S. $.38:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-564831\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/status-succeeded.png\" alt=\"\" width=\"327\" height=\"278\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/status-succeeded.png 327w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/status-succeeded-300x255.png 300w\" sizes=\"auto, (max-width: 327px) 100vw, 327px\" \/><\/p>\n<p>The detailed cloud resource consumption as reported in Azure portal is:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-564858\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/detailed-cloud-resource-consumption.png\" alt=\"\" width=\"327\" height=\"631\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/detailed-cloud-resource-consumption.png 568w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/detailed-cloud-resource-consumption-155x300.png 155w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/detailed-cloud-resource-consumption-531x1024.png 531w\" sizes=\"auto, (max-width: 327px) 100vw, 327px\" \/><\/p>\n<p>Obviously, you can use fewer AUs and wait for longer to reduce the cost down from 38 cents, but we think 38 cents is quite a bargain because you get to generate a chart like the following one, based on the data output from the script:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-564861\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/publication-year.png\" alt=\"\" width=\"1175\" height=\"741\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/publication-year.png 1175w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/publication-year-300x189.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/publication-year-768x484.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/publication-year-1024x646.png 1024w\" sizes=\"auto, (max-width: 1175px) 100vw, 1175px\" \/><\/p>\n<p>Here, we can see that the academic community has been on an exponential growth path (note the vertical axis is in logarithmic scale!), both in terms of publications and citations, for the past two hundred years. Both pace and the rate of growth show no signs of slowing. The only periods that we see dips are during the two world wars, and the publication growth seems to be flat during the Great Depression in the 1930s. On average, recent papers receive more citations than the ancient ones, judging by the widening separations of the two curves. The total citation count begins to decay after year 2005, suggesting that average papers take more than a decade for their due citations to be properly observed and included into MAG.<\/p>\n<p>Tabulating the publication and citation counts over the year is indeed really simple. How about another common task we are frequently asked by our users, where they would like to track a list of authors by their impact in the research community and compute their <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/en.wikipedia.org\/wiki\/H-index\" target=\"_blank\" rel=\"noopener noreferrer\">h-index<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>? This task is slightly more complex than the previous one in that we have to first look up the authors from the list in the MAG, then retrieve all the publications each author has in order to compute their impact metric. The script involves a few <code>INNER JOIN<\/code> commands in U-SQL for looking up data from MAG; and to compute h-index, we have to create a citation histogram for each author. The core logic can be realized as follows:<\/p>\n<pre>@targetAuthors =\r\n     EXTRACT\r\n          AuthorName string\r\n     FROM @targetList\r\n     USING Extractors.Tsv();\r\n\r\n\/\/ Lookup Author IDs for each name\r\n@targetAuthorsWithIDs =\r\n     SELECT\r\n          A.AuthorId,\r\n          T.AuthorName\r\n     FROM @targetAuthors AS T\r\n     INNER JOIN @authors AS A\r\n          ON T.AuthorName == A.DisplayName; \/\/ Can alternatively use C# string match functions\r\n\r\n@affiliationAuthorPapers =\r\n     SELECT DISTINCT \/\/ Condense author with multiple affiliations into a single row\r\n          A.AuthorId,\r\n          A.AuthorName,\r\n          T.PaperId\r\n     FROM @targetAuthorsWithIDs AS A\r\n     INNER JOIN @paperAuthorAffiliation AS T\r\n          ON T.AuthorId == A.AuthorId;\r\n\r\n\/\/ Get citation count from Papers table\r\n@authorPaperCitation =\r\n     SELECT\r\n          A.AuthorName,\r\n          A.PaperId,\r\n          P.EstimatedCitation,\r\n          P.Rank\r\n     FROM @affiliationAuthorPapers AS A\r\n     INNER JOIN @papers AS P\r\n          ON A.PaperId == P.PaperId;\r\n\r\n\/\/ Compute Paper Rank using citation count\r\n@authorPaperRankByCitation =\r\n     SELECT\r\n          PaperId,\r\n          AuthorName,\r\n          EstimatedCitation,\r\n          Rank,\r\n          ROW_NUMBER() OVER(PARTITION BY AuthorName ORDER BY EstimatedCitation DESC) AS PaperRank\r\n     FROM @authorPaperCitation;\r\n\r\n\/\/ Compute h-index and total citation count\r\n@authorHIndex =\r\n     SELECT\r\n          AuthorName,\r\n          COUNT(*) AS PaperCount,\r\n          SUM(EstimatedCitation) AS CitationCount,\r\n          MAX((EstimatedCitation >= PaperRank) ? PaperRank : 0) AS Hindex,\r\n          SUM(Math.Exp(-1.0*Rank\/1000)) AS Saliency\r\n     FROM @authorPaperRankByCitation\r\n     GROUP BY AuthorName;\r\n\r\nOUTPUT @authorHIndex\r\n     TO @outAuthorHIndex\r\n     ORDER BY Saliency DESC, AuthorName ASC, Hindex DESC\r\n     USING Outputters.Tsv(quoting : false);<\/pre>\n<p>Again, the three MAG streams are specified with:<\/p>\n<pre>@papers = Papers(@uriPrefix);\r\n@authors = Authors(@uriPrefix);\r\n@paperAuthorAffiliation = PaperAuthorAffiliations(@uriPrefix);<\/pre>\n<p>Here, the script assumes the target list is first uploaded onto your Azure Data Lake and it contains the full names (that is, \u201c<code>DisplayName<\/code>\u201d in MAG) of the authors to be tracked, but if some authors have a common name that can be easily confused with others, the target list can optionally include the AuthorID and save the second statement to lookup IDs with names. The script also demonstrates how to compute a measure called saliency, with which Microsoft Academic ranks the search results by default. Saliency is a measure developed to avoid many known pitfalls of h-index. For example, if an author has published only ten papers, they can, by design, never have an h-index greater than 10 regardless how much impact and how many citations these ten papers have received, making the record look like it is work from a novice researcher. This problem is deftly avoided in saliency, and the results from the above script can be used to study this issue. A research paper describing saliency has been submitted for review. Please watch for it if you are interested in the details.<\/p>\n<p>With these additional lookups and computation, the task still can be finished in 11 minutes and 14 seconds with a charge of U.S. $1.34:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-564864\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/89.png\" alt=\"\" width=\"329\" height=\"272\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/89.png 329w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/89-300x248.png 300w\" sizes=\"auto, (max-width: 329px) 100vw, 329px\" \/><\/p>\n<p>Interestingly, this charge is not that different had we just computed the h-index for everyone in MAG. This can be done by removing the first three statements to look up a specific group of authors and count only their papers. Not surprisingly, the task is faster to complete (9 minutes 40 seconds) and the cost is less at U.S. $1.15:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-564867\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/76.png\" alt=\"\" width=\"326\" height=\"273\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/76.png 326w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/01\/76-300x251.png 300w\" sizes=\"auto, (max-width: 326px) 100vw, 326px\" \/><\/p>\n<p>In other words, the cloud is rather efficient in dealing with massive amounts of data and the costs of computing various indicators from MAG are often much lower than a cup of coffee.<\/p>\n<p>Happy researching!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Microsoft Academic Graph (MAG) explores how we can harness the power of modern machine cognition capabilities to capture the accumulated knowledge as recorded in the academic publications and patent filings of the last two hundred years. Since MAG launched in late 2015, more than 170 research projects and systems have been able to use this rich knowledge base to try out new ideas and release new products.<\/p>\n","protected":false},"author":39507,"featured_media":564876,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":170262,"msr_hide_image_in_river":0,"footnotes":""},"research-area":[],"msr-locale":[268875],"msr-post-option":[],"class_list":["post-564801","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_assoc_parent":{"id":170262,"type":"project"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/564801","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/39507"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/564801\/revisions"}],"predecessor-version":[{"id":573528,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/564801\/revisions\/573528"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/564876"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=564801"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=564801"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=564801"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=564801"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}