GOTO ranking made affordable with MAG
We are living in a modern society where performance is quantified to inform our actions and decisions. This is true in business, where companies are often measured by their revenues, earnings, and growth. It is also true in our daily lives where fans are obsessed with the stats of star athletes and sport teams, and we check online reviews and “likes” before trying out a new restaurant or making a purchase. Naturally, in the academic world, we are also focused on measurement. Some of them are less life-and-death, such as those used by university librarians to determine which journals and magazines to subscribe to. Some, however, are consequential, such as those used to decide whether a faculty member should earn tenure or which research areas should see a budget increase (or cut!). Scientists have treated the question of designing and applying proper metrics to characterize scientific activities with the same rigor as they would approach any other topic. Unfortunately, many commercial products and services do not always employ as rigorous a standard as employed by the scientific community when they try to measure academic effectiveness. For example, the faculty of Rutgers University raised serious concerns and disputes with the administration, which spent $500,000 on a data mining firm that produced faulty reports on their research outputs; and Computing Research Association (CRA) recently issued a strongly-worded rebuke on the university rankings put forth by US News and World Report. One recurring theme in these incidents is that many commercial entities often underestimate the effort it takes to gather the high-quality data on which they base their analyses. Additionally, the methods and algorithms they use to derive the analytics from their proprietary datasets are often shrouded in secrecy, making it difficult to understand and have confidence in the results, especially when users can observe obvious errors.
It is against this backdrop that a taskforce at the last CRA meeting proposed the principle that should be the bedrock of any academic assessment. It is playfully called GOTO, which stands for Good and Open data with Transparent and Objective process and methodologies. A viewpoint article on this subject has been published in the July 2019 issue of Communications of the ACM, and two proof-of-concept websites implementing the GOTO principle to assess universities based on their computer science programs have been created (gotorankings.org).
For Microsoft, the four tenets in the GOTO principle are second nature, as reflected in the Microsoft Academic project we started five years ago. To ensure we have good data, we teamed up with our partners in Bing to apply best-in-class machine reading technologies to process the entire web index and produce the Microsoft Academic Graph (MAG). MAG provides fresh, accurate, and comprehensive coverage of scholarly communications. Peer-reviewed studies published in Scientometrics and Journal of Informetrics suggest we are on the right track because MAG, although curated from the web by machines, favorably compares to other datasets that are created directly from the publishers. To promote open collaboration, we distribute MAG under the Open Data Commons Attribution license, or ODC-BY, which encourages data mining, redistributing, and improving the dataset as appropriate, including making commercial products and services based on it. To promote transparent and objective analytics, we have published our source code, some in U-SQL for Azure Data Lake and some in Python for Spark, to GitHub. The URL links to these resources are included in our MAG documentation. We hope that, by publishing these scripts, results shown at the Microsoft Academic website can be precisely understood, reproduced, and even adapted to other purposes.
To drive the point home, let us take a look at the issue of university ranking that prompted CRA’s blistering critique. The script shown in last week’s blog can be slightly modified to evaluate institutions rather than researchers. Since most of commercial reports rank universities based on high-level fields of study (for example, “best computer science/business schools”), we also analyze institutions based on the 18 top-level fields of study that MAG contains. The U-SQL code is as follows:
@affiliationPaperCitation = SELECT DISTINCT // authors may share the same affiliation (long) A.AffiliationId AS AffiliationId, A.PaperId, Q.FieldOfStudyId, R.DisplayName AS FosName, P.EstimatedCitation, P.Rank FROM @paperAuthorAffiliation AS A INNER JOIN @papers AS P ON A.PaperId == P.PaperId INNER JOIN @paperFos AS Q ON A.PaperId == Q.PaperId INNER JOIN @fos AS R ON Q.FieldOfStudyId == R.FieldOfStudyId WHERE A.AffiliationId != null // only consider known affiliations AND P.Year > 2008 // consider only past 10 years of publications AND R.Level == 0; // consider only top level fields // // Compute Paper Rank using EstimatedCitation // @affiliationPaperRankByCitation = SELECT PaperId, AffiliationId, FieldOfStudyId, FosName, EstimatedCitation, Rank, ROW_NUMBER() OVER(PARTITION BY AffiliationId, FieldOfStudyId ORDER BY EstimatedCitation DESC) AS PaperRank FROM @affiliationPaperCitation; // // Compute HIndex, Saliency and total citation count // @affiliationHIndex = SELECT AffiliationId, FieldOfStudyId, ANY_VALUE(FosName) AS FosName, COUNT(*) AS PaperCount, SUM(EstimatedCitation) AS CitationCount, MAX((EstimatedCitation >= PaperRank) ? PaperRank : 0) AS Hindex, SUM(Math.Exp(-1.0*Rank/1000)) AS Saliency FROM @affiliationPaperRankByCitation GROUP BY AffiliationId, FieldOfStudyId; // // Look up the Affiliation Name // @affiliationStats = SELECT A.DisplayName AS AffiliationName, FosName, H.PaperCount, H.Saliency, H.CitationCount, H.Hindex FROM @affiliationHIndex AS H INNER JOIN @affiliations AS A ON H.AffiliationId == A.AffiliationId; OUTPUT @affiliationStats TO @outStream ORDER BY FosName ASC, Saliency DESC, AffiliationName ASC, Hindex DESC USING Outputters.Tsv(quoting : false);
Again, as described in last week’s post, we have first executed the CreateFunctions.usql to simplify access to MAG:
@paperFos = PaperFieldsOfStudy(@uriPrefix); @fos = FieldsOfStudy(@uriPrefix); @papers = Papers(@uriPrefix); @paperAuthorAffiliation = PaperAuthorAffiliations(@uriPrefix); @affiliations = Affiliations(@uriPrefix);
The script generates the results shown on the Microsoft Academic Institution Analytics page, where institutional rankings can be viewed with various metrics and over several time periods. Note that in the above script, you can zoom in on academic impacts for the past 10 years only. Again, like the examples shown previously, the script that ranks all of the more than 25,000 institutions in over 18 fields in MAG is fast (12 minutes and 39 seconds) and quite affordable (U.S. $1.35):
However, if you explore the analytics pages further, you might have noticed that institution rankings are much more nuanced and fluid than most commercial rankings would lead you to believe. Specifically, if you drill into the fields of study hierarchy, you can see that where an institution stands in the rankings can vary dramatically. Often, this reflects that many institutions have strategic focus areas and just because institution A ranks lower than institution B in one field does not mean institution A will rank lower than B in all of its subfields. In other words, without carefully accounting for academic specialization in fields of study, university rankings are more likely to be meaningless and misleading than helpful. MAG can help avoid this issue by allowing you to compute the rankings on all 18 fields and their 660,000 subfields. The script above can be modified to do so by relaxing the conditional clause “AND R.Level == 0” in the first statement. Obviously, the amount of computation grows tremendously, but the Azure bill you will incur, at $3.80, is still less than an average visit to Starbucks:
With MAG making GOTO analytics so affordable, there is really no good reason not to do the right thing.