[PDF] How Reliable are University Rankings?

Abstract

University or college rankings have almost become an industry of their own, published by US News \& World Report (USNWR) and similar organizations. Most of the rankings use a similar scheme: Rank universities in decreasing score order, where each score is computed using a set of attributes and their weights; the attributes can be objective or subjective while the weights are always subjective. This scheme is general enough to be applied to ranking objects other than universities. As shown in the related work, these rankings have important implications and also many issues. In this paper, we take a fresh look at this ranking scheme using the public College dataset; we both formally and experimentally show in multiple ways that this ranking scheme is not reliable and cannot be trusted as authoritative because it is too sensitive to weight changes and can easily be gamed. For example, we show how to derive reasonable weights programmatically to move multiple universities in our dataset to the top rank; moreover, this task takes a few seconds for over 600 universities on a personal laptop. Our mathematical formulation, methods, and results are applicable to ranking objects other than universities too. We conclude by making the case that all the data and methods used for rankings should be made open for validation and repeatability.

Full PDF

HHow Reliable are University Rankings?

Ali Dasdan, Eric Van Lare, and Bosko ZivaljevicKD ConsultingSaratoga, CA, [email protected] 21, 2020

Abstract

University or college rankings have almost become an industry of theirown, published by US News & World Report (USNWR) and similar or-ganizations. Most of the rankings use a similar scheme: Rank universitiesin decreasing score order, where each score is computed using a set ofattributes and their weights; the attributes can be objective or subjectivewhile the weights are always subjective. This scheme is general enoughto be applied to ranking objects other than universities. As shown in therelated work, these rankings have important implications and also manyissues. In this paper, we take a fresh look at this ranking scheme using thepublic College dataset; we both formally and experimentally show in mul-tiple ways that this ranking scheme is not reliable and cannot be trustedas authoritative because it is too sensitive to weight changes and can eas-ily be gamed. For example, we show how to derive reasonable weightsprogrammatically to move multiple universities in our dataset to the toprank; moreover, this task takes a few seconds for over 600 universities on apersonal laptop. Our mathematical formulation, methods, and results areapplicable to ranking objects other than universities too. We conclude bymaking the case that all the data and methods used for rankings shouldbe made open for validation and repeatability.

Rankings of higher education institutions (universities for short) have almostbecome an industry of its own [33]. Most of the rankings use a similar method-ology: Select a set of numeric attributes and a numeric weight for each attribute,then compute a ﬁnal numeric score as the sum of the products of each attributewith its weight. The weight of an attribute determines the amount of contribu-tion the attribute makes to the ﬁnal score. For the ﬁnal ranking, the universitiesare ranked in their decreasing score order.This generic ranking methodology is simple enough that it has been ap-plied to ranking all kinds of objects, from universities to hospitals to cities to1 a r X i v : . [ c s . D L ] A p r ountries [41], e.g., see [40, 59, 60] for the rankings methodologies for objectsother than universities: Places to live, hospitals, and countries. As a result,even though our focus in this study is universities, the ﬁndings are applicableto other areas where this generic ranking methodology is used.There are many questions that have been explored about this ranking method-ology in general and university rankings in particular. For example, why a givenattribute is selected, whether a given attribute has the correct or most up-to-date value, why an attribute is weighted more than another one, what impact theopinion-based attributes have on the ﬁnal score, whether or the impact of theserankings in student choices is warranted, whether or not a university shouldincentivize their admins to improve their rank with a given ranking, whether ornot a university games these rankings when they share their data, etc. There isa rich body of related work exploring many of these issues, as explored in ourrelated work section. There is also an international eﬀort that has provided aset of principles and requirements (called the Berlin Principles [30]) to improverankings and the practical implementations of the generic ranking methodology.We start our study in § § §

4. We lean towards providingreferences to survey or overview papers as well as entry points to subﬁelds sothat the interested readers can go deeper if they so desire.We follow this by introducing the mathematical formulation that underliesthe generic ranking methodology. The mathematical formulation uses conceptsfrom linear algebra and integer linear programming (ILP).The main part of this paper is presented in §

6. We formulate and solvesix problems in this section. The main vehicle we use in our formulation ofthese problems is the integer linear programming. The ILP programs for theseproblems take a few seconds to both generate and run for over 600 universitieson a personal laptop.In the ﬁrst problem in § § § § § § our main thesis is that university rankings as commonly donetoday and somehow proposed as unique with so much fanfare are actually notreliable and can even be easily gamed . We believe our results via these sixproblems provide strong support to this thesis. This thesis applies especially tothe universities at the top rank. We show in multiple ways that it is relativelyeasy to move multiple university to the top rank in a given ranking. As wementioned above, our ﬁndings are applicable to areas where objects other thanuniversities are ranked. In §

7, we provide a discussion of these points togetherwith a few recommendations.One unfortunate aspect of the rankings from the four well-known rankingsorganizations is that their datasets and software code used for their rankingsare not in the public domain for repeatability. We wanted to change this sowe have posted our datasets and the related software code in a public coderepository [16]. We hope the well-known rankings organizations too will soonshare their latest datasets and the related software code in the public domain.

Rankings of higher education institutions were ﬁrst started by the U.S. Bureauof Education in 1870 and have been done by multiple other organizations since;however, it can be argued that the current rankings industry has been ignitedby the U.S. News and World Report’s ﬁrst “America’s Best Colleges” ranking in1983 [33]. Today, there are many rankings around the world, some of which arethe well-known worldwide rankings and some are country speciﬁc [63]. The fourwell-known rankings that we will cover in § In general, there are three basic sources of the input data for university rank-ing [56]:a) surveys of the opinions of various stakeholders such as university or highschool administrators;b) independent third parties such as government agencies; andc) university sources.The last source is now standardized under the initiative called the “CommonData Set Initiative” [67]. A simple web search with a university name followedby the query “Common Data Set” will return many links to these data setsfrom many universities. Note that among these three data sources, the surveydata is inherently subjective while the other two are supposed to be objective,although unfortunately some intentional alteration of data by university sourceshave been observed [46].Our data source ﬁts into category b above: The College dataset [52] is partof the StatLib datasets archive, hosted at the Carnegie Mellon University; itcontains data about many (1,329 to be exact) but not all American highereducation institutions. Its collection in 1995 was facilitated by the AmericanStatistical Association. The two data sources are Association of American Uni-versity Professors (AAUP) and US News & World Report (USNWR), whichcontribute 17 and 35 attributes, respectively, per university. There are manyattributes with missing values for multiple universities. See [52] for the meaningof each attribute. All the attributes in this dataset are objective.With the attributes and weights we selected, as detailed below, we generatedthe top 20 universities as shown in Fig. 1. We hope the reader can appreciate4igure 1: Our top 20 ranking generated from the College dataset. This rankingaligns well with the well-known rankings (partly by construction). Here thelabels a ij , w ij , and s i refer to the attribute a ij , weight w j , and score s i of theuniversity at rank i for j in 1 through 11. Also here and in the sequel we usegreen color for highlighting attribute values.that this ranking is a reasonable one to the extent it aligns well with the well-known rankings, to be discussed in a bit below.One question that may come to the reader’s mind could be the reason for se-lecting this dataset and the relevance of this fairly old dataset to the present. Forthe reason question, we wanted to make sure that we choose a standard datasetthat is available to all who want to replicate our results; moreover, we do nothave access to the latest datasets used by the well-known rankings organizations.For the relevance question, we ask the reader to review our mathematical andproblem formulations and convince themselves that our results are applicable toany dataset containing a set of objects to rank using their numerical attributes. We have n universities in some ranking, each of which has the same m attributes(also called variables or indicators in the Economics literature) with potentiallydiﬀerent values. We will use i to index universities and j to index attributes.Each attribute a ij of the i th university is associated with the same real-numberweight w j . The score s i of the i th university is a function of the attributes and5eights of the university as s i = g  m (cid:88) j =1 w j f ( a ij )  , (1)where the functions g ( · ) and f ( · ) usually reduce to the identity function resultingin the following sum-of-products form: s i = m (cid:88) j =1 w j a ij , (2)where the j th weight determines the contribution of the j th attribute to theﬁnal score.A ranking of n universities is a sorting of scores in decreasing order suchthat the “top” ranked or the “best” university is the university with the highestscore or the one at rank 1. Later in § §

5, we will keep this section short. Also see the weighting and aggregationsection ( § a ij , w ij , and s i referto the attribute a ij , weight w j , and score s i of the university at rank i for j in1 through 11. In general, the attributes for ranking or as indicators of quality in higher edu-cation can be grouped into the following four categories [24]:a) beginning characteristics, which cover the characteristics of the incomingstudents such as test scores or high school ranking;b) learning inputs, which cover factors that help aﬀect the learning experi-ences of students such as the ﬁnancial resources of the university;6) learning outputs, which cover the skill sets or any other attributes ofgraduates; andd) ﬁnal outcomes, which cover the outcomes of the educational system andwhat the students achieve in the end such as employment, income, andjob satisfaction.It may be argued that category d is what matters most but most of the currentrankings focus on categories a and b because category d and to some extentcategory c are diﬃcult to measure continuously. Our attributes due to what wecan ﬁnd in our dataset also fall into categories a and b.We prepared our dataset in two steps: 1) we joined the two source datasetsfrom AAUP and USNWR using the FICE code, a unique id per university as-signed by the American Federal Government. This generated 1,133 universities.2) We selected 20 attributes out of the 52 total, including the name and thestate of the university. We then eliminated any university with a missing valuefor any of the selected attributes. This resulted in 603 universities.The selected 20 attributes are: University, state, instructional expenditureper student (1), in-state tuition (1), room and board costs (1), room costs,board costs (1), additional fees (1), estimated book costs (1), estimated personalspending (1), number of applications received (2), number of applicants accepted(2), number of new students enrolled (3), percent of new students from the top10% of their high school class (4), percent of new students from the top 25% oftheir high school class (5), percent of faculty with terminal degree (6), percentof faculty with PhDs (5), student/faculty ratio (8), graduation rate (9), percentof alumni who donate (10), number of full professors (11), and number of facultyin all ranks (11). The number j in parenthesis indicates that the correspondingattribute is used to derive the j th attribute in Fig. 2.Why did we choose only 20 attributes out of the 52 total? Two reasons: Wewanted to use as many attributes as possible for each university; we also wantedto make sure that the ﬁnal list of attributes used in ranking are comparable inthe following senses:1. Every ﬁnal attribute is either a percentage or a ratio. This ensures thatthey are comparable in magnitude.2. For every ﬁnal attribute, a university with a higher value should be re-garded by a prospective student as “better” than another university witha lower value.Taking these two into account, we could not select more than 20 initial attributesfrom the 52 total. We then converted these 20 attributes into a ﬁnal list of 11attributes (see Fig. 2), not counting the name and the state of the university,using the following thought process that we think a reasonable student wouldpotentially go through:“I, the student, want to go to a university ia i : that spends for me far more than what it costs me in total (so that I getback more than what I put in); 7 i , a i : that is desired highly by far more students that it can accommodate (sothat I get a chance to study with top students); a i , a i : that attracts the top students in their graduating class (so that I get achance to study with top students); a i , a i : that has more faculty with PhDs or other terminal degrees in their ﬁelds(so that I get taught by top researchers or teachers); a i : that has a smaller student to faculty ratio (so that I can get more attentionfrom professors); a i : that has a higher graduation rate (so that I can graduate more easily); a i : that has more of its alumni donating to the university (so that moreﬁnancial resources are available to spend on students); and a i : that has more of its classes taught by full professors (so that I get taughtby top researchers or teachers).”Here the spend above indicates the total instructional expenditure by theuniversity per student and the cost to a student above covers the tuition, roomand board, fees, books, and personal expenses.We hope the arguments made above look reasonable and we expect them tobe at least directionally correct. For example, a top researcher may not be atop teacher but it feels reasonable to us to assume that with a solid researchexperience there is some correlation towards better qualiﬁcations to teach aparticular subject.The reader may or may not agree with this attribute selection process butthat is exactly one of the points of this paper: There is so much subjectivitycreeping in during multiple steps of the ranking process. Later we will showhow to remove some of this bias. We then realized that we could easily repair some missing values: a) if the valueof the attribute “room and board costs” is missing, it can easily be calculated bythe sum of the values of the attributes “room costs” and “board costs” if bothexist; b) University of California campuses have very similar tuitions and fees, sowe substituted any missing value with the average of the remaining values; c) ForStanford University, the values of the attributes “additional fees” and “percentof faculty with PhDs” were missing so we replaced them with the values we wereable to ﬁnd or calculate after some web searching. After these few repairs, wewere able to increase the number of universities very modestly, from 603 to 609to be exact, in our input dataset. We note that these repairs were not necessaryto reach our conclusions but if not done, well-regarded universities like StanfordUniversity and most of the University of California campuses would have beenmissing from the ﬁnal ranking. 8 .5 Step 4: Multivariate Analysis

Using techniques such as principal component analysis or factor analysis, thisstep looks at the underlying structure via the independent components or factorsof the attribute space. We will skip this step for our dataset as this step is nothighly relevant to the theme of this paper.

Normalization ensures that each attribute falls into the same interval, usually[0 , , a new = a old − min a max a − min a , (3)where max a and min a are the maximum and minimum values for the attributein question among the selected universities. As mentioned earlier, each university has the same m attributes but with po-tentially diﬀerent values. Each attribute a ij is paired with a weight w j , whichis the same across all universities. These attribute and university pairs are ma-nipulated in one of the following ways to generate a score so that universitiescan be ranked by their scores. The weighted arithmetic mean formula.

The score s i of the i th uni-versity is equal to s i = m (cid:80) j =1 w j a ijm (cid:80) j =1 w j = m (cid:88) j =1 w j a ij (4)where due to normalization or by construction the sum of the weights in thedenominator is equal to 1, i.e., m (cid:80) j =1 w j = 1. The weighted geometric mean formula.

The score s i of the i th univer-sity is equal to s i = exp  m (cid:80) j =1 w j ln( a ij ) m (cid:80) j =1 w j  = exp  m (cid:88) j =1 w j ln( a ij )  (5)9igure 3: Rankings for the universities in the College dataset with respect todiﬀerent weighting schemes: Arithmetic vs. geometric mean formulas, and foreach, uniform (identical) vs. non-uniform (diﬀerent) weights. Note the presenceof some signiﬁcant changes in ranks for especially universities at higher ranks.Also here and in the sequel we use yellow color for highlighting ranks and scores.where due to normalization or by construction the sum of the weights in thedenominator is equal to 1, i.e., m (cid:80) j =1 w j = 1. Here exp( · ) and ln( · ) are the expo-nential and natural logarithm functions, respectively.Between these, the former formula is more common although the latter isproven to be more robust [55]. The former formula is a sum-of-products formulareferred to by many other names in especially the Economics literature suchas “the composite index” (the most common), “the weight-and-sum method”,“the composite indicator”, “the attribute-and-aggregate method”, “the simpleadditive weighting”, or “the weighted linear combination”, e.g. see [56]. Whenthe sum of the weights is unity, the formula is also equivalent to the dot productof the attribute and weight vectors.Now we know how to compute a score per university but how do we selectthe weights? There are at least three ways of selecting weights [25, 41]: a)data driven such as using principal component analysis; b) normative such aspublic or expert opinion, equal weighting, arbitrary weighting; and c) hybridweighting. We will use the normative weighting scheme as follows.1. Non-uniform weighting.

We use the student persona in § Uniform weighting.

We assign the same weight of 1 to every attribute.This removes any potential bias due to weight diﬀerences among attributesbut it has its own critiques, e.g., see [25].3.

Random weighting.

We assign a uniformly random weight to eachattribute subject to the constraint that the sum of the weights is equal to1. Random weighting and its consequences are explored in § We will cover this in the experimental results.

Our top 20 ranking from the College dataset is shown in Fig. 1. Our datasetand this top 20 ranking both include liberal arts colleges; the four rankingsorganizations usually have a separate ranking for liberal arts colleges. In thisﬁgure, we show the attribute values and the ﬁnal unnormalized score. Thelast row shows our weights chosen for this ranking. We hope the readers canconvince themselves that the resulting rankings, both non-uniform and uniformcases, seem reasonable and look in close alignment with the four well-knownrankings (also see the argument in the last paragraph in the section for Step 6).11ll in all, we hope we have provided a reasonable illustration of a rankingprocess in this section. As we repeatedly mention, our mathematical formulationand conclusions are not speciﬁc to this dataset.

We now brieﬂy discuss the four well-known rankings in this section. For eachranking, we provide a very brief history, its ranking methodology with the list ofthe latest attributes and their weights, and the top 10 universities. This sectioncovers the rankings of universities in the aggregate as well as in ComputerScience.Figure 4: The attributes and weights used by US News and World Report in2020 for the US national ranking.

U.S. News & World Report (USNWR) rankings have been active since 1985 [66].The attributes and weights of the latest ranking methodology are shown in Fig. 4for US national ranking and in Fig. 5 for global ranking.The national ranking has six categories of attributes, which are 13 in total,and the weights range from 1% to 22%. About 20% of the total weight, i.e.,the “peer assessment” attribute, is opinion based. The global ranking has threecategories of attributes, which are also 13 in total, and the weights range from2.5% to 12.5%. About 25% of the total weight, i.e., the research reputationattributes, is opinion based.More details about the methodology is available at [38] and [39] for the na-tional and global rankings, respectively. The latest rankings using this method-ology are available at [58] and [57] for the national and global rankings, respec-tively. 12igure 5: The attributes and weights used by US News and World Report in2020 for the global ranking.Figure 6: The attributes and weights used by Quacquarelli Symonds (QS) in2020.USNWR has rankings for objects other than universities such as hospi-tals [59] and countries [60] using the same weights-based ranking methodology.

The Quacquarelli Symonds (QS) rankings have been active since 2004 [64]. Be-tween 2004 and 2010, these rankings were done in partnership with Times HigherEducation (THE). Since 2010, QS rankings have been produced independently.The attributes and weights of the latest ranking methodology are shown inFig. 6; it has ﬁve categories of attributes, which are seven in total. The weightsrange from 5% to 20%. At least 50% of the total weight is opinion based un-der the “academic reputation” and “employer reputation” categories. Moredetails about the methodology is available at [45]. The latest rankings usingthis methodology are available at [44].13igure 7: The attributes and weights used by Times Higher Education (THE)in 2019.

The Times Higher Education (THE) rankings have been active since 2004 [65].Between 2004 and 2010, these rankings were done in partnership with QS. Since2010, THE rankings have been produced independently. The attributes andweights of the latest ranking methodology are shown in Fig. 6; it has ﬁve cat-egories of attributes, which are 13 in total. The weights range from 2.25% to30%. At least 33% of the total weight is opinion based under the “reputationsurvey” attributes. More details about the methodology is available at [54].The latest rankings using this methodology are available at [53].Figure 8: The attributes and weights used by ShanghaiRanking Consultancy in2018.

The ShanghaiRanking Consultancy (SC) rankings have been active since 2003 [62,35]. Between 2003 and 2008, these rankings were done by Shanghai Jiao TongUniversity. Since 2009, these rankings have been produced independently. Theattributes and weights of the latest ranking methodology are shown in Fig. 8;it has four categories of attributes, which are six in total. The weights rangefrom 10% to 20%. No part of the total weight is directly opinion based. More14etails about the methodology is available at [50]. The latest rankings usingthis methodology are available at [49].Figure 9: An illustration of the diﬀerences among the four well-known rankings,with the top 10 national ranking of USNWR taken as the reference. Also notethe ranking by the Kemeny Rule and the average ranking computed out of theﬁrst ﬁve columns. The last row of numbers shows the similarity score betweenthe Kemeny rule ranking and each of the other rankings, as measured by theSpearman’s footrule distance.

To illustrate the diﬀerences between rankings, Fig. 9 shows the top 10 overallrankings of the US universities. In this ﬁgure, the ﬁrst column is the referencefor this table: USNWR national ranking. The next four rankings are USNWRglobal ranking, and the other three well-known rankings. In the column “Ke-meny”, we present a ranking (called the Kemeny ranking) using the Kemenyrule, which minimizes the disagreements between the ﬁrst four rankings and theﬁnal ranking [20]. Finally, in the column “Average”, we present a ranking usingthe average of the ranks over the ﬁrst ﬁve columns.Fig. 9 already illustrates the wide variation between these rankings: a) thereis no university that has the same rank across all these rankings; b) there is noteven an agreement for the top university; c) some highly regarded universities,e.g., UC Berkeley, are not even in top 10 in these rankings; and d) the twoUSNWR rankings of the same universities do not agree. The last row shows thediﬀerence between each ranking and the Kemeny ranking, where the diﬀerenceis computed using Spearman’s footrule [15, 51], which is nothing more than thesum of the absolute diﬀerences between pairwise ranks. The distance showsthat the rankings closer to the Kemeny ranking in decreasing order are thefollowing rankings: Average, USNWR national, Times THE, QS, Shanghai SC,and USNWR global.It is instructive to see the top ranked university in each ranking: PrincetonUniversity in USNWR national ranking, Harvard University in USNWR globalranking and SC ranking, Massachusetts Institute of Technology in QS ranking15nd Kemeny ranking, Stanford University in THE ranking, Harvard Universityin SC ranking and the average ranking. The top ranked university in theserankings may also change from year to year. These disagreements even forthe top ranked university hopefully convinces the readers about the futility ofpaying attention to the announcements of the top ranked university from anyrankings organization.Figure 10: The top ten universities in the world in computer science rankingper each rankings organization. The ﬁrst two rankings by CSRankings andCSMetrics, two computer science focused rankings developed and maintainedby academicians. The ﬁrst column is the reference ranking.

To illustrate the diﬀerences between rankings hopefully better, Fig. 10 givesthe top 10 university rankings for computer science. Three words of cautionhere are a) that these rankings organizations use diﬀerent titles for their com-puter science rankings (USNWR: “Computer and Information Sciences”, QS:“Computer Science and Information Systems”, THE: “Computer Science”, SC:“Computer Science and Engineering”); b) that due to these diﬀerent titles theserankings possibly cover more than computer science; and c) that it is not clearwhat changes these organizations made in their generic ranking methodologiesfor computer science and, for that matter, for other subjects or areas.The ﬁrst two columns in Fig. 10 are by CSRankings [5] and CSMetrics [23],respectively, two computer science focused rankings developed and maintainedby academicians in computer science [22]. These two rankings also align wellwith our own experiences as computer scientists. For the sake of simplicity, wewill refer to them as the “academic” rankings.These academic rankings are mainly based on citations in almost all thevenues that matter to computer science. The central premise of these rankingsis “to improve rankings by utilizing more objective data and meaningful met-rics” [22]. These rankings intent to follow the best practices set by the Comput-ing Research Association (CRA): CRA believes that evaluation methodologiesmust be data-driven and meet at least the following criteria: a) Good data: have16een cleaned and curated; b) Open: data is available, regarding attributes mea-sured, at least for veriﬁcation; c) Transparent: process and methodologies areentirely transparent; and d) Objective: based on measurable attributes. Thesebest practices are the reason for these sites declaring themselves as GOTO-ranking compliant, where GOTO stands for these four criteria. For computerscience rankings, a call to ignore the computer science ranking by USNWR wasmade by the CRA due to multiple problems found with the ranking [3].Note in Fig. 10 the signiﬁcant diﬀerences among the rankings. As mentionedabove, we agree with the academic rankings. However, it is diﬃcult to agreewith the ranks assigned to some universities in the other rankings. For example,it is diﬃcult for us to agree with Carnegie Mellon University having rank 25in USNWR and University of California, Berkeley having rank 118 in THE.Any educated computer scientist would agree that these two universities aredeﬁnitely among the best in computer science. These two examples alone showthe unreliability of the “non-academic” rankings at least for computer science.

There is a huge literature on rankings, especially in the Economics literature forrankings of countries for various well-being measures. As a result, we cannot beexhaustive here; we will instead refer to a fairly comprehensive set of key papersthat are mainly overview or survey papers or papers that are directly relevantto our work.Recall the following acronyms that we deﬁned above for the four well-knownrankings organizations: The US News & World Report (USNWR), QuacquarelliSymonds (QS), Times Higher Education (THE), and ShanghaiRanking Consul-tancy (SC).[33] gives a history of rankings. [41] is the de facto bible of all things relatedto composite indices. Although the focus is on well-being indices for populationsand countries, the techniques are readily applicable to university rankings, as wealso brieﬂy demonstrated in this paper. [56] surveys 18 rankings worldwide. Itacknowledges that there is no single deﬁnition of quality, as seen by the diﬀerentsets of attributes and weights used across these rankings. It recommends qualityassurance to enable better data collection and reporting for improved inter-institutional comparisons.[29] provides an insider view of USNWR rankings. [35] is a related paper buton SC rankings. [48], though focusing on SC and THE only, follows a generalframework that can be used to to compare any two university rankings. It ﬁndsout that SC is only good for identifying top performers and only in researchperformance, that THE is undeniably biased towards British institutions andinconsistent in the relation between subjective and objective attributes. [8] pro-poses a critical analysis of SC, identiﬁes many problems with SC, and concludesthat SC does not qualify as a useful guide to neither academic institutions norparents and students.[21] presents and criticizes the arbitrariness in university rankings. [61] fo-17uses on the technical and methodological problems behind the university rank-ings. By revealing almost zero correlation between the expert opinions andbibliometric outcomes, this paper casts a strong doubt on the reliability ofexpert-based attributes and rankings. This paper also argues that a league ofoutstanding universities in the world may not exceed 200 members, i.e., anyranks beyond 200 are potentially arbitrary. [10] presents a good discussion ofthe technical pitfalls of university ranking methodologies.[13] provides guidelines on how to choose attributes. [25] reviews the mostcommonly used methods for weighting and aggregating, including their beneﬁtsand drawbacks. It proposes a process-oriented approach for choosing appropri-ate weighting and aggregation methods depending on expected research out-comes. [19] categorizes the weighting approaches into data-driven, normative,and hybrid and then discusses a total of eight weighting approaches along thesecategories. It compares their advantages and drawbacks.[4] uses Kemeny rule based ranking to avoid the weight imprecision problem.[47] provides a comparative study of how to provide rankings without explicitand subjective weights. These rankings work in a way similar to Kemeny rulebased ranking.[27] provides a synopsis of the choices available for constructing compositeindices in light of recent advances. [42] provides a literature review and historyon research rankings and proposes the use of bibliometrics to strengthen newuniversity research rankings.[34] provides an example of how to game the rankings system, with mul-tiple quotes from USNWR and some university presidents on how the systemworks. [32] presents a way to optimize the attribute values to maximize a givenuniversity’s rank in a published ranking.[17, 18] construct a model to clarify the incentives of the ranker, e.g., US-NWR, and the student. They ﬁnd the prestige eﬀect pushing a ranker into aranking away from student-optimal, i.e., not to the advantage of the student.They discuss why a ranker chooses the attribute weights in a certain way andwhy they change them over time. They also present a student-optimal rankingmethodology. [37] exposes the games business school play whether or not toreveal their rankings.[36] provides the casual impact of rankings on application decisions, i.e.,how a rank boost or decline of a university aﬀect the number of applicationsthe university gets in the following year.University rankings are an instance of multi-objective optimization. [12]provides a survey of such systems spanning many diﬀerent domains, includinguniversity rankings. [28] presents applications to ranking in databases.In summary, rankings like many things in life have their own pros andcons [26]. The pros are that they in part rely on publicly available informa-tion [1]; that they bring attention to measuring performance [1]; that they haveprovided a wake-up call, e.g., in Europe [1], for paying attention to the quality ofuniversities and providing enough funding for them due to the strong correlationbetween funding and high rank; that they provide some guidance to studentsand parents in making university choices; that they use easily understandable18ttributes and weights and a simple score-based ranking.The cons are unfortunately more than the pros. The cons are that datasources can be subjective [61], can be and has been gamed [46], can be in-complete; that attributes and weights sometimes seem arbitrary [1, 43]; thatweights accord too little importance to social sciences and humanities [1]; thatmany operations on attributes and weights aﬀect the ﬁnal rankings [27]; thatmany attributes can be highly correlated [43]; that there is no clear deﬁnition ofquality [56]; that rankings encourage rivalry among universities and strengthenthe idea of the academic elite [61]; that rankings lead to a “the rich gettingricher” phenomenon due to highly ranked universities getting more funding,higher salaries for their admin staﬀ, more demand from students, and morefavorable view of quality in expert opinions [61]; that assigning credit such aswhere an award was given vs. where the work was done is unclear [61]; thatexpert opinions are shown to be statistically unreliable and yet some well-knownranking organizations still use them [61]; that even objective bibliometric anal-ysis has its own issues [61]; that some rankings such as THE have undeniablebias towards British universities [48]; and that the current rankings cannot betrusted [8, 21, 22].The current well-known rankings have created their own industry and theyhave strong ﬁnancial and other incentives to continue their way of presentingtheir own rankings [18]. There are initiatives to close many drawbacks of thecurrent rankings such as the “Common Data Set Initiative” to provide publiclyavailable data directly from universities [9], Computer Science rankings createdby people who know Computer Science as in academicians from Computer Sci-ence [22], a set of principles and requirements (called the Berlin Principles) thata ranking needs to satisfy to continuously improve [30], an international insti-tution to improve the rankings methodology [31], extra validation steps andprompt action by the current rankings organization against gaming [29, 46]. Inshort, there is hope but it will take time to reach a state where many of thepros have been eliminated.

We have n objects to rank. Objects are things like universities, schools, andhospitals. A ranking is presented to people to select one of. Each object i has m numerical attributes from a i to a im , index with j . We can use the matrix A to represent the objects and their attributes, one row for each thing and onecolumn for each attribute. A =  a a · · · a m a a · · · a m ... ... ... ... a n a n · · · a nm  (6)19e have m unknown numerical coeﬃcients from w to w m in the vector w . w = (cid:2) w w · · · w m (cid:3) , (7)where each w j represents a weight for the attribute a ij for some i . Each weightis non-negative. The same weights are used for every row of attributes. Thesum of the weights is set to 1 due to normalization.For each object i , we compute a score s i as s i = m (cid:88) j =1 w j a ij (8)or in the matrix form s = Aw (9)where A is the matrix of the known attributes and w is the set of unknownweights, both as deﬁned above. Recall from § Aw is a matrix-vector multiplication. Alsonote that setting each weight in w to 1 is the uniform weight case.Fig. 1 illustrates the attributes, weights, scores, and ranks for the top 20universities from the College dataset. The attributes across all 20 rows and11 columns represent the matrix A for the top 20 universities only; the actualmatrix A has over 600 rows. The last row in this table represents the vector w of 11 weights we assigned for this ranking. The last two columns represent thevector s of scores and the ranks for each university. Ordering.

The top n ranking, like the top 20 in Fig. 1, is an ordering ofthese universities in decreasing score, i.e., s > s > · · · > s n , (10)which we will refer as the score ordering constraint. Another form of theseinequalities is s i ≥ s i +1 + (cid:15), ∀ i ∈ [1 , n − , (11)where (cid:15) is a small constant. This form will be useful in linear programmingformulations.Later in § Domination.

Given two attribute vectors a x and a y of length m , we say a x strictly dominates a y if and only if for all j , a x [ j ] ≥ a y [ j ]; we say a x partiallydominates a y if and only if a y does not strictly dominate a x . Note that if a x does not strictly dominate a y , then it is necessarily true that a x and a y partiallydominate each other for diﬀerent sets of their attributes.The domination idea is due to the impact on the score ordering. Given twoattribute vectors a x and a y , it is easy to see that if a x partially dominates a y ,we can always ﬁnd a set of weights to make s x > s y , calculated as in Eq. 8.20 Integer) Linear Programming (LP) Formulation.

In the next sectionwe will deﬁne a set of problems. For each problem we will formulate a linearprogram or an integer linear program and solve it using one of the existingopen source LP packages. For our experiments, the LP package we used waslp solve [6]. Details are in the following problems and solutions section. Forthe constraints of these programs, strict domination is used extensively. Notethat although ILP is NP-hard, it takes a few seconds on a personal laptop togenerate our ILP programs using the Python programming language and runthem using our LP package for over 600 universities. As a result, we do not seeany reason (potentially other than intellectual curiosity) to develop specializedalgorithms for the problems we study in this paper.

We will now show that there are multiple valid choices in assigning ranks touniversities. For each choice, we will pose a problem and then provide a solutionto it. Below there will be a section dedicated to each problem.In Problem 1, we explore the existence of diﬀerent rankings using MonteCarlo simulation. We will do so in two ways: a) how many universities can bemoved to rank 1, b) whether or not we can ﬁnd weights to keep a given top kranking of universities. The search space here is the space of weight vectors,called the weight space.Our solution to Problem 1 may have two issues due to the use of simulation:The weight space may not be searched exhaustively and the search may not beeﬃcient. Using linear programming, Problem 2 ensures that the weight spaceexploration is both eﬃcient and optimal.In the ﬁrst two problems the existence of diﬀerent rankings reduces to theexistence of diﬀerent weights under ordering constraints. As long as such weightsexist, we are concerned about how they may appeal to a human judge. InProblem 3, we rectify this situation in that we derive weights that would appealto a human judge as reasonable or realistic as if assigned by a human.These three problems show that there are many universities that can attainthe top rank but not every university can achieve it. Then a natural nextquestion to ask is how to ﬁnd the best rank that each university can attain.Problem 4 is about solving this problem.The ﬁrst four problems always involve attribute weights. In Problem 5,we explore the problem of ﬁnding rankings without using weights at all. Thesolution involves a technique of aggregating rankings per-attribute, called theKemeny rule.In Problem 6, we explore the problem of how much improvement in theranking of a given university is possible by improving attribute values in aweight-based scoring methodology. This problem should provide some guidanceto universities in terms of what to focus on ﬁrst to improve their ranks.21igure 11: Ranking of the top 20 universities in decreasing average scores orderusing the arithmetic mean formula and using uniformly random weights in all10,000 runs. “Avr”, “Std”, and “CV” in Columns 2-7 stand for the average,the standard deviation, and the coeﬃcient of variation, respectively. “Prob inTop 20” in Column 8 is the probability of falling in the top 20 universities whenthey are ranked in decreasing score order. Each column is explained in § Assigning weights subjectively has its own issues so what if we do not assignweights manually at all? In this section, we explore the weight space automati-cally to discover diﬀerent rankings and ﬁnd out what extremes are possible.Since we use randomization, we need repetition to get meaningful outcomeson the average. Let N be the number of runs. In our experiments, we set N to 10,000. Each run derives m weights with three constraints: a) each weight isindependently and identically drawn (iid); b) each weight is uniformly random;c) the sum of the weights is equal to 1. Among these constraints, care is neededfor constraint c, for which we adapted an algorithm suggested in [2].Over these N runs, we collect the following results for each university:2. Average (Avr) of scores ,3.

Standard deviation (Std) of scores ,4.

Coeﬃcient of variation (CV) of scores , computed as the ratio of the stan-dard deviation to the average,5.

Average of ranks ,6.

Standard deviation of ranks , 22igure 12: Ranking of the top 20 universities in decreasing average scores orderusing the geometric mean formula and using uniformly random weights in all10,000 runs. “Avr”, “Std”, and “CV” in Columns 2-7 stand for the average,the standard deviation, and the coeﬃcient of variation, respectively. “Prob inTop 20” in Column 8 is the probability of falling in the top 20 universities whenthey are ranked in decreasing score order. Each column is explained in § Coeﬃcient of variation of ranks ,8.

Probability of falling into Top 20 ,9.

Top group id , explained in § Maximum rank attained,11.

Minimum rank attained,12.

Maximum count , the number of times the maximum rank has been at-tained,13.

Minimum count , the number of times the minimum rank has been at-tained,14.

Product , explained in § The average score and average rank rankings are probabilistically identicalto those by the uniform weight. This is because the constraint c aboveensures that the expected value of each weight is equal to the uniformweight (which is easy to prove using the linearity of expectation togetherwith the constraints a-c). • The maximum rank values show that there were runs in which every uni-versity was not in the top 20 but as small maximum rank count valuestogether with small average rank values show that these max rank valueswere an extreme minority. More speciﬁcally, in the arithmetic case, Har-vard University was the best as it did not drop below rank 16 in all N runs while in the geometric case, Princeton University was the best as itdid not drop below rank 25 in all N runs. At the same time, with respectto the average score ranking, neither of these universities was at rank 1.Moreover, in the arithmetic case, Harvard University had the smallest av-erage rank while in the geometric case Princeton University did not havethe smallest average rank. • The minimum rank values show that about half of the universities neverreached the top rank over all N runs. From the opposite angle, this alsomeans that about half of the universities were at the top position in someruns. • Column 13 tells us how many times a university was at its minimum rank.It seems some universities do reach the top position but it is rare. On theother hand, for the top two universities, the top position is very frequent.More speciﬁcally, in the arithmetic case, California Institute of Technologyis at rank 1 in about 55% of the runs whereas Harvard University is atrank 1 for about 42%. In the geometric case, again these two universitieshas the highest chance of hitting the top position at about 49% and 35%,respectively.These observations conclusively show that a single ranking with subjectiveor random weights is insuﬃcient to assert that a particular university is the topuniversity or at a certain rank. Moreover, it is unclear which metric is the deﬁni-tive one to rank these universities; we could as well rank these universities peraverage score, average rank, maximum rank reached, minimum rank reached,or probability of hitting a certain rank, each yielding a diﬀerent ranking. Wewill return to these possibilities in § A large number of runs in Problem 1 may explore the weight space quite ex-haustively but the exploration using simulation still cannot be guaranteed to befully exhaustive. Moreover, the search itself may not be eﬃcient due to directdependence on the number of runs. In this section, we use LP to guaranteeoptimality and eﬃciency. 24e have two cases: The special case is to enforce the top 1 rank for a singleuniversity whereas the general case is to enforce the top k, from 1 to k, for agiven top k universities in order. The input is our ranking of all the universitiesin our dataset using geometric mean formula with uniform weights.

Enforcing for top 1 rank.

This case asks whether or not a set w of weightsexists to ensure a given university strictly dominates every other university. Thisis done by moving the given university to rank 1 and checking if s is greaterthan every other score. In this problem, we are not interested in ﬁnding w ,although LP will return it, rather we are interested in its existence.Let us see how we can transform this special case into a linear program.The special case requires that s > s i for any i ≥

2, or equivalently, s − s i > i ≥

2. Since both w and s are unknown, a linear program cannot becreated. However, if we subtract each row of attributes (component-wise) fromthe ﬁrst row, we convert s = Aw into Dw > row i − ( D ) = row ( A ) − row i ( A ) (12)for i ≥

2, where row i ( · ) represents the i th row of its argument matrix. Thisconverts the n rows in A into n − D . The resulting program in summaryis w ≥ , m (cid:88) j =1 w j = 1 , Dw > , (13)where the equality constraint on the sum of the weights is enforced to avoid thetrivial solution w = 0. This linear program can be rewritten more explicitly asminimize 1subject to( a ) w j ≥ ∀ j ∈ [1 , m ]) , ( b ) m (cid:80) j =1 w j = 1 , ( c ) ( a j − a ij ) w j ≥ (cid:15) ( ∀ i ∈ [2 , n ] and ∀ j ∈ [1 , m ]) , (14)where the lower bound (cid:15) is set to zero or a small nonzero constant, 0.05% inour experiments. The zero lower bound case allows ties in ranking whereas thenonzero lower bound case enforces strict domination. Note that this linear pro-gram has a constant as an objective function, which indicates that a feasibilitycheck rather than an optimization check is to be performed by the LP packagewe are using.Now if this linear program is feasible, then this implies there exists a set ofweights w that satisfy all these constraints, or equivalently, our special case hasa solution.The results of this experiment for the special case are as follows. We gen-erated and solved the linear program for each of the 609 universities in ourdataset. How many universities could be moved to the top rank? For the zeroand nonzero lower bound cases, the numbers are 45 and 28, respectively.25t is probably expected that multiple of the top ranked universities couldbe moved to the top rank. For example, for the zero and nonzero lower boundcases, any of the top 13 and 4, respectively, could achieve the top rank. Whatwas surprising to ﬁnd out that some universities at high ranks could also bemoved to the top rank; for the zero and nonzero lower bound cases, the highestranks were 553 and 536, respectively.One note on the diﬀerence in the ﬁndings between Monte Carlo simulationvs. LP. Our Monte Carlo simulation was able to ﬁnd about 12 universitiesthat could be moved to the top rank, whereas LP was able to ﬁnd more, asgiven above, in a fraction of the time; moreover, the 12 universities found byMonte Carlo simulation were subsumed by the ones found by LP. Although thisis expected due to the LP’s optimality guarantee, it is worth mentioning toemphasize the importance of running an exhaustive but eﬃcient search like LP. Enforcing top k ranks.

For the general case, we need to enforce moreconstraints. We require strict domination in succession, i.e., s i − s i +1 > i = 1 to k −

1, to enforce the rank order for the top k universities. The resultinglinear program isminimize 1subject to( a ) w j ≥ ∀ j ∈ [1 , m ]) , ( b ) m (cid:80) j =1 w j = 1 , ( c ) ( a ij − a i +1 ,j ) w j ≥ (cid:15) ( ∀ i ∈ [1 , k −

1] and ∀ j ∈ [1 , m ]) , ( d ) ( a kj − a ij ) w j ≥ (cid:15) ( ∀ i ∈ [ k + 1 , n ] and ∀ j ∈ [1 , m ]) , (15)where k is a given constant less than n .Taking the ranking in Fig. 1 as input, we wanted to ﬁnd out the maximumk such that we can ﬁnd a set of weights to enforce the top k ranking. For thezero bound case, we could ﬁnd such weights for each k from 1 to 20. For thenonzero bound case, we could ﬁnd such weights for each k from 1 to 18.We could increase these k even further by ﬁguring out which universitiesneed to move up or down in the ranking. To ﬁnd them out, we used a tricksuggested in [7]. This trick involves adding a slack variable to each inequality inEq. 15 and also adding their sum with large numeric coeﬃcients in the objectivefunction. The goal becomes discovering the minimum number of nonzero slackvariables. For each nonzero slack variable, the implication is that the orderingneeds to be reversed. The LP formulation is below for reference:minimize k − (cid:80) i =1 M d subject to( a ) w j ≥ ∀ j ∈ [1 , m ]) , ( b ) m (cid:80) j =1 w j = 1 , ( c ) ( a ij − a i +1 ,j ) w j + d i ≥ (cid:15) ( ∀ i ∈ [1 , k −

1] and ∀ j ∈ [1 , m ]) , ( d ) ( a kj − a ij ) w j + d i ≥ (cid:15) ( ∀ i ∈ [ k + 1 , n ] and ∀ j ∈ [1 , m ]) , (16)26here M is a large integer constant like 1,000 and d are the slack variables.Since the point about top k for a reasonable k is already made, we will notreport the results of these experiments.The signiﬁcance of these experiments is that a desired ranking of top k formany values of k can be enforced with a suitable selection of attributes andweights. This is another evidence for our central thesis that university rankingscan be unreliable. When we explored in the problems above the existence of weights to enforce adesired ranking for top k, we did not pay attention to how these weights lookto a human judge. It is possible that a human judge may think she or he wouldnever assign such odd looking weights, e.g., very uneven distribution of weightvalues or weights that are too large or too small. Although such an objectionmay not be fair in all cases, it is a good idea to propose a new way of derivingweights that are expected to be far more appealing to human judges. In thissection, we will explore this possibility.Our starting point is the claim that uniform weighting removes most or all ofthe subjectivity with weight selection. This claim has its own issues as discussedin the literature but we feel it is a reasonable claim to take advantage of. Wecan use this claim in two ways: a) create rankings using uniform weights, b)approximate uniform weights. The latter is done with the hope that it cangenerate rankings with larger k.The results with uniform weights are given in Fig. 3. In this section we focuson approximating uniform weights. Our approximation works by minimizing thediﬀerence d between the maximum derived weight and the minimum derivedweight so that the weights are closer to uniform as the diﬀerence gets closer tozero as shown below: d = m max j =1 w j − m min j =1 w j , (17)where we will refer to the numerator and denominator in this equation as max w and min w , respectively, so that we can use them as parameters in our linearprogram.Using the most basic properties of the maximum and minimum functions asin max w ≥ w j and w j ≥ min w (18)27or each j from 1 to m , our linear program isminimize d = max w − min w subject to( a ) w j ≤ max w ( ∀ j ∈ [1 , m ]) , ( b ) w j ≥ min w ( ∀ j ∈ [1 , m ]) , ( c ) min w ≥ , ( d ) m (cid:80) j =1 w j = 1 , ( e ) Dw ≥ (cid:15), (19)where Dw is to be deﬁned below for each case. Enforcing for top 1 rank.

The results of this experiment for feasibility isthe same as in the special case of Problem 1 (with D deﬁned as in Eq. 12), eventhough we changed the linear program slightly. For the weights derived by thelinear program, refer to Fig. 13.Figure 13: Weights (in percentages) to guarantee the top 1 for each university inour top 20 ranking. Here the label w i refers to w i . The rows marked “infeasible”mean no weights could be found by LP. Also here and in the sequel we use redcolor for highlighting weights.In this ﬁgure, each row gives the set of weights that will guarantee the rank 1position for the university at the same row. We claim that the derived weightswould look reasonable to a human judge but we encourage the reader to usetheir own judgment in comparison with the weights used by the four rankingsorganizations presented earlier.Another way of looking at these derived weights is to see which attributes gethigher weights. It is perhaps reasonable to argue that such attributes providestrengths of the university that they belong. Following this line of thinking, we28ay say that the top ranked California Institute of Technology is strong acrossall its attributes whereas the lowest ranked Emory University in our originaltop 20 ranking has the “Percent of faculty with PhD degrees” and “Faculty tostudent ratio” as its strongest attributes. For a prospective student, this line ofthinking can provide two viewpoints: a) the best university is the one that hasmost of its weights closer to uniform, or b) the best university is the one thathas its highest weights for the attributes that the student is interested in. Inour opinion both viewpoints seem valid.Note that four rows have “infeasible”, meaning that no weights exist to makethe corresponding universities top ranked. These can also be conﬁrmed to someextent via the simulations as shown in Fig. 11 and Fig. 12. In those ﬁgures,the minimum ranks these universities could reach in 10,000 simulations werenever the top rank. Here we say “to some extent” because the coverage of LPis exhaustive while that of random simulation is not. Enforcing top k ranks.

The results of this experiment for feasibility isthe same as in the general case of Problem 1 (with D deﬁned as in Eq. 12), eventhough we changed the linear program slightly. For the weights derived by thelinear program, refer to Fig. 14.Figure 14: Weights (in percentages) to guarantee the top k for each k from 1to 20. Here the label w i refers to w i . The rows marked “infeasible” mean noweights could be found by LP.In this ﬁgure, each row gives the weights to guarantee the ranking of top k,where k is the value in the ﬁrst column of the related row. That is, we derivethe set of weights to enforce the top k ranking as given in our top 20 rankings,for k from 1 to 20. It is not a coincidence that top 1 weights match the ﬁrstrow in Fig. 13.As in the top 1 case, we have some “infeasible” rows, namely, the last two29ows. This means we could ﬁnd weights to enforce the ranking up to top 18only. For those last two rows, there exist no weights to enforce their ranks tothe 19th and 20th, respectively, unless we reorder the universities.Figure 15: 27 universities that reach the optimal rank of 1. In the ﬁrst three problems, we have seen that not every university can attainthe top rank or the top score. In this section, we want to conclusively ﬁnd outthe top rank each university can attain.The approach we take to compute the top rank possible for a universityworks in three steps: 1) move the university to the top rank; 2) generate theconstraints to enforce that the score of the university dominates every otheruniversity score; 3) count the number of score constraints that are not satisﬁed.The last step ensures that for every violated constraint, the enforced order waswrong and the university in question needs to move one rank down for eachviolation. In the end the count of these violations gives us the top rank theuniversity can attain in presence of all the other universities in the dataset.We again want to use LP for the approach above. The formulation is similarto that of Eq. 20 but we need a trick to count the number of constraint violations.We found such a trick in [11], which when combined with our formulation getsthe job done. The trick involves generating two new variables y i and d i forevery constraint s − s i >

0, turning this constraint into s − s i + d i >

0. If s − s i >

0, i.e., the score constraint is satisﬁed, we want the LP to set d i ≤ s − s i ≤

0, i.e., the score constraint is violated, we want30igure 16: The next 27 universities that reach the optimal ranks up to 8.the LP to set d i >

0. In addition, we want to count the number of violations,i.e., the number of times d i >

0. This is where y i , which can only be 0 or 1,comes into the picture. We bring y i and d i in the form of a new constraint: d i − M y i ≤

0. For a large constant M , every time d i >

0, this new constraintgets satisﬁed only if y i = 1. Every time y i is 1, this means the university inquestion needs to be demoted by 1 in rank. This also means the total numberof times this demotion happens will gives us the top rank. One caveat here isto ensure y i is zero every time d i ≤ d i ≤ d i − M y i ≤ y i zero or one, i.e., the latter needs to be avoided. This is achieved withthe objective function: Minimize the sum of y i , which will avoid y i = 1 unlessit is absolutely required.Our LP formulation implementing the approach above isminimize n (cid:80) i =2 y i subject to( a ) w j ≥ ∀ j ∈ [1 , m ]) , ( b ) m (cid:80) j =1 w j = 1 , ( c ) ( a j − a ij ) w j + d i ≥ (cid:15) ( ∀ i ∈ [2 , n ] and ∀ j ∈ [1 , m ]) , ( d ) d i − M y i ≤ ∀ i ∈ [2 , n ]) , ( e ) y i ∈ { , } ( ∀ i ∈ [2 , n ]) , (20)where M is a large enough constant, which we set to 10 in our experiments.Fig. 15 and Fig. 16 present the results in two tables. The table in Fig. 1531ontains the top 27 universities and the table in Fig. 16 contains the next setof 27 universities. For each university, these tables have three top ranks thatthese universities can attain: the “Deterministic” one coming from the geomet-ric uniform ordering, the “Random” one coming from the Monte Carlo weightassignment, the “Optimal” one coming from the LP formulation in this section.The table in Fig. 15 shows that there exist some weight assignments thatcan guarantee the top rank to 27 universities. Weight assignments also exist formoving the next ﬁve universities to the rank 2 position, for moving the next sixuniversities to the rank 3 position, and so on.Both of these tables also show the key diﬀerence between the Monte Carlosearch and the optimal search to ﬁnd the top rank. By optimality and also asthese tables demonstrate, the “Optimal” ranks necessarily lower than or equalto the “Random” ranks; however, for some universities the diﬀerences are quitelarge. Recall that the Monte Carlo search used 10,000 runs while the optimalsearch used a single run. On the other hand, the Monte Carlo search ﬁnds thetop ranks for every university while the optimal search needs a new run to ﬁndthe top rank for each university. Despite these diﬀerences, the optimal searchis far faster to run for 10s of universities.What is the implication of the experiments in this section? Recall thatany of the university rankings assigns the top rank to a single university. Theexperiments in this section indicate that actually 27 universities can attain thetop rank under some weight assignments. Does it then make sense to claimthat only a single university is the top university? This section indicates thatthe answer has to be a no. Then, how can we rank universities based on theresults of this section? We think within the limits of this section, we may rankuniversities in groups, the top group containing the universities that can attainrank 1, the next group containing the universities that can attain rank 2, andso on.One counter-argument to the argument in the paragraph above may be therealization that universities attain the top rank under some but diﬀerent weightassignments. In other words, a given weight assignment that moves a particularuniversity to its best rank may not make another university to attain its bestrank. This means a single ranking cannot be used to rank the universities, whichis against the idea of university rankings in the ﬁrst place. At the same time asingle ranking does not give the correct picture to the interested parties, e.g.,the students trying to choose a university to apply to. In every problem we solved above, we had to derive weights. In this section, wewill explore the possibility of rankings that do not use weights at all. The ideais to rank every university for each attribute independently, and then aggregatethese rankings into a ﬁnal joint ranking. This ensures that if the attributesthemselves (apart from which ones are selected) are objective enough, the per-attribute ranking is also objective. This leads to a far more objective rankingthat any weight-based rankings. Similar proposals, independently proposed, are32igure 17: The rank of each university with respect to each attribute for theuniversities in our original ranking. Here the label a ij refers to a ij .in [4, 47].The problem of aggregating multiple independent rankings into a ﬁnal jointranking is called the “rank aggregation” problem in the literature [14, 20], wherethe best method to use depends on the application area. Here we will use theKemeny rule, which is recognized as one of the best overall [20].The Kemeny rule minimizes the total number of disagreements between theﬁnal aggregate ranking (called the Kemeny ranking) and the input rankings,which are the independent per-attribute rankings. Unfortunately computingthe optimal Kemeny ranking is NP-hard [20], which means we can either resortto approximation or heuristic algorithms or we can still seek the optimal byreducing the problem size. We will go for the latter.The Kemeny ranking of a set of universities (or objects) can be computedoptimally by solving the following integer linear programming (ILP) formula-tion: minimize (cid:80) a (cid:54) = b n ba x ab subject to x ab + x ba = 1 ( ∀ a, b : a (cid:54) = b ) x ab + x bc + x ca ≤ ∀ a, b, c : a (cid:54) = b, b (cid:54) = c, c (cid:54) = a ) x ab ∈ , ∀ a, b : a (cid:54) = b ) (21)where for two universities a and b , x ab = 1 if a is ranked ahead of b in theaggregate ranking or 0 otherwise, and n ba is the number of input rankingsthat rank b ahead of a . The second constraint above can also be written as x ab + x bc + x ca ≥ i per attribute a ij . The ranks were input33igure 18: Our original ranking reranked with respect to Kemeny ranks. Herethe label a ij refers to a ij .to the integer linear program in Eq. 21. The “Kemeny rank” column shows theresulting ranks for each university. For comparison, the “Average rank” columnshows the average rank over all the per-attribute ranks for each university. Herethe diﬀerence between the former and latter ﬁgure is that the former shows theuniversities in our original top20 ranking as in Fig. 1 whereas the latter reranksthis top 20 based on the derived Kemeny ranks.In Fig. 17, the last row gives the average similarity score between each per-attribute ranking and the Kemeny ranking (computed using the Spearman’sfootrule distance). A close inspection of the per-attribute (col) ranks and thesimilarity score reveals two interesting observations: a) the ranks based on a i and a i are very large; b) the ranks for a i , a i , and a i are usually small, withthe ﬁrst one being the smallest. The “large” and “small” designations also applythe similarity score. It may be possible to reason from these observations thatthose attributes that produce a ranking too dissimilar to the Kemeny rankingmay not be good attributes to use for university ranking but we will leave thisas a conjecture for future research at this point.You may wonder whether or not it is possible (as done in Problem 2) toﬁnd a set of weights to guarantee the ﬁnal Kemeny ranking for the top 20.The answer to this is unfortunately negative unless NP=P. The reason is thatProblem 2 can be solved in polynomial time whereas Problem 3 is NP-hard;unless NP=P, we cannot use a polynomial time algorithm to solve an NP-hardproblem. This means that a Problem 2 version of the Kemeny ranking for thetop 20 is necessarily an infeasible linear program. There are techniques, e.g.,via the use of slack variables as shown in [7], to discover a set of weights thatadd up to 99% instead of 100% as required but a full exploration of this avenueis left for future research.Finally, if the number of universities is too many, it is possible to use approx-34igure 19: Our original ranking reranked with respect to average ranks over allthe attribute-based ranks. Here the label a ij refers to a ij .imate algorithms for Kemeny ranking. If all else fails, even using the averagerank over the per-attribute as a crude approximation to Kemeny ranking maywork. For example, the similarity distance between the average and Kemenyrankings is 3, the best over all the per-attribute rankings. Note that averageranks can be found quickly in polynomial time. (a) Histogram (pdf) (b) Cumulative histogram (cdf) Figure 20: The histogram (left) and cumulative histogram (right) of the per-centage of the maximum rank improvement of a university with respect to oldrank of the university for Case One (one university improving at a time). Mostof the improvements are above 80%.

The problems we studied so far have shown that there are many ways of creatingreasonable rankings. One problem left is the question of which attributes a35 a) Histogram (pdf) (b) Cumulative histogram (cdf)

Figure 21: The histogram (left) and cumulative histogram (right) the percentageof the maximum rank improvement of a university with respect to old rank ofthe university for Case All (all universities improving simultaneously). A goodnumber of improvements are in the vicinity of 20% to 40%.university should focus on to improve its ranking. This section proposes asimple solution to this problem.One question to address is how many attributes a university should improveat the same time. The extreme answer is all of the attributes but this is probablyunrealistic due to the large amount of resources such a focus would require. Forsimplicity, we will assume that the university focuses on only one attribute, theone that provides the best improvement in the score of the university.Another question is how many other universities are improving their scoresat the same time that the university in question is focusing on its own. Theanswer is diﬃcult to know but probably the answer is most of the universitiesdue to the drastic impact, positive or negative, of the university rankings. Againfor simplicity, we will consider one attribute at a time with the following twoextremes: Given an attribute, one extreme is only the university in questionmodiﬁes the value of the attribute (called “Case One”), and the other extremeis every university modiﬁes the value of the attribute simultaneously (called“Case all”). For the university in question, the improvement in score is likelysomewhere in between of these extremes.Note that we will not delve into what it takes in terms of resources, e.g.,time, money, staﬀ, etc., for a university to improve its score. The cost of theseresources is expected to be substantial. Interested readers can refer to therelevant references, e.g., [34].Let us start with some deﬁnitions. Let a ∗ ij denote the maximum value ofthe attribute a ij for the i th university. Recall that due to the attribute valuenormalization, this maximum value is at most 1.0. Also recall that in ourformulation, larger values of each attribute are the desired direction to maximizethe scores.Earlier in Eq. 8 we deﬁned the score s i of the i th university in the ranking36s s oldi = m (cid:88) j =1 w j a ij , (22)which we will refer to as the old score due to the change we will introduce tocompute the new version. Suppose we maximized the value of the k th attribute;then the new score becomes s newi = m (cid:88) j =1 w j a ij − w k a ik + w k a ∗ ik , (23)and the improvement in the score∆ s i = s newi − s oldi = w k ( a ∗ ik − a ik ) , (24)which is guaranteed to be non-negative.Our algorithm for this problem follows the following steps: 1) ﬁnd the max-imum of each attribute across all universities in the input list of universities; 2)compute new scores for each pair of university and attribute; 3) sort the list ofuniversities using the new scores; 4) print new ranks and rank changes. In step3, sorting is done over each attribute and for both cases, Case One and CaseAll.The results are given using histograms in Fig. 20 for Case One and in Fig. 21for Case All. In each plot in these ﬁgures, the x-axes represent the percentageof the maximum rank improvement, i.e., ∆ s i /s oldi , in ten buckets; the y-axesrepresent diﬀerent things: The y-axis on the left is the count or number ofuniversities falling into each bucket on the x-axis whereas the y-axis on theright is the cumulative count or number of the universities falling into eachbucket from the one on the left up to the one corresponding to the count.These histograms show that drastic rank improvements are possible for CaseOne, probably as expected due to the changes happening one at a time. For themajority of universities, the rank improvements can be above 80%. As for CaseAll, the rank improvements are still impressive, half of them in the vicinity of20% to 40%.Fig. 22 shows how the percentage of the maximum rank improvement changeswith respect to the old rank for both Case One (blue or upper plot) and CaseAll (green or lower plot). These plots show that the rank improvements areroughly consistent across ranks. This also means that for lower ranked univer-sities, rank improvements can be signiﬁcant. Note that the two ﬁgures aboveare distribution or histogram versions of the data in this ﬁgure. Let us summarize the diﬀerent ways we can produce a ranking. We ﬁrst needto select attributes. The universities in our dataset have 52 attributes, 50 ofwhich are suitable as attributes. Let us assume that we wanted to select 2037igure 22: Percentage of the maximum rank improvement of a university withrespect to the old rank of the university as a result of attribute value improve-ments. The blue or upper plot is for Case One (one university improving ata time) and the green or lower plot is for Case All (all universities improvingsimultaneously). These plots indicate a good amount of consistency in rankimprovements across ranks.attributes for our ranking. The number of ways of selecting 20 attributes outof 50 attributes is approximately 4 . × , i.e., roughly 47 trillion!Then we need to decide on whether or not to use weights at all. If wedecide not to use any weights, then we have to use one of the rank aggregationalgorithms. Each of these algorithms is highly likely to produce a diﬀerentranking.If we decide to use weights, then we have multiple choices to select them:Uniform weights, non-uniform weights derived subjectively, non-uniform weightsderived randomly, non-uniform weights derived optimally (which in turn hasmultiple possibilities based on the objective function used).Together with the weights, we also need to select which aggregation formulato use: Arithmetic or geometric. The combination of how to derive weights andhow to aggregate them lead to diﬀerent rankings.Once these are selected, we next need to decide on whether or not we willderive the best rank each university can attain or run Monte Carlo simulation.The latter leads to more ways of ranking universities based on one of thesefactors: The average score, average rank, etc., many of the columns in Fig. 23and Fig. 24.All in all, the discussion above shows that there are many ways of rankinguniversities, each way with its own pros and cons. Hope this may provide moreevidence on the reliability of university rankings.At this point a good question is what we would recommend. First, we wouldlike to clarify that our aim in this paper is not to present a better way of rankinguniversities; we hope to come to realizing this aim in a future study. However,we will mention a couple of ways that might be interesting to explore further.38igure 23: Ranking of the top 20 universities in increasing “Prod” (Column 14)order using the arithmetic mean formula and using uniformly random weights inall 10,000 runs. “Avr”, “Std”, and “CV” in Columns 2-7 stand for the average,the standard deviation, and the coeﬃcient of variation, respectively. “Prob inTop 20” in Column 8 is the probability of falling in the top 20 universities whenthey are ranked in decreasing score order. Each column is explained in § § § i may consist of all universities whose best possible rank is i . For example, the39igure 24: Ranking of the top 20 universities in increasing “Prod” (Column 14)order using the geometric mean formula and using uniformly random weights inall 10,000 runs. “Avr”, “Std”, and “CV” in Columns 2-7 stand for the average,the standard deviation, and the coeﬃcient of variation, respectively. “Prob inTop 20” in Column 8 is the probability of falling in the top 20 universities whenthey are ranked in decreasing score order. Each column is explained in § § i to i + 9 if the university attains the ranks forthis group more often than the other ranks, i.e., ranks smaller than i or largerthan i + 9. If there are ties for a university, we may assign the university tothe highest rank group. In Fig. 23 and Fig. 24, Column 9 gives the group idsaccording to this way of assigning groups.This way of group assignment is actually not a good way if the rankings are40one in score orderings. This is because of the high correlation between thesegroup assignments and the score ordering, which follows by the ﬁrst observationmade in § Rankings of universities are attention-grabbing events in the public due to theirimpact on students, parents, universities, funding agencies, and even countries.Among a large number of such rankings, four are well known and attract themost interest.These rankings use a similar methodology that ranks universities based ontheir scores, usually computed as a sum-of-products formula involving a set ofattributes and their respective weights, all subjectively selected by the rankingsorganizations. There is a huge literature on these rankings and many issues oftheirs.In this paper, we produce a university ranking of our own in a repeatableway and in the open by applying a generic ten-step ranking methodology to apublic dataset of US universities. Using formal and algorithmic formulations onthis ranking as our testbed, we explore multiple problems and provide convinc-ing evidence that university rankings as commonly done today using the samegeneric ranking methodology (though diﬀerent attributes and weights) are notreliable in that it is relatively easy to move many universities to the top rankor automatically generate many reasonable rankings with appealing weights.Given the many applications of the generic ranking methodology in rankingobjects other than universities, we believe our ﬁndings have wide applicability.We share our datasets and software code in a public repository [16] to ensurerepeatability and encourage further studies.41 eferences [1] P. Aghion, M. Dewatripont, C. Hoxby, A. Mas-Colell, and A. Sapir.

HigherAspirations: An Agenda for Reforming European Universities , volume 5 of

Blueprint Series . Bruegel, Belgium, Jul 2008.[2] T. Andews. Generating random numbers that sum to 100. https://math.stackexchange.com/questions/1276206/ , May 2015.[3] Computing Research Association. CRA statement onU.S. News and World Report rankings of computer sci-ence universities. Retrieved from https://cra.org/cra-statement-us-news-world-report-rankings-computer-science-universities/ ,July 2019.[4] S. Athanassoglou. Multidimensional welfare rankings under weight impre-cision: a social choice perspective.

Social Choice and Welfare , 42:719–44,2015.[5] E. Berger. Csrankings: Computer science rankings. Retrieved from http://csrankings.org/ , July 2019.[6] M. Berkelaar and J. Dirks et al. lp solve 5.5. Retrieved from http://lpsolve.sourceforge.net/ , July 2019.[7] M. Berkelaar and J. Dirks et al. lp solve 5.5: Infeasible models. Re-trieved from http://lpsolve.sourceforge.net/5.5/Infeasible.htm ,July 2019.[8] J.-C. Billaut, D. Bouyssou, and P. Vincke. Should you believe in the Shang-hai ranking?

Scientometrics , 84(1):237–63, 2010.[9] The College Board, Peterson’s, and U.S. News & World Report. CommonData Set initiative. Retrieved from ,July 2019.[10] M.-L. Bougnol and J.H. Dula. Technical pitfalls in university rankings.

Int.J. Higher Education Research , 69(5):859–66, May 2015.[11] M.-L. Bougnol and J.H. Dula. The other side of ranking schemes: Gen-erating weights for speciﬁed outcomes.

Int. Transactions in OperationsResearch , 23(4):655–668, Jul 2016.[12] J.-H. Cho, Y. Wang, I.-R. Chen, K.S. Chan, and A. Swami. A survey onmodeling and optimizing multi-objective systems.

IEEE CommunicationsSurveys and Tutorials , 9(3):1867–901, May 2017.[13] M. Clarke. Some guidelines for academic quality rankings.

Higher Educa-tion in Europe , 27(4):444–59, 2002.4214] P. D’Alberto and A. Dasdan. On the weakness of correlation. CoRR,abs/1107.2691, 2011.[15] A. Dasdan. A weighted generalization of the Graham-Diaconis Inequalityfor ranked list similarity. https://arxiv.org/abs/1804.05420 , 2018.[16] A. Dasdan. University rankings. https://github.com/alidasdan/university-rankings , 2020.[17] J.A. Dearden, R. Grewal, and G.L. Lilien. Framing the university rankinggame: Actors, motivations, and actions.

Ethics in Science and Environ-mental Politics , 13:1–9, Jan 2014.[18] J.A. Dearden, R. Grewal, and G.L. Lilien. Strategic manipulation of uni-versity rankings, the prestige eﬀect, and student university choice.

J. Mar-keting Research , 56(4):691–707, May 2019.[19] K. Decancq and M.A. Lugo. Weights in multidimensional indices of well-being: An overview.

Econometric Reviews , 32(1):7–34, 2013.[20] C. Dwork, R. Kumar, M. Naor, and D. Sivakumar. Rank aggregationmethods for the web. In

Proc. Int. Conf. on WWW , pages 613–22. ACM,2001.[21] R.G. Ehrenberg. Method or madness? Inside the U.S. News & WorldReport college rankings.

J. College Admission , 189:29–35, 2005.[22] E. Berger et al. GOTO rankings considered helpful.

Communications ofthe ACM , 62(7):29–30, Jul 2019.[23] S. Blackburn et al. Institutional publication metrics for computer science.Retrieved from http://csmetrics.org/ , July 2019.[24] R. Finnie and A. Usher. Measuring the quality of post-secondary education:Concepts, current practices and a strategic plan. Research Report W28,Canadian Policy Research Networks, Apr 2005.[25] X. Gan, I.C. Fernandez, J. Guo, M. Wilson, Y. Zhao, B. Zhou, and J. Wu.When to use what: Methods for weighting and aggregating sustainabilityindicators.

Ecological Indicators , 81:491–502, Oct 2017.[26] M. Gladwell. The order of things. The New Yorker, at , Feb2011.[27] S. Greco, A. Ishizaka, M. Tasiou, and G. Torrisi. On the methodologicalframework of composite indices: A review of the issues of weighting, ag-gregation, and robustness.

Social Indicators Research , 141(1):61–94, Jan2019. 4328] Y. Guan, A. Asudeh, P. Mayuram, H.V. Jagadish, J. Stoyanovich, G. Mik-lau, and G. Das. Mithraranking: A system for responsible ranking design.In

Proc. SIGMOD , pages 1913–6. ACM, Jun 2019.[29] IHEP. College and university ranking systems: Global perspectives andamerican challenges. Technical report, Institute for Higher Education Pol-icy, Apr 2007.[30] IREG. Berlin principles on ranking of higher education institutions. http://ireg-observatory.org/en/index.php/berlin-principles ,Sep 2019.[31] IREG. IREG observatory on academic ranking and excellence. http://ireg-observatory.org/en/ , Sep 2019.[32] P. Jabjaimoh, K. Samart, N. Jansakul, and N. Jibenja. Optimization forbetter world university rank.

J. Scientometric Research , 8(1):18–20, 2019.[33] W.C. Kirby and J.W. Eby. ’World-Class’ universities: Rankings and rep-utation in global higher education. Background Note 316-065, HarvardBusiness School, Nov 2016.[34] M. Kutner. How to game the college rankings. Boston Mag-azine, at , Aug 2014.[35] N.C. Liu and Y. Cheng. The academic ranking of world universities.

HigherEducation in Europe , 30(2):127–36, Jul 2005.[36] M. Luca and J. Smith. Salience in quality disclosure: Evidence fromthe U.S. News college rankings.

J. Economics and Management Strategy ,22(1):58–77, 2013.[37] M. Luca and J. Smith. Strategic disclosure: The case of business schoolrankings.

J. Economic Behavior and Organization , 112:17–25, Apr 2015.[38] R. Morse, E. Brooks, and M. Mason. How U.S. News cal-culated the 2019 best colleges rankings. Retrived from , July 2019.[39] R. Morse and J. Vega-Rodriguez. How U.S. News calculated the 2019 bestglobal universities rankings. Retrived from , Oct2019.[40] Niche. About Niche’s 2020 best places. Retrived from , Mar 2020.4441] OECD. Handbook on constructing composite indicators: Methodology anduser guide. Technical report, Organization for Economic Co-Operation andDevelopment (OECD), 2008.[42] R.A. Pagell. University research rankings : From page counting to academicaccountability.

Evaluation in Higher Education , 3(1):71–101, 2009.[43] I. Permanyer. Assessing the robustness of composite indices rankings.

Re-view of Income and Wealth , 57(2):306–26, Jun 2011.[44] Quacquarelli Symonds (QS). World university rankings. Retrievedfrom , July 2019.[45] Quacquarelli Symonds (QS). World university rankings method-ology. Retrieved from , Jul 2019.[46] C. Rim. UC Berkeley removed from US news collegerankings for misreporting statistics. Forbes, at ,Jul 2019.[47] E. Roszkowska. Rank ordering criteria weighting methods – a compariveoverview.

Optimum. Studia Ekonomiczne , 65(5):14–33, 2013.[48] M. Saisana and B. D’Hombres. Higher education rankings: Robustnessissues and critical assessment. Scientiﬁc and Technical Report JRC 47028,European Commission Joint Research Centre, 2008.[49] ShanghaiRanking Consultancy (SC). Academic ranking of world universi-ties 2018. Retrieved from , July 2019.[50] ShanghaiRanking Consultancy (SC). Academic ranking of world univer-sities methodology. Retrieved from , July 2019.[51] C. Spearman. A footrule for measuring correlation.

British J of Psychology ,2(1):89–108, Jul 1906.[52] StatLib. College datasets 1995 from U.S. News and AAUP. Retrieved from http://lib.stat.cmu.edu/datasets/colleges/ , July 2019.[53] Times Higher Education Supplement (THE). World university rank-ings 2019. Retrieved from , July 2019.4554] Times Higher Education Supplement (THE). World univer-sity rankings 2019: methodology. Retrieved from , July 2019.[55] C. Tofallis. On constructing a composite indicator with multiplicative ag-gregation and the avoidance of zero weights in dea.

J. the OperationalResearch Society , May 2014.[56] A. Usher and M. Savino. A global survey of university ranking and leaguetables.

Higher Education in Europe , 32(1):5–15, Apr 2007.[57] U.S. News & World Report (USWNR). Best global universi-ties ranking. Retrieved from , Dec 2019.[58] U.S. News & World Report (USWNR). National university ranking.Retrieved from , Dec 2019.[59] U.S. News & World Report (USWNR). FAQ: How andwhy we rank and rate hospitals. Retrived from https://health.usnews.com/health-care/best-hospitals/articles/faq-how-and-why-we-rank-and-rate-hospitals , Mar 2020.[60] U.S. News & World Report (USWNR). Methodology: How the 2020 bestcountries were ranked. Retrived from , Mar 2020.[61] A.F.J. van Raan. Challenges in ranking of universities. In

Proc. 1st Int.Conf. on World Class Universities , pages 613–22. Shanghai Ranking, Jun2005.[62] Wikipedia. Academic ranking of world universities. Retrievedfrom https://en.wikipedia.org/wiki/Academic_Ranking_of_World_Universities , July 2019.[63] Wikipedia. College and university rankings. Retrieved from https://en.wikipedia.org/wiki/College_and_university_rankings , July 2019.[64] Wikipedia. QS world university rankings. Retrieved from https://en.wikipedia.org/wiki/QS_World_University_Rankings , Jul 2019.[65] Wikipedia. Times higher education world university rankings. Retrievedfrom https://en.wikipedia.org/wiki/Times_Higher_Education_World_University_Rankings , July 2019.[66] Wikipedia. U.S. News & World Report best colleges ranking. Retrievedfrom https://en.wikipedia.org/wiki/U.S._News_%26_World_Report_Best_Colleges_Ranking , July 2019.4667] Wikipedia. Common data set. https://en.wikipedia.org/wiki/Common_Data_Sethttps://en.wikipedia.org/wiki/Common_Data_Set