Joint aggregation of cardinal and ordinal evaluations with an application to a student paper competition
JJoint aggregation of cardinal and ordinal evaluations with anapplication to a student paper competition
Dorit S. HochbaumDepartment of Industrial Engineering and Operations ResearchUniversity of California, Berkeleyemail: [email protected]
Erick Moreno-CentenoDepartment of Industrial and Systems EngineeringTexas A&M Universityemail: [email protected]
January 14, 2021
Abstract
An important problem in decision theory concerns the aggregation of individual rankings/ratingsinto a collective evaluation. We illustrate a new aggregation method in the context of the 2007 MSOM’sstudent paper competition. The aggregation problem in this competition poses two challenges. Firstly,each paper was reviewed only by a very small fraction of the judges; thus the aggregate evaluation ishighly sensitive to the subjective scales chosen by the judges. Secondly, the judges provided both car-dinal and ordinal evaluations (ratings and rankings) of the papers they reviewed. The contribution hereis a new robust methodology that jointly aggregates ordinal and cardinal evaluations into a collectiveevaluation. This methodology is particularly suitable in cases of incomplete evaluations—i.e., when theindividuals evaluate only a strict subset of the objects. This approach is potentially useful in manage-rial decision making problems by a committee selecting projects from a large set or capital budgetinginvolving multiple priorities.
Keywords:
Consensus ranking, group ranking, student paper competition, decision making, incom-plete ranking aggregation, incomplete rating aggregation.
We present here a new framework for group decision making in which a group of individuals, or judges,collectively ranks all of the objects in a universal set. This framework takes into consideration the pairwisecomparisons implied by the individuals’ evaluations, and furthermore, it is first to combine ordinal rankingswith cardinal ratings so as to achieve an aggregate ranking that represents as well as possible the individuals’assessments, as measured by pre-set penalty functions.Group-ranking problems are differentiated by whether the evaluations are given in ordinal or cardinalscales. An ordinal evaluation, or ranking , is one where the objects are ordered from “most preferred” to“least preferred” in the form of an ordered list (allowing ties). On the other hand, a cardinal evaluation, or rating , is an assignment of scalars, which are cardinal scores/grades, to the objects evaluated. In a rating,the difference between the scores of two objects indicates the magnitude of separation between such objects.Depending on the type of evaluations to be aggregated, group-ranking problems are referred to as rankingaggregation problem or rating aggregation problems .1 a r X i v : . [ c s . A I] J a n revious work addressed either the rankings alone aggregation problem (e.g. Kemeny and Snell 1962;Arrow 1963; Bartholdi et al. 1989; Hochbaum and Levin 2006), or the ratings alone aggregation problem(e.g. Keeney 1976; Saaty 1977), but not both. One of the primary contributions here is the technique thatpermits to jointly aggregate rankings and ratings into a collective evaluation. The individual evaluationsinput to a group-ranking problem can be complete or incomplete . In cases when each individual in the groupranks (rates) all of the objects in the universal set, then the ranking (rating) is said to be complete, or fulllist ; otherwise, it is said to be incomplete, or partial list . The framework developed here is applicable whenthe judges’ ratings and rankings are incomplete.The power of the framework developed here is illustrated in ranking the participants of the 2007 MSOM’sstudent paper competition ( SPC ). This SPC aggregate ranking problem poses challenges that are unique tothat scenario:1. The judges provided both ratings and rankings of the papers they reviewed. This requires to reconcilethe possibly conflicting two types of evaluations.2. The incompleteness of the evaluations was extreme: Each judge evaluated fewer than a tenth of thepapers, and each paper was reviewed by fewer than a tenth of the judges. This caused the aggregationto be subject to the “incomplete evaluation” phenomenon bias, in which the individual scales used bythe judges affect the average scores, even if the preference ordering of all the judges agree with eachother. Also, outlier scores that are too low or too high tend to dominate the aggregate score of thepapers.The issue of subjective scales is well recognized within the aggregate ranking literature. French (1988)argues that, the value difference functions (the rating scales) of two individuals involve an arbitrary choiceof scale and origin and thus the same numeric score from two different judges generally do not have thesame meaning. Similarly, in the context of international surveys, a large number of studies (see, for exam-ple, Baumgartner and Steenkamp 2001; Smith 2004; Harzing 2006) show that the responses across differentcountries do not have the same meaning. In particular, these studies showed that even when asking respon-dents to rate each object using a simple 5-point rating scale, there are significant differences in the responsestyles between countries. One example of a difference in response style, that arises even when using a simple5-point rating scale, is that in some countries there is a tendency to use only the extreme categories whilein others there is a tendency to use only the middle categories. Another example of a difference in responsestyle, is that in some countries there is a tendency to use only the top categories while in others there is atendency to use only the bottom categories.In a decision making set up when the judges provide scores, one can generate implied pairwise com-parisons that reflect the intensity of the preference. This is done by letting this intensity be the differencein the scores for the two respective objects (these are called additive comparisons further discussed later).Hochbaum and Levin (2006) demonstrated that an aggregate rating that minimizes the penalties for differingfrom the individual judges’ implied pairwise comparisons overcomes the issue of using different parts of thescale and is less sensitive to subjective scales than the use of cardinal scores alone. These type of penaltiesare called separation penalties , and the optimization problem that seeks to assign scores that minimize thetotal separation penalties is called the separation problem . The separation-deviation (SD) model, proposedin (Hochbaum, 2004, 2006; Hochbaum and Levin, 2006), considers an aggregate rating scenario where theinput to the rating process is given as separation gaps and point-wise scores . A separation gap is a quan-tity that expresses the intensity of the preference of one object i over another j by one particular judge. Apoint-wise score is a cardinal score of an object. The SD optimization problem combines the objective ofminimizing the penalties of the deviation of the assigned scores to the point-wise scores assigned by judgesto each object and the minimization of the separation penalties. For any choice of penalty functions theaggregate rating obtained by solving the SD model is a complete-rating that minimizes the sum of penaltieson deviating from the given point-wise scores and separation gaps. The SD model is solved in polynomialtime if the penalty functions are convex. It is NP-hard otherwise.2n our problem setting the judges provided only point-wise scores. Therefore there are no pairwisecomparisons provided directly. Instead we use here the pairwise comparisons implied by the scores. Themechanism we propose here uses the SD model for both the rankings and ratings provided by the judges.For the rankings the penalty functions proposed are not convex. We “convexify” those functions and attainan optimization model that combines the separation and deviation penalties for deviating from the rankingsand from the ratings of all judges. This is the first aggregate decision model that combines both ordinal andcardinal inputs.The advantages of the mechanism proposed are obvious in comparison to standard approaches. It is easyto recognize a discrepancy in scores given to the same object by different judges. However, it is possible thatthe scores given are very close, yet each one is assigned from a different subjective scale. For one judge thescore of 7 out of 10 can indicate the top evaluation, whereas for another it may mean the very bottom. Suchscale differences cannot be identified by considering the variance of the scores alone. Our optimal solutionto the SD problem, with the given penalty functions, allow to identify immediately the largest penalty pairswhich, if large enough, indicate that different judges disagreed significantly on the comparison between suchpairs of objects. This permits to identify inconsistencies and outliers that could be judges who are too lenientor too strict, or for other reasons had intensity of preference substantially different from the others. As suchthe methodology proposed not only provides an aggregate ranking, but also clarifies the disagreements andinconsistencies that allow to go back and possibly investigate the reasons for those outliers.The paper is organized as follows: Section 2 provides a literature review on some relevant aggregategroup-decision making techniques for rankings and ratings aggregation. Section 3 describes the evaluationmethodology used in the 2007 MSOM’s SPC, and gives examples where the differences in scale used bythe judges are evident. Section 4 reviews the models and distance metrics used to construct the penaltyfunctions and defines the notions of consensus ranking and consensus rating used here. Section 5 describesthe methodology for the combined use of the given ratings and rankings in order to obtain the aggregateranting-ranking pair. Section 6 uses the methodology presented in Section 4 to rank the contestants in the2007 MSOM’s SPC and analyzes the obtained results. Finally, Section 7 provides comments on our group-decision making framework and its usefulness for different applications and decision-making scenarios. The ranking aggregation problem has been studied extensively, especially in the social choice literature.In this context, one of the most celebrated results is Arrows’s impossibility theorem (Arrow, 1963), whichstates that there is no “satisfactory” method to aggregate a set of rankings. Kenneth Arrow defined a satis-factory method as one that satisfies the following properties: universal domain, no imposition, monotonicity,independence of irrelevant alternatives, and non-dictatorship.Kemeny and Snell (1962) proposed a set of axioms that a distance metric between two complete rankingsshould satisfy. They proved that these axioms were jointly satisfied by a unique metric distance. Thisdistance between two rankings is measured by the number of rank reversals between them. A rank reversalis incurred whenever two objects have a different relative order in the given rankings. Similarly, half a rankreversal is incurred whenever two objects are tied in one ranking but not in the other. Kemeny and Snelldefined the consensus ranking as the ranking that minimizes the sum of the distances to each of the inputrankings. Bartholdi et al. (1989) showed that the optimization problem that needs to be solved to find theKemeny-Snell consensus ranking is NP-hard.Following the work of Kemeny-Snell, several axiomatic approaches have been developed to determineconsensus. For instance, Bogart (1973) developed an axiomatic distance between partial orders. One ofthe applications of Bogart’s distance is to determine a consensus partial order from a set of partial orders.Moreno-Centeno (2010) developed an axiomatic distance between incomplete rankings that is used here.3he difficulties presented by Arrow’s impossibility theorem and the NP-hardness of finding the Kemeny-Snell’s consensus ranking can be overcome by replacing ordinal rankings by (cardinal) ratings. Followingthis direction, Keeney (1976) proved that the averaging method satisfied all of Arrow’s desirable properties.In the averaging method, the consensus rating of each object is the average of the scores it received. The mostimmediate drawback of this approach is that the averaging method implicitly requires that all judges use thesame rating scale; that is, that all individuals are equally strict or equally lenient in their score assignments.This work also ignores the aspect of pairwise comparisons, which is essential to the Kemeny-Snell model.Pairwise comparisons intensities are the input to Saaty’s Analytic Hierarchy Process technique (Saaty,1977). There, the optimal scores are found by the principal eigenvector technique. The readers are referredto (Hochbaum, 2010) for an analysis of the principal eigenvector method in the context of aggregate decisionmaking.The separation-deviation model of (Hochbaum, 2004, 2006; Hochbaum and Levin, 2006) addressesthe computational shortcomings of the Kemeny-Snell model, and the decision quality inadequacies of theprincipal eigenvector method. This model takes point-wise scores and potentially also pairwise comparisonas inputs. It is the building block of the mechanism proposed here. As pointed out above, the respectiveseparation-deviation optimization problem is solvable in polynomial time if all the penalty functions areconvex (Hochbaum and Levin, 2006).The rating aggregation problem has also been studied in the context of multi-criteria decision makingliterature. Hochbaum and Levin (2006) showed the equivalence between the rating aggregation problemand the multi-criteria decision making problem. In this context, the non-axiomatic ELECTRE (Brans et al.,1975) and PROMETHEE (Brans and Vincke, 1985) methods (and their extensions) solve the rating aggrega-tion problem that arises from a multi-criteria decision problem by transforming it in some sense to a rankingaggregation problem. This transformation is claimed to be needed because each criterion is evaluated on adifferent scale.
The data used here is the evaluations for the 2007 MSOM’s SPC. There were 58 papers submitted to thecompetition and 63 judges participated in the evaluation process. Each of the 63 judges evaluated only threeto five out of the 58 papers; and each of the 58 papers was evaluated by only three to five out of the 63judges. Each judge reviewed and evaluated the assigned papers on the attributes (scale):A) Problem importance/interest (1–10),B) Problem modeling (0–10),C) Analytical results (0–10),D) Computational results (0–10),E) Paper writing (1–10), andF) Overall contribution to the field (Field contribution, for short) (1–10).On each attribute, the judges assigned scores according to the score guidelines provided (see Table 1). Inaddition, each judge also provided an ordinal ranking of the papers he/she reviewed (1 = best, 2 = secondbest, etc.). 4 able 1:
Interpretation of each numerical score. The journals considered are: MSOM, Operations Research(OR) and Management Science (MS).
Score Definition / Interpretation
10 Attribute considered is comparable to that of the best papers published in the journals.8,9 Attribute considered is comparable to that of the average papers published in the journals.7 Attribute considered is at the minimum level for publication in the journals.5,6 Attribute considered independently would require a minor revision before publication in the journals.3,4 Attribute considered independently would require a major revision before publication in the journals.1,2 Attribute considered would warrant by itself a rejection if the paper were submitted to the journals.0 Attribute considered is not relevant or applicable to the paper being evaluated.
Although precise score interpretations were provided to the judges (Table 1), they nevertheless appearedto have differed significantly in their evaluation and must have interpreted the scale differently. Examples ofthis phenomenon are illustrated for paper 43, in Table 2, and paper 26, in Table 3. To maintain the anonymityof judges and papers the judge and paper identification numbers were assigned randomly.
Table 2:
Evaluations received on paper 43.
Judge Problem Problem Analytical Computational Paper Field PaperImportance Modeling Results Results Writing Contribution Ranking
47 9 8 8 8 9 9 16 6 4 2 4 4 4.5 155 9 6 0 9 8 6 22 7 7 2 6 7.5 4 3
A detailed examination of Table 2 illustrates that paper 43 received in the Problem Modeling categorya score of 8 by one judge (meaning that the Problem Modeling in the paper is comparable to that in anaverage paper published in MSOM, OR and MS), and a score of 4 by other judge (meaning that the problemmodeling in the paper requires a major revision before publication in MSOM, OR and MS). These scoredifferences are not insignificant. Another example of the differences between the judges’ evaluations isfound on the Analytical Results category. In this category, a judge gave a score of 8 (meaning that theanalytical results in the paper are comparable to those in an average paper published in MSOM, OR andMS), two judges gave a score of 2 (meaning that the analytical results in the paper are so bad that the papershould be rejected by MSOM, OR and MS), and the remaining judge considered that the category was notapplicable to the paper (thus assigned the value of zero).
Table 3:
Evaluations received on paper 26.
Judge Problem Problem Analytical Computational Paper Field PaperImportance Modeling Results Results Writing Contribution Ranking
21 8 10 8 8 5 8 324 8 9 8 10 7 8 114 7 2 3 2 2 2 526 8 8 7 8 8 7 349 10 7 6 9 9 8 1
In Table 3 the data shows that judge 14’s evaluations were not on the same scale as the evaluations of theother judges. In particular, in all attributes (with the exception of Problem Importance) judge 14 gave a scoreindicating that the paper would be rejected by MSOM, OR and MS; on the other hand in every attribute allof the other judges considered the paper is worth of publishing (some of their evaluations even indicate thatthe paper would be among the best papers published in MSOM, OR and MS!). Such discrepancies in thejudges’ evaluations are quite common throughout the data.Henceforth, we use as the input point-wise ratings the (cardinal) scores only on the attribute “OverallContribution to the Field” (“Field Contribution”, for short). This is because the authors and the head judgeof the 2007 MSOM’s SPC, believe that, among all the attributes that were scored according to the cardinalscale in Table 1 (i.e., excluding the ordinal paper ranking), this attribute is the single most important attributeevaluated. 5
Preliminaries
This section gives the notation used throughout the rest of the paper, and reviews the concepts of: theseparation-deviation (SD) model, distance between incomplete ratings, and distance between incompleterankings.
Let V be the ground set of n objects to be rated; without loss of generality, we assign a unique identifier toeach element so that V = { , , ..., n } . The judges are K individuals. Each judge k , k ∈ { , , ..., K } , providesa set of scores, or ratings vector, aaa ( k ) of the objects in a subset A ( k ) of V . Thus a ( k ) j is the score of object j bythe k th individual, and a ( k ) j is undefined if the k th individual did not rate object j . Without loss of generality,we assume that the scores are integers contained in a pre-specified interval [ (cid:96), u ] . The range of the ratings isdefined as R ≡ u − (cid:96) .We say that judge’s k implied pairwise comparison , or separation gap of i to j is p ( k ) i j where p ( k ) i j = (cid:40) a ( k ) i − a ( k ) j if i ∈ A ( k ) and j ∈ A ( k ) undefined otherwise.Analogously, in the ordinal setting of the incomplete-ranking aggregation problem, each judge k pro-vides an incomplete ranking bbb ( k ) of the objects in B ( k ) , a subset of V . Here b ( k ) i is the rank (an ordinalnumber) of object i in the ranking provided by the k th individual, and b ( k ) i is undefined if individual k did notrank object i .The implied separation gaps for ordinal rankings are sign ( b ( k ) i − b ( k ) j ) for i , j ∈ B ( k ) , where the signfunction is defined as: sign ( x ) = − x <
00 if x =
01 if x > . For a vector of scores, or ratings, aaa of a set of objects, we denote by rank (cid:0) aaa (cid:1) the complete ranking ofthose objects obtained by sorting the objects according to their scores in aaa . For example, the vector of scores ( . , , , . , ) corresponds to the ranking ( , , , , ) . The SD model can be applied to group-decision making problems where the input is given as pairwisecomparisons and/or point-wise scores. In the model formulation, the variable x i is the aggregate score of thei th object, and the variable z i j is the aggregate separation gap of the i th over the j th object. The separation gapsmust be consistent. A set of separation gaps, p i j , is said to be consistent if and only if for all triplets i , j , k , p i j + p jk = p ik . In (Hochbaum, 2010; Hochbaum and Levin, 2006) it was proved that the consistency of aset of separation gaps is equivalent to the existence of a set of scores ω i for i = , . . . , n so that p i j = ω i − ω j .The mathematical programming formulation of the SD model is:(SD) min xxx , zzz K ∑ k = n ∑ i = n ∑ j = f ( k ) i j ( z i j − p ( k ) i j ) + K ∑ k = n ∑ i = g ( k ) i ( x i − a ( k ) i ) (1a)subject to z i j = x i − x j i = , . . . , n ; j = , . . . , n (1b) (cid:96) ≤ x i ≤ u i = , . . . , n (1c) x i ∈ Z i = , . . . , n . (1d)6he function f ( k ) i j ( · ) penalizes the difference between the aggregate separation gap of the object pair ( i , j ) and the k th reviewer’s separation gap of the object pair ( i , j ) . The function g ( k ) i ( · ) penalizes the differencebetween the aggregate score of object i and the k th reviewer’s score of object i . In order to ensure polynomial-time solvability, the functions f ( k ) i j ( · ) and g ( k ) i ( · ) must be convex . In the context of rating aggregation, thepenalty functions assume the value 0 for the argument 0; meaning that if the output separation gap for i an j , z i j agree with p ( k ) i j then f ( k ) i j ( z i j − p ( k ) i j ) = f ( k ) i j ( ) =
0. If i (cid:54)∈ B ( k ) , then g ( k ) i ( ˙ ) is set to the constant function 0;similarly, if at least one of i or j (cid:54)∈ B ( k ) , then f ( k ) i j ( ˙ ) is set to the constant function 0. Constraints (1b) enforcethe consistency of the aggregate separation gaps conforming to the aggregate rating.It was proved in (Hochbaum, 2004, 2006; Hochbaum and Levin, 2006) that problem (SD) is a specialcase of the convex dual of the minimum cost network flow (CDMCNF) problem. The most efficient algo-rithm known for the CDMCNF has a running time of O ( mn log n m log ( u − (cid:96) )) (Ahuja et al., 2003), where m is the total number of given separation gaps, and n is the number of objects. Ahuja et al. (2004) presentedan alternative algorithm that uses a minimum-cut algorithm as a subroutine. Defining a penalty function on separation gaps is equivalent to quantifying the distance between them.Cook and Kress (1985) proposed a distance between complete ratings. This distance function was adaptedto incomplete ratings in (Moreno-Centeno, 2010). It was shown that for a set of desirable properties thisadaption, called normalized projected Cook-Kress distance (NPCK), is the only one that satisfies all thoseproperties.Given two incomplete ratings aaa ( ) and aaa ( ) , the NPCK distance between the implied separation gaps is d NPCK ( aaa ( ) , aaa ( ) ) = C ∑ i ∈ A ( ) (cid:84) A ( ) ∑ j ∈ A ( ) (cid:84) A ( ) | p ( ) i j − p ( ) i j | , (2)where C = (cid:32) · R · (cid:38) (cid:12)(cid:12) A ( ) (cid:84) A ( ) (cid:12)(cid:12) (cid:39) · (cid:36) (cid:12)(cid:12) A ( ) (cid:84) A ( ) (cid:12)(cid:12) (cid:37)(cid:33) − . (3) C is a normalization constant that guarantees that 0 ≤ d NPCK ( · , · ) ≤ R is the range of the ratings, R ≡ u − (cid:96) . We note that d NPCK ( aaa ( ) , aaa ( ) ) = aaa ( ) and aaa ( ) ,and d NPCK ( aaa ( ) , aaa ( ) ) = aaa ( ) and aaa ( ) . The normalizationis important so that the distances in problem (4) are comparable to each other even when the individuals ratea different number of objects. The normalization constant C was chosen to address the following difficulties:(a) Each of the distances in problem (4) are between a complete rating xxx ( c ) and an incomplete rating. (b) Thenumber of objects rated by each incomplete rating are different; therefore the distances in problem (4) areover different dimensional spaces (the distance only considers the objects rated by the incomplete rating).(c) Distances in higher dimensional spaces tend to be bigger than distances in lower dimensional spaces;specifically, observe that the number of summands in equation (2) is the square of the number of objectsrated by the incomplete rating.In (Moreno-Centeno, 2010) the consensus rating , xxx ( c ) , is the optimal solution to the following optimiza-tion problem: min xxx K ∑ k = d NPCK ( aaa ( k ) , xxx ) . (4)The problem of finding the consensus rating is as a special case of the SD model and therefore solvablein polynomial time. 7 .4 Distance between Incomplete-Rankings Given a set of incomplete rankings, (cid:110) bbb ( k ) (cid:111) Kk = , the consensus ranking is defined as the complete rankingclosest to the given incomplete rankings. Kemeny and Snell (1962) proposed a distance between completerankings. This distance function was adapted to incomplete rankings in (Moreno-Centeno, 2010). It wasshown that for a set of desirable properties this adaption, called normalized projected Kemeny-Snell distance (NPKS), is the only one that satisfies all those properties.Given two incomplete rankings bbb ( ) and bbb ( ) , the NPKS distance between them is calculated as follows: d NPKS ( bbb ( ) , bbb ( ) ) = D ∑ i ∈ B ( ) (cid:84) B ( ) ∑ j ∈ B ( ) (cid:84) B ( ) | sign ( b ( ) i − b ( ) j ) − sign ( b ( ) i − b ( ) j ) | , (5)where D = (cid:16)(cid:12)(cid:12) B ( ) (cid:84) B ( ) (cid:12)(cid:12) − (cid:12)(cid:12) B ( ) (cid:84) B ( ) (cid:12)(cid:12)(cid:17) − . D is a normalization constant that guarantees that 0 ≤ d NPKS ( · , · ) ≤
1. When d NPKS ( bbb ( ) , bbb ( ) ) = bbb ( ) and bbb ( ) , and when d NPKS ( bbb ( ) , bbb ( ) ) = bbb ( ) and bbb ( ) . The normalization is important sothat the distances in problem (6) are comparable to each other even when the individuals rank a differentnumber of objects. The normalization constant D was chosen to address the following difficulties: (a) Eachof the distances in problem (6) are between a complete ranking xxx ( o ) and an incomplete ranking. (b) Thenumber of objects ranked by each incomplete ranking are different; therefore the distances in problem (6)are over different dimensional spaces (the distance only considers the objects ranked by the incompleteranking). (c) Distances in higher dimensional spaces tend to be bigger than distances in lower dimensionalspaces; specifically, observe that the number of summands in equation (5) is the square of the number ofobjects ranked by the incomplete ranking.The distance d NPKS ( bbb ( ) , bbb ( ) ) has the following natural interpretation: The distance between two in-complete rankings is proportional to the number of rank reversals between them. Where a rank reversal isincurred whenever two objects have a different relative order in the rankings bbb ( ) and bbb ( ) . Similarly, half arank reversal is incurred whenever two objects are tied in one ranking but not in the other ranking.In (Moreno-Centeno, 2010) the consensus ranking , xxx ( o ) , is the optimal solution tomin xxx K ∑ k = d NPKS ( bbb ( k ) , xxx ) . (6) In contrast to problem (4), problem (6) is NP-hard. We propose here to convexify the nonlinear sign func-tions in d NPKS ( · , · ) as suggested in (Moreno-Centeno, 2010): h ( k ) i j ( z i j ) = max (cid:110) , z ij + (cid:111) if sign ( b ( k ) i − b ( k ) j ) = − (cid:110) − z ij , z ij (cid:111) if sign ( b ( k ) i − b ( k ) j ) = (cid:110) − z ij , (cid:111) if sign ( b ( k ) i − b ( k ) j ) = xxx , zzz K ∑ k = D k ∑ i ∈ B ( k ) ∑ j ∈ B ( k ) h ( k ) i j ( z i j ) (8a)subject to z i j = x i − x j i = , . . . , n ; j = , . . . , n . (8b)8e conclude this section by observing that, for the rankings given by the judges in the 2007 MSOM’sSPC, the optimal solution to convexified problem (8) is a good approximation to the optimal solution ofproblem (6). That is, the distance d NPKS ( · , · ) between the optimal solution to problem (6), (obtained usingthe implicit hitting set approach problem of Karp and Moreno-Centeno 2013), and the optimal solution tothe convex approximation, problem (8), is only 0 . This section describes the model to jointly aggregate the ratings and the rankings. The goal of this modelis not only to fairly represent the judges’ rating and the judges’ rankings, but also to balance the cardinaland ordinal evaluations. We refer to this optimization model as the
Combined Aggregate raTing problem, or(CAT).The input to (CAT) is a set of ratings (cid:8) aaa ( k ) (cid:9) Kk = and a set of rankings (cid:110) bbb ( k ) (cid:111) Kk = . (CAT) is a combinationof the rating aggregation problem (4) and the ranking aggregation problem (6). In order to guarantee thatratings rankings weigh equally in the optimization problem (CAT), both distance functions, d NPCK and d NPKS , are normalized. Note that one can weigh these distances differently if justified by the circumstancesof the decision context. Also, the choice of d NPCK and d NPKS as penalty functions, or distances, can bereplaced by other distances between incomplete ratings and between incomplete rankings, respectively.(CAT) min xxx K ∑ k = d NPCK (cid:16) aaa ( k ) , xxx (cid:17) + K ∑ k = d NPKS (cid:16) bbb ( k ) , rank (cid:0) xxx (cid:1)(cid:17) (9)We next establish that (CAT) is NP-hard by reducing problem (6) (which is NP-hard) to it. Lemma 1. (CAT) is NP-hard.Proof.
Given an instance of problem (6), a set of incomplete rankings (cid:110) bbb ( k ) (cid:111) Kk = ), one can transform it (inpolynomial time) to an instance of (CAT) as follows. Keep unchanged (cid:110) bbb ( k ) (cid:111) Kk = , and create a set of ratings (cid:8) aaa ( k ) (cid:9) Kk = such that each rating evaluates exactly one object (the choice of object is irrelevant; moreover allof the ratings can evaluate the same object). From the definition of d NPCK (equation (2)), it follows that,for every xxx , the first summand in (CAT) will be equal to 0. Therefore, with this choice of ratings, rank (cid:0) xxx ∗ (cid:1) ,where xxx ∗ is the optimal solution to (CAT), will be the optimal solution to problem (6).The (nonlinear, nonconvex) mathematical programming formulation of (CAT) ismin xxx , zzz K ∑ k = C k ∑ i ∈ A ( k ) ∑ j ∈ A ( k ) (cid:12)(cid:12)(cid:12) z i j − p ( k ) i j (cid:12)(cid:12)(cid:12) + K ∑ k = D k ∑ i ∈ B ( k ) ∑ j ∈ B ( k ) | sign ( z i j ) − sign ( b ( k ) j − b ( k ) i ) | (10a)subject to z i j = x i − x j i = , . . . , n ; j = , . . . , n (10b) (cid:96) ≤ x i ≤ u i = , . . . , n (10c) x i ∈ Z i = , . . . , n . (10d)The convexification of the objective of problem (10), as described in Section 4.4, results in the convexformulation: 9in xxx , zzz K ∑ k = C k ∑ i ∈ A ( k ) ∑ j ∈ A ( k ) (cid:12)(cid:12)(cid:12) z i j − p ( k ) i j (cid:12)(cid:12)(cid:12) + K ∑ k = D k ∑ i ∈ B ( k ) ∑ j ∈ B ( k ) h ( k ) i j ( z i j ) (11a)subject to z i j = x i − x j i = , . . . , n ; j = , . . . , n (11b) (cid:96) ≤ x i ≤ u i = , . . . , n (11c) x i ∈ Z i = , . . . , n (11d)where, h ( k ) i j ( z i j ) = max (cid:110) , z ij + (cid:111) if sign ( b ( k ) j − b ( k ) i ) = − (cid:110) − z ij , z ij (cid:111) if sign ( b ( k ) j − b ( k ) i ) = (cid:110) − z ij , (cid:111) if sign ( b ( k ) j − b ( k ) i ) =
1. (11e)Problem (11) is a special case of the convex SD model and thus solvable in polynomial time.
Remark:
Note that in equations (10a) and (11e), the argument of the sign function is b ( k ) j − b ( k ) i andnot b ( k ) i − b ( k ) j as in equations (5) and (7). This is because of the classical convention that in the given ratings high cardinal numbers are assigned to the most preferred objects; while in the given rankings high ordinalnumbers are assigned to the least preferred objects.The optimal solution to (CAT) is a combined aggregate rating-ranking pair which is denoted by xxx ( cat ) ,and its implied ranking is denoted by rank (cid:0) xxx ( cat ) (cid:1) .Next, we propose two mechanisms to identify inconsistencies in the given evaluations (e.g. outliers,judges that are too lenient or too strict, etc.). This information is helpful so that (say) the lead decisionmaker initiates an investigation of the nature of the discrepancies and acts appropriately (for example, bydiscussing these inconsistencies with the judges and promote a discussion with the objective of alleviatingthem).The first mechanism is to use the solution xxx ( cat ) to identify (a) judges whose evaluations differ themost with the rest of the evaluations and (b) objects such that the judges evaluating them had particularlydivergent evaluations. These judges (objects) are those that assigned (received) scores that disagree the mostwith xxx ( cat ) . Specifically, we use the separation penalty to identify the judges whose evaluations are at thefarthest distance from xxx ( cat ) (i.e., have the highest separation penalty). Specifically, the contribution of judge k to the separation penalty is ∑ i ∈ A ( k ) ∑ j ∈ A ( k ) C k (cid:12)(cid:12)(cid:12) ( xxx ( cat ) i − xxx ( cat ) j ) − ( aaa ( k ) i − aaa ( k ) j ) (cid:12)(cid:12)(cid:12) . (12)Similarly, we use the separation penalty to identify the objects such that the judges evaluating them hadparticularly divergent evaluations. These objects are those with the highest contribution to the separationpenalty. The contribution of object i to the separation penalty is ∑ k | i ∈ A ( k ) ∑ j ∈ A ( k ) C k (cid:12)(cid:12)(cid:12) ( xxx ( cat ) i − xxx ( cat ) j ) − ( aaa ( k ) i − aaa ( k ) j ) (cid:12)(cid:12)(cid:12) . (13)The second mechanism to identify inconsistencies in the given evaluations is based on Brans andVincke’s PROMETHEE method (Brans and Vincke, 1985). The mechanism is to aggregate the consen-sus rating xxx ( c ) (solution to problem (4)) and the consensus rating xxx ( o ) (solution to problem (6)) into a partial10rder ( P , T , I ) as follows: a is preferred to b ( a P b ) if (cid:40) xxx ( c ) ( a ) > xxx ( c ) ( b ) and xxx ( o ) ( a ) ≥ xxx ( o ) ( b ) xxx ( c ) ( a ) ≥ xxx ( c ) ( b ) and xxx ( o ) ( a ) > xxx ( o ) ( b ) (14a) a and b are tied ( a T b ) if xxx ( c ) ( a ) = xxx ( c ) ( b ) and xxx ( o ) ( a ) = xxx ( o ) ( b ) (14b) a and b are incomparable ( a I b ) otherwise. (14c)Thus, by construction, the partial order ( P , T , I ) summarizes the agreement (or lack thereof) between theconsensus rating xxx ( c ) and the consensus ranking xxx ( o ) .Section 6 illustrates these mechanisms and their usefulness for identifying objects whose evaluationsdeserve special attention/further discussion. We illustrate here how to use the proposed mechanism in the ranking of the contestants of the 2007 MSOM’sSPC. These results are compared to those obtained by aggregating only the cardinal evaluations, and thoseobtained by aggregating only the ordinal evaluations.Table 4 gives the consensus rating (optimal solution to problem (4)) xxx ( c ) ; the (approximate) consen-sus ranking (optimal solution to problem (8)) xxx ( o ) ; and, the combined aggregate rating xxx ( cat ) and ranking rank (cid:0) xxx ( cat ) (cid:1) (optimal solutions to problem (11)). Table 4:
Aggregate ratings and rankings for the 2007 MSOM’s SPC.
Paper xxx ( c ) xxx ( o ) xxx ( cat ) rank (cid:0) xxx ( cat ) (cid:1) Paper xxx ( c ) xxx ( o ) xxx ( cat ) rank (cid:0) xxx ( cat ) (cid:1) In Table 4, the consensus rating xxx ( c ) is non-integral because some of the judges assigned fractional scores(in particular they assigned grades that are multiple of 1 / xxx ( cat ) , which is the optimal solution to (CAT), and in itsimplied ranking rank (cid:0) xxx ( cat ) (cid:1) , paper 54 is rated and ranked higher than paper 14; this, as discussed previ-ously, seems appropriate. In contrast, the consensus rating xxx ( c ) ranks paper 14 higher than 54. This providessome evidence that indeed the combined aggregate rating-ranking pair better represents the judges’ evalu-ations/opinions than the consensus rating xxx ( c ) , which takes into consideration only the ratings provided bythe judges. Table 5:
Evaluations of papers 14 and 54.
FieldContribution PaperPaper Judge Score Ranking
14 35 6 114 23 6 114 48 7 114 57 4 414 44 5 454 30 5 154 32 4 454 25 6 154 22 7 1
Table 6:
Evaluation statistics of the judgesthat evaluated papers 14 and 54.
Number Averageof Papers FieldJudge Evaluated Contribution
35 4 4.5023 4 4.2548 4 5.2557 4 5.7544 5 7.0030 5 3.6032 4 5.2525 5 4.0022 4 4.75
Table 7:
Adjusted Field Contribution re-ceived by papers 14 and 54.
AdjustedPaper Judge Field Contribution
14 35 1.3314 23 1.4114 48 1.3314 57 0.7014 44 0.7154 30 1.3954 32 0.7654 25 1.5054 22 1.47
Next, we use the partial order ( P , T , I ) (created as described in Section 5) to highlight the discrepancies12etween the consensus rating xxx ( c ) and the consensus ranking xxx ( o ) . Figure 1 gives a graphical representationof the partial order that highlights the pairs of objects where xxx ( c ) and xxx ( o ) disagree on their relative order(that is, those object pairs that are members of the set I in the partial order ( P , T , I ) ).From Figure 1 we observe the following: (a) Paper 14 has the highest consensus score, however thisconflicts with several papers (e.g., paper 54) that have a lower consensus score but a higher consensus rank(this agrees with the analysis given above). (b) Paper 20 (lower left corner of Figure 1) should receivethe lowest consensus evaluation. (c) Although the agreement between xxx ( c ) and xxx ( o ) is not perfect, thereare subsets of papers should receive a lower (or higher) collective evaluation than others. For example,the papers { , , , , , , , } should receive a collective evaluations higher than that of paper 20;lower than or equal to that of papers { , , , } ; and lower than the rest of the papers.13 igure 1: The papers (circled) are ordered (top to bottom) in decreasing consensus score. There is an arc between two papers whenever the lower ratedpaper has a better ranking than a higher rated paper. n the 2007 MSOM’s SPC, papers 38, 14, 10, 1 and 42 had the highest contributions to the separationpenalty. As noted previously, this indicates that these papers are those whose evaluations are not consis-tent/deserve further discussion. For example, paper 38—a very low rated paper in the consensus rating—received scores from 2 to 5 and was ranked by all but one of the judges as their least preferred paper (seeTables 8 and 9). In particular, paper 38 was the second most preferred paper of judge 9; perhaps becausethis judge received other papers with less quality than paper 38? We believe this is not the case since, asshown in Table 10, the paper ranked last by judge 9 was paper 10. As noted above, paper 10 is also amongthe highest contributors to the separation penalty. Paper 10 received three high evaluations and 2 very lowevaluations (see Table 11). Therefore, we believe that, in order to get a better consensus, the scores/ranks ofpaper 38 and paper 10 should be discussed by the judges assigned to these two papers. Table 8:
Evaluations of paper 38.
FieldContribution PaperJudge Score Ranking
30 3 541 2 544 3 59 5 220 5 4
Table 9:
Evaluation statistics of the judgesthat evaluated paper 38.
Number Averageof Papers FieldJudge Evaluated Contribution
30 5 3.6041 5 5.0044 5 7.009 5 4.6020 4 7.25
Table 10:
Evaluations of judge 9.
FieldContribution PaperPaper Score Ranking
10 3 519 4 338 5 250 4 358 7 1
Table 11:
Evaluations of paper 10.
FieldContribution PaperJudge Score Ranking
33 7 141 7 119 2 315 6 19 3 5
Next we analyze the combined aggregate rating xxx ( cat ) and ranking rank (cid:0) xxx ( cat ) (cid:1) (solution to problem(11)). We make the following observations:1. The consensus rating, xxx ( c ) , has a total rating distance (equation (4)) of 7.3611.2. The consensus ranking, xxx ( o ) , has a total ranking distance (equation (6)) of 13.8500.3. (a) The combined aggregate rating, xxx ( cat ) , has a total rating distance (equation (4)) of 8.16667.(b) The combined aggregate ranking, rank (cid:0) xxx ( cat ) (cid:1) , has a total ranking distance (equation (6)) of13.9333.This shows that, in this case, the combined aggregate rating xxx ( cat ) and ranking rank (cid:0) xxx ( cat ) (cid:1) achieve a verygood compromise. In particular, xxx ( cat ) remains almost as close as the consensus rating xxx ( c ) to the judges’ rat-ings, and rank (cid:0) xxx ( cat ) (cid:1) remains almost as close as the consensus ranking rank (cid:0) xxx ( cat ) (cid:1) to the judges’ rankings. We propose here a new framework for group decision making that aggregates both cardinal and ordinalinput evaluations (referred to as ratings and rankings, respectively). Our framework consists on finding therating-ranking pair that minimizes the sum of the rating-distances from the rating to the given ratings plusthe sum of the ranking-distances from the ranking to the given rankings.15he effectiveness of the new framework is illustrated by ranking the contestants of the 2007 MSOM’sstudent paper competition. We provide evidence that obtaining a combined aggregate cardinal and ordinalevaluations better represents the judges’ opinions as compared to a rating that aggregates only the judges’cardinal evaluations or only the judges’ ordinal evaluations.Aggregating incomplete evaluations is challenging because the aggregate evaluation is prone to be biasedby the judges’ subjective scales; for example, objects assigned to a particularly strict (lenient) judge havean advantage (disadvantage) compared to those objects not assigned to this specific judge. Our frameworkidentifies these inconsistencies in the given evaluations. This information is helpful so that the lead decisionmaker can initiate an investigation of the nature of the conflicts and act accordingly (for example, by havingthe specific judges discuss, and possibly resolve, these inconsistencies).The problem of aggregating complete evaluations (in which all judges evaluate all objects) is a specialcase of the problem of aggregating incomplete evaluations (in which the judges are allowed to evaluate onlysome of the objects). Therefore our framework is also applicable to aggregating complete evaluations.
Acknowledgements
The authors gratefully acknowledge J´er´emie Gallien, head judge of the 2007 MSOM’s student paper com-petition, for using our methodology to aggregate the judges evaluations in the competition.The research of the first author is supported in part by NSF award No. CMMI-1760102. The researchof the second author is supported in part by NSF award No. OAC-1835499.
References
Ahuja, R. K., D. S. Hochbaum, and J. B. Orlin (2003). Solving the convex cost integer dual network flowproblem.
Management Science 49 , 950–964.Ahuja, R. K., D. S. Hochbaum, and J. B. Orlin (2004). A cut-based algorithm for the nonlinear dual of theminimum cost network flow problem.
Algorithmica 39 , 189–208.Arrow, K. J. (1963).
Social Choice and Individual Values . New York: Wiley.Bartholdi, J., C. A. Tovey, and M. A. Trick (1989, April). Voting schemes for which it can be difficult to tellwho won the election.
Social Choice and Welfare 6 , 157–165.Baumgartner, H. and J. E. M. Steenkamp (2001). Response styles in marketing research: A cross-nationalinvestigation.
Journal of Marketing Research 38 , 143–156.Bogart, K. P. (1973). Preferences structures I: Distances between transitive preference relations.
Journal ofMathematical Sociology 3 , 49–67.Brans, J. P., B. Roy, and P. Vincke (1975). Aide a la decision multicritere.
Rev. Beige Statisti., d’Informatiqueet de Recherche Operationnelle 15 .Brans, J. P. and P. Vincke (1985, June). Note–a preference ranking organisation method: (the prometheemethod for multiple criteria decision-making).
Management Science 31 (6), 647–656.Cook, W. D. and M. Kress (1985). Ordinal ranking with intensity of preference.
Management Science 31 ,26–32. 16rench, S. (1988).
Decision Theory: An Introduction to the Mathematics of Rationality . New York: HalstedPress.Harzing, A. W. K. (2006). Response styles in cross-national survey research: A 26-country study.
Interna-tional Journal of Cross Cultural Management 6 , 243–266.Hochbaum, D. S. (2004). 50th anniversary article: Selection, provisioning, shared fixed costs, maximumclosure, and implications on algorithmic methods today.
Management Science 50 , 709–723.Hochbaum, D. S. (2006). Ranking sports teams and the inverse equal paths problem. In
Internet andNetwork Economics, Second International Workshop, WINE 2006 , Greece, pp. 307–318. Springer.Hochbaum, D. S. (2010). The separation, and separation-deviation methodology for group decision makingand aggregate ranking. In J. J. Hasenbein (Ed.),
TutORials in Operations Research , Volume 7, pp. 116–141. Hanover, MD: INFORMS.Hochbaum, D. S. and A. Levin (2006). Methodologies and algorithms for group-rankings decision.
Man-agement Science 52 , 1394–1408.Karp, R. M. and E. Moreno-Centeno (2013). The implicit hitting set approach to solve combinatorial opti-mization problems with an application to multigenome alignment.
Operations Research (forthcomming) .Keeney, R. L. (1976, October). A group preference axiomatization with cardinal utility.
ManagementScience 23 , 140–145.Kemeny, J. G. and L. J. Snell (1962).
Preference ranking: An axiomatic approach , pp. 9–23. MathematicalModels in Social Science. Boston, MA: Ginn.Moreno-Centeno, E. (2010).
Use and analysis of new optimization techniques for decision theory and datamining . Ph. D. thesis, University of California, Berkeley.Saaty, T. (1977). A scaling method for priorities in hierarchical structures.
Journal of Mathematical Psy-chology 15 (3), 234–281.Smith, P. B. (2004). Acquiescent response bias as an aspect of cultural communication style.