CrowDEA: Multi-view Idea Prioritization with Crowds
CCrowDEA: Multi-view Idea Prioritization with Crowds
Yukino Baba
Unviersity of [email protected]
Jiyi Li
University of [email protected]
Hisashi Kashima
Kyoto [email protected]
Abstract
Given a set of ideas collected from crowds with regard toan open-ended question, how can we organize and prioritizethem in order to determine the preferred ones based on prefer-ence comparisons by crowd evaluators? As there are diverselatent criteria for the value of an idea, multiple ideas can beconsidered as “the best”. In addition, evaluators can have dif-ferent preference criteria, and their comparison results oftendisagree. In this paper, we propose an analysis method forobtaining a subset of ideas, which we call frontier ideas, thatare the best in terms of at least one latent evaluation criterion.We propose an approach, called CrowDEA, which estimatesthe embeddings of the ideas in the multiple-criteria prefer-ence space, the best viewpoint for each idea, and preferencecriterion for each evaluator, to obtain a set of frontier ideas.Experimental results using real datasets containing numerousideas or designs demonstrate that the proposed approach caneffectively prioritize ideas from multiple viewpoints, therebydetecting frontier ideas. The embeddings of ideas learned bythe proposed approach provide a visualization that facilitatesobservation of the frontier ideas. In addition, the proposed ap-proach prioritizes ideas from a wider variety of viewpoints,whereas the baselines tend to use to the same viewpoints; itcan also handle various viewpoints and prioritize ideas in sit-uations where only a limited number of evaluators or labelsare available.
Despite the recent advances in artificial intelligence, thereare still several challenges that humans can handle betterthan machines, especially abstract, open-ended, and context-dependent problems. Brainstorming new ideas is a typicalexample; for instance, to answer open-ended questions, suchas “What is the best logo for the next summer Olympicgames?”, “How can we reduce the number of latecomersat team meetings”, and “What are the most reasonable solu-tions for preventing global warming?”, humans are expectedto present more creative and reasonable solutions than ma-chines. Existing studies demonstrate that crowdsourcing isan effective approach to collecting several creative ideasfrom a wide range of people (Yu and Nickerson 2011;
Copyright c (cid:13)
Koyama, Sakamoto, and Igarashi 2014; Siangliulue et al.2015; Prpi´c et al. 2015).Let us consider the example of designing a suitable logofor the next Olympic games. For example, let us assume thatwe ask crowd workers to provide a set of candidate designs.After collecting several design ideas, we should organizeand prioritize them to select the best. However, the crite-ria for the best design are usually multi-faceted; for exam-ple, there may be two different criteria for design, e.g., tra-ditional aesthetics and contemporary aesthetics. Therefore,there rarely exists a single overwhelming winner over theother candidates in terms of all criteria. Moreover, it is oftendifficult to define the criteria in advance.Thus, we must turn to the crowd for assistance, with theexpectation that crowd evaluators may be able to identifythe unknown diverse criteria. We must ask them to evaluatethe ideas, often in the form of pairwise preference compar-isons. The criteria for these comparisons can also be diversedepending on evaluators’ personal viewpoints.In this study, we consider the problem of aggregating thepairwise idea preference comparisons by crowds containingdifferent viewpoints so that a set of best ideas from certainviewpoints may be obtained. These ideas are called frontierideas . The proposed method, which is called C
ROW
DEA,generates a priority map that is a low-dimensional latentspace, where ideas are embedded such that the frontier ideasare furthest from the origin and the ideas projected onto theviewpoint of each evaluator are consistent with their pair-wise comparisons.Existing studies (Bradley and Terry 1952; Causeur andHusson 2005; Chen et al. 2013) estimate a unique ranklist from the pairwise preference comparisons; they usuallyassume that there exists a unique rank list as the groundtruth. In addition, as there are no explicit evaluation cri-teria readily available, existing methods, such as skylinequery (Borzsony, Kossmann, and Stocker 2001; Hose andVlachou 2012; Lofi, El Maarry, and Balke 2013), cannot beused. The priority map of C
ROW
DEA assists in making thefinal decision or further analysis (such as next-round ideasourcing) by providing an organized view from various per-spectives.We provide an illustrative example in Fig. 1; there are a r X i v : . [ c s . H C ] A ug a) Target objects (ideas): balls with different sizes and colors. , , , (b) Input: pairwise comparison results (c) Output: priority map Figure 1: Illustrative example of the proposed multi-view analysis C
ROW
DEA. (a) Target objects have different sizes andcolors. (b) Pairwise comparison is performed by crowd evaluators with individual preferences. (c) C
ROW
DEA yields a prioritymap , which is a multiple-criteria preference space, where the objects are embedded so that promising candidates are foundas the frontier objects . The largest object and the darkest-colored object as well as the fairly large-and-dark object are on thefrontier (shown by the dotted line).nine objects with different sizes and colors (Fig. 1(a)), andwe have to prioritize them in terms of various latent cri-teria, such as size and color. We ask crowd evaluators tomake pairwise preference comparisons based on their ownpersonal criteria (Fig. 1(b)). For example, some evaluatorsprefer darker objects regardless of the object size, whereasothers prefer larger objects. C
ROW
DEA outputs the prioritymap (Fig. 1(c)), where the frontier objects are placed on theconvex hull (shown by the dotted line) of all the embeddedobjects. The x -axis is interpreted as the object size and the y -axis as the color darkness. The rightmost and topmost ob-jects are the best according to the size and darkness criteria,respectively. In addition, the top-right object is the best interms of an intermediate criterion. The object is both fairlylarge and dark-colored, making it also a promising candi-date.We verify the proposed approach using real datasets thatcontain numerous ideas or designs. The quantitative resultsand qualitative analysis demonstrate that C ROW
DEA out-performs the baselines. The contributions of this study areas follows: • We define a problem that involves organizing and priori-tizing a set of ideas from multiple preference viewpointsto support decision-making. • We propose an approach that prioritizes ideas from multi-ple viewpoints based on pairwise preference comparisonsby crowd evaluators. The proposed approach can effec-tively determine the frontier ideas in a set of ideas. • The embeddings of ideas learned by the proposed ap-proach provide a visualization that facilitates observa-tion of the frontier ideas; in addition, the proposed ap-proach prioritizes ideas from a wider variety of view-points, whereas the baselines tend to use the same view-points. The proposed approach can also handle variousviewpoints and prioritize ideas in situations where only alimited number of evaluators or labels are available.
Existing studies demonstrate that crowdsourcing is an ef-fective method for collecting several creative ideas from awide range of people (Yu and Nickerson 2011; Siangliu-lue et al. 2015; Prpi´c et al. 2015). To understand a set ofideas, it is important to organize and visualize them. Sev-eral studies considered with crowdsourcing for organizingideas. Siangliulue et al. proposed an idea map to visualize aset of ideas using triple-wise similarity queries (Siangliulueet al. 2015). Ahmed and Fuge proposed to find high qualityideas by using community feedback, idea uniqueness, andtext features (Ahmed and Fuge 2017). Li et al. proposed anapproach that simultaneously ranks and clusters ideas (Li,Baba, and Kashima 2018). In contrast to these approaches,we allow multiple criteria so that promising candidates canbe obtained from various viewpoints (i.e., frontier ideas).Similar to our work, Lykourentzou et al. proposed a strategyfor ranking ideas according to quality and diversity (Lyk-ourentzou et al. 2018). In their work, the diversity was mea-sured by using the results of manual clustering although ourmethod does not require such manual effort.
Mathematical methods for supporting decision making havebeen traditionally studied in operations research. For ex-ample, data envelopment analysis (DEA) is a nonparamet-ric method for estimating production frontiers (Seiford andThrall 1990; Cooper, Seiford, and Zhu 2004), from whichthe proposed notion of frontier ideas was inspired. The sky-line query method, which retains only the objects that arenot worse than any others in terms of at least one eval-uation criterion, has been extensively studied (Borzsony,Kossmann, and Stocker 2001; Hose and Vlachou 2012;Lofi, El Maarry, and Balke 2013). In contrast with DEA andskyline query, the proposed frontier analysis does not requireexplicit evaluation criteria, and latent evaluation criteria arelearned from the data. .3 Pairwise preference aggregation
Methods for aggregating pairwise comparison resultshave long been discussed. The Bradley–Terry (BT)model (Bradley and Terry 1952) is a well-known modelfor pairwise comparisons. It estimates a single competencyscore for each object so that the scores are consistent withthe pairwise comparison labels. To model more complexobject relationships, multi-dimensional generalizations ofthe BT model have been proposed, such as, the multi-dimensional BT model (Causeur and Husson 2005) and in-transitivity model (Chen and Joachims 2016a; Chen andJoachims 2016b; Duan et al. 2017). The BT model has alsobeen extended to allow variability in the evaluators (Chenet al. 2013). Our work can be considered as the intersectionof the above two extensions; we consider multi-dimensionalcriteria for both evaluators and evaluated objects.
In some studies on learning multi-view representations,the term ‘multi-view’ has multiple meanings. In severalcases, it implies that data instances are described by differ-ent types of explicit features (Li, Yang, and Zhang 2016;Wang et al. 2015), for example, images and texts (Li, Yang,and Zhang 2016), texts in two different languages (Chandaret al. 2014), and audio and video media (Huang and Kings-bury 2013). Amid and Ukkonen targeted multiple implicitattributes, where object similarity from triple-wise questionsis preserved (Amid and Ukkonen 2015). Their goal is to ob-tain a space reflecting object similarity, whereas we obtain aspace reflecting idea priority.
Personalized ranking in recommendation systems in whichthe relative preference of each user is estimated has beenextensively studied. For example, Rendle et al. proposedBayesian personalized ranking, which trains a matrix fac-torization model to optimize a ranking loss function (Rendleet al. 2009). This topic has been studied in various scenar-ios, such as group preference (Pan and Chen 2013), visualrecommendation (He and McAuley 2016), and event rec-ommendation (Qiao et al. 2014). Their focus is on predictingpersonalized sets of items for different users, whereas we areinterested in obtaining the most advantageous evaluation cri-terion for each item so that all promising items (i.e., ideas)for decision making may be determined. This results in adifferent formulation.
When using web search, users expect not only the most rele-vant search results to a given query but also diverse ones.Some studies provide both diverse and representative re-sults in terms of content and semantic information (Kennedyand Naaman 2008; Wang et al. 2010), and there are stud-ies on the users’ potential intents (such as navigational orinformational) of their queries based on their search behav-iors (Cheng, Gao, and Liu 2010; Santos, Macdonald, andOunis 2011). An important difference between the above-mentioned studies and this one lies in the problem setting: explicit features such as content or context are not avail-able, and we prioritize ideas based solely on pairwise prefer-ences rather than features. Another difference is that many ofthese studied have predefined viewpoints, such as the typesof user intents, while ours finds the viewpoints from prefer-ence comparisons.
We address the problem of prioritizing a collection of n ideas in terms of different latent evaluation criteria. Let [ n ] = { , , · · · , n } . Then, we consider the embedding x i for each idea i ∈ [ n ] in a d -dimensional space, which wecall the priority map . Each axis of the priority map corre-sponds to a latent preference criterion, and a large value onan axis implies high preference in terms of the correspond-ing criterion.Decisions are usually made not only according to a singlecriterion but also by balancing different criteria. For everyidea, there should be a viewpoint that best emphasizes itsmerits, and it is beneficial to determine the set of all ideasthat are “the best” from certain viewpoints. We define thebest viewpoint for an idea i as a d -dimensional unit vec-tor v i , where the projection of x i onto v i (i.e., v (cid:62) i x i ) isconsidered to be its preference score from that viewpoint.If idea i is the most preferred among all the ideas, i.e., v (cid:62) i x i > v (cid:62) i x j , for all j (cid:54) = i , the idea is promising andshould be further investigated. The goal is to determine theseideas, which we call frontier ideas ; they are located on theconvex hull (indicated by the dotted line in Fig. 1(c)) of allideas in the embedding space. It should be noted that not allideas can be the best, even from their best viewpoints.To create the priority map, we collect preference datafrom m crowd evaluators in the form of pairwise compar-isons. Let C k = { ( i, j ) | i, j ∈ [ n ] , i (cid:31) k j } be the set ofpairwise comparison results by evaluator k ∈ [ m ] , where i (cid:31) k j indicates that evaluator k prefers idea i over idea j .As in the case of the best viewpoints for ideas, every crowdevaluator has its individual viewpoint. We define the view-point of crowd evaluator k as a d -dimensional unit vector, w k . The projections of { x i } ni =1 onto w k , i.e., { w (cid:62) k x i } ni =1 ,are regarded as the preference scores by the evaluator, andthey are expected to be consistent with the pairwise compar-ison results, C k .In summary, the inputs and outputs of the problem are asfollows: Inputs: n ideas, m crowd evaluators, and {C k } mk =1 , where C k = { ( i, j ) | i, j ∈ [ n ] , i (cid:31) k j } is the set of pairwisecomparison results by evaluator k ∈ [ m ] . Outputs: { x i } ni =1 , { v i } ni =1 , { w k } mk =1 , where x i is the d -dimensional embedding of idea i ∈ [ n ] , v i is the best view-point for idea i ∈ [ n ] , and w k is the viewpoint of crowdevaluator k ∈ [ m ] . We formulate the multi-view analysis as an optimizationproblem. Based on the discussions in the previous section,e have two optimization sub-goals: (i) determine as manyfrontier ideas as possible, and (ii) achieve consistency withthe pairwise preference comparison results.For the first sub-goal, we impose the best viewpoint foreach idea, from which the idea is most valuable among allideas. That is, we require that the resultant idea embeddings { x i } ni =1 and corresponding best viewpoints { v i } ni =1 satisfythe constraints v (cid:62) i x i > v (cid:62) i x j , ∀ i ∈ [ n ] , ∀ j (cid:54) = i ∈ [ n ] . (1)As it is not possible to satisfy all of the constraints, we quan-tify the number of constraint violations using a loss function.Specifically, we use the hinge loss as the loss function: L F ( { x i } ni =1 , { v i } ni =1 ) =1 n ( n − (cid:88) i ∈ [ n ] (cid:88) j ∈ [ n ] \ i max (cid:8) , − v (cid:62) i ( x i − x j ) (cid:9) . (2)For the second sub-goal, the aim is to make the viewpointof each evaluator consistent with the pairwise comparisonresults by that evaluator. We assume that each crowd evalua-tor has their own viewpoint, and we define w k as the prefer-ence criterion vector for the preference labels of evaluator k .From the viewpoint of evaluator k , the preference score ofeach idea i is given as w (cid:62) k x i ; therefore, the set C k of all pair-wise comparison results by evaluator k should be consistentwith the preference scores, i.e., w (cid:62) k x i > w (cid:62) k x j , ∀ k ∈ [ m ] , ∀ ( i, j ) ∈ C k . (3)As before, it is not always possible to meet all of the con-straints, and again we use the hinge loss function: L C ( { x i } ni =1 , { w k } mk =1 ) =1 c (cid:88) k ∈ [ m ] (cid:88) i,j ∈C k max (cid:8) , − w (cid:62) k ( x i − x j ) (cid:9) , (4)where c = (cid:80) k |C k | is the number of observed preferencelabels.In addition, we impose the constraint that all embeddingsand preference criterion vectors should be non-negative fora more intuitive visualization (as shown in Fig. 1(c)). Fur-thermore, we add the constraints that all the preference cri-terion vectors, w k and v i , have unit length, i.e., (cid:107) w k (cid:107) = 1 and (cid:107) v i (cid:107) = 1 . One advantage of this constraint is that itscales the embeddings for all objects. This unit length con-straint can also avoid the preference criterion vector beingzero. For example, for an object o i that is not on the frontierand ranked low even in its best viewpoint, if v i is not equal tozero, v (cid:62) i ( x i − x j ) for many o j are lower than zero, whichmay result in v i = minimizing L F ( x i , v i ) .By combining the loss functions for the two sub-goals andthe constraints, the optimization problem can be fully formu-lated as follows: minimize { x i } ni =1 , { v i } ni =1 , { w k } mk =1 L C ( { x i } ni =1 , { w k } mk =1 )+ α L F ( { x i } ni =1 , { v i } ni =1 )subject to x i , v i , w k ∈ R d + , ∀ i ∈ [ n ] , k ∈ [ m ]; (cid:107) w k (cid:107) = 1 , ∀ k ∈ [ m ]; (cid:107) v i (cid:107) = 1 , ∀ i ∈ [ n ] , where α > is a constant that controls the trade-off between L C and L F .The constrained optimization is performed in a straight-forward fashion; after the optimization algorithm updatesthe parameters at each step, all negative entries are set tozero to satisfy the non-negativity constraints; each w k and v i is then normalized to satisfy the unit length constraint.Finally, idea i is considered a frontier idea if there exists v that satisfies (cid:107) v (cid:107) = 1 , v ≥ , and v (cid:62) x i > v (cid:62) x j for all j (cid:54) = i ∈ [ n ] . We empirically evaluate the proposed method using realdatasets containing ideas and designs for pairwise compari-son. The experiments were designed to answer the followingquestions:
Q1. Visualization:
How successful is C
ROW
DEA in organiz-ing ideas?
Q2. Accuracy:
How accurately does C
ROW
DEA prioritizeideas according to multiple viewpoints?
Q3. Efficiency:
How does the accuracy change according tothe number of evaluators?
We constructed two types of real datasets (Table 1 summa-rizes the data statistics) : • Ideas : We prepared five open-ended day-to-day life ques-tions, such as “How can we reduce the number of late-comers at team meetings?”, and we collected solutionideas from crowdsourcing workers using the crowdsourc-ing platform, Lancers. We obtained approximately ideas for each question. We hired another set of crowdworkers for collecting preference labels, and we askedthem to compare pairs of ideas for each problem. Ap-proximately workers were assigned for each pair ofideas, and each worker evaluated at least pairs. Theorder of pairs and that of ideas in each pair were random-ized. There were approximately – evaluators and K preference labels in total for each dataset. • Designs:
We held a character design contest for an artifi-cial intelligence (AI) research laboratory and collected designs. We also prepared logos for the summer andwinter Olympic games from 1948 to 2020, and we col-lected preference labels for these two design tasks in thesame manner as for the datasets containing ideas. Therewere evaluators and K preference labels for the“Character” dataset and evaluators and K labels forthe “Olympic” dataset.
We compare C
ROW
DEA with the following four baselines(They are summarized in Table 2): Datasets, codes, and Jupyter notebook for reproducing ta-bles and figures are available at: https://github.com/yukinobaba/crowdea. able 1: Summary dataset statistics (a) Ideas
Dataset Problem
81 217 64 , Cheat “How can we effectively prevent students from cheating in exams?”
80 257 63 , Meeting “How can we reduce the number of latecomers for team meetings?”
80 177 63 , Night “How can we stay safe when walking alone at night?”
80 171 63 , Visitor “How can we support foreign tourists who encounter a language bar-rier?”
81 158 64 , (b) Designs Dataset Problem
38 64 14 , Character “Design a character for an AI research laboratory.”
66 183 42 , Table 2: Comparison of CrowDEA and baselinesMulti- Multi- Multi-evaluators dimensional viewBT - - -C
ROWD BT (cid:88) - -B LADE - CHEST - (cid:88) -BPR (cid:88) (cid:88) -C ROW
DEA (cid:88) (cid:88) (cid:88) • BT (Bradley and Terry 1952) is the Bradley–Terry (BT)model, a standard approach for aggregating pairwise pref-erences. This model represents a preference score for eachitem by a scalar value and does not assume a differentviewpoint for each evaluator. • C ROWD BT (Chen et al. 2013) is an extension of BT thatincorporates the diversity of evaluator reliability into themodel. • B LADE - CHEST (Chen and Joachims 2016a) is a multi-dimensional extension of BT and it models intransitivityin pairwise preference. • BPR (Rendle et al. 2009) is a method for recommen-dation, which models both the item embedding and userpreference by using d -dimensional vectors.The regularization parameter of the baseline methods waschosen from { . , . , . } , and the best case for a targetmetric is presented in the results. Although there exist sev-eral related studies, most of them are not applicable to thepresent problem setting; only the results of pairwise com-parison are given, whereas the features of each idea are un-available. α was set to . in all experiments to achieve a good balancebetween L C and L F . If α is large, L F pushes all ideas to thefrontier, which does not promote detecting the best ideas,whereas a small α lets the frontier ideas form a small andmeaningful subset. As the proposed method aims to generatepriority maps, we set d = 2 or d = 3 . We conducted a case study with design datasets to inves-tigate how well C
ROW
DEA visually organizes the ideasfrom multiple viewpoints. We applied C
ROW
DEA (with d = 2 ) to all the preference labels in the dataset, and theestimated two-dimensional embeddings were used for gen-erating the priority map shown in Fig. 2a. It can be ob-served that C ROW
DEA organizes the ideas along with thefrontier curve; C
ROW
DEA can locate each idea receivinga higher preference score (from its best viewpoint), and thepriority map thus shows the frontier curve. This providesa well-organized visualization, which facilitates the evalu-ation of ideas from multiple viewpoints. As mentioned inthe introduction, the priority map created by C
ROW
DEA al-lows us to recognize a variety of viewpoints, such as con-temporary aesthetics ( x -axis) and traditional aesthetics ( y -axis). Recent Olympic logos are placed in the bottom-rightregion, whereas older logos from the ’60s to ’80s are placedin the upper-left region, which possibly correlates with theages of those who provide the preference labels. It should benoted that the above interpretations of the axes are not givenin advance. In the priority map, Nagano (1998) Olympics and Calgary (1988) Olympics, which are highlighted in red,are the two winners on each of the two axes. The prioritymaps can also capture combinations of these two perspec-tives, and the winners on them are highlighted in blue inFig. 2a. Fig. 2b shows the visualization produced by BPR,which achieves the highest accuracy, as presented in Sec.4.6. In contrast to C ROW
DEA, BPR assigns much higherpriorities to modern logos than traditional ones, and it thusdoes not produce a frontier curve.
We demonstrate how accurately C
ROW
DEA determines thebest ideas in various viewpoints.
Setup:
We prepared the ground truth of idea prioritiesfrom various viewpoints to investigate accuracy. We firstcollected viewpoints for each dataset from crowdsourc-ing workers who were shown a pair of ideas and asked to Nagano (1998) is actually regarded as the best use of athleticimagery by some professional critics, https://en.99designs.jp/blog/famous-design/olympic-logos/a) C
ROW
DEA (b) BPR
Figure 2: Priority maps for the “Olympic” dataset generated by C
ROW
DEA and BPR. C
ROW
DEA produces well-organizedvisualization and detects good ideas in diverse viewpoints. The top-right corner of each image corresponds to its embedding inthe space. The frontier objects detected by C
ROW
DEA are highlighted in red or blue. Both C
ROW
DEA and BPR locate ideaswith higher priorities further from the origin. In contrast to BPR, C
ROW
DEA assigns high priorities to the ideas from multipleviewpoints and organizes the ideas along with the frontier curve.Table 3: Examples of frontier ideas for “Cheat” problemfound by C
ROW
DEA. C
ROW
DEA finds worthy ideas invarious viewpoints.Impose severe penalties for cheating, such as cancel-lation of modules for an entire year.Prepare two types of examination sheets with differ-ently ordered items, and distribute one to every stu-dent such that neighboring students have differentexam sheets.Have proctors watch students from the back of an ex-amination room.Instead of multiple-choice questions or short answerquestions, use essay questions to make it difficult tocopy the answers of other students.describe a viewpoint that distinguishes the two ideas. For in-stance, we obtained “This idea can be easily implemented”as a viewpoint for the “Cheat” problem. We then askedworkers to grade each idea in terms of each viewpoint ona five-point scale. Ten workers were assigned to each idea–viewpoint pair, and the average grade was used as the groundtruth priority p ∗ ij of idea i from viewpoint j . We removedoverlapped or less popular viewpoints by applying k -meansclustering to the obtained priorities; that is, we considered p ∗ j = (cid:0) p ∗ j , . . . , p ∗ nj (cid:1) to be the feature vector of viewpoint j and used it for clustering. The number of clusters was set to . The clusters with only one sample were then omitted,and the number of remaining clusters was – . We chosethe viewpoint closest to the center of each of the remain-ing clusters, referred to as a representative viewpoint. Wethus had – representative viewpoints for each dataset.We note that neither the proposed method nor the baselinemethods can access the ground truth; it is used only for eval- The representative viewpoints were almost the same when thenumber of clusters was chosen from { , , , , } . uation.We applied C ROW
DEA, BPR, and B
LADE - CHEST tothe preference labels in each dataset and obtained the em-beddings { x i } ni =1 . We intended to use the embeddings torank ideas according to each representative viewpoint in theground truth to evaluate the ranking accuracy. Given a view-point vector v , the projection of x i onto v (i.e., v (cid:62) x i ) isconsidered as the priority score for this viewpoint. We op-timize a viewpoint vector v ∗ j , which well represents view-point j , according to a evaluation measure. This yielded p ij = v ∗(cid:62) j x i , which is the predicted priority score for thatviewpoint. We also applied BT and C ROWD
BT to the pref-erence labels, and regarded the estimated score p i as p ij for each viewpoint j . Each method generated a ranking ofthe ideas for viewpoint j according to { p ij } i . We comparedthese with the ranking by the ground truth priorities, { p ∗ ij } i ,and evaluated the ranking accuracy.The ranking accuracy (nDCG@ k ) for each viewpoint wascalculated as follows: we had the top k ideas accordingto the predicted priorities, and their true priorities, y =( y , . . . , y k ) , where y i is the true priority of the i -th rankedidea. We additionally had the true top k ideas and their truepriorities t = ( t , . . . , t k ) . We calculated DCG( k, y ) = (cid:80) ki =1 y i / log ( i + 1) and IDCG( k, t ) = (cid:80) ki =1 t i / log ( i +1) to obtain nDCG@ k = DCG( k, y ) / IDCG( k, t ) . Results:
Table 3 lists examples of the frontier ideas ob-tained by C
ROW
DEA for the “Cheat” dataset. It can beseen that C
ROW
DEA provides useful ideas that are con-sidered good from various viewpoints. Table 4 shows theaverage nDCG@5 and nDCG@10 over the representativeviewpoints. It can be seen that C
ROW
DEA outperforms thebaselines in most cases; C
ROW
DEA can capture the diver-sity of viewpoints that are not considered by the other sim-ple methods. Moreover, C
ROW
DEA with d = 3 achieveshigher scores than with d = 2 in all datasets, as the higher-dimensional embedding handles various viewpoints.We quantitatively investigate the variety of the ideas pri-oritized by the proposed method. Fig. 3 shows the top- able 4: Average of nDCG@5 and nDCG@10 scores among the representative viewpoints. C ROW
DEA accurately ranks theideas according to various viewpoints. The cases in which C
ROW
DEA outperforms the baselines are bold-faced. The cases inwhich C
ROW
DEA is the statistically significant ( p < . ) winner by the Wilcoxon signed rank test are underlined. (a) d = 2 Dataset nDCG@5 nDCG@10BT C
ROWD
BT B
LADE
BPR C
ROW
DEA BT C
ROWD
BT B
LADE
BPR C
ROW
DEA-
CHEST - CHEST
Bike .
772 0 .
779 0 .
757 0 . . .
798 0 .
800 0 .
756 0 . . Cheat .
768 0 .
767 0 .
813 0 . . .
795 0 .
791 0 .
819 0 . . Meeting .
817 0 .
815 0 .
829 0 . . .
824 0 .
825 0 .
837 0 . . Night .
790 0 .
790 0 .
903 0 . . .
809 0 .
808 0 .
901 0 . . Visitor .
818 0 .
825 0 .
868 0 . . .
832 0 .
835 0 .
874 0 . . Character .
902 0 .
912 0 .
866 0 . . .
911 0 .
921 0 .
865 0 . . Olympic .
926 0 .
926 0 . . .
936 0 .
937 0 .
937 0 . . . (b) d = 3 Dataset nDCG@5 nDCG@10BT C
ROWD
BT B
LADE
BPR C
ROW
DEA BT C
ROWD
BT B
LADE
BPR C
ROW
DEA-
CHEST - CHEST
Bike .
772 0 .
779 0 .
803 0 . . .
798 0 .
800 0 .
797 0 . . Cheat .
768 0 .
767 0 .
847 0 . . .
795 0 .
791 0 .
839 0 . . Meeting .
817 0 .
815 0 .
867 0 . . .
824 0 .
825 0 .
862 0 . . Night .
790 0 .
790 0 .
907 0 . . .
809 0 .
808 0 .
894 0 . . Visitor .
818 0 .
825 0 .
916 0 . . .
832 0 .
835 0 .
906 0 . . Character .
902 0 .
912 0 .
905 0 . . .
911 0 .
921 0 . . . Olympic .
926 0 .
926 0 .
927 0 . . .
937 0 .
937 0 .
926 0 . . ideas and a heatmap of the ground truth priority p ∗ ij of eachtop- idea for each viewpoint. The top- ideas ranked byC ROWD
BT (with λ = 0 . ) and BT ( λ = 0 . ) are se-lected by using p i , and those by C ROW
DEA (with d = 2 )are according to p i = (cid:80) j ∈ [ n ] \ i v (cid:62) i ( x i − x j ) , which indi-cates how likely the ideas are to be frontier ideas. It is ob-served that C ROW
DEA prioritizes ideas from a wider vari-ety of viewpoints, whereas the baselines tend to use the sameviewpoints. Note that B
LADE - CHEST and BPR cannot out-put a single priority score due to the absence of v i . Each dataset contains the preference labels from approxi-mately 200 evaluators; however, it is not always feasibleto collect these labels from a large group of evaluators. Todemonstrate the efficiency of the proposed method, we eval-uate the accuracy of C
ROW
DEA in terms of the number ofevaluators. Additionally, each dataset contains – la-bels per evaluator. We evaluate the accuracy of C ROW
DEAin cases where the number of available labels is limited.
Setup:
We randomly chose q ∈ { , , } evaluatorsor r ∈ { , , , , } labels and appliedC ROW
DEA to the preference labels (i.e., a subset of thepreference labels in a dataset). For each q or r , we performed trials and selected a different set of evaluators (or labels)for each trial. Results:
Fig. 4a shows the average nDCG@5 of eachmethod according to the number of evaluators used formodel inference. The average nDCG@5 scores are shownfor different viewpoints and ten different subsets of evalua-tors. The performance of C
ROW
DEA declines as the num- ber of evaluators decreases; however, the average nDCG@5scores are still over . in all cases, even when the number ofevaluators is only , and C ROW
DEA outperforms the base-lines in all cases. Fig. 4b shows the average nDCG@5 ofeach method according to the number of labels. C
ROW
DEAshows better performance than the other methods evenwhen the number of labels is small. It is worth noting thatC
ROW
DEA can handle various viewpoints and prioritizeideas in situations where only a limited number of evalua-tors or labels are available.
We addressed the problem of idea prioritization with crowds.The proposed method estimates the best viewpoint for ev-ery idea and preference criterion of every crowd evaluator.Experimental results based on real datasets containing ideasdemonstrated that the proposed approach effectively prior-itizes ideas from multiple viewpoints and obtains frontierideas. The visualization based on the learned embeddingsfacilitates observation of the frontier ideas. Possible futurework may include extensions to multiple best viewpoints foreach idea, as the present formulation allows only a singlebest viewpoint. The interpretation of the obtained results isalso an important issue; although this is left to users in thepresent study, systematic interpretation by crowds is an in-teresting future research direction.
Acknowledgments
This work was supported by JSPS KAKENHI Grant Num-ber JP18K18105 and JST PRESTO Grant Number JP-MJPR19J9, Japan. ank of idea Rank of idea Rank of idea (a) “Olympics”
Rank of idea Rank of idea Rank of idea (b) “Character”
Figure 3: (Left) The top-10 ideas prioritized by each method. The ideas are ordered from left to right according to their estimatedpreference scores. (Right) The ground truth priority of each of the top-10 ideas in each representative viewpoint. The ideasselected by C
ROW
DEA are prioritized in different viewpoints, while those chosen by the baselines are prioritized in the sameviewpoints. (a) Efficiency with a small number of evaluators(b) Efficiency with a small number of labels
Figure 4: Average of nDCG@5 scores for the representative viewpoints and ten trials. C
ROW
DEA accurately ranks the ideaseven when the number of evaluators or the number of labels is small. d is set to . Due to space limitation, we only present theresults of the first four datasets. eferences [Ahmed and Fuge 2017] Ahmed, F., and Fuge, M. 2017.Capturing winning ideas in online design communities.In Proceedings of the 2017 ACM Conference on Com-puter Supported Cooperative Work and Social Computing(CSCW) , 1675–1687.[Amid and Ukkonen 2015] Amid, E., and Ukkonen, A.2015. Multiview triplet embedding: learning attributes inmultiple maps. In
Proceedings of the 32nd InternationalConference on Machine Learning (ICML) , 1472–1480.[Borzsony, Kossmann, and Stocker 2001] Borzsony, S.;Kossmann, D.; and Stocker, K. 2001. The skyline operator.In
Proceedings of the 17th International Conference onData Engineering (ICDE) , 421–430.[Bradley and Terry 1952] Bradley, R. A., and Terry, M. E.1952. Rank analysis of incomplete block designs: I. themethod of paired comparisons.
Biometrika
Journal of Statistical Planning andInference
Advances in Neural Information Process-ing Systems 27 , 1853–1861.[Chen and Joachims 2016a] Chen, S., and Joachims, T.2016a. Modeling intransitivity in matchup and comparisondata. In
Proceedings of the 9th ACM International Confer-ence on Web Search and Data Mining (WSDM) , 227–236.[Chen and Joachims 2016b] Chen, S., and Joachims, T.2016b. Predicting matchups and preferences in context. In
Proceedings of the 22nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining (KDD) ,775–784.[Chen et al. 2013] Chen, X.; Bennett, P. N.; Collins-Thompson, K.; and Horvitz, E. 2013. Pairwise rankingaggregation in a crowdsourced setting. In
Proceedings ofthe 6th ACM International Conference on Web Search andData Mining (WSDM) , 193–202.[Cheng, Gao, and Liu 2010] Cheng, Z.; Gao, B.; and Liu, T.-Y. 2010. Actively predicting diverse search intent from userbrowsing behaviors. In
Proceedings of the 19th Interna-tional Conference on World Wide Web (WWW) , 221–230.[Cooper, Seiford, and Zhu 2004] Cooper, W. W.; Seiford,L. M.; and Zhu, J. 2004. Data envelopment analysis. In
Handbook on Data Envelopment Analysis . Springer. 1–39.[Duan et al. 2017] Duan, J.; Li, J.; Baba, Y.; and Kashima,H. 2017. A generalized model for multidimensional intran-sitivity. In
In Proceedings of the 21st Pacific-Asia Confer-ence on Knowledge Discovery and Data Mining (PAKDD) ,840–852.[He and McAuley 2016] He, R., and McAuley, J. 2016.VBPR: Visual Bayesian personalized ranking from implicit feedback. In
Proceedings of the 30th AAAI Conference onArtificial Intelligence (AAAI) , 144–150.[Hose and Vlachou 2012] Hose, K., and Vlachou, A. 2012.A survey of skyline processing in highly distributed environ-ments.
The VLDB Journal
Proceedings of 2013 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , 7596–7599.[Kennedy and Naaman 2008] Kennedy, L. S., and Naaman,M. 2008. Generating diverse and representative imagesearch results for landmarks. In
Proceedings of the 17thInternational Conference on World Wide Web (WWW) , 297–306.[Koyama, Sakamoto, and Igarashi 2014] Koyama, Y.;Sakamoto, D.; and Igarashi, T. 2014. Crowd-poweredparameter analysis for visual design exploration. In
Pro-ceedings of the 27th Annual ACM Symposium on UserInterface Software and Technology (UIST) , 65–74.[Li, Baba, and Kashima 2018] Li, J.; Baba, Y.; and Kashima,H. 2018. Simultaneous clustering and ranking from pairwisecomparisons. In
Proceedings of the 27th International JointConference on Artificial Intelligence (IJCAI) , 1554–1560.[Li, Yang, and Zhang 2016] Li, Y.; Yang, M.; and Zhang,Z. 2016. Multi-view representation learning: A surveyfrom shallow methods to deep methods. arXiv preprintarXiv:1610.01206 .[Lofi, El Maarry, and Balke 2013] Lofi, C.; El Maarry, K.;and Balke, W.-T. 2013. Skyline queries in crowd-enableddatabases. In
Proceedings of the 16th International Confer-ence on Extending Database Technology (EDBT) , 465–476.[Lykourentzou et al. 2018] Lykourentzou, I.; Ahmed, F.; Pa-pastathis, C.; Sadien, I.; and Papangelis, K. 2018. Whencrowds give you lemons: Filtering innovative ideas usinga diverse-bag-of-lemons strategy.
Proceedings of the ACMHuman Computer Interaction .[Pan and Chen 2013] Pan, W., and Chen, L. 2013. GBPR:Group preference based bayesian personalized ranking forone-class collaborative filtering. In
Proceedings of the 23rdInternational Joint Conference on Artificial Intelligence (IJ-CAI) , 2691–2697.[Prpi´c et al. 2015] Prpi´c, J.; Shukla, P. P.; Kietzmann, J. H.;and McCarthy, I. P. 2015. How to work a crowd: Developingcrowd capital through crowdsourcing.
Business Horizons
Proceedings of the 28thAAAI Conference on Artificial Intelligence (AAAI) , 3130–3131.[Rendle et al. 2009] Rendle, S.; Freudenthaler, C.; Gantner,Z.; and Schmidt-Thieme, L. 2009. BPR: Bayesian person-alized ranking from implicit feedback. In
Proceedings ofthe 25th Conference on Uncertainty in Artificial Intelligence(UAI) , 452–461.Santos, Macdonald, and Ounis 2011] Santos, R. L.; Mac-donald, C.; and Ounis, I. 2011. Intent-aware search result di-versification. In
Proceedings of the 34th International ACMSIGIR Conference on Research and Development in Infor-mation Retrieval (SIGIR) , 595–604.[Seiford and Thrall 1990] Seiford, L. M., and Thrall, R. M.1990. Recent developments in dea: the mathematical pro-gramming approach to frontier analysis.
Journal of Econo-metrics
Proceedings of the 18thACM Conference on Computer Supported Cooperative Workand Social Computing (CSCW) , 937–945.[Wang et al. 2010] Wang, M.; Yang, K.; Hua, X.-S.; andZhang, H.-J. 2010. Towards a relevant and diversesearch of social images.
IEEE Transactions on Multimedia
Proceedings of the 32nd International Conferenceon International Conference on Machine Learning (ICML) ,1083–1092.[Yu and Nickerson 2011] Yu, L., and Nickerson, J. V. 2011.Cooks or cobblers?: crowd creativity through combination.In