[PDF] A preference elicitation interface for collecting dense recommender datasets with rich user information

Abstract

We present an interface that can be leveraged to quickly and effortlessly elicit people's preferences for visual stimuli, such as photographs, visual art and screensavers, along with rich side-information about its users. We plan to employ the new interface to collect dense recommender datasets that will complement existing sparse industry-scale datasets. The new interface and the collected datasets are intended to foster integration of research in recommender systems with research in social and behavioral sciences. For instance, we will use the datasets to assess the diversity of human preferences in different domains of visual experience. Further, using the datasets we will be able to measure crucial psychological effects, such as preference consistency, scale acuity and anchoring biases. Last, we the datasets will facilitate evaluation in counterfactual learning experiments.

Full PDF

AA preference elicitation interface for collecting denserecommender datasets with rich user information

Demo

Pantelis P. Analytis

Cornell [email protected]

Tobias Schnabel

Cornell [email protected]

Stefan Herzog

MPI for Human [email protected]

Daniel Barkoczi

MPI for Human [email protected]

Thorsten Joachims

Cornell [email protected]

ABSTRACT

We present an interface that can be leveraged to quickly and ef-fortlessly elicit people’s preferences for visual stimuli, such asphotographs, visual art and screensavers, along with rich side-information about its users. We plan to employ the new interface tocollect dense recommender datasets that will complement existingsparse industry-scale datasets. The new interface and the collecteddatasets are intended to foster integration of research in recom-mender systems with research in social and behavioral sciences. Forinstance, we will use the datasets to assess the diversity of humanpreferences in different domains of visual experience. Further, usingthe datasets we will be able to measure crucial psychological effects,such as preference consistency, scale acuity and anchoring biases.Last, we the datasets will facilitate evaluation in counterfactuallearning experiments.

CCS CONCEPTS • Human-centered computing → Collaborative filtering ; So-cial media ; Collaborative and social computing devices ; KEYWORDS preference elicitation, recommender system datasets, visual art

Over the last three decades the recommender systems communityhas made immense progress in the way we represent, understandand learn people’s preferences as a function of previously collectedexplicit or implicit evaluations. Research in recommender systemshas by all means increased the quality of the curated and recom-mended content in the online world. Several large datasets havebeen a crucial component of this success, as they have commonlyfunctioned as test-beds on which new theories and algorithms havebeen compared (Movielens, LastFM and Netflix to name just a few).Most of these datasets, however, are very sparse. They containthousands items and even the most popular among the items havebeen evaluated only by a small subset of their users. Given the largefraction of missing ratings, it is challenging to accurately estimateeven simple quantities like the average quality of an item, especiallysince the patterns of missing data are subject to strong selection

Recsys’17, August 2017, Como, Italy biases [6]. This presents fundamental challenges when evaluatingrecommendation algorithms on sparse datasets. Further, it becomesan obstacle for scholars in the social and behavioral sciences asworkarounds have to be developed for dealing with missing values.To the best of our knowledge, the only dense collaborative filter-ing dataset was the outcome of the Jester Interface [3]. The interfacecurated 100 jokes of various styles and topics. People utilized a sliderto evaluate 5 jokes that were presented to them sequentially. Thefirst evaluations were used to estimate people’s preferences and torecommend them the remaining jokes. The users continued to readand evaluate jokes until the pool of 100 items was exhausted. Intotal, more than 70.000 people have evaluated at least some of thejokes, and more than 14.000 have evaluated all the jokes, resultingto a fully evaluated subset of the dataset.

Figure 1: The design of the preference elicitation interface.We replicate the design of the

Jester interface, using a contin-uous bar that people can use to express how much they likedor disliked an item. Participants have to wait for at least 5seconds before they can proceed to the next item. a r X i v : . [ c s . S I] J un THE INTERFACE AND DATA COLLECTION

We plan to collect new datasets in different domains of people’s vi-sual experience, ranging from photographs and paintings to designsfor screensavers. Our interface replicates the design of the Jesterinterface, adding new elements that can counteract its limitations.At the outset, people are provided with instructions about how touse the interface. Then, before the presentation of the stimuli wecollect demographic information about the users. To reduce possi-ble order effects, the visual stimuli are presented in random order.As in Jester, users are asked to evaluate items using a slider bar;they can move the marker of the slide bar to the left to indicate thatthey did not like the item, or to the right to indicate that they likedit. We implement a continuous scale, which allows a fine-grainedevaluation of the presented items. Finally, to limit anchoring bias,the slide bar is initially semi-transparent and the colors becomevivid only when the user has clicked on it. Once all the items have been evaluated, we collect further psy-chologically relevant information about the users. Numerous stud-ies have shown that side information can substantially improveestimates of people’s preference and it complements first hand eval-uations [5]. In the first experiments we will deploy the visual-artexpertise questionnaire developed by Chatterjee et al. [2] to gaugepeople’s familiarity with the visual arts and a succinct version of thebig-five questionnaire to quickly assess the people’s personalities[7] (see Figure 2). It takes about 20 minutes to complete the currentversion of the interface, including the instructions, questionnairesand evaluation phase.We intend to conduct the first experiments at Amazon’s Mechani-cal Turk labor market. Several studies have shown that for effortlesstasks the results produced on mTurk are comparable to laboratorystudies [4]. The visual stimuli used in this interface evoke imme-diate aesthetic judgments, and thus can quickly be transformedto evaluations. Eventually, we intend to develop a data visualiza-tion tool that will reward people who complete the study withinformation about their preference profiles and how they relate tothose of other individuals. Thus, we intend to create an inherentlymotivating interface using as a reward the informational value gen-erated by the collected data. In this way, we will reduce the cost ofdata collection, but also introduce basic ideas behind collaborativefiltering and recommender systems to the wider public.

We envisage several new applications for the developed datasets.Here we foreshadow a few of these potential uses, keeping in mindthat the community that will have access to the produced datasetswill certainly come up with more. First, they will facilitate cross-fertilization with the cognitive and behavioral sciences. For instance,social and cognitive psychologists have extensively studied simplestrategies for inference and estimation where different features areused to predict an objective truth. The new datasets will open theway to study strategies for social preference learning in domainswhere no objective truth exists [1]. Also, we can manipulate thedesign of the interface to study relevant behavioral effects, such asto study the consistency of evaluations or to investigate the effect The interface can be accessed at http://abc-webstudy.mpib-berlin.mpg.de/recstrgs/study_simulator.php. Both the code for the interface and the collected data will bepublicly available.

Figure 2: At the end of the evaluation phase we collect addi-tional information about the users. We invited the users tocomplete a questionnaire about their expertist in the visualarts and a 10-question version of the big-five questionnaire. of the granularity of the evaluation scale on the predictions. Tosum up, the datasets will allow us to better understand preferencediversity and its implications for different recommender systemsalgorithms as well as for psychological social learning strategies.Moreover, we believe that the new datasets can fuel existingstreams of research in recommender systems and machine learning.For instance, dealing with selection-biases and with data missingnot at random is a growing research stream in recommender sys-tems and machine learning [9]. To evaluate algorithms tuned todeal with such problems, we can impose selection biases ex-anteand remove data from the dense dataset accordingly. This set-upcould complement existing sparse datasets for learning, with thedifference that selection biases can be controlled and varied in orderto test robustness. Moving on to the broader class of counterfactualsimulations, dense datasets greatly simplify evaluation since theycan serve as ground-truth when conducting simulations [8].

REFERENCES [1] Pantelis P Analytis, Daniel Barkoczi, and Stefan M Herzog. You’re special, butit doesn’t matter if you’re a greenhorn: Social recommender strategies for meremortals. In

Cognitive Science Society , pages 1799–1804. Cognitive Science Society,2015.[2] Anjan Chatterjee, Page Widick, Rebecca Sternschein, William B Smith, and BiancaBromberger. The assessment of art attributes.

Empirical Studies of the Arts ,28(2):207–222, 2010.[3] Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: Aconstant time collaborative filtering algorithm.

Information Retrieval , 4(2):133–151,2001.[4] Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. Running experimentson Amazon Mechanical Turk.

Judgment and Decision Making , 5(5), 2010.[5] Seung-Taek Park and Wei Chu. Pairwise preference regression for cold-startrecommendation. In

Proceedings of the third ACM conference on Recommendersystems , pages 21–28. ACM, 2009.[6] B. Pradel, N. Usunier, and P. Gallinari. Ranking with non-random missing ratings:influence of popularity and positivity on evaluation metrics. In

RecSys , pages147–154, 2012.[7] Beatrice Rammstedt and Oliver P John. Measuring personality in one minuteor less: A 10-item short version of the big five inventory in english and german.

Journal of Research in Personality , 41(1):203–212, 2007.[8] Matthew J Salganik, Peter Sheridan Dodds, and Duncan J Watts. Experimentalstudy of inequality and unpredictability in an artificial cultural market.

Science ,311(5762):854–856, 2006.[9] Tobias Schnabel, Adith Swaminathan, Peter I Frazier, and Thorsten Joachims.Unbiased comparative evaluation of ranking functions. In

ICTIR , pages 109–118,2016., pages 109–118,2016.