LLabel Visualization and Exploration in IR
Omar Alonso
Microsoft [email protected]
ABSTRACT
There is a renaissance in visual analytics systems for data analysisand sharing, in particular, in the current wave of big data applica-tions. We introduce RAVE, a prototype that automates the genera-tion of an interface that uses facets and visualization techniques forexploring and analyzing relevance assessments data sets collectedvia crowdsourcing. We present a technical description of the maincomponents and demonstrate its use.
1. INTRODUCTION
The adoption of visual analytics in the information retrieval (IR)community has been relatively low compared to other areas in com-puter science, like databases, that process massive large data sets.Visual analytics, a combination of techniques drawn from infor-mation visualization, data mining and statistics, allow users to di-rectly interact with the information presented to gain insights, de-duce conclusions, and make better decisions.With the ever increasing number of data sets and emerging newdata sources, the combination of automatic analysis and visual toolsoffers potential to gain better understanding on the many assets thatthe typical IR researcher, data analyst or relevance engineer has todeal with on a daily basis. Furthermore, the inclusion of crowd datain the form of labels that can be used for training machine learningmodels or evaluating the quality of search engine results, offers newopportunities for using visualization techniques.In IR, collecting high quality document relevance assessmentspairs is a crucial step for building relevance models. The task,which is very subjective, consists of assessing the relevance of adocument to a given topic or query. The human (editor, judge, orworker) performs a visual inspection of the document and providesa label on a particular relevance scale. Finally, label quality controlmechanisms are used to produce the final data set.In contrast to infographics or static visualization tools that canpresent different types of metrics and summary statistics, we are in-terested in interactive visualization and exploration of preferencesor labels from the crowd in the context of a particular IR experimentthat contains a human intelligence task. That is, visual explorationof workers’ preferences for certain scenarios as well as other be-
ACM ISBN 978-1-4503-2138-9.DOI: havioral data. In other words, visual analytics for data explorationwhere humans also data providers along with data about their workperformance.We argue that visual analytics techniques should be part of therelevance assessment process to help gather better labels and dis-cover potential issues in the experimental design or worker perfor-mance by looking at the entire data set [1]. Visualization tools helpusers in situations where seeing the structure of a data set in detailis better that seeing only a brief summary [7].Instead of focusing on the visualization of search results for aquery or other descriptive statistics, we are interested in exploringthe output of a relevance assessment task with the goal of recog-nizing patterns. Why do certain workers disagree on specific docu-ments or topics? Are there any relevance cues on the presentationthat may confuse a worker? Can we identify difficult tasks?In industrial settings, thousands of thousands of labels are col-lected weekly using data pipelines that select query document pairsand upload them to internal or external crowdsourcing platforms.The data analyst then looks for specific metrics or anomalies indata sets using traditional database-driven tools. In this context, vi-sual analytics should allow users to analyze data when they do notknow exactly what questions they need to ask in advance.We automate the construction of a visualization interface giventhe results of a crowdsourcing task. RAVE (Relevance Assessmentsand Visual Exploration) is a prototype that uses as input the re-sults of a labeling task and produces a faceted-based visualizationinterface for exploring and analyzing relevance assessments. Wedemonstrate how RAVE can be utilized to gather better insightsfrom judges, data sets, and labels.
2. SYSTEM OVERVIEW
We make the following assumptions in our architecture in termsof tools and data access. Our user, the data analyst, has access toa database of queries, topics, and documents. Human intelligencetasks are implemented in an external crowdsourcing platform (e.g.,Mechanical Turk, CrowdFlower, etc.) or internal equivalent tool.The user begins the implementation of the experiment by sam-pling query-document pairs (cid:104) q, d (cid:105) from a database. The secondstep is to annotate the sampled data by running some classifiers andNLP tools such as query type identification (e.g., navigational, in-formational, transactional) and named-entity recognition (e.g., per-son, organization, location, etc.) to augment the original data withannotations (cid:104) e , . . . , e n (cid:105) . Once the crowdsourcing task is com-pleted, the labels provided by workers and other assignment meta-data (e.g., worker id, time spent, approval rate, etc.), (cid:104) l, a , . . . , a m (cid:105) ,are available. We can think of the underlying data representa-tion as query-document pairs with query annotations, labels, andassignment metadata. That is, for a single assessment, a tuple a r X i v : . [ c s . I R ] D ec uery doc_A doc_B query length query type has_entity label worker_id work timeyoutube r1 r2 1 navigational company A 1 19youtube r2 r1 1 navigational company A 2 7youtube r1 r2 1 navigational company A 3 8selena gomez r2 r1 2 informational person same 1 21selena gomez r1 r2 2 informational person B 4 37selena gomez r2 r1 2 informational person B 3 9Table 1: Data description example for a relevance assessment task. Columns doc_A and doc_B represent the content of the A-B comparison;r1 and r2 are the rankers. Query length in characters, query type, and has_entity are query annotations. Worker_id and work time in secondsrepresent assignment metadata. (cid:104) q, d, e , . . . , e n , l, a , . . . , a m (cid:105) where l represents the label. Ta-ble 1 shows an example.Relevance assessment is a visual exercise and capturing the im-age of what the worker sees at assessing time is an important part ofour approach. Each document is saved as an image and the workersperform the task looking at the same set of images. This allows ouruser to see the same content as workers.
3. RELEVANCE ASSESSMENT VISUALIZA-TION AND EXPLORATION
As mentioned earlier, RAVE uses the image of the document asthe visual focus and query annotations and assignment metadata asfacets. The prototype automatically generates visualizations andfacets for a couple of available tools: Pivot and Exhibit . We nowdescribe the specific details for the automation. As driving example, we would like to evaluate the quality of anew ranking function against an existing baseline. That is, assessthe relevance of two ranking functions r and r , each returning afixed number of documents d , . . . , d n (where n < ) as resultsfor a query q . For collecting the assessments, we create an A-Bcomparison task that shows the query and the results for the tworankers in random order (a ranker may appear in column A or B).The task for the workers is to select which search results they preferaccording to three choices: A is better, B is better, or they are thesame as third option.As part of the data preparation step, the tool captures a debrandedSERP (Search Engine Results Page) screenshot for a query docu-ment pair. The document, in this case the ranked hit list, is savedas an image using standard libraries. A debranded page means thatthere are no specific user interface items that may bias workers.A bit more processing is needed for Pivot, which requires theuse of the Pauthor command line tool for generating Deep Zoomimages. For Exhibit, the thumbnail version of the original image isalso produced. A configuration file is needed for specifying whichcolumns from the results data file corresponds to which facets. Pivot collections are stored in a CXML schema that defines facetsand other properties. In essence, the collection is a cxml file thatdescribes the facets and contains all the elements needed for presen-tation. The generation of the collection works as follows. The codefirst outputs the facets that we are interested in (
FacetCategory )and then loops through the rows of the input file (the experimenttask results) and outputs an item for each entry (
Item ). http://bit.ly/1DsLdC6 http://simile-widgets.org/exhibit/
Exhibit is implemented as an open source JavaScript library andthere is no software to install; everything works on the browser.RAVE generates two files for creating an Exhibit: an HTML filethat contains the layout of the elements in the web page and thedata file in json format. For producing the json view, in a similar types: {‘Item’: {pluralLabel: ‘Items’}},items: [{ type: ‘Item’,label: ‘4’, worker: ‘4’,querytype: ‘informational’,answer: ‘B’,image: ‘Selena_GomeztopB.png’,thumbnail: ‘Selena_GomeztopB_tb.png’,},{ type: ‘Item’,label: ‘1’, worker: ‘1’,querytype: ‘informational’,answer: ‘Same’,image: ‘Selena_GomeztopB.png’,thumbnail: ‘Selena_GomezopB_tb.png’,},... ] }
Figure 3: Exhibit collection source code. A json snippet that de-scribes the data set.fashion to Pivot, the prototype reads the results of the experimentoutput and produces all the information needed.Figure 3 describes sample data and Figure 4 shows the explo-ration of the results sets for a specific navigational query to investi-gate if there was any difference in the preferences for both rankers.Figure 5: Analyzing workers’ work units by quantity.
Now that we have shown how to visualize a data set, we can ex-plore workers’ data to identify patterns. For example, of the nineworkers who participated in the task, three of them have performedmost of the work and two worked on single assignments (Figure 5).By zooming into the workload of a particular worker we can inves-tigate his/her answers in more detail by comparing to the answer ofother workers. The tool then allows the user to explore not only thefinal labels but also the specific worker data that can help determineworker quality and other behaviors.
4. RELATED WORK
As researchers and practitioners collect and analyze their ownlabeled data sets, new tools and solutions that facilitate such tasksare becoming available. Examples are end-to-end industrial crowd- sourcing pipelines [3], the automation of crowdsourcing relevancewith Terrier [6], and an open source system for collecting relevanceassessments [4]. On the visual analytics front, VIRTUE, a sys-tem for exploring IR system performance and related metrics is de-scribed in [2]. SeeDB, a visualization recommendation engine forfast visual analysis is presented in [8]. Finally, there is emergingwork on using visualization to help collect good labels via crowd-sourcing in the NLP annotations [5].
5. CONCLUSION AND FUTURE WORK
We showed a prototype that can automatically generate a facet-based visualization for exploring a collection of relevance assess-ments collected via crowdsourcing. While this may look like avery narrow space, in practice, practitioners spend considerableamount of time looking at labeled data before the relevance mod-eling phase. Our goal is to assist data analysts who need to collectand assess relevance tasks labels by allowing them to visually ex-plore those data sets in more detail. As an example, we showed anA-B comparison experiment but the techniques presented work forany type of task that requires workers to visually explore contentand produce some label.We are not interested in imposing a particular visualization metaphorbut rather to suggest the adoption of this type of tools as part of therelevance assessment gathering process in IR. The prototype offersthe visualization for two tools and can be extended to others. TheExhibit example is very flexible and easy to deploy making it alow-cost development alternative.Visually exploring a data set can be useful to decide if the labelsare of good quality, if there are no potential issues with the exper-iment or if the presentation of the results can bias the final labels.RAVE differs from previous research work in the sense that our fo-cus is on exploring data sets instead of visualizing metrics. WithRAVE the user can identify patterns and perform comparisons.Future work includes automating the recommendation of visual-izations, using the prototype to explore other existing assessmentsdata sets like the TREC relevance labels and investigate the inte-gration with other toolkits like D3.
6. REFERENCES [1] Omar Alonso. Visualization for relevance assessments.
SIGIRForum , 48(2):14–21, 2014.[2] Marco Angelini, Nicola Ferro, Giuseppe Santucci, andGianmaria Silvello. VIRTUE: A visual tool for informationretrieval performance evaluation and failure analysis.
J. Vis.Lang. Comput. , 25(4):394–413, 2014.[3] Vasilis Kandylas, Omar Alonso, Shiroy Choksey, KedarRudre, and Prashant Jaiswal. Automating crowdsourcing tasksin an industrial environment. In
HCOMP , 2013.[4] Bevan Koopman and Guido Zuccon. Relevation!: an opensource system for information retrieval relevance assessment.In
SIGIR , pages 1243–1244, 2014.[5] Hanchuan Li, Haichen Shen, Shengliang Xu, and CongleZhang. Visualizing NLP annotations for crowdsourcing.
CoRR , abs/1508.06044, 2015.[6] Richard McCreadie, Craig Macdonald, and Iadh Ounis.Crowdterrier: automatic crowdsourced relevance assessmentswith terrier. In
SIGIR , page 1005, 2012.[7] Tamara Munzner.
Visualization Analysis and Design . A. K.Peters, 2014.[8] Manasi Vartak, Samuel Madden, Aditya G. Parameswaran,and Neoklis Polyzotis. SEEDB: automatically generatingquery visualizations.
PVLDB , 7(13):1581–1584, 2014. acets
Search results as image
Information panel
Figure 2: Pivot collection visualization. Three screenshots of the tool in action. From background to front: overview of the collection (allimages) sorted by queries, focus on the search results for the query {Golden Globes 2013}, distribution of ranker preferences for a data set.