[PDF] TurkEyes: A Web-Based Toolbox for Crowdsourcing Attention Data

Abstract

Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies in the literature. ZoomMaps is a novel "zoom-based" interface that captures viewing on a mobile phone. CodeCharts is a "self-reporting" methodology that records points of interest at precise viewing durations. ImportAnnots is an "annotation" tool for selecting important image regions, and "cursor-based" BubbleView lets viewers click to deblur a small area. We compare these methodologies using a common analysis framework in order to develop appropriate use cases for each interface. This toolbox and our analyses provide a blueprint for how to gather attention data at scale without an eye tracker.

Full PDF

TTurkEyes: A Web-Based Toolbox forCrowdsourcing Attention Data

Anelise Newman , Barry McNamara , Camilo Fosco , Yun Bin Zhang ,Pat Sukhum , Matthew Tancik , Nam Wook Kim , Zoya Bylinskii MIT{apnewman, barryam3,camilolu}@mit.edu Harvard{ybzhang,psukhum}@g.harvard.edu University of California,[email protected] Boston [email protected] Adobe [email protected]

ABSTRACT

Eye movements provide insight into what parts of an imagea viewer ﬁnds most salient, interesting, or relevant to thetask at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Herewe explore an alternative: a comprehensive web-based tool-box for crowdsourcing visual attention. We draw from fourmain classes of attention-capturing methodologies in the litera-ture. ZoomMaps is a novel zoom-based interface that capturesviewing on a mobile phone. CodeCharts is a self-reporting methodology that records points of interest at precise view-ing durations. ImportAnnots is an annotation tool for select-ing important image regions, and cursor-based

BubbleViewlets viewers click to deblur a small area. We compare thesemethodologies using a common analysis framework in orderto develop appropriate use cases for each interface. This tool-box and our analyses provide a blueprint for how to gatherattention data at scale without an eye tracker.

Author Keywords

Eye tracking; attention; crowdsourcing; interaction techniques

CCS Concepts • Human-centered computing → Interaction techniques;

Web-based interaction; Interactive systems and tools; Userstudies; Empirical studies in interaction design;

INTRODUCTION

Gaze provides a window into what aspects of an image, design,or visualization people ﬁnd most engaging. Where someonelooks on an image can predict whether they remember it or not[6, 7]. Attention-grabbing regions of a poster can be used tosummarize the design for later retrieval [9]. The most salientparts of an image can guide automatic cropping and retargeting[12]. All of these applications rely on inferring where peopleare paying attention by capturing where they are looking.

This is the authors’ version of the work. It is posted here for your personal use. Notfor redistribution. The deﬁnitive Version of Record was published in:

CHI ’20, April 25–30, 2020, Honolulu, HI, USA.

Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-6708-0/20/04 ...$15.00.http://dx.doi.org/10.1145/3313831.3376799

Figure 1. We consider approaches for crowdsourcing human visual at-tention data without the use of an eye tracker. The attention heatmapsgenerated by two of our interfaces, CodeCharts and ZoomMaps, mimicheatmaps obtained using eye tracking. While CodeCharts closely ap-proximates eye movements, ZoomMaps gives a coarser approximationof attention with more emphasis on distant details. This paper will coverhow the methodologies capture stable aspects of human attention andthe unique features that make them suitable for different applications.

However, attention data has historically been difﬁcult to col-lect at scale, as it involves in-lab eye tracking using dedicatedhardware. The time it takes to recruit and run each participantprevents quick data collection and iteration. Meanwhile, on-line crowdsourcing allows for rapidly collecting large amountsof human data. Although webcam-based eye tracking has beenproposed as a crowdsourceable alternative [25, 32, 42], it hasmany requirements, such as speciﬁc lighting conditions andparticipant pose, that are difﬁcult to enforce. This has moti-vated a body of research on interactive user interfaces capableof capturing attention data without eye tracking.In this paper, we analyze and expand the state-of-the-art ininteraction methodologies for capturing attention. We presentTurkEyes ( http://turkeyes.mit.edu/ ), a toolbox of four inter-faces for gathering attention data using just a laptop or mobilephone. None of the interfaces we consider explicitly mea-sure eye movements. Rather, we make use of interactionmethodologies from the literature that are correlated with vi-sual attention (Fig. 1). The interfaces we explore are: a r X i v : . [ c s . H C ] J a n oomMaps (zoom-based): Participantsuse the pinch-zoom gesture on a mobilephone to explore image content.

CodeCharts (self-report):

Participantsspecify where on an image they gazed us-ing a grid of codes that appears after imagepresentation, inspired by [35].

ImportAnnots (annotation):

Participantspaint over regions of a design they considerimportant using binary masks [31].

BubbleView (cursor-based):

Participantsclick to deblur/expose small bubble regionson an otherwise blurry image [22].We design our toolbox to represent four main categories ofattention-capturing interfaces that we identify in the literature.Two of these categories lacked a well-studied concrete imple-mentation, so we created novel interfaces (ZoomMaps andCodeCharts) to ﬁll this gap. For the other two categories wedraw on existing tools (ImportAnnots and BubbleView). Weintegrate these interfaces into a shared software frameworkand provide code to convert their attention data into a commonformat so that they can be directly compared.Next, we do a deep-dive on these interfaces by conductingextensive experiments on Amazon’s Mechanical Turk. Wecarefully design tasks and validation procedures to producehigh-quality data. We collect data on a variety of stimuli(natural and non-natural images) to determine what insightsare discoverable by each tool. Although all the interfacescapture some common aspects of attention, they are best suitedfor different image types and tasks, and we provide guidelinesas to applicable use cases for each.Our contributions are: 1) a comprehensive toolbox that gathersattention-gathering interfaces into a common code and analysisframework and 2) a user guide explaining how, when, and whyto deploy each interface to gather attention data tailored to aparticular use case.

RELATED WORK

Interaction data such as mouse/keyboard on desktop andtouch/zoom on mobile provides a window into what peopleﬁnd relevant and interesting in online content [10, 18, 34, 41].However, these same interaction methodologies can be har-nessed to capture attention on images as a replacement forin-lab eye tracking.

Eye tracking.

Eye movements collected using dedicated hard-ware have long been used to quantify attention. Researchershave also used built-in webcams to obtain coarse-grained atten-tion data from crowdworkers [25, 32], but these methods areinsufﬁciently robust, requiring controlled conditions. Effortshave thus turned to interaction techniques that approximate eyemovements, falling into one of the following four categories.

Cursor-based interfaces.

Prior work investigated the corre-lation between mouse and gaze locations [14, 18, 34]. Cur-sor movements can complement eye movements, especiallywhen a participant can use both to interact with visual con- tent. A separate line of work considered cursor-based inter-faces as a proxy for eye tracking [3, 21, 38]. For instance,the moving-window methodology reveals only portions of anotherwise-obscured image depending on where a user posi-tions the mouse cursor [20, 30, 33, 40]. This is the basis of theBubbleView methodology, which was extensively exploredin [22] and provides a well-understood comparison point forour work.

Self-report interfaces.

Moving-window methodologies likeBubbleView distort the underlying image. An alternative isto show viewers an undistorted image and ask them to reportwhere they looked, often with the aid of an annotated grid [11,35]. Our CodeCharts interface is based on [35].

Zoom-based interfaces.

Zoom allows users to expand con-tent that they ﬁnd engaging and want to view in greater detail[2, 4]. Previous work investigated the zoomable viewport on amobile phone as a measure of user engagement with an inter-face or list of search results [15, 16, 26, 27, 28, 29]. Huanget al. even proposed generating heatmaps based on the view-port [17]. Our ZoomMaps methodology expands on this workby using viewport data to produce an attention heatmap onan arbitrary image, treating the mobile phone as a restrictedwindow through which users explore areas of interest.

Annotation interfaces.

UI tools for collecting object segmen-tations in images were developed to produce training data forcomputer vision tasks such as object detection and recognition[36, 37]. However, they have also been used to identify graphicdesign elements that a viewer rates as important. ImportAn-nots refers to the interface for capturing explicit “ImportanceAnnotations”, ﬁrst introduced by O’Donovan et al. [31], andhas been used to collect data for training computational modelsto predict importance of graphic designs [9, 31].

CONSIDERATIONS FOR EVALUATING ATTENTION

Here we discuss ideas, tools, and analysis methods for evalu-ating attention interfaces. These considerations will motivatean in-depth look at each interface and guide our comparisonsbetween them.

Four classes of interfaces.

We group previous work onattention-gathering interfaces into four categories.

Zoom-based interfaces use a viewer’s zoom patterns as a signalof interest in regions that are viewed or zoomed.

Self-report interfaces show an image for a limited time and ask the viewerto report where they were looking using a visual guide.

Anno-tation interfaces allow users to explicitly segment regions theyjudge to be relevant to the task at hand.

Cursor-based inter-faces leverage correlations between mouse movements and eyemovements, often by incentivizing the viewer to click/hover toexplore points of interest. Our toolbox contains one exemplarof each and our analyses will consider the capabilities anddrawbacks of these approaches.

Representations and metrics for comparing attention.

Weconvert the output from all of our interfaces into a common rep-resentation: an attention heatmap , where regions with higherheatmap values are more attended to. This is signiﬁcant be-cause it lets us directly compare output from all the interfaces. nterface Use Case Advantages DrawbacksZoomMaps

Capturing exploration of largeimages at multiple scales Works on images with multi-scalecontent, natural form of interaction Coarse approximation of attention

CodeCharts

Approximating eye-tracking, esp.for precise viewing durations Doesn’t distort stimuli,experimenter controls timing, fun Little data per participant, imagesmust ﬁt on screen

ImportAnnots

Comparing importance ofgraphic design elements Produces clean segmentations,captures importance Not ideal for natural images,measures importance overattention

BubbleView

Approximating eye-tracking, esp.during description tasks Versatile, cheap Distorts stimuli and viewingexperience

Table 1. Use cases and trade-offs for the four TurkEyes interfaces.

To quantify the similarity of attention data captured in differentways, we use the Pearson’s Correlation Coefﬁcient (CC) andNormalized Scanpath Saliency (NSS) metrics. CC and NSSare the preferred metrics for evaluating saliency predictionsand are highly correlated [8]. CC measures the pixel-wise cor-relation between two normalized heatmaps and ranges from-1 (inversely correlated) to 1 (perfectly correlated); we use itto compare attention heatmaps generated by different inter-faces. NSS measures the mean value of a normalized attentionheatmap evaluated at ground-truth eye ﬁxation locations andranges from 0 (no heatmap density at ﬁxated locations) toinﬁnity (all heatmap density at ﬁxated locations). We reportNSS when comparing our attention heatmaps to ground trutheye ﬁxations, which does not require post-processing the ﬁxa-tions into a heatmap (in contrast to CC). These metrics wereused in [22] and studied in detail in [8].

Types of stimuli.

To investigate how well our interfaces func-tion for different image types, we collect data using a subsetof our interfaces on natural images, resumes, graphic designs,infographics, and data visualizations. The natural images aredrawn from the CAT2000 dataset [5] which has ground-trutheye tracking data. For 35 images from CAT2000, we collecteddata using all four interfaces: ZoomMaps, CodeCharts (at6 different viewing durations), ImportAnnots, and Bubble-View. We also evaluated our interfaces on 116 resumes and 20graphic designs that we downloaded from Canva.com. Addi-tionally, we ran ZoomMaps on larger, more complex imagesthan the other interfaces: infographics from MASSVIS [6]and data visualizations from personal collections.

Types of task.

There are several common viewing tasks usedwhen collecting attention data, including: search (lookingfor a particular element in an image), description (describingor annotating an image while viewing it), memory (recallingsome aspect of an image after viewing it), and free-viewing(exploring the image freely with no explicit task).A limitation of relying on interaction to collect attention is thatthe interface cannot be completely decoupled from the task.For instance, it is impossible to have a free viewing ImportAn-nots task or a description CodeCharts task. For crowdsourcing,it also important for (1) the task to incentivize participantsto engage with the interface, and (2) data quality to be easilyvalidated. For our experiments, we choose tasks that align with both the interaction methodology of each interface andthe incentives of crowdworkers, and we explain how to val-idate the quality of the data captured with our procedures.For BubbleView and CodeCharts, we use a free-viewing taskbecause the actions of clicking and reporting codes, respec-tively, naturally encourage interaction. For ZoomMaps weuse a memory task to encourage participants to explore theimage by zooming in on details. For ImportAnnots, we usethe annotation (description) task intrinsic to the interface, andwe also report on the description task using BubbleView.

Evaluation criteria.

We will consider several criteria whenevaluating these interfaces, including: cost of data collectionper image, type of stimuli and task that is appropriate for eachinterface, similarity of the data collected to eye movements,and what exactly each interface is measuring. Table 1 providesa high-level summary of the beneﬁts and drawbacks of each.We present this table up front in order to contextualize thediscussion of the details of each interface in the next section.

INTRODUCING THE TURKEYES TOOLBOX

In this section, we do a deep dive into the individual inter-faces to describe how they work and present our experimentalprocedures for collecting attention data with each.

ZoomMaps

The mobile screen provides a naturally restricted window thatis frequently used to explore multi-scale content with the helpof the zoom functionality. We build a novel interface to capturethe zoom patterns of participants viewing images on theirmobile phones and show that these patterns can be used as anapproximation of visual attention.

Task ﬂow.

Participants are sent to a landing page that containsa QR code and a URL that they use to open an image gallery intheir mobile browser. They are instructed to spend a minimumamount of time (5-15 seconds per image depending on theexperiment) exploring each image by panning and zooming.Depending on the experiment, they either answer questionsabout each image (in which case they answer questions onmobile) or ﬁll out a task-speciﬁc questionnaire at the endjust before submission (which can be completed on a desktop). igure 2. ZoomMaps UI. Participants use the pinch-zoom gesture ontheir phones to explore image content at higher resolutions.

Once they are done viewing the images on mobile, participantsreceive a completion code to enter on the landing page andreceive credit for the task.

Implementation.

We built an image gallery webpage aug-mented with tracking capabilities using the PhotoSwipeJavaScript library . We modify the library to capture anychanges to the visible region of the image along with a times-tamp. The interface allows pinching to zoom in or out on animage and swiping to switch images (Fig. 2). The interactiondata contains viewport coordinates on the image and a times-tamp for every event triggered by the user (i.e., the image isre-scaled or re-positioned on the screen). Validation procedure.

We require that participants spend atleast 1-5 seconds (depending on the experiment) on 85% ofthe images and at least 3 minutes in total on a task that weestimate should take 4-9 minutes. Furthermore, we requirethat participants zoom on at least 20% of the images. Thesethresholds were chosen empirically based on pilots. In mostimage collections, there are at least a few images that are unin-teresting, receiving little viewing time, and there are frequentlyimages that require no zooming because all elements are visi-ble at the original scale. Participants are encouraged to keepexploring until they have met our time or zoom requirements;this ensures that we do not need to discard data retroactively.

Generating attention heatmaps.

Our mobile interface storesthe bounding boxes of zoomed image regions along with times-tamps of when they were in focus. We use this information toextract which parts of the image were viewed for how long andat what zoom level. We then construct the attention heatmapas follows: for every pixel in the image, we compute its aver-age zoom level over the entire viewing interval. We deﬁne thezoom level for an image region as the full image area dividedby the area of the image region that has been magniﬁed. Weassign this zoom level to all the pixels contained in the imageregion. We then compute each pixel’s average zoom level overthe viewing duration to obtain a ZoomMaps attention heatmap(Fig. 3). Higher values in the heatmap correspond to regionsof the image that were inspected with closer zoom on average.

Cost.

Participants were paid $1.00-$1.25 to look at 5-35 im-ages. Payment depended on how many questions we askedand how long participants should spend exploring each image,where we assumed that more complex/larger visuals like in-fographics would take longer to explore than natural images. https://github.com/dimsemenov/photoswipe Figure 3. A ZoomMaps attention heatmap, visualized alone (top) andoverlaid on the image used to create it (bottom). For three pixels, la-beled A, B, and C, zoom over time is averaged to produce a value in thecorresponding attention heatmap.

For twenty impressions per image, data collection cost $0.72per natural image, $1.33 per resume, and $5 per infographic.

Likeability.

Participants generally enjoyed the task and tran-sitioned smoothly from browser to mobile. They were occa-sionally frustrated when asked to spend more time exploringthe image after failing to meet our time or zoom requirements.

CodeCharts

CodeCharts collects individual gaze points by asking partic-ipants to self-report where they were looking on an imageusing a grid of three-character codes called a codechart . Be-cause the timing of image presentation is controlled by theexperimenter, it is possible to collect attention data at preciseviewing durations.

Task ﬂow.

A participant views an image on their screen fora preset duration, typically a few seconds. When the imagedisappears it is replaced by a jittered grid of three-charactercodes. The participant notes the last triplet they see when theimage vanishes, which approximates where on the image theywere looking at the time. When the codechart vanishes theyself-report this alphanumeric code. This process, visualized inFig. 4, repeats for a sequence of images. Trials are separatedby a ﬁxation cross to re-center the participant’s gaze.

Figure 4. CodeCharts UI. Participants self-report a region of an imagethey gazed at using a grid of codes that appears after image presentation.igure 5. Sample validation images for the CodeCharts UI. We exper-imented with a regular ﬁxation cross (left) and a red circle on a whitebackground (center) before selecting a cropped face on a white back-ground (right) as the image that most effectively encouraged partici-pants to move their eyes to the cue. Participants were expected to typea triplet code that overlapped with the coordinates of the cue. The cuesare plotted larger than actual scale for ease of viewing.

Implementation.

Each codechart is a grid of three-characteralphanumeric codes. Each triplet is placed with a slight ran-dom horizontal and vertical jitter. Every time an image ispresented it is accompanied by a different codechart. We padand resize all input images to a consistent size and generatecodecharts with the same dimensions so they can be easilydisplayed in the browser. Both target images and codechartsare resized to ﬁt into a display window of 1000 by 700 pixelsand triplets are displayed at a font size of around 16 pixels.To ﬁt this display size, our CodeCharts implementation wasdesigned for desktop, but the same methodology could beadapted for mobile.The goal is to display the codechart long enough for partici-pants to read a triplet but not so long that their eyes can wander.We found that 400ms was the optimal exposure duration. Bysystematically decreasing codechart exposure from 750msto 400ms, the NSS similarity of CodeCharts data to groundtruth eye movements increased from 1.74 to 1.89. When wedropped the exposure time to 300ms, we saw a signiﬁcant drop(24%) in accuracy at reporting a valid code.One limitation of the interface is that the codechart can pro-duce gridlike artifacts in the output. This can be mitigated byjittering the axes of the whole grid instead of jittering individ-ual codes within a rectangular cell. Experimenters must takecare that triplets are well-spaced and extend to the edge of theimage to avoid collecting skewed data.

Validation procedure.

Some of the images shown in our taskare validation images. These images have a plain backgroundand a single point of interest that aligns with one or more“correct” triplets in the corresponding codechart. We triedthree different styles of validation images: a plain ﬁxationcross, a red circle on a white background, and a cropped faceimage on a white background (Fig. 5). Face images were takenfrom the face dataset compiled by Bainbridge et al. [1]. Wefound that faces provide a more interesting cue than simplerstimuli, incentivizing participants to attend to the cue. Wealso found that explicitly instructing participants to look atvalidation images increased similarity to ground-truth eyemovements over all experiment images (not just the validationones!).Our task starts with a screening phase of three normal and threevalidation images where participants must correctly enter allvalidation codes and may only enter one nonexistent codethat does not appear on the codechart. Validation imagesare also interspersed throughout the sequence and are used to

Figure 6. ImportAnnots UI. Participants paint over regions of a designthat they consider important using binary masks. The mask is shown asa transparent red overlay. retroactively discard data from participants who miss over 25%of validation codes. We also eliminate participants who lookat the same spot on the screen for many images in a row. Weﬁnd that participants with higher validation accuracy producedata that is more similar to human eye movements, whichjustiﬁes our attempts to select interesting validation images.This could be because these participants are more attentive orget used to moving their eyes in response to stimuli.

Generating attention heatmaps.

We combine all the gazepoints for one image (one per participant) and blur them witha sigma of 50 to generate a heatmap. Our triplets are spacedapproximately 100 pixels apart, so 50 is a good approximationof the radius of uncertainty in the interface.

Cost.

For a 48-trial experiment on natural images, participantsspent on average 1 minute reading instructions and just over 6minutes on the rest of the task including a demographic survey.At an hourly rate of $10, this works out to $1.25-$2.00 perimage for 50 participants’ worth of data, depending on howlong the images were shown.

Likeability.

The task was very well-received. Participantsoften described it as “fun” and “interesting” while also being“hard” and “fast”. We think that the automatic timing of thetask contributed to a game-like experience.

ImportAnnots

O’Donovan et al. introduced the idea of having crowdworkersannotate important elements on graphic designs using binarymasks and averaging them to construct importance heatmaps[31]. For this paper, we re-purpose the initial interface, add avalidation procedure, and test the interface on different imagetypes including natural scenes, infographics, and resumes.

Task ﬂow.

Participants are presented with a series of imagesone at a time and are asked to annotate the most importantregions (Fig. 6). There are no deﬁnitions of what should beconsidered “important”. We restrict participants to a maximumof one minute per image. igure 7. Sample validation images for the ImportAnnots UI: we sim-pliﬁed vector graphic designs to contain a single element that unambigu-ously stood out. Throughout the task, we computed intersection-over-union of participant annotations with ground truth annotations (insets)to ensure annotation quality.

Implementation.

We used legacy code from O’Donovan etal. [31] for the annotation tool embedded in our task interface.It is a Flash application, designed for desktop, that provides 3annotation tools: stroke ﬁll that allows tracing the contours ofan object to provide a ﬁne-grained segmentation, polygon ﬁll that allows plotting points with connected lines for coarser an-notation, and regular stroke for painting over a region. Strokeﬁll is set as the default and is what most participants choose:64% of images are annotated using stroke ﬁll compared to34.2% with polygon ﬁll and 1.8% with regular stroke.

Validation procedure.

Interspersed throughout the task werevalidation images in the form of graphic designs containingone main textual or graphical element (Fig. 7). Forty suchvalidation images were constructed by manually deleting extraelements from existing vector designs to make annotation ofimportance more obvious. For a 5-minute task, participants an-notated 10 images and 3 validation designs in a random order.To meet our quality thresholds, participants needed to annotateat least one object in all but one of the images and to correctlyannotate 2/3 of the validation designs. Correctness on the val-idation designs was measured by an intersection-over-union(IoU) threshold of 0.55, where IoU is the area of overlap of thevalidation element and the user’s selection over the total areaof the two. We set this threshold by running a pilot experimentto capture reasonable variation in validation annotations andthen calculating the mean IoU across annotations manuallyselected to be of high quality. We only collected data from par-ticipants who met the validation threshold and did not discarddata retroactively.

Generating attention heatmaps.

Each participant generatesone binary mask per image. The binary masks are averagedacross participants to produce an overall attention heatmap forthe image. Despite high inter-observer variability and noisyannotations, averaged over a large number of participants (20-30), the mean importance maps give a plausible ranking ofimportance (see appendix of [31]).

Cost.

We paid participants $0.85-$1.00 for annotating 10designs, bringing total costs to $2.55-$3.00 per image for 30annotations.

Likeability.

Some of the participants found that the tools werenot immediately intuitive. In general, this task takes longer perimage than others because of the level of annotation required.

Figure 8. BubbleView UI. Participants click to deblur/expose small re-gions of an otherwise blurry image.

BubbleView

BubbleView is a cursor-based moving-window methodology.The original image is blurred to distort text regions and disablelegibility, requiring participants to click around to de-blursmall, circular “bubble” regions at full resolution (Fig. 8).BubbleView was initially introduced in [23], which includeda thorough comparison to eye movements on different imagetypes and tasks. We reuse some of those analyses in this paper.

Task ﬂow.

Participants are asked to explore one image at atime using their mouse cursor to click and deblur regions ofan image. Clicking on a new location re-blurs the previouslyclicked location. Blurring the image loosely approximatesperipheral vision. Kim et al. found that a blur radius of 30-50pixels, corresponding to 1-2 degrees of visual angle, producesclicks that most closely approximate eye movements, withouthindering search [22]. For free-viewing tasks (no instructionsother than to freely explore the image), viewing time is ﬁxed,while for description tasks, participants are presented witha text box on the side of the image where they are askedto describe the image. As noted in Kim et al. [22], it takes2-3 times longer for participants to explore an image withBubbleView than to view it naturally for the same number ofgaze points in the same unit time. In other words, this interfaceslows down visual processing relative to natural viewing.

Implementation.

BubbleView exposes several parameters tothe experimenter: the blur sigma used to distort the image,radius of the bubble region exposed on-click, task timing andset-up (whether or not to include a required text ﬁeld entry witheach image), and whether discrete mouse clicks or continuousmouse movements are collected. Different task types warrantdifferent parameter choices to best approximate ground trutheye movements; we refer to Kim et al. [22] for the details.

Validation procedure.

At task-time the only validation pro-cedure was a minimum number of 150 characters requiredin the text entry box for describing each image. Duringpost-processing, participant data that included too few clickswas removed. These thresholds were set empirically (to 10clicks/image for a description task and 2 clicks/image for afree-viewing task [22]). Interquartile range-based outlier com-putation [24] was also used to remove too few or too many igure 9. Performance improves with number of participants forCodeCharts, ZoomMaps, and ImportAnnots. We chose our recom-mended number of participants to achieve 98% of the performancewith the maximum number of participants tested. For CodeCharts andZoomMaps, we use NSS similarity to eye movements to measure per-formance. For ImportAnnots, which is more tailored to ranking graphicdesign elements than approximating eye movements, we use Spearman’srank correlation over ranked graphic design elements, where we com-pare the ranking produced by a limited number of participants to thatproduced by many participants. clicks as a secondary threshold. This eliminated on average2% of participants during postprocessing [22].

Generating attention heatmaps.

Given a set of BubbleViewmouse clicks on an image, an attention heatmap is computedby blurring the click locations with a Gaussian with a particularsigma (a different one per image dataset [22]).

Cost.

Adjusting BubbleView’s hourly rate from $6 to $10(to be comparable to the compensation we use in our tasks),and accounting for data discarded by quality checks, we canre-estimate the numbers in Table 7 of [22]: $0.45 per imagefor 15 participants and a free-viewing duration of 10 seconds,including the cost of discarded data.

Likeability.

Participants’ experience with this interface de-pended in part on the experimental parameters used. Kimet al. [22] indicated that participants complained about thedifﬁculty and tediousness of the task at small bubble sizes.

CHOOSING A TOOL

In this section, we analyze the data we collected with ourinterfaces to evaluate them along various axes of interest. Theinteraction methods of each interface make each suitable todifferent types of tasks and stimuli. Additionally, the attentionheatmaps generated by each method differ in some signiﬁcantways; we consider why and the implications for researchers.

How many participants are required?

The interfaces collect different amounts of data per partici-pant. For instance, CodeCharts yields just a single gaze point,whereas BubbleView produces many clicks per participant.Thus, we calculate for each interface how many participantswe need to obtain a stable measure of attention on an image.For BubbleView, CodeCharts, and ZoomMaps, we measureperformance as the NSS similarity of the generated attentionheatmaps to ground-truth eye ﬁxations. Similar to Kim etal. [22], we calculate the number of participants that yields98% of maximum performance for a given interface. Ourresults are in Fig. 9. Kim et al. report that after about 10-15participants, the NSS similarity of BubbleView was already97-98% of the performance achievable with many more par-ticipants [22]. For the rest of our interfaces, we use perfor-mance with all participants as an upper bound. We ﬁnd that

Figure 10. Cost comparison (on natural images). We ﬁnd that differ-ences in price per image for each interface (top row) are driven moreby number of participants required (middle row) than differences in theprice of one person attending to an image (bottom row). ImportAnnotsis a special case because in addition to reporting attention, it requiresthe participant to segment objects, which drives up the time requiredper image and its price.

CodeCharts requires 50 participants per image, signiﬁcantlymore than BubbleView. ZoomMaps requires 15-20 partici-pants. For ImportAnnots, direct comparison to eye trackingdata is not as meaningful because by design, the generatedheatmaps have a different structure than eye tracking data (seethe low similarity in Table 2). As such, we use a performancemetric more tailored to evaluating the importance of graphicdesign elements. We use the ImportAnnots heatmap to assignan importance score per element, taking the maximum valueof the heatmap across an element, as in [9]. We then use theseimportance scores to rank the elements in a graphic design.We ﬁnd that the ImportAnnots rankings produced by 30 par-ticipants are closely aligned (Spearman’s rank correlation =0.98) with those obtained using all participants.

How much will it cost?

The cost of obtaining an attention heatmap for an image is(number of participants) × (cost per image per participant).The results of this calculation for our experiments on naturalimages are shown in the ﬁrst row of Fig. 10. BubbleViewis the cheapest at $0.45 per image, followed by ZoomMaps,CodeCharts, and ImportAnnots at $3.00.What drives this difference in the cost of attention? We ob-serve that across interfaces, there is a remarkable consistencyin cost per image per participant (bottom row of Fig. 10). Wepaid on average 2-4 cents per image per participant in theCodeCharts experiments, depending on how long the imagewas shown. The per-image per-participant costs of Bubble-View and ZoomMaps are around 3 and 4 cents, respectively.As a comparison point, for in-lab eye-tracking, we typicallypay participants $20 for a single-hour sitting in which theyview around 1000 images for 2-3 seconds each, which works igure 11. ZoomMaps on data visualizations. ZoomMaps is an idealtool for evaluating complex images because viewers can study contentat multiple scales via a natural interface. Here, we see ZoomMaps haspotential as a visualization debugging tool; participants who rated an im-age as more “well-designed" had different viewing patterns from thosewho did not. out to 2 cents per image. ImportAnnots stands out from thistrend at a higher rate of roughly 10 cents per participant perimage; this is because it requires segmentation in additionto paying attention. This indicates that the price for simplyattending to an image is relatively constant at around 3 cents,and that the difference in price of an attention heatmap isdriven by varying numbers of participants required to obtainstable data. Thus, as a rule of thumb, we can expect interfacesthat collect a lot of attention data per participant to be cheaper. Which interface is appropriate for which stimuli?

Some stimuli do not work equally well with each interface.

Image scale.

ZoomMaps, ImportAnnots, and BubbleViewallow panning/scrolling and are therefore compatible with im-ages larger than the screen. By contrast, CodeCharts’ brisktask progression requires that stimuli ﬁt on the screen. Thismakes CodeCharts inappropriate for images with an extremeaspect ratio and limits the amount of detail that can be seen.Only ZoomMaps supports viewing images at varying resolu-tions. This makes it uniquely qualiﬁed to collect viewing dataon images with multiscale content, such as infographics ordata visualizations (Fig. 11).

Natural vs. non-natural images.

ImportAnnots is most appro-priate for easily-segmentable, non-natural images, whereas theother interfaces can handle natural and non-natural images.

Dynamic content.

Although we did not explore this possibil-ity in our experiments, CodeCharts can collect gaze data onvideos, as suggested in [35]. Instead of showing an image fora ﬁxed duration, one can show a short video clip ending ata moment of interest to capture gaze locations at that frame.CodeCharts can also be used to collect attention data at differ-ent viewing durations, thus giving insight into how attentionevolves with time [13].

Combining insights from multiple interfaces.

These tools arenot mutually exclusive. In fact, they can be used in combina-tion to gain a more nuanced picture of attention (Fig. 12).

Which interface is appropriate for which task?

As with type of stimuli, some interfaces lend themselves betterto certain task types than others. For instance, BubbleView

Figure 12. ImportAnnots, ZoomMaps, and CodeCharts expose differ-ent aspects of how viewers explore a resume. CodeCharts shows what isimmediately salient, ZoomMaps shows what people spent time explor-ing, and ImportAnnots shows what they think is most relevant afterconsidering the entire document. In this case, the diffuse and center-biased CodeCharts data indicates that there was nothing immediatelyeye-catching about the resume. Viewers zoomed in to read the text, butthey rated the title and the skills graph as the most important elements. and ImportAnnots support description tasks because they canbe displayed concurrently with a text input and do not collectdata unless the user actively clicks on the image. By contrast,CodeCharts requires that the user focus on the image at alltimes to avoid missing the codechart, and ZoomMaps collectsdata continuously, which could be disturbed by the partici-pant stopping to type. All interfaces support search exceptImportAnnots, which allows participants to carefully consideran image before annotating anything and thus would not cap-ture the search process. CodeCharts is an ideal interface forfree-viewing because the game-like pace of the experimentautomatically engages participants, whereas they might notfeel incentivized to engage with other interfaces without anexplicit task. All interfaces support memory tasks in additionto their intrinsic interaction methodology.

How similar is this data to eye movements?

IOC CodeCharts BubbleView ZoomMaps ImportAnnotsCC % of IOC

72% 69% 59%

NSS % of IOC

65% 57% 50%

Table 2. Comparison of our toolbox to ground-truth eye move-ments. We report both Correlation Coefﬁcient and Normalized Scan-path Saliency, where both metrics increase with higher similarity. Seethe section on metrics for an explanation of these metrics.

We ran data collection using all four interfaces on a set of 35images sampled from the CAT2000 dataset [5]. We computedthe NSS and CC scores for each of these attention heatmapscompared to ground-truth eye movements. As a human base-line, we computed Inter-Observer Consistency (IOC): for NSS,using attention heatmaps of N-1 participants to predict theremaining participant [8, 22]; for CC, comparing attentionheatmaps of half the observers to the other half. The resultsare in Table 2. CodeCharts data is most similar to eye move-ments, accounting for over 80% of human consistency. It isfollowed by BubbleView, ZoomMaps, and ImportAnnots. igure 13. The TurkEyes interfaces compared to human eye movements on the CAT2000 dataset. CodeCharts best approximates human eye movements,including single ﬁxations resulting from exploration. BubbleView also captures salient regions. ZoomMaps occasionally focuses on background elementsinstead of salient foreground objects, and ImportAnnots segments semantically important elements, often focusing on a single central object.

Fig. 13 shows some representative examples of the results onCAT2000 images. Human gaze (whether collected using aneye tracker or using the CodeCharts UI) falls on certain objectregions only (e.g., faces, hands, points of contact, etc.), as doesBubbleView. By contrast, ImportAnnots tends to highlighta few objects per scene, ascribing uniform importance overentire objects. ZoomMaps occasionally over-focuses on dis-tant background objects at the expense of salient foregroundobjects, as in the middle row of Fig. 13.

Is the data measuring saliency or importance?

Figure 14. Our interfaces can be organized on an “intentionality” scalebased on the degree to which they measure saliency (more spontaneous)or importance (more intentional). For BubbleView, we distinguish be-tween a free-viewing task and a description task.

Attention comes in different ﬂavors. Saliency is a bottom-up measure of what parts of an image are most attention-grabbing [19]. It is most commonly measured by aggregatingeye movements across participants. Importance is a top-downmeasure of which elements in an image are most relevant [31].The former is more spontaneous and happens automaticallyduring viewing, while the latter requires the viewer to considerand evaluate the image before making a determination.We hypothesize that an interface’s place on the saliency-importance continuum is a function of its “intentionality”:the amount of cognitive processing required to use a particularinterface’s interaction methodology while viewing an image.The more involved a given interaction methodology, the more it elicits top-down importance; the less interaction required,the more closely it measures saliency. Interaction has the ef-fect of slowing down the viewing process, allowing the userto explore the image before attention data is recorded.Fig. 14 places our attention-capturing interfaces on an inten-tionality scale, where intentionality increases and similarity toeye movements decreases to the right. Eye tracking requiresno explicit user interaction and thus is the most direct measureof saliency. CodeCharts is the second-best measure of saliencybecause it does not distort the image or require user interac-tion while viewing the image. BubbleView (free viewing) stillcaptures image locations that draw people’s attention, but ismore intentional because it slows down viewing time, distortsthe image, and requires users to click to expose areas of in-terest. ZoomMaps requires the user to decide to interact withthe image by pinching and zooming, but uses a familiar andalmost second-nature mechanism. BubbleView (description)asks participants to complete a speciﬁc task, so they are morelikely to deliberate and click on an area important for under-standing some part of a visualization as opposed to the areasmost attractive at ﬁrst glance. Finally, ImportAnnots measuresimportance instead of saliency: participants are given ampletime to view the image and are asked to select regions (notsingle gaze points) that best represent the content of the imageafter considering the entire design.To understand the difference between saliency and importance,we compare data from interfaces on either end of the intention-ality spectrum: CodeCharts and ImportAnnots. CodeChartsreﬂects common patterns in eye tracking data like center bias(the tendency of humans to gaze at the center of an image[39]), exploration (gaze points scattered throughout an image),and emphasis on faces, while ImportAnnots produces largeregions of uniform importance that coincide with discrete ob-jects (Fig. 15 top). ImportAnnots and CodeCharts are mostsimilar when the image contains a handful of objects that areboth salient and segmentable (Fig. 15 bottom). igure 15. Natural images where CodeCharts and ImportAnnots agreeand differ. Top: CodeCharts shows center bias and image explorationswhile ImportAnnots ﬁnds objects of interest. Bottom: The interfacesagree because the animals are salient and easily-segmented.Figure 16. Graphic designs where CodeCharts and ImportAnnots agreeand differ. Top: The title is important but not salient. Bottom: The catis salient and important, but saliency is concentrated at a point whereasimportance segments the entire photo.

On graphic designs (Fig. 16), CodeCharts heatmaps have astrong center bias (probably because people do not have timeto examine the details of the design), whereas ImportAnnotsindicates that people ﬁnd text to be important. Quantitatively,CodeCharts and ImportAnnots heatmaps are weakly correlated(CC of 0.413 for natural images, 0.491 for graphic designs).When using each to rank graphic design elements, the twointerfaces achieve a Spearman’s rank correlation of 0.509. Animportant object is not necessarily a salient one and vice versa.

What insights arise from each interface?

Figure 17. Can you guess which attention heatmap was generated withwhich UI? Level sets visualized to improve discriminability.

The attention heatmaps collected by these interfaces can varydramatically for certain images. CodeCharts resembles eyetracking data with an emphasis on salient regions and someexploration patterns. BubbleView tends to focus on salientregions. ZoomMaps can pick up on small details of interestthat are missed by other interfaces. ImportAnnots produceshigh-ﬁdelity, relatively uniform element segmentations. Examples of these differences are shown in Fig. 17. The mapsvisualized are, respectively: CodeCharts, ImportAnnots, andZoomMaps. The CodeCharts attention heatmap focuses onfaces, much as human eye movements would. The ImportAn-nots heatmap highlights the full object that is the main focus ofthe photograph. The ZoomMaps heatmap includes interestingdetails in the background of the image with small people onthe mountainside.

WHICH INTERFACE SHOULD I USE?

After a thorough analysis of each interface, we refer back toTable 1 for a summary of the advantages of each. ZoomMapscan collect attention data on detailed, multi-scale content viaan intuitive interface, but it provides a coarse-grained approxi-mation of attention and sometimes places outsized emphasison smaller items. CodeCharts most accurately replaces eyetracking, is the only interface where stimuli exposure time iscarefully controlled by the experimenter, and does not requireimage distortion, but it does require many participants and isrelatively expensive. ImportAnnots provides high-ﬁdelity ele-ment segmentations and emphasizes importance over saliency.BubbleView is versatile, cheap, and a reasonable approxima-tion of eye data, but it distorts the underlying image and slowsdown the viewing process. The best interface depends on theuse case, stimuli, and type of data desired.

CONCLUSION AND FUTURE WORK

In this work we introduced TurkEyes, a crowdsourceable UItoolbox that relies on user interaction, not eye tracking, togather attention data on images. The TurkEyes toolbox repre-sents the state of the art in web-based attention tracking tools.It handles a wide variety of images and task designs, and itprovides nuanced data about how viewers interact with animage including what they ﬁnd eye-catching, engaging, andimportant. We demonstrate how to convert data from disparateinterfaces into a common format so that it can be combined,compared, and analyzed. Finally, we provide instructions forhow to collect and validate data and how to choose the bestinterface for a particular use case.TurkEyes provides the tools necessary to collect attention dataat scale. This lays the groundwork for future work in explor-ing different image types, viewing tasks, and applications ofattention. Crowdsourced attention could be used to identifyareas of interest in satellite pictures or medical images. At-tention can help designers verify that the correct parts of adesign are attention-grabbing, or interfaces can be combinedto give a nuanced picture of how a viewer explores a visu-alization. Computational models trained on cheap, scalablycrowdsourced attention data can help machines understandimages the way humans do. TurkEyes makes attention data anaccessible tool for researchers and creators who want to betterunderstand how humans respond to visual content.

ACKNOWLEDGEMENTS

We would like to thank Kimberli Zhong, Spandan Madan,and Dr. Frédo Durand for brainstorming with us and helpingto develop earlier iterations of the ZoomMaps interface; Dr.Aude Oliva for feedback on our study designs; and Allen Leefor his magic touch on the TurkEyes website.

EFERENCES [1] Wilma A. Bainbridge, Phillip Isola, and Aude Oliva.2013. The intrinsic memorability of face photographs.

Journal of experimental psychology. General

142 4(2013), 1323–34.[2] Lyn Bartram, Albert Ho, John Dill, and FrankHenigman. 1995. The Continuous Zoom: A ConstrainedFisheye Technique for Viewing and Navigating LargeInformation Spaces.

UIST (User Interface Software andTechnology): Proceedings of the ACM Symposium (011995), 207–215.

DOI: http://dx.doi.org/10.1145/215585.215977 [3] Roman Bednarik and Markku Tukiainen. 2005. Effectsof Display Blurring on the Behavior of Novices andExperts During Program Debugging. In

CHI ’05Extended Abstracts on Human Factors in ComputingSystems (CHI EA ’05) . ACM, New York, NY, USA,1204–1207.

DOI: http://dx.doi.org/10.1145/1056808.1056879 [4] Eric A. Bier, Maureen C. Stone, Ken Pier, WilliamBuxton, and Tony D. DeRose. 1993. Toolglass andMagic Lenses: The See-through Interface. In

Proceedings of the 20th Annual Conference onComputer Graphics and Interactive Techniques(SIGGRAPH ’93) . ACM, New York, NY, USA, 73–80.

DOI: http://dx.doi.org/10.1145/166117.166126 [5] Ali Borji and Laurent Itti. 2015. Cat2000: A large scaleﬁxation dataset for boosting saliency research. arXivpreprint arXiv:1505.03581 (2015).[6] Michelle A. Borkin, Zoya Bylinskii, Nam Wook Kim,Constance May Bainbridge, Chelsea S. Yeh, DanielBorkin, Hanspeter Pﬁster, and Aude Oliva. 2016.Beyond Memorability: Visualization Recognition andRecall.

IEEE Transactions on Visualization andComputer Graphics

22, 1 (Jan 2016), 519–528.

DOI: http://dx.doi.org/10.1109/TVCG.2015.2467732 [7] Zoya Bylinskii, Phillip Isola, Constance MayBainbridge, Antonio Torralba, and Aude Oliva. 2015.Intrinsic and extrinsic effects on image memorability.

Vision Research

116 (2015), 165–178.[8] Zoya Bylinskii, Tilke Judd, Aude Oliva, AntonioTorralba, and Frédo Durand. 2016. What do differentevaluation metrics tell us about saliency models?

CoRR abs/1604.03605 (2016). http://arxiv.org/abs/1604.03605 [9] Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan,Sami Alsheikh, Spandan Madan, Hanspeter Pﬁster,Fredo Durand, Bryan Russell, and Aaron Hertzmann.2017. Learning Visual Importance for Graphic Designsand Data Visualizations. In

Proceedings of the 30thAnnual ACM Symposium on User Interface Software &Technology (UIST ’17) . ACM.

DOI: http://dx.doi.org/10.1145/3126594.3126653 [10] Mon Chu Chen, John R. Anderson, and Myeong HoSohn. 2001. What Can a Mouse Cursor Tell Us More?:Correlation of Eye/Mouse Movements on Web Browsing. In

CHI ’01 Extended Abstracts on HumanFactors in Computing Systems (CHI EA ’01) . ACM,New York, NY, USA, 281–282.

DOI: http://dx.doi.org/10.1145/634067.634234 [11] Shiwei Cheng, Zhiqiang Sun, Xiaojuan Ma, Jodi L.Forlizzi, Scott E. Hudson, and Anind Dey. 2015. SocialEye Tracking: Gaze Recall with Online Crowds. In

Proceedings of the 18th ACM Conference on ComputerSupported Cooperative Work & . ACM, New York, NY, USA, 454–463.

DOI: http://dx.doi.org/10.1145/2675133.2675249 [12] Rachel England. 2018. Twitter uses smart cropping tomake image previews more interesting. https//engadget.com/2018/01/25/twitter-uses-smart-cropping-to-make-image-previews-more-interest .(2018).[13] Camilo Fosco*, Anelise Newman*, Pat Sukhum,Yun Bin Zhang, Aude Oliva, and Zoya Bylinskii. 2019.How Many Glances? Modeling Multi-duration Saliency.In

SVRHM Workshop at NeurIPS, 2019 .[14] Qi Guo and Eugene Agichtein. 2010. TowardsPredicting Web Searcher Gaze Position from MouseMovements. In

CHI ’10 Extended Abstracts on HumanFactors in Computing Systems (CHI EA ’10) . ACM,New York, NY, USA, 3601–3606.

DOI: http://dx.doi.org/10.1145/1753846.1754025 [15] Qi Guo, Haojian Jin, Dmitry Lagun, Shuai Yuan, andEugene Agichtein. 2013. Mining Touch Interaction Dataon Mobile Devices to Predict Web Search ResultRelevance. In

Proceedings of the 36th InternationalACM SIGIR Conference on Research and Developmentin Information Retrieval (SIGIR ’13) . ACM, New York,NY, USA, 153–162.

DOI: http://dx.doi.org/10.1145/2484028.2484100 [16] Qi Guo and Yang Song. 2016. Large-Scale Analysis ofViewing Behavior: Towards Measuring Satisfaction withMobile Proactive Systems. In

Proceedings of the 25thACM International on Conference on Information andKnowledge Management (CIKM ’16) . ACM, New York,NY, USA, 579–588.

DOI: http://dx.doi.org/10.1145/2983323.2983846 [17] Jeff Huang and Abdigani Diriye. 2012. Web UserInteraction Mining from Touch-Enabled MobileDevices.[18] Jeff Huang, Ryen White, and Georg Buscher. 2012. UserSee, User Point: Gaze and Cursor Alignment in WebSearch. In

Proceedings of the SIGCHI Conference onHuman Factors in Computing Systems (CHI ’12) . ACM,New York, NY, USA, 1341–1350.

DOI: http://dx.doi.org/10.1145/2207676.2208591 [19] Laurent Itti and Christof Koch. 2000. A saliency-basedsearch mechanism for overt and covert shifts of visualattention.

Vision Research

40, 10 (2000), 1489 – 1506.

DOI: http://dx.doi.org/https://doi.org/10.1016/S0042-6989(99)00163-7

20] Anthony R. Jansen, Alan F. Blackwell, and KimMarriott. 2003. A tool for tracking visual attention: TheRestricted Focus Viewer.

Behavior Research Methods,Instruments, & Computers

35, 1 (2003), 57–69.

DOI: http://dx.doi.org/10.3758/BF03195497 [21] Ming Jiang, Shengsheng Huang, Juanyong Duan, and QiZhao. 2015. SALICON: Saliency in Context. In . 1072–1080.

DOI: http://dx.doi.org/10.1109/CVPR.2015.7298710 [22] Nam Wook Kim, Zoya Bylinskii, Michelle A Borkin,Krzysztof Z Gajos, Aude Oliva, Fredo Durand, andHanspeter Pﬁster. 2017. BubbleView: an interface forcrowdsourcing image importance maps and trackingvisual attention.

ACM Transactions onComputer-Human Interaction (TOCHI)

24, 5 (2017), 36.[23] Nam Wook Kim, Zoya Bylinskii, Michelle A. Borkin,Aude Oliva, Krzysztof Z. Gajos, and Hanspeter Pﬁster.2015. A Crowdsourced Alternative to Eye-tracking forVisualization Understanding. In

Proceedings of the 33rdAnnual ACM Conference Extended Abstracts on HumanFactors in Computing Systems (CHI EA ’15) . ACM,New York, NY, USA, 1349–1354.

DOI: http://dx.doi.org/10.1145/2702613.2732934 [24] Steven Komarov, Katharina Reinecke, and Krzysztof ZGajos. 2013. Crowdsourcing performance evaluations ofuser interfaces. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems .ACM, 207–216.[25] Kyle Krafka, Aditya Khosla, Petr Kellnhofer, HariniKannan, Suchendra Bhandarkar, Wojciech Matusik, andAntonio Torralba. 2016. Eye Tracking for Everyone. In . 2176–2184.

DOI: http://dx.doi.org/10.1109/CVPR.2016.239 [26] Dmitry Lagun, Chih-Hung Hsieh, Dale Webster, andVidhya Navalpakkam. 2014. Towards BetterMeasurement of Attention and Satisfaction in MobileSearch. In

Proceedings of the 37th International ACMSIGIR Conference on Research & . ACM, New York,NY, USA, 113–122.

DOI: http://dx.doi.org/10.1145/2600428.2609631 [27] Dmitry Lagun and Mounia Lalmas. 2016. UnderstandingUser Attention and Engagement in Online NewsReading. In

Proceedings of the Ninth ACM InternationalConference on Web Search and Data Mining (WSDM’16) . ACM, New York, NY, USA, 113–122.

DOI: http://dx.doi.org/10.1145/2835776.2835833 [28] F. Lamberti, Gianluca Paravati, Valentina Gatteschi, andAlberto CannavÃš. 2017. Supporting Web Analytics byAggregating User Interaction Data From HeterogeneousDevices Using Viewport-DOM-Based Heat Maps.

IEEETransactions on Industrial Informatics

PP (01 2017),1–1.

DOI: http://dx.doi.org/10.1109/TII.2017.2658663 [29] Yixuan Li, Pingmei Xu, Dmitry Lagun, and VidhyaNavalpakkam. 2017. Towards Measuring and InferringUser Interest from Gaze. In

Proceedings of the 26thInternational Conference on World Wide WebCompanion (WWW ’17 Companion) . InternationalWorld Wide Web Conferences Steering Committee,Republic and Canton of Geneva, Switzerland, 525–533.

DOI: http://dx.doi.org/10.1145/3041021.3054182 [30] George W. McConkie and Keith Rayner. 1975. The spanof the effective stimulus during a ﬁxation in reading.

Perception & Psychophysics

17, 6 (1975), 578–586.

DOI: http://dx.doi.org/10.3758/BF03203972 [31] Peter O’Donovan, Aseem Agarwala, and AaronHertzmann. 2014. Learning Layouts for Single-PageGraphic Designs.

IEEE Transactions on Visualizationand Computer Graphics

20, 8 (Aug 2014), 1200–1213.

DOI: http://dx.doi.org/10.1109/TVCG.2014.48 [32] Alexandra Papoutsaki, Patsorn Sangkloy, James Laskey,Nediyana Daskalova, Jeff Huang, and James Hays. 2016.WebGazer: Scalable Webcam Eye Tracking Using UserInteractions. In

Proceedings of the 25th InternationalJoint Conference on Artiﬁcial Intelligence (IJCAI) .AAAI, 3839–3845.[33] Keith Rayner. 2014. The gaze-contingent movingwindow in reading: Development and review.

VisualCognition

22, 3-4 (2014), 242–258.

DOI: http://dx.doi.org/10.1080/13506285.2013.879084 [34] Kerry Rodden, Xin Fu, Anne Aula, and Ian Spiro. 2008.Eye-mouse Coordination Patterns on Web SearchResults Pages. In

CHI ’08 Extended Abstracts onHuman Factors in Computing Systems (CHI EA ’08) .ACM, New York, NY, USA, 2997–3002.

DOI: http://dx.doi.org/10.1145/1358628.1358797 [35] Dmitry Rudoy, Dan B Goldman, Eli Shechtman, andLihi Zelnik-Manor. 2012. Crowdsourcing gaze datacollection. arXiv preprint arXiv:1204.3367 (2012).[36] Bryan C Russell, Antonio Torralba, Kevin P Murphy,and William T Freeman. 2008. LabelMe: a database andweb-based tool for image annotation.

Internationaljournal of computer vision

77, 1-3 (2008), 157–173.[37] Amaia Salvador, Axel Carlier, Xavier Giro-i Nieto, OgeMarques, and Vincent Charvillat. 2013. CrowdsourcedObject Segmentation with a Game. In

Proceedings ofthe 2Nd ACM International Workshop onCrowdsourcing for Multimedia (CrowdMM ’13) . ACM,New York, NY, USA, 15–20.

DOI: http://dx.doi.org/10.1145/2506364.2506367 [38] Michael Schulte-Mecklenbeck, Ryan O. Murphy, andFlorian Hutzler. 2011. Flashlight - RecordingInformation Acquisition Online.

Comput. Hum. Behav.

27, 5 (Sept. 2011), 1771–1782.

DOI: http://dx.doi.org/10.1016/j.chb.2011.03.004

39] Benjamin W. Tatler. 2007. The central ﬁxation bias inscene viewing: Selecting an optimal viewing positionindependently of motor biases and image featuredistributions.

Journal of Vision

7, 14 (11 2007), 4–4.

DOI: http://dx.doi.org/10.1167/7.14.4 [40] Benjamin W. Tatler, Roland J. Baddeley, and Iain D.Gilchrist. 2005. Visual correlates of ﬁxation selection:effects of scale and time.

Vision Research

45, 5 (2005),643 – 659.

DOI: http://dx.doi.org/https://doi.org/10.1016/j.visres.2004.09.017 [41] Pingmei Xu, Yusuke Sugano, and Andreas Bulling.2016. Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces. In

Proceedings of the 2016 CHI Conference on HumanFactors in Computing Systems (CHI ’16) . ACM, NewYork, NY, USA, 3299–3310.

DOI: http://dx.doi.org/10.1145/2858036.2858479 [42] Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano,and Andreas Bulling. 2018. Training person-speciﬁcgaze estimators from user interactions with multipledevices. In