Expectation Versus Reality: The Failed Evaluation of a Mixed-Initiative Visualization System
EExpectation Versus Reality:The Failed Evaluation of a Mixed-Initiative Visualization System
Sunwoo Ha * , Adam Kern † , Melanie Bancilhon ‡ , and Alvitta Ottley § Figure 1: An overview of the system with visualization components labelled. Component 1 corresponds to the hover tooltip;component 2 corresponds to the information card; component 3 corresponds to the label and filter. A BSTRACT
Our research aimed to present the design and evaluation of a mixed-initiative system that aids the user in handling complex datasetsand dense visualization systems. We attempted to demonstratethis system with two trials of an online between-groups, two-by-two study, measuring the effects of this mixed-initiative system onuser interactions and system usability. However, due to flaws inthe interface design and the expectations that we put on users, wewere unable to show that the adaptive system had an impact onuser interactions or system usability. In this paper, we discuss theunexpected findings that we found from our “failed” experimentsand examine how we can learn from our failures to improve furtherresearch.
NTRODUCTION
Evaluation in visualization studies show how visual aides can sup-port analysts, researchers, and all who must interact with data of alltypes in understanding, synthesizing, and communicating that data. * Ha is with Washington University in St. Louis: [email protected] † Kern is with MIT Lincoln Laboratory: [email protected] ‡ Bancilhon is with Washington University in St. Louis: [email protected] § Ottley is with Washington University in St. Louis: [email protected]
Historically, this has been accomplished using controlled, in-personlaboratory experiments following practices from the psychology andbroader HCI community. With calls for more ecologically valid eval-uations that examine actions from a more representative user studypopulation, there has been a relatively recent tendency to collect datavia crowdsourced platforms in place of an undergraduate studentpopulation.Researchers have successfully replicated pioneering studies usingplatforms such as Amazon Mechanical Turk [3], and have producedmore generalizable findings that involve a more extensive and di-verse population (see [6] for a comprehensive survey of the priorwork). However, there are a few caveats. Mechanical Turk data isnotoriously noisy, and researchers typically need to use techniquessuch as attention checks , ground truths , and interaction analysis forquality assurance [12].In this paper, we report the expectations, findings, and lessonslearned from two “failed” Mechanical Turk experiments in whichwe aimed to evaluate a mixed-initiative visualization system. Theresearch agenda was motivated by the need to improve data explo-ration for “small” but crowded data visualizations. In particular, forhigh-density data, visualizing every data point can lead to overplot-ting and information overload. Although, there are several methodsfor improving visual clutter such as filtering and sampling [15], thesemethods largely focus on “big data. Information overload can stilloccur in the small data settings, and nave data reduction methodsapplied to a small samples can remove elements that are importantto the user or exaggerate irrelevant points.Our research aimed to manage overplotting by presenting a mixed-1 a r X i v : . [ c s . H C ] S e p nitiative information visualization system. The design uses a hiddenMarkov model algorithm, developed by Ottley et. al [18], to cap-ture and predict user attention. The visualization then respondsby emphasizing the points that likely fits the user’s interest. Afterconducting two trials of a large scale crowd-sourced experimentto study the effect of the system described in this paper, we foundno evidence to support our hypotheses. Furthermore, our analysisrevealed two significant and unexpected findings:• Participants just wanted to hover. We used clicking to trig-ger the visualization adaptation. However, the subjects in ouronline experiment overwhelmingly interacted with the visual-ization by hovering.• Open-ended tasks were not appropriate for our online studies.A vast majority of our online participants failed to provide thequality of feedback that we expected and that were producedby our in-person pilot studies. IXED -I NITIATIVE S YSTEM O VERVIEW
Our proposed mixed-initiative system was straightforward. Weutilized the algorithm presented in Ottley et al. [18] that capturesand predicts users’ attention, to promote potential points of interestto the user’s interface. Consider for example, a visualization withlarge amounts of occlusion where datapoints overlap and partiallyor fully obscure each other. The user would typically use zoomingto handle this occlusion, but this solution has its problems andlimitations. When the visualization is completely zoomed out, manyof the datapoints are partially or entirely occluded, making largetrends hard to see. When the visualization is completely zoomedin, focal points can be seen in their entirety, but such a close viewcan cause the analyst to lose context [11]. Our visualization systemresponds by adaptively re-drawing datapoints, bringing datapointsthat the user is likely to be interested in to the foreground, andsending “uninteresting” datapoints to the background (as seen inFigure 2). In doing so, we hoped to create a system that allows foran informative overview, encourages easier exploration, and reducesthe need for visual transformations.Figure 2: The changing z-order of the pins as the user clicks, with aparticular interest in Mexican restaurants (coded in brown).
The system’s interface is displayed in Figure 1. Our experiment usedtwo datasets, Toronto restaurants [21] and St. Louis crimes [16]. Weuse pins the location of each data point on the map and color-codedthen according to categories. For the Toronto dataset, the color ofindicates the main cuisine for a given restaurant, and the color of thepins for the St. Louis crime data indicates the type of crime. Detailsof points are provided in two forms: (1) a tooltip that appear onhovered, and (2) “information cards” are adding to the sidebar onclick. The hover tooltip, as seen in segment 1 of Figure 1, showsjust a few details, such as the restaurant’s name, rating, and price,or the crime’s type and description. The information cards, on theother hand, show all the attributes of a given datapoint, as seen insegment 2 Figure 1. The information cards also provide two moremodes of interaction. The first is “View”, which transitions theviewport to center and zoom in on the datapoint corresponding tothat information card and temporarily enlarges the selected datapointto bring the user’s attention to the pin. The second is “Delete”, whichremoves the card from the sidebar. This allows the user to keep arunning list of datapoints of interest, and refer back to them on themap on demand. A legend at the bottom right of the screen, alsoserves as a filter, as seen in segment 3 of Figure 1.
ESTING O UR S YSTEM
We conducted two online experiments to determine the effect of areal-time adaptive system on user interactions and system usability.For each study, we recruited 200 participants via Amazon’s Mechan-ical Turk. Our experiment used the two datasets introduced in theprevious section and two conditions (responsive and unresponsive)to study the effects of adaptive systems, creating four groups:1. Responsive Toronto, then Unresponsive St. Louis2. Responsive St. Louis, then Unresponsive Toronto3. Unresponsive Toronto, then Responsive St. Louis4. Unresponsive St. Louis, then Responsive TorontoIf a user is on the responsive session of the experiment, the systemadapts the visualization to the user’s interests. In the unresponsivesession, the system will not change the visualization at all as theuser interacts.
Task design is critical to the success of an evaluation [17]. As aresult, we carefully considered the evaluation of our system and ex-plored a variety of task taxonomies (e.g., [2] and [22]). Ultimately,we wanted to focus and exploratory data analysis. Specifically,we distinguish between bottom-up exploration and top-down ex-ploration . Bottom-up explorations “are driven in reaction to thedata” [1] or “may be triggered by salient visual cues” [14]. Top-down explorations, on the other hand, are based on a high-level goalsor hypothesis [5, 14]. We settled on an open-ended task because wewanted to observe the users’ instinctual behavior. We conducted aseries of in-person pilot studies to determine the best phrasing foropen-ended tasks prompts.
At the start of the experiment, the participants were randomly as-signed to one of the four groups. Inspired by a laboratory studyon latency by Liu and Heer [14], the participants were first askedto “take some time to interact with the dataset in front of [them],exploring the data and gathering insights”. Once the participants feltthat they were familiar with the dataset, they were given the oppor-tunity to write down as many (or as few) insights as they would like.The participants were primed for interaction and insight gathering.Before the experiment started, examples of insights were shown tousers (e.g., “There are more kid-friendly coffee shops Downtown2han there are Uptown”). The goal of this priming was to introducethe participants to the visualization and make explicit the idea ofan “insight” without biasing the user during either segment of theexperiment. In this way, users were free to perform exploratory dataanalysis without guidance or restriction, creating a general-purposetask that can demonstrate the flexibility of the adaptive system. Ad-ditionally, the reward structure of the experiment was designed toencourage this insight-gathering: participants were awarded $1 forparticipating, and $0 .
50 for every insight gathered. After recordingall the insights that they found, the participants completed the Sys-tem Usability Scale (SUS) [8], a widely-used, “robust and versatiletool for usability professionals” [4], with an added comments sectionat the end of each survey for general comments from users. Uponcompletion of the survey, the participants continued onto the secondcondition/visualization which follows the same procedure as thefirst.As the participants interacted with the system, we captured everymouse interaction: clicks, hovers, zooms, pans, views, deletes, andfilter toggles. To separate intentional from unintentional hovers,we only recorded hovers with duration of at least 250 milliseconds.Additionally, we captured all insights, interaction time, and surveyresponses.
HAT W E E XPECTED
Before publishing our study on Mechanical Turk, there were someexpectations that we put on our users.• One of the main ways that users would interact with our systemwas through clicking on data points.• The users would be able to provide useful insights about thedata.• The responsive system would elicit fewer zooms and moreinsights.Overall, we hoped to see the mixed-initiative system have a sig-nificant positive impact on the users’ interactions and subjectivefeedback. HE R EALITY OF S TUDY In our first study, hovering over a data point revealed a tooltip asshown in Figure 1. Clicking on a point adds a card to the sidebarthat shows a historical log of click interactions.Figure 3: A graph from our first experiment showing how oftenusers clicked on the visualization within a condition. As the numberof clicks increases on the x -axis, the percentage of users who clickedat least x times decreases. The dotted line represents the users whoseclick interactions were within the top 20% of their cohort. Participants spent, on average, 27 minutes exploring the two datasetsand recording their insights. People wanted to hover.
Unsurprisingly, we observed a largevariance in the number of interactions. However, on average, partici-pants clicked on 9 out of 2915 points on the Toronto map and 7 outof 1951 points in the St. Louis map. Figure 3 plots the percentageof users by the number click performed during a session. Since, weused clicks as input to the machine learning algorithm that tracks andpredicts future interactions, our responsive conditions were largelyineffective. We found no evidence that the adaptive system impactedanalysis, and there was no indication that participants even noticedthe adaptation. For further analysis, we narrowed the dataset to agroup of users whose interactions met the expectations of the ex-perimental design (41 participants remained). Table 1 displays theaverage number of each type of interaction, spread across datasetsand conditions.
Toronto St. LouisResponsive Unresponsive Responsive Unresponsive
Clicks 7.9 12.6 11.8 11.0Hovers 52.0 41.0 46.4 38.6Zooms 57.4 25.4 45.8 35.6Insights 5.3 4.8 5.5 6.2Table 1: Average values for interaction data frequency from the firsttrial.
Insights were shallow.
We coded insights and categorized themas deep and shallow based on the amount of information that theycontain. A shallow insight is an observations attained only throughminimal interactions. For example,
ID893 : “There is a large concentration of Chinese restaurantsin one area.”
ID950 : “A majority of the crimes committed are theft-related.”A deep insight requires building knowledge of the dataset throughinteractions. For example,
ID821 : “There must be a Chinatown, or Asian-American pop-ulated area down Spadina Avenue between and including Dun-das Street West, College Street, and Beverley Street.”
ID795 : “There are incidences of larceny scattered all over thearea. Most have to do with automobiles, but burglaries near theriver tend to involve burglaries in businesses and buildings.”Participants in our study entered a total of 779 insights, of which747 were shallow and 32 were classified as deep . We believed thatthe lack of differences between the conditions were due to a flawin the system design: the information given on-hover was nearlyidentical to the information given on-click. It is possible that thisinformation parity gave no incentive for users to click on data points,other than to keep their information in a persistent state on thesidebar. This was particularly problematic in the context of ourexperiment that relied on user clicks to trigger the experimentalcondition. HE R EALITY OF S TUDY Due to the flaws in the interface design and the overall null results,we rerun the experiment and address the design missteps causingthe discrepancy between expectations and reality. We kept muchof the experiment design from Experiment 2, but made two minorchanges:1. We removed the tooltip on hover. This meant that participantsonly saw details on click and clicking also triggered in thesystem adaptation.3. We added additional guidance for formulating insights. Inaddition to the examples detailed in Section 3.2, the instructiondissuaded shallow insights by stating “Obvious insights like‘There are a lot of coffee shops’ will not be rewarded thebonus.”In this second study, participants spent an average of 29 minutesexploring the two datasets and recording their insights.
People did not want to click.
We observed a moderate increasein the number of clicks. On average, participants clicked on 40out of 2915 points on the Toronto map and 23 out of 1951 pointson the St. Louis map. However, we observed a pattern similar toExperiment 1. 110 out of 200 participants clicked on fewer than 5data points. An analysis of the remaining 90 participants revealedinconclusive results (see Table 2).
Toronto St. LouisResponsive Unresponsive Responsive Unresponsive
Clicks 36.5 63.0 31.0 21.2Hovers 50.8 67.6 48.1 41.2Zooms 58.5 54.6 38.6 49.2Insights 4.8 5.3 5.6 4.7Table 2: Average values for interaction data frequency from thesecond trial.
Insights were shallow again.
Similar to the click finding, weobserved an increase in the number of insights and moderate im-provement in the quality of insights. Participants in the second studyentered a total of 1045 insights, of which 996 were shallow and 79were classified as deep . ESSONS L EARNED
The finding that people may not interact with visualization in theway that we expect them to is not new. In the storytelling realm, Boyet al. [7] found the participants in the web-based field experimentsdid not engage with visualization as expected. Reports from
NewYork Times suggest that users prefer scrolling as a means of interac-tion, which as led them to reconsider their investment in interactivevisualization [20]. These are only anecdotal results, however, alongwith the findings of the “failed” user studies in this paper, they echothe sentiments of Lam [13] who encourages designers to weigh thecost against potential gains of interaction.Many researchers believe the “the purpose of visualization isinsight” [9]. In our studies, we opted for open ended tasks andcaptured insights in addition to quantitative measures. Althoughthere is prior work that define [10] and characterize [19] insights,insight-based evaluative methods, especially for online studies, arenot clear. Many of the existing studies (e.g., [15] and [19]) capturedinsights in laboratory settings. In addition, the open-ended natureof our tasks made it difficult to the filter participants who wereclicking through to get paid. This potentially highlights an importantlimitation of online studies and provides suggestive evidence thatopen-ended tasks may not be appropriate for this experiment setting.Not all of our findings were negative. The fact that users foundthe visualization well-designed, yet were overwhelmed by the sizeand density of the datasets, indicates that the chosen datasets workedwell to induce a need for an assistive or mixed-initiative agent to helpusers make sense of the data. When designing this adaptive system,we were nervous that constantly updating the visualization would bedisorienting to the user, especially given failed examples of adaptivesystems like the notorious
Microsoft Office Clippy . However, theusability results for both responsive and non-responsive conditionswere overwhelmingly positive and we saw no impact on usabilityfor the responsive conditions. Overall, these lessons learned are motivating and we believe that we can move forward to create andevaluate a more rigorous and robust mixed-initiative system thatactively supports the user during exploration.
UTURE D IRECTIONS
The failures we discussed opens up next directions for our adaptivesystem.1. In both experiments, we saw that a majority of the users likedto hover to interact and gain insights. Like clicks, hovers couldprovide us a better understanding of the users’ interests in real-time. A possible solution to improve the system would be toincorporate hovers along with clicks to the algorithm presentedin Ottley et. al [18].2. Again, in both experiments, we had difficulty in obtaininginsights that were deep and in good quality. It would be in-teresting to see if a mix of closed and open-ended tasks willhelp increase the quality of insights from Mechanical Turkusers. The closed-ended tasks would be asked first to get theusers familiar with interacting with the system. Then, the userswould be free to explore and complete the open-ended tasks.
ONCLUDING R EMARKS
We designed a mixed-initiative system and attempted to investigatethe effect of the system on user interaction and system usability.It is tempting to say that results we found supports the alternativehypothesis and conclude that “the adaptive system did nothing”.However, it is more accurate to say that we do not have enough evi-dence to reject the null hypothesis. Why is this a crucial distinction?Because it does not destroy any hope of an effective mixed-initiativesystem. It is important to note that these results should not dissuaderesearchers from further work on mixed-initiative systems like theone we designed. Although we were unsuccessful in achieving thedata we expected, we found significant and unexpected findingsfrom the users. When developing these systems, it is easy to assumethat the users will interact the way that we want them to. Sincewe made an assumption that users would mainly show interest byclicking on the points, our system was unable to aide the users whentheir interactions did not meet our expectations. We have learnedfrom our failures and hope that the VIS community can also learnfrom them too. A CKNOWLEDGMENTS
This work was supported in part by the National Science Foundationunder Grant No. 1755734. R EFERENCES [1] S. Alspaugh, N. Zokaei, A. Liu, C. Jin, and M. A. Hearst. Futzing andmoseying: Interviews with professional data analysts on explorationpractices.
IEEE Transactions on Visualization and Computer Graphics ,25(1):22–31, 2018. doi: 10.1109/TVCG.2018.2865040[2] R. Amar, J. Eagan, and J. Stasko. Low-level components of analyticactivity in information visualization.
IEEE Symposium on InformationVisualization , p. 7, 2005.[3] Amazon. Mechanical Turk, 2020.[4] A. Bangor, P. T. Kortum, and J. T. Miller. An Empirical Evaluation ofthe System Usability Scale.
International Journal of HumanComputerInteraction , 24(6):574–594, 2008. doi: 10.1080/10447310802205776[5] L. Battle and J. Heer. Characterizing exploratory visual analysis: Aliterature review and evaluation of analytic provenance in tableau.
Computer Graphics Forum , 38(3):145–159, 2019. doi: 10.1111/cgf.13678[6] R. Borgo, L. Micallef, B. Bach, F. McGee, and B. Lee. Informationvisualization evaluation using crowdsourcing. In
Computer GraphicsForum , vol. 37, pp. 573–595. Wiley Online Library, 2018.[7] J. Boy, F. Detienne, and J.-D. Fekete. Storytelling in informationvisualizations: Does it engage users to explore data? In
Proceedings f the 33rd Annual ACM Conference on Human Factors in ComputingSystems , pp. 1449–1458, 2015.[8] J. Brooke. SUS - A quick and dirty usability scale. Usability evaluationin industry , 189(194):4–7, 1996.[9] M. Card.
Readings in information visualization: using vision to think .Morgan Kaufmann, 1999.[10] R. Chang, C. Ziemkiewicz, T. M. Green, and W. Ribarsky. Defininginsight for visual analytics.
IEEE Computer Graphics and Applications ,29(2):14–17, 2009.[11] M. R. Jakobsen and K. Hornbæk. Interactive visualizations on largeand small displays: The interrelation of display size, information space,and scale.
IEEE Transactions on Visualization and Computer Graphics ,19(12):2336–2345, 2013. doi: 10.1109/TVCG.2013.170[12] R. Kosara and C. Ziemkiewicz. Do mechanical turks dream of squarepie charts? In
Proceedings of the 3rd BELIV’10 Workshop: Beyondtime and errors: Novel evaluation methods for information visualiza-tion , pp. 63–70, 2010.[13] H. Lam. A framework of interaction costs in information visualization.
IEEE transactions on visualization and computer graphics , 14(6):1149–1156, 2008.[14] Z. Liu and J. Heer. The effects of interactive latency on exploratoryvisual analysis.
IEEE Transactions on Visualization and Com-puter Graphics , 20(12):2122–2131, 2014. doi: 10.1109/TVCG.2014.2346452[15] Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying ofbig data.
Computer Graphics Forum , 32(3):421–430, 2013. doi: 10.1111/cgf.12129[16] Metropolitan Police Department of St. Louis. St. Louis Crime Data,2017.[17] T. Munzner. A nested model for visualization design and valida-tion.
IEEE Transactions on Visualization and Computer Graphics ,15(6):921–928, 2009. doi: 10.1109/TVCG.2009.111[18] A. Ottley, R. Garnett, and R. Wan. Follow The Clicks: Learningand Anticipating Mouse Interactions During Exploratory Data Analy-sis.
Computer Graphics Forum , 38(3):41–52, 2019. doi: 10.1111/cgf.13670[19] P. Saraiya, C. North, and K. Duca. An insight-based methodologyfor evaluating bioinformatics visualizations.
IEEE transactions onvisualization and computer graphics , 11(4):443–456, 2005.[20] A. Tse. Why we are doing fewer interactives.
The New York Times ,2016.[21] Yelp. Yelp Open Dataset, 2018.[22] J. S. Yi, Y. a. Kang, J. Stasko, and J. Jacko. Toward a deeper under-standing of the role of interaction in information visualization.
IEEETransactions on Visualization and Computer Graphics , 13(6):1224–1231, 2007. doi: 10.1109/TVCG.2007.70515, 13(6):1224–1231, 2007. doi: 10.1109/TVCG.2007.70515