[PDF] How to evaluate data visualizations across different levels of understanding

Abstract

Understanding a visualization is a multi-level process. A reader must extract and extrapolate from numeric facts, understand how those facts apply to both the context of the data and other potential contexts, and draw or evaluate conclusions from the data. A well-designed visualization should support each of these levels of understanding. We diagnose levels of understanding of visualized data by adapting Bloom's taxonomy, a common framework from the education literature. We describe each level of the framework and provide examples for how it can be applied to evaluate the efficacy of data visualizations along six levels of knowledge acquisition - knowledge, comprehension, application, analysis, synthesis, and evaluation. We present three case studies showing that this framework expands on existing methods to comprehensively measure how a visualization design facilitates a viewer's understanding of visualizations. Although Bloom's original taxonomy suggests a strong hierarchical structure for some domains, we found few examples of dependent relationships between performance at different levels for our three case studies. If this level-independence holds across new tested visualizations, the taxonomy could serve to inspire more targeted evaluations of levels of understanding that are relevant to a communication goal.

Full PDF

©© 2020 IEEE. This is the author’s version of the article that has been published in the proceedings of IEEEVisualization conference. The ﬁnal version of this record is available at: xx.xxxx/TVCG.201x.xxxxxxx/

How to evaluate data visualizations across different levels ofunderstanding

Alyxander Burns * UMass Amherst

Cindy Xiong † Northwestern University

Steven Franconeri ‡ Northwestern University

Alberto Cairo § University of Miami

Narges Mahyar ¶ UMass Amherst

Hall Gwinnett Fulton Cobb DeKalb

1. Knowledge 2. Comprehension 3. Application 4. Analysis 5. Synthesis 6. Evaluation

Figure 1: How do design decisions affect levels of understanding of the data in a visualization? We propose a systematic frameworkto evaluate data visualization across multiple facets of understanding based on Bloom’s taxonomy. A BSTRACT

Understanding a visualization is a multi-level process. A readermust extract and extrapolate from numeric facts, understand howthose facts apply to both the context of the data and other poten-tial contexts, and draw or evaluate conclusions from the data. Awell-designed visualization should support each of these levels ofunderstanding. We diagnose levels of understanding of visualizeddata by adapting Bloom’s taxonomy, a common framework from theeducation literature. We describe each level of the framework andprovide examples for how it can be applied to evaluate the efﬁcacyof data visualizations along six levels of knowledge acquisition -knowledge, comprehension, application, analysis, synthesis, andevaluation. We present three case studies showing that this frame-work expands on existing methods to comprehensively measure howa visualization design facilitates a viewer’s understanding of visu-alizations. Although Bloom’s original taxonomy suggests a stronghierarchical structure for some domains, we found few examplesof dependent relationships between performance at different levelsfor our three case studies. If this level-independence holds acrossnew tested visualizations, the taxonomy could serve to inspire moretargeted evaluations of levels of understanding that are relevant to acommunication goal.

Index Terms:

Human-centered computing—Visualization—Visualization design and evaluation methods;

NTRODUCTION

When designers create visualizations for communication, they makechoices about encoding and design that they think will accuratelyand persuasively communicate their interpretation of the data. Theultimate interpretation of a visualization depends on both the de- * e-mail: [email protected] † e-mail: [email protected] ‡ e-mail: [email protected] § e-mail: [email protected] ¶ e-mail: [email protected] signer and the reader [19]. Because visualizations are being used tocommunicate data about real-world problems and phenomena, it isimportant for visualization designers and practitioners to validate iftheir design decisions are successful [44].Consider the two graphs shown above in Figure 1 which containthe same underlying data and differ only in the way the bars areorganized. What do you notice when you look at each? How mightwe systematically test the impact of this reorganization on a typicalviewer’s understanding? We could ask how quickly and accurately aviewer extracts speciﬁc points, but we might expect to see no differ-ence between the charts. In contrast, we would expect differencesin how viewers describe their overall takeaway message (perhaps“COVID-19 cases are decreasing” for the left and “COVID-19 casesare not decreasing everywhere” for the right) or extrapolate predic-tions (“there will be no new COVID-19 cases on May 10” for theleft and “there will be some new cases on May 10” for the right).What other differences in understanding might you predict?We sought a more systematic way to evaluate these levels ofunderstanding and how each level might differ depending on a vi-sualization’s design – understanding is far more than extraction ofindividual values, after all. Policy makers rely on visualized data todecide which public project to fund. Health ofﬁcials rely on visual-ized data to monitor and predict the trends of a pandemic. Investorsrely on visualized data to evaluate the success of their businesses andidentify future directions. The public sees visualizations in the newsand must extrapolate how the data relate to their community. In allof these cases, viewers must synthesize the visualized informationand evaluate its soundness, then extrapolate the knowledge gainedto other situations. An effective visualization should facilitate theseinteractions with data.Existing evaluation methods tend to focus on perceptual taskssuch as retrieving values or comparing means [30]. The visualiza-tion community has acknowledged that these tasks are effective formeasuring perceptual accuracy, but insufﬁcient for evaluating theknowledge or insight obtained [28, 36]. Visualization researchershave since introduced techniques to evaluate understanding beyondperception, such as answering factual questions about visualiza-tions [50], summarizing key messages [7], or reporting what waslearned [49]. Though these techniques can evaluate some aspects ofthe knowledge-generation process, the ﬁeld still lacks a systematicmethod for evaluating the affordances for understanding that are1 a r X i v : . [ c s . H C ] S e p provided by visualizations. We have seen in the visualization com-munity that frameworks for evaluation (such as [4]) can be helpfulfor systematizing the design of experiments, so we looked outsideof the discipline for inspiration.Inspired by evaluation techniques from the ﬁeld of education [32],we expand on the existing methods in the visualization communityand introduce a comprehensive framework for evaluation based onBloom’s 6-level taxonomy. Though the original taxonomy suggestsa strict hierarchy between the levels, our results are consistent withchallenges to a strict hierarchical structure (e.g. [6]), ﬁnding somedependencies among levels, but also independence among others.With our proposed framework, we systematically construct a setof questions, moving from testing a person’s perceptual understand-ing of a visualization to eliciting how they would apply the learnedinformation from the visualization to a real-world problem. Wedemonstrate the applicability of our proposed framework with threecase studies comparing two alternative designs of real-world visu-alizations, selected because they represent several common charttypes and were identiﬁed by the data visualization or data journal-ism communities as being confusing or misleading. We found thatdifferent versions of each chart afford different percepts and conclu-sions across particular levels of the taxonomy, allowing a systematicevaluation of similarities and differences between designs. We alsofound interesting relationships between the levels of understandingmeasured, which suggest some level of dependence between skills.The contributions of this work are the following: (1) a novelframework that provides a systematic way to evaluate levels ofunderstanding in a visualization; (2) three ”in the wild” case studiesthat demonstrate how this method might work; and (3) a discussionof future work and improvements in this area of research. EASURING U NDERSTANDING IN V ISUALIZATION

Within the visualization community, there are a variety of metricsused to assess particular aspects of understanding. Critically, thisincludes the common metrics of response speed and accuracy. Whilethe relevance of the accuracy component is fairly straightforward,Chen et al [13] argued that if visualizations are intended to makecontent easier to understand, then “saving time” is a useful metric ofmeasuring success. While this is true, we argue that visualizationsare not just intended to help people understand information easieror faster, but also to better afford aspects of understanding such asapplying or evaluating the visualized information. Therefore, inorder to truly evaluate the efﬁcacy of a visualization, we need tomove beyond purely quantitative measures and ask more difﬁcult,open-ended questions [36] which can capture different informationthan closed-ended methods [37, 42].There are several existing methods which in some way accomplishthis goal. For example, one study asked participants to describe onething they found interesting or surprising [49]. Collecting salientpoints is a very direct way of assessing what participants foundimportant, but only requiring one insight may mean that the infor-mation reported is not necessarily indicative of all that they havelearned. Further, while it measures what the perceived takeaway is,it is not able to assess other related levels of learning, such as howthe knowledge could be applied to a new situation; a step critical tothe learning process [40].Another approach asked participants to describe everything thatcame to mind as they interacted with a visualization [37]. Due tothe open-ended nature of this method, it is certainly thorough, but itmay not be possible in remote settings and relies on the ability ofthe participant to verbalize their thoughts reliably. Additionally, onedownside of the approach (as identiﬁed in the original paper) is thesheer volume of qualitative information that is generated [37]. Ourapproach considerably reduces this burden, condensing the amountof information collected into only six questions.A third approach in this vein asked participants to describe dif- ferent aspects of a visualization such as the overall message, trend,and sentiment [7]. By requiring participants to comment on eachaspect of interest, this method is effective at assessing the breadth ofknowledge obtained by participants, but it too misses the opportunityto assess other kinds of knowledge acquisition which might haverevealed other differences in the stimuli.As we have summarized, while some researchers are asking moredifﬁcult questions that assess particular aspects of understanding, they all lack a system for comprehensively evaluating differentlevels of understanding . Such concrete models can also help tomeasure affordances that are provided by visualizations to helpreaders obtain knowledge. This will enable visualization designers toassess their readers beyond graph literacy and create more effectivevisualizations for their audience. As visualization is still a relativelynew ﬁeld, we can take inspiration from the effective methods usedin other disciplines.

EASURING U NDERSTANDING IN O THER D ISCIPLINES

In other disciplines, there are a larger variety of methods used toassess understanding. In visual literacy research, questionnairescomprised of multiple choice questions have been used to quantifyhow ﬂuently people understand graphics in general (e.g. [30]) andin medical contexts (e.g. [22, 38]). We see evidence of similar use ofquestionnaires in Medicine and Chemistry to, for example, measurepatients’ understanding of Informed Consent documents [25] andassess the public’s understanding of nanotechnology [45].Other than quantitative questionnaires, research has also usedmore qualitative or open-ended measures. For example, benchmarkshave been used to measure knowledge of topics important to thejob performance of professional psychologists and to evaluate theway that this knowledge appears in on-the-job behavior [21]. Inanother approach, researchers used a set of open-ended questionsabout Climate Change to establish the effect of combining refutationtexts with graphics and analogies on understanding [18].Inspired by the work done in other disciplines and the need for asystematic way to measure knowledge-related affordances in visu-alization, we endeavored to create a method which combines open-ended questioning techniques, which have been shown to capturedata that would otherwise have been lost [42], with a comprehensivesurvey of different aspects of understanding. Thankfully, educationhas provided us with a helpful framework to do just this – Bloom’ssix-level taxonomy of educational objectives. In 2015, Mahyar etal proposed the use of Bloom’s taxonomy in information visualiza-tion to measure the depth of engagement and knowledge obtainedby viewers [32]. However, as of yet, we do not have a concretedescription of how it can be applied and a demonstration of its ef-ﬁcacy. In Section 5, we describe Bloom’s taxonomy in detail andprovide speciﬁc examples of how this framework can be used toassess visualizations.

XISTING T AXONOMIES IN V ISUALIZATION

Existing taxonomies within the visualization community can beroughly divided into three categories: taxonomies of visualizations,objectives, and actions. Where taxonomies of visualizations classifytypes of data visualizations (e.g., [12, 41]), taxonomies of objectivescategorize the questions that a user wants the answers to in order tosolve a problem (e.g., [11,14,24,33,43,48,55]). Finally, taxonomiesof actions classify concrete actions done in pursuit of an objective(e.g., [1, 4, 27, 43, 46]).Our proposed framework is a taxonomy of objectives and ismost similar to the taxonomy of analytic tasks described by Amarand Stasko [2, 3]. In both taxonomies, the categories of objec-tives correspond to types of tasks which need to be completed indecision-making practices. However, our proposed taxonomy ismore comprehensive, expanding the scope of objectives to includeboth more simplistic tasks and the ﬁnal conclusions drawn.2

Some existing work also focuses on participant learning outcomes,but they cannot be used to form evaluation questions – only toevaluate the responses. This work quantiﬁes the completeness andcorrectness of a response (e.g., in [7, 53]) or the complexity ofa reported insight (e.g., in [49]) in order to evaluate open-endedresponses in a post-hoc way.

LOOM ’ S T AXONOMY

In 1948, a group of psychologists and teachers had an informalmeeting at the American Psychological Association Convention,brought together by a shared problem. They wanted their students tounderstand their lessons, but they each had different ideas about whatthey meant by terms like “understand” or “internalize,” leaving themwith no way to compare those objectives [10]. Motivated by a desireto ﬁnd a common, rigorous way to categorize student understanding,the group came up with the Taxonomy of Educational Objectives,described in a handbook published in 1956. The system is nowcommonly referred to as Bloom’s taxonomy, named after BenjaminBloom, the educational psychologist who edited the handbook.In the years since its creation, Bloom’s taxonomy has been usedbroadly in Education [29]. Educators can create activities and as-sessments that target speciﬁc levels of the taxonomy [5] or use itto evaluate the assessments they already use to better understandthe objectives being tested and potentially inspire a broadening ofthose objectives [26]. In addition, Bloom’s taxonomy has been ap-plied widely in other ﬁelds such as Biology, Business, and HealthSciences, to create and evaluate exam questions [16], teach criticalthinking skills [35], guide course creation on substance use disor-ders [34], and visualize the breadth of goals present while developingcurriculum on visual literacy [20].Drawing inspiration from the taxonomies of Biology, Bloom’staxonomy is intended to be hierarchical, meaning that learning atthe higher levels is dependent on demonstrating mastery of lowerlevels. However, this is not necessarily a realistic representation ofthe learning process, which is not linear [9]. Therefore, althoughwe will now describe the levels of Bloom’s taxonomy in the orderoriginally presented, we reject the assumption that a strict hierarchyexists between levels. Instead, we view them as complementaryskills. For a quick reference to the 6 levels and how they can be usedfor evaluating visualization, refer to Table 1.

The Knowledge level historically describes the simplest learningobjective demonstrable by the learner. Associated with verbs suchas retrieve, identify, and recall, at this level a learner is able toaccurately recall or recognize factual information that they havelearned. Note that this does not require the learner to understand anycontextual information or the reason behind facts [10].This level, in its original context, describes a very simple learningtask – reporting back something already seen. Therefore, for visual-ization, we translate this level into tasks which ask participants tolocate and report speciﬁc pieces of information. In both the origi-nal and translation, no transformation is applied to the informationby the viewer and no understanding of context is required. In theexperiment described later in this paper, we used this question toask participants to locate speciﬁc data points, though other appro-priate tasks might include copying text from an annotation layer oridentifying what manipulations an interactive visualization offers.

At the Comprehension level, learners begin to understand the un-derlying information as a whole [10]. Traditionally, questions atthis level ask learners to write summaries or identify key ideas [17]and use verbs such as describe, explain, and summarize.When applied to visualizations, we can translate this level’s focuson understanding information as a whole into tasks that ask about features present in a dataset as a whole. For example, we suggestasking for a general summary of the data or the key take-awaymessages. Questions formed at this level can be more open-endedthan those at the Knowledge level and, through this, can reveal thedifferent conclusions afforded by the visualization being evaluated.

At the Application level, learners apply their knowledge to solvean unfamiliar problem [10]. This level is commonly associated withverbs such as translate, solve, calculate, and apply.For visualization, this level could be translated to tasks where theparticipant solves a problem using the data from the visualization,such as identifying a proportion and using it in a simple computation.This approach may be most appropriate when the response modalityfor the participant is restricted to text. In our experiment, we askedparticipants to determine the difference between two data values.For situations where there is not an obvious problem to be solved,this level can also be interpreted as translating knowledge from oneform to another. For example, one could ask participants to translatethe data displayed in a visualization into another visual style, thoughthis approach requires a more complicated response modality.

At the Analysis level, the learner is expected to break down atopic into parts and understand the relationship between eachpart [10]. This level is therefore associated with verbs such asclassify, break-down, associate, and relate.Questions targeting this level could ask about trends, as thisrequires the participant to identify relevant components and thencompare their spatial relationship to each other. We rely on this typeof question in the experiment presented in this paper. Alternately,questions could also ask participants to identify which pieces ofevidence were used to support a speciﬁc conclusion drawn fromthe data. This translation views the conclusion as the “topic” tobe broken down and the data points as the components. Acquir-ing a conclusion from the data points requires understanding therelationship of the points to each other.

Where the focus of Analysis level was to test the learner’s ability todecompose a topic into its requisite parts, the Synthesis level focuseson the learner’s ability to put ideas together to create somethingnew [10]. Among the hierarchy, this is the ﬁrst level which requiresa certain amount of creativity from the learner and is associated withthe verbs such as create, invent, predict, and devise [10, 17].When applied to visualizations, questions targeting this levelcould ask participants to make predictions about what values willcome next in a sequence. In this translation, the participant takesexisting trends and values extrapolates on them to form a prediction.Alternately, in the case of interactive visualizations, participantscould instead use interactive features to ﬁnd a view of the data whichreveals something new. Because we used static data visualizations inour experiment, we used the prior approach and asked participantsto make a prediction.

Finally, we arrive at the sixth and ﬁnal level of Bloom’s taxonomy– Evaluation. This level evaluates a learner’s ability to judge thevalue of a topic or idea based on criteria that is either provided orself-derived [10]. Therefore, rather than judgements, tasks at thislevel may look more like arguments or proofs, as evidenced by theverbs often associated with this level which include judge, justify,argue, and recommend [17].The most straightforward translation of this level to an evalua-tion task might be to ask participants to judge the quality of thevisualization itself by some provided criteria (e.g., reliability). This3

Level Description Example Tasks

Knowledge Recall basic facts and deﬁnitions. • Retrieve points• Locate value• Identify axis labelsComprehension Understand the information in context. • Summarize main message/take away• Describe content of visualization• Explain the topic of the visualizationApplication Apply knowledge to a new problem or represent it differently. • Use a percentage and total population to calculate a number• Calculate the difference between two points• Translate the data in a chart to a tableAnalysis Break down a concept into parts and understand their relationship. • Describe a trend• Describe the relationship between two variables• Identify what data was used to come to a conclusionSynthesis Use knowledge to create something new. • Predict a future value• Generate a new visual representationEvaluation Judge the value of information, backed by evidence. • Justify a conclusion based on data• Judge which design is more appropriate

Table 1: This table presents the 6 levels present in the original Bloom’s taxonomy [10], a short description of each, and example tasks speciﬁc tothe visualization community. method might be appropriate if the experiment aims to evaluate theparticipant’s understanding of the visual encoding of the visualiza-tion. Alternately, if the experiment aims to evaluate the participant’sunderstanding of the underlying data, we suggest instead either ask-ing participants to come to a conclusion and provide a data-basedjustiﬁcation for that conclusion or to provide a conclusion and askparticipants to only provide the justiﬁcation. With this tactic, par-ticipants judge the value of data features when deciding which areappropriate to justify the conclusion drawn. We use this approach inour experiments.

XPERIMENT

To demonstrate the kind of results an experiment could obtain withour evaluation method, we evaluated three pairs of visualizations ascase studies.

For our stimuli, we selected 3 static, real world data visualizationsvarying in complexity and design that were identiﬁed by the inter-net community as being confusing or misleading [15, 31, 52]. Wealso collected three corresponding redesigns of the confusing vi-sualization created either by one of the authors who is an expertin visualization design or the original visualization designer at theEconomist [31], where the goal of the redesign was to clarify themessage conveyed by the visualization.

Markets

The ﬁrst stimulus (as shown on the left in Figure 2) washighlighted in an online article from The Economist titled “Mistakes,we’ve drawn a few” [31]. The original chart depicts the US tradedeﬁcit with China and the number of people in the US employedin manufacturing between 1995 and 2016 as two line plots. Thecardinal sin committed by this chart is its double y-axes – onepositive and one negative, color-coded to match its associated line.In the article, the author presents a redesigned version of the chartthat separates the two plotted quantities into separate, side-by-sidecharts and encodes the trade deﬁcit using bars connected to the 0baseline to emphasize the directionality of the plot.

Immigration

The second stimulus was produced by The Globe andMail to discuss differences in immigrant populations in ThunderBay, Ontario, and Canada (as shown on the right in Figure 2). Theoriginal version of the chart features three stacked bars, one perlocation. Each stack is divided into ﬁve sections, corresponding tothe decade when people immigrated to Canada. At a ﬁrst glance,this chart might be confusing because the sections correspond totime – a variable conventionally reserved for the x-axis.

Figure 2: The Markets (left) and Immigration (right) charts used in ourexperiment. The original versions are on top.

Unlike the ﬁrst chart, the structural problem with this chart isharder to identify. If the purpose of the chart is to highlight thedifferences in distribution or visualize the aspect of time, the originalchart does a poor job. However, if the primary purpose of this chart isto highlight differences between the total percentage of immigrantsin each area, the original chart might accomplish that goal. Forthe purpose of this paper, we assumed that the designer of thischart intended to show the differences between the distributionsof immigrant year of arrival and emphasized this message in ourredesigned chart by unstacking the bars and recoloring the chart suchthat each location uses a different color (instead of each decade).

COVID-19

The ﬁnal chart we selected was a now retracted chartcreated by the Georgia Health Department [23] depicting the numberof COVID-19 cases reported over two weeks in the ﬁve countieswith the largest number of cases (see Figure 3). The original versionof this chart at ﬁrst looks uncomplicated. It’s just a simple bar chart,but a close reading of the labels along the x-axis reveals the problem– the dates are not in chronological order. Instead, both the days andthe counties within each day are sorted by severity.As made obvious by its retraction, if this chart was intended toshow how the number of cases of COVID-19 have changed overtime, this chart is plainly misleading. However, like the chart onCanadian Immigration, this chart would be appropriate for answer-4 ing a different question. Namely: which days saw the largest orsmallest number of COVID-19 cases reported, and which countieshad the highest or lowest number of cases within each of those days?Unfortunately, this message is not supported by the annotation layerof the chart, which uses the phrase “the number of cases over time.”For comparison, one of the authors created a second version of thischart with its dates in chronological order and with the bars dividedby county to make it easier to compare how the number of caseschanged over time within each region.

We conducted our experiment on Amazon’s Mechanical Turk andrecruited a total of 60 Workers (Mean age = .

4, SD age = .

7, 18women, 41 men, 1 other). In line with past research on acquiringquality results without attention check questions, we required thatworkers had completed at least 100 tasks with an approval rate of atleast 95% [39]. The experiment took approximately 30 minutes andparticipants were compensated with $5.00 for their time.In this experiment, participants were shown 3 charts and asked toanswer 6 questions based on each one. Each question was designedto target a speciﬁc level of Bloom’s taxonomy and was presentedin order, beginning with Knowledge and ending with Evaluation.The order of charts was determined via a 2 by 3 Latin Square design(yielding 6 unique chart orderings and thus 6 conditions). Eachparticipant saw 1 version of each chart in accordance with theirassigned condition. In 3 of the conditions, participants saw 2 of theoriginal charts and 1 redesigned chart, while participants assignedto the other 3 conditions saw 1 original and 2 redesigned charts.We want to emphasize that the purpose of using the alternativedesigns in this study was to show the range of reader interpretationsand visualization affordances our method could evaluate, ratherthan to generate concrete design guidelines from these case studies.Because of this, we note that our analysis utilizes an exploratoryapproach – not formal hypothesis testing.

ESULTS

Figure 3 provides the questions used for the COVID-19 chart and asummary of results for all 6 levels.

We begin our analyses with the ﬁrst question, which asked partic-ipants to locate particular values in the chart. We constructed alogistic regression predicting response accuracy with visualizationdesign (original vs. redesigned) across the three chart topics. Wefound that participants were more likely to answer the question cor-rectly when they saw the redesigned visualization compared to theoriginal ( χ ( ) = . p < . χ ( ) = . p = .

10) (see Sup-plemental materials for pair-wise odds ratios). This suggests thatparticipants were retrieving values from the redesigned visualiza-tions more accurately compared to the original version. Further,as the visualization was redesigned following common guidelinesto increase clarity and afford more accurate value retrieval, this re-sult suggests that the Knowledge level questions were successful incapturing the improvement brought by the redesign.

Questions targeting the Comprehension level were open-ended with-out deﬁned correct or incorrect answers. Therefore, to analyze theseresponses, we read through the responses blind to the version seen,and identiﬁed varying categories of conclusions. Each responsewas tagged as either containing or not containing each conclusion.This approach was selected in order to compare differences in pat-terns identiﬁed between the two versions of each chart. We notethat while this approach was chosen for demonstration, this kindof analysis is not speciﬁc to our proposed method; many existing techniques for analyzing qualitative results could be appropriatehere, such as content-based analyses or interpretive analyses (see [8]for some common practices). We conducted exploratory analysisusing Pearson’s chi-squared tests of independence to examine therelation between visualization design and user-identiﬁed salient pat-terns, with Yates’ continuity correction and Bonferroni adjustments.See the Supplemental materials for details.As shown in Figure 3, participants who saw the original and re-designed versions of the COVID chart describe salient features withvarying frequencies. For example, they were trendingly more likelyto identify and compare the counties included in the redesigned chart,( χ ( , N = ) = . p = . χ ( , N = ) = . p = Questions at this level challenged participants to determine the nu-merical difference between two speciﬁc points. As with Question1, we constructed a logistic regression predicting response accu-racy with the original and redesigned visualization across the threecharts. Unlike with Question 1, participants who viewed the re-designed version of a chart were not statistically more likely toanswer this question correctly ( χ ( ) = . p = . χ ( ) = . p = .

23) and no signiﬁcant interac-tion ( χ ( ) = . p = . As with Question 2, questions for level 4 were open-ended. Partici-pants were asked to describe a speciﬁc trend present in each chart.We identiﬁed a series of descriptions for each chart that either werementioned by the participants or were determined by the authors asreasonable conclusions to mention. Their responses ranged fromdescribing the directions of the trend (e.g., up or down) to comment-ing on modality. We coded each response blind to the chart version,tagging each description as present or not (e.g., was the trend de-scribed as positive or not). We then compared the frequency ofdescriptions between the two chart versions. We note that, as before,5

Q1 – Knowledge

Which county had the largest number of cases in 1 day during the period of time shown in the chart?

Q2 - Comprehension

Imagine that you want to explain the content of this chart to a friend without showing them the chart. How would you describe the data shown here, in your own words?

Q3 - Application

How many fewer cases did Hall County have on May 1 than Apr 28?

Q4 - Analysis

How have the number of COVID-19 cases in Fulton County changed over time?

Q5 - Synthesis

Given the data here, how many cases would you expect there to be in DeKalb County on May 10?

Q6 - Evaluation

If you were a lawmaker in this US state, what policy change would you use this information to argue for and what evidence would you provide?

Most participants predicted 0 cases, possibly directly from the count on May 9 More participants provided higher values with the Redesign100% % o f c o rr ec t r es pon ses RedesignOriginal

OriginalRedesign % o f c o rr ec t r es pon ses Predicted values % of responses containing topic

Hall Gwinnett Fulton Cobb DeKalb

VariedDownSharp EndGradual EndMultiple Peaks 100%100% % of responses containing description

Increased CasesDecreased CasesOpen UpDon’t OpenCurrent is Effective 100%100% % of responses containing topic density0.3 0.15

Figure 3: A summary of the stimuli and questions used in the experiment for the COVID charts. Charts on the right show results for each levelexamined. Participants answered Question 1 correctly signiﬁcantly more often when using the redesigned chart than the original. This does nothold for Question 3, however. When describing the contents of the chart (Q2), participants viewing the redesigned chart were more likely to talkabout the chart on a county level, but otherwise answered similarly. Participants more often incorrectly classiﬁed the trend of the chart as goingdown when viewing the original version and were less likely to comment on the bi-modal shape (Q4). We observed no effect of the version on theprediction made about the number of cases (Q5) and, unexpectedly, participants argued for “opening up” and cited decreasing cases as frequentlyin both groups (Q6). this is not the only way to complete this analysis, but was selectedfor demonstrative purposes. We conducted chi-square analysis toexamine the relation between chart design and chart descriptions,similar to that in Section 7.2.In response to the COVID-19 chart, we can see two distinctpoints of difference in the distributions (as shown in Figure 3).First, although participants were equally likely to describe thetrend direction as containing both positive and negative sections,( χ ( , N = ) = . p = χ ( , N = ) = . p = . χ ( , N = ) = . p = . correctly describe the trend of immigrationas increasing ( χ ( , N = ) = . p = . At this level, participants were asked to make predictions aboutfuture data values and trends for each chart. By looking at thedistributions of the predicted values, we begin to unpack the decisionmaking process afforded by each chart and version. In particular,we can glimpse both where the design had an effect on the averageprediction made, as well as on the variance of those predictions.To identify statistical differences, we utilized Welch’s two-sidedt-tests with Bonferroni corrections to compare mean predictions andtwo-sided F-tests with Bonferroni corrections to compare variances.For this analysis, we only included responses that contained a single,numeric answer, excluding those with ranges or that described atrend. This excluded 17% of responses from the COVID-19 chart,0% from Immigration, and 7% from the Markets charts.For the COVID chart, participants were asked to predict the num-ber of COVID-19 cases for one day beyond the dates shown in thechart. Results showed no signiﬁcant difference between the mean6 prediction ( ( M = [ . , . ] , SD = [ . , . ] , t ( . ) = p =

1) nor the variance of the predictions made ( F ( , ) = . p = . F ( , ) = . p =

1) or Ontario ( F ( , ) = . p = . F ( , ) = . p = . M = [ . , . ] , SD = [ . , . ] ), t ( . ) = − . p = . F ( , ) = . p = . M = [ . , . ] , SD = [ . , . ] , t ( . ) = . p = . F ( , ) = . p < . In the sixth and ﬁnal question, participants were asked to apply theirlearning to a real-world situation by describing the argument thatthey would make. Similar to Section 7.2 and 7.4, we identiﬁed com-mon conclusions and evidence that was cited, tagged each responseas containing or not containing mentions of each of these topics, andused the same chi-square analysis approach.In response to COVID-19 charts, the most common conclusiondrawn by participants was for “opening up” (see Figure 3). This con-clusion was mentioned by participants regardless of which versionof the COVID-19 chart they were presented with ( χ ( , N = ) = . p = χ ( , N = ) = . p = . χ ( , N = ) = p = χ ( , N = ) = . p = χ ( , N = ) = . p = The original Bloom’s taxonomy argues that there is a strict hierarchi-cal relationship between the levels [10]. To shed some light on thisidea, we looked for evidence that would suggest that performanceon earlier questions is correlated with performance on later ones.Because Questions 1 and 3 had correct answers, we ﬁrst examinedif success on Question 1 was correlated with success on Question 3.We constructed a logistic regression and observed that participantswho answered Question 1 correctly were signiﬁcantly more likelyto answer Question 3 correctly compared to those who answeredQuestion 1 wrong ( χ ( ) = . p < . χ ( ) = . p = . χ ( ) = . p = . χ ( ) = . p = . ISCUSSION

The visualization community needs better ways to evaluate whatvisualizations afford with respect to the understanding obtained byviewers. Being able to identify where affordances differ is criticalto making smart design choices that enable different aspects of theknowledge acquisition process to occur. As we have demonstrated inthis paper, the six levels of understanding from Bloom’s taxonomy7 can provide a useful framework for generating new questions whichcomprehensively evaluate the affordances of visualizations acrossa spectrum of tasks and reveal differences in knowledge-makingaffordances that might otherwise have been missed.Although we did not ﬁnd differences between versions for everychart topic on every level, our evaluation method was able to cap-ture some affordance differences, conﬁrming the effectiveness ofcommonly-held beliefs about design and revealing under-exploreddirections in affordance evaluation. This method can complementperceptual tasks involving speed and accuracy by measuring whatinformation readers can extract from visualizations and providesa systematic framework for designers to assess various levels of areader’s understanding.

Limitations and Future Work

We did not ﬁnd signiﬁcant differences between all design pairs forevery level of understanding. There are several factors which mightexplain why we observed null effects in our case studies. First, thealternative visualizations were designed by domain experts, whichmay hold a different perspective from the average crowd-sourceworker in interpreting visualization [54]. The reason we observed nodifferences in worker responses could have been because there existsno signiﬁcant difference in affordances between the designs in theeyes of an average worker, despite what the experts thought. While itis likely that prior knowledge contributes to how well one performson our task, we emphasize that we chose these alternative designsfor the study to demonstrate the range of reader interpretations andvisualization affordances that this method could evaluate, rather thanto generate concrete design guidelines from these case studies.Additionally, we recognize that participants’ ability to state acorrect answer on the Knowledge, Application, and the Analysislevels may depend on their numeracy skills. We recommend thatfuture researchers who use this method to evaluate visualizationaffordances also include participant numeracy evaluation in theirexperimental design (such as the one proposed in [51]). Even withoutnumeracy evaluation, we maintain that it is important to evaluatevisualizations at the Knowledge, Application, and Analysis level totest whether the reader has correctly interpreted the visualization ata grammatical level. Further, the Application question is importantbecause it is a measure of learning transfer which is an importantbut challenging aspect of the learning process. People often fail totransfer learning to a novel context [40], which could explain whywe did not observe differences in accuracy between chart versionsin the application level – this transfer task was difﬁcult enough tooverpower any effect of visualization design. Finally, we recognizethat the Evaluation level does not capture individual differences ofbiases in beliefs that may drive differing responses.While we maintain that it is a worthwhile effort to apply Bloom’staxonomy to user-study task generation, we recognize that it does notalways translate perfectly. First, Bloom’s taxonomy was intendedto evaluate learning that takes place over a much longer span than atypical person spends looking at a visualization. However, the levelsin this taxonomy also relate to processes inherent in decision makingprocedures which also might occur over such short periods of time.Additionally, while we tried to capture the spirit of the levels of theoriginal taxonomy, our translations may not measure identical skills.Some of the levels (particularly the upper-levels) seem difﬁcult toevaluate in a purely textual format, but may be more easily translatedto creation or editing tasks.Future iteration of this line of work could extend the list of exam-ple tasks in this paradigm to evoke speciﬁc, detailed responses thathelp designers and researchers gain insights regarding how a visual-ization reader is reacting to a visualization. We see strong potentialfor this method to evolve into a useful technique supplementingin-person interviews in remote environments where in-depth inter-views may be difﬁcult or impossible. Future researchers could also combine this evaluation method with other measures of graphical,linguistic, or numerical literacy to generate a more comprehensiveevaluation method, or use this method to identify concrete designguidelines. Alternatively, researchers could diversify the data analy-sis approaches to extend our method beyond just evaluating affor-dances to cover graphical literacy or numeracy. For example, onecould compare the participants’ trend predictions in the synthesis(prediction) task to a ground truth to determine the accuracy of theirprediction, which could inform researchers about their numeracyskills (e.g., how well the participant could interpolate/extrapolatetrends in data). Additionally, although in our paradigm, we onlycompared two alternative designs of the same chart, this frameworkis ﬂexible enough to allow for single or multi-chart comparisons.

ONCLUSION

Motivated by a desire to design data visualizations that communicateinformation accurately and effectively, the visualization communityhas long asked for novel ways to measure what is understood bythe reader, and through this, what is afforded by visualizations. Inthis paper, we proposed a concrete framework grounded in Bloom’staxonomy and demonstrated how it can be used to form a set ofquestions that systematically evaluate the kinds of affordances pro-vided by visualizations. We demonstrated how the framework can beused through 3 case studies of real-world visualizations and showedthat our comprehensive method was able to identify understanding-related affordances that would have been missed by existing methodsof evaluation that focuses on accuracy and speed. While it may notbe appropriate to apply every level covered in the taxonomy to eval-uate every visualization, this framework can help the communitydesign questions that target speciﬁc aspects of the knowledge acqui-sition process that were previously unexplored or challenge existingassumptions about what makes one design choice “better” or “worse”than another. Finally, it allows us, for the ﬁrst time, to systematicallyevaluate affordances in a way that is consistent with educationaltheories of learning and comparable across studies. R EFERENCES [1] J. Ahn, C. Plaisant, and B. Shneiderman. A task taxonomy for networkevolution analysis.

IEEE Transactions on Visualization and ComputerGraphics , 20(3):365–376, 2014.[2] R. Amar and J. Stasko. A knowledge task-based framework for designand evaluation of information visualizations. In

IEEE Symposium onInformation Visualization , pp. 143–150, 2004.[3] R. A. Amar and J. T. Stasko. Knowledge precepts for design and evalu-ation of information visualizations.

IEEE Transactions on Visualizationand Computer Graphics , 11(4):432–442, 2005.[4] N. Andrienko and G. Andrienko.

Exploratory Analysis of Spatial andTemporal Data: A Systematic Approach . Springer Science & BusinessMedia, 2006.[5] J. B. Arneson and E. G. Offerdahl. Visual literacy in Bloom: UsingBlooms taxonomy to support visual learning skills.

CBELife SciencesEducation , 17(1):ar7, 2018.[6] S. N. Bagchi and R. Sharma. Hierarchy in Bloom’s taxonomy: Anempirical case-based exploration using MBA students.

Journal of CaseResearch , 5(2), 2014.[7] S. Bateman, R. L. Mandryk, C. Gutwin, A. Genest, D. McDine, andC. Brooks. Useful junk? the effects of visual embellishment on com-prehension and memorability of charts. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems , p. 25732582.Association for Computing Machinery, New York, NY, USA, 2010.[8] P. Bazeley.

Qualitative data analysis: Practical strategies . Sage, 2013.[9] R. Berger. Here’s what’s wrong with Bloom’s taxonomy: A deeperlearning perspective, 2018. https://blogs.edweek.org/edweek/learning_deeply/2018/03/heres_whats_wrong_with_blooms_taxonomy_a_deeper_learning_perspective.html .[10] B. S. Bloom.

Taxonomy of educational objectives: The classiﬁcationof educational goals . Longman, 1956. [11] M. Brehmer and T. Munzner. A multi-level typology of abstract vi-sualization tasks. IEEE Transactions on Visualization and ComputerGraphics , 19(12):2376–2385, 2013.[12] A. Buja, D. Cook, and D. F. S. R. Scientist. Interactive high-dimensional data visualization.

Journal of Computational and Graphi-cal Statistics , 5(1):78–99, 1996.[13] M. Chen, L. Floridi, and R. Borgo.

What Is Visualization Really For? ,pp. 75–93. Springer International Publishing, Cham, 2014.[14] M. Chen and A. Golan. What may visualization processes opti-mize?

IEEE Transactions on Visualization and Computer Graphics ,22(12):2619–2632, 2016.[15] S. Collins. Georgias COVID-19 cases arent declining as quickly asinitial data suggested they were, May 2020. .[16] A. Crowe, C. Dirks, and M. P. Wenderoth. Biology in bloom: Im-plementing bloom’s taxonomy to enhance student learning in biology.

CBELife Sciences Education , 7(4):368–381, 2008.[17] J. Dalton and D. Smith.

Extending Childrens Special Abilities: Strate-gies for primary classrooms . Ofﬁce of Schools Administration, Min-istry of Education, Victoria, 1989.[18] R. Danielson, G. Sinatra, and P. Kendeou. Augmenting the refutationtext effect with analogies and graphics.

Discourse Processes , 53, 032016.[19] M. D¨ork, P. Feng, C. Collins, and S. Carpendale. Critical infovis:Exploring the politics of visualization. In

CHI 13 Extended Abstractson Human Factors in Computing Systems , p. 21892198. Associationfor Computing Machinery, New York, NY, USA, 2013.[20] V. Faccin-Herman. Visual literacy education: Developing a curriculumfor designers and nondesigners. Master’s thesis, Iowa State University,Ames, Iowa, 2020.[21] N. Fouad, C. Grus, R. Hatcher, N. Kaslow, P. Hutchings, M. Madson,F. Jr, and R. Crossman. Competency benchmarks: A model for under-standing and measuring competence in professional psychology acrosstraining levels.

Training and Education in Professional Psychology , 3,11 2009.[22] M. Galesic and R. Garcia-Retamero. Graph literacy: A cross-culturalcomparison.

Medical Decision Making , 31(3):444–457, 2011.[23] Georgia Department of Public Health. Georgia department of pub-lic health daily status report, 2020. https://dph.georgia.gov/covid-19-daily-status-report .[24] P. J., P. H.C., and S. J.T. Tasks for multivariate network analysis. InK. A., P. H.C., and W. M.O., eds.,

Multivariate Network Visualization ,vol. 8380. Springer, Cham, 2014.[25] S. Joffe, E. F. Cook, P. D. Cleary, J. W. Clark, and J. C. Weeks. Qual-ity of Informed Consent: a New Measure of Understanding AmongResearch Subjects.

JNCI: Journal of the National Cancer Institute ,93(2):139–147, 01 2001.[26] K. O. Jones, J. Harland, J. M. V. Reid, and R. Bartlett. Relationshipbetween examination questions and Bloom’s taxonomy. In , pp. 1–6, 2009.[27] N. Kerracher, J. Kennedy, and K. Chalmers. A task taxonomy fortemporal graph visualisation.

IEEE Transactions on Visualization andComputer Graphics , 21(10):1160–1172, 2015.[28] R. Kosara. An empire built on sand: Reexamining what we think weknow about visualization. In

Proceedings of the Sixth Workshop onBeyond Time and Errors on Novel Evaluation Methods for Visualization ,p. 162168. Association for Computing Machinery, New York, NY,USA, 2016.[29] D. R. Krathwohl. A revision of bloom’s taxonomy: An overview.

Theory Into Practice , 41(4):212–218, 2002.[30] S. Lee, S.-H. Kim, and B. C. Kwon. VLAT: Development of a visual-ization literacy assessment test.

IEEE Transactions on Visualizationand Computer Graphics , 23(1):551–560, 2017.[31] S. Leo. Mistakes, we’ve drawn a few,2019. https://medium.economist.com/mistakes-weve-drawn-a-few-8cdd8a42d368 .[32] N. Mahyar, S.-H. Kim, and B. C. Kwon. Towards a taxonomy forevaluating user engagement in information visualization. In

Workshopon Personal Visualization: Exploring Everyday Life , vol. 3, p. 2, 2015. [33] P. Murray, F. McGee, and A. G. Forbes. A taxonomy of visualizationtasks for the analysis of biological pathway data.

BMC Bioinformatics ,18(2):21, 2017.[34] A. J. Muzyk, C. Tew, A. Thomas-Fannin, S. Dayal, R. Maeda,N. Schramm-Sapyta, K. Andolsek, and S. Holmer. Utilizing bloom’staxonomy to design a substance use disorders course for health profes-sions students.

Substance Abuse , 39(3):348–353, 2018.[35] N. Nentl and R. Zietlow. Using Bloom’s taxonomy to teach criti-cal thinking skills to business students.

College & UndergraduateLibraries , 15(1-2):159–172, 2008.[36] C. North. Toward measuring visualization insight.

IEEE ComputerGraphics and Applications , 26(3):6–9, 2006.[37] C. North, P. Saraiya, and K. Duca. A comparison of benchmark task andinsight evaluation methods for information visualization.

InformationVisualization , 10(3):162–181, 2011.[38] Y. Okan, E. Janssen, M. Galesic, and E. A. Waters. Using the shortgraph literacy scale to predict precursors of health behavior change.

Medical Decision Making , 39(3):183–195, 2019.[39] E. Peer, J. Vosgerau, and A. Acquisti. Reputation as a sufﬁcient condi-tion for data quality on amazon mechanical turk.

Behavior researchmethods , 46(4):1023–1031, 2014.[40] D. Perkins and G. Salomon. Transfer of learning. In

The InternationalEncyclopedia of Education . Pergamon Press, 2 ed., 1992.[41] D. Pﬁtzner, V. Hobbs, and D. Powers. A uniﬁed taxonomic frameworkfor information visualization. In

Proceedings of the Asia-Paciﬁc Sym-posium on Information Visualisation - Volume 24 , p. 5766. AustralianComputer Society, Inc., AUS, 2003.[42] U. Reja, K. L. Manfreda, V. Hlebec, and V. Vehovar. Open-ended vs.close-ended questions in web questionnaires.

Developments in appliedstatistics , 19(1):159–177, 2003.[43] A. Rind, W. Aigner, M. Wagner, S. Miksch, and T. Lammarsch. Taskcube: A three-dimensional conceptual space of user tasks in visualiza-tion design and evaluation.

Information Visualization , 15(4):288–300,2016.[44] B. S. Santos. Evaluating visualization techniques and tools: What arethe main issues. In the 2008 AVI Workshop on Beyond Time and Errors:Novel Evaluation Methods For information Visualization (BELIV08) ,2008.[45] K. J. Sch¨onborn, G. E. Hst, and K. E. Lundin Palmerius. Measuringunderstanding of nanoscience and nanotechnology: development andvalidation of the nano-knowledge instrument (NanoKI).

Chem. Educ.Res. Pract. , 16:346–354, 2015.[46] H. Schulz, T. Nocke, M. Heitzler, and H. Schumann. A design space ofvisualization tasks.

IEEE Transactions on Visualization and ComputerGraphics , 19(12):2366–2375, 2013.[47] M. Streit and N. Gehlenborg. Bar charts and box plots.

Nature Methods ,11, 2014.[48] E. R. A. Valiati, M. S. Pimenta, and C. M. D. S. Freitas. A taxonomyof tasks for guiding the evaluation of multidimensional visualizations.In

Proceedings of the 2006 AVI Workshop on BEyond Time and Errors:Novel Evaluation Methods for Information Visualization , BELIV ’06,p. 16. Association for Computing Machinery, 2006.[49] J. Walny and S. Carpendale. Data sketches: An exploratory study. In

InfoVis 2014 Posters Compendium , 2014.[50] Z. Wang, S. Wang, M. Farinella, D. Murray-Rust, N. Henry Riche, andB. Bach. Comparing effectiveness and engagement of data comics andinfographics. In

Proceedings of the 2019 CHI Conference on HumanFactors in Computing Systems , p. 112. Association for ComputingMachinery, New York, NY, USA, 2019.[51] J. A. Weller, N. F. Dieckmann, M. Tusler, C. Mertz, W. J. Burns, andE. Peters. Development and testing of an abbreviated numeracy scale:A rasch analysis approach.

Journal of Behavioral Decision Making ,26(2):198–212, 2013.[52] WTF visualizations, September 2019. https://viz.wtf/post/187558414596/the-axis-choices-are-interesting-this-thing-is .[53] C. Xiong, J. Shapiro, J. Hullman, and S. Franconeri. Illusion of causal-ity in visualized data.

IEEE transactions on visualization and computergraphics , 26(1):853–862, 2019.[54] C. Xiong, L. van Weelden, and S. Franconeri. The curse of knowledge in visual data communication. IEEE Transactions on Visualization andComputer Graphics , 2019.[55] M. X. Zhou and S. K. Feiner. Visual task characterization for automated visual discourse synthesis. In

Proceedings of the SIGCHI conferenceon Human factors in computing systems , pp. 392–399, 1998., pp. 392–399, 1998.