[PDF] Learning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison

Abstract

We contribute a method to automate parameter configurations for chart layouts by learning from human preferences. Existing charting tools usually determine the layout parameters using predefined heuristics, producing sub-optimal layouts. People can repeatedly adjust multiple parameters (e.g., chart size, gap) to achieve visually appealing layouts. However, this trial-and-error process is unsystematic and time-consuming, without a guarantee of improvement. To address this issue, we develop Layout Quality Quantifier (LQ2), a machine learning model that learns to score chart layouts from pairwise crowdsourcing data. Combined with optimization techniques, LQ2 recommends layout parameters that improve the charts' layout quality. We apply LQ2 on bar charts and conduct user studies to evaluate its effectiveness by examining the quality of layouts it produces. Results show that LQ2 can generate more visually appealing layouts than both laypeople and baselines. This work demonstrates the feasibility and usages of quantifying human preferences and aesthetics for chart layouts.

Full PDF

LLearning to Automate Chart Layout ConfigurationsUsing Crowdsourced Paired Comparison

Aoyu Wu

Hong Kong University of Science [email protected]

Liwenhan Xie

Hong Kong University of Science [email protected]

Bongshin Lee

Microsoft [email protected]

Yun Wang

Microsoft Research [email protected]

Weiwei Cui

Microsoft Research [email protected]

Huamin Qu

Hong Kong University of Science [email protected]

ABSTRACT

We contribute a method to automate parameter configurationsfor chart layouts by learning from human preferences. Existingcharting tools usually determine the layout parameters using pre-defined heuristics, producing sub-optimal layouts. People can re-peatedly adjust multiple parameters ( e.g. , chart size, gap) to achievevisually appealing layouts. However, this trial-and-error process isunsystematic and time-consuming, without a guarantee of improve-ment. To address this issue, we develop Layout Quality Quantifier(LQ ), a machine learning model that learns to score chart lay-outs from paired crowdsourcing data. Combined with optimizationtechniques, LQ recommends layout parameters that improve thecharts’ layout quality. We apply LQ on bar charts and conductuser studies to evaluate its effectiveness by examining the qualityof layouts it produces. Results show that LQ can generate morevisually appealing layouts than both laypeople and baselines. Thiswork demonstrates the feasibility and usages of quantifying humanpreferences and aesthetics for chart layouts. CCS CONCEPTS • Human-centered computing → Visualization design andevaluation methods ; Visualization toolkits.

KEYWORDS

Machine Learning, Visualization, Crowdsourced, Visual Design,Image Quality Assessment

ACM Reference Format:

Aoyu Wu, Liwenhan Xie, Bongshin Lee, Yun Wang, Weiwei Cui, and HuaminQu. 2021. Learning to Automate Chart Layout Configurations Using Crowd-sourced Paired Comparison. In

Proceedings of ACM Conference on HumanFactors in Computing Systems (CHI ’21).

ACM, New York, NY, USA, 12 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Data visualizations have been ubiquitous in everyday life, such associal media, magazines, and websites. They are widely used bythe general public to express complex data in an intuitive, concise,and visually appealing manner. However, creating effective andelegant visualizations is a challenging task even for professions [52].Individuals usually need to engage in a time-consuming process tocraft designs that clearly convey information and insights, whilesatisfying the aesthetic goals. As such, there have been huge effortsfrom both industry and research communities to aid the designprocess by automated approaches.Existing approaches have predominantly focused on studyingand optimizing performance metrics for data analytics concerningusability and utility. For example, commercial software such as Ex-cel automatically recommends chart types based on selected data.Besides, much recent research proposes automated visualizationsystems that retain data integrity [28], highlight interesting datafacts [37], and recommend effective visual encodings [14, 29, 47].Nevertheless, those systems utilize pre-defined heuristics to gen-erate visual styles, which could be sub-optimal (Figure 1). Thisparadigm results in a quasi-automated process where individualsneed to manually adjust the visual style of the automatically gener-ated charts ( e.g. , [63, 70]). However, performing manual adjustmentscan be unsystematic and difficult, especially for lay users withoutdesign backgrounds [73]. Users might be unaware of guidance orfind it tedious to adjust multiple parameters simultaneously. Toaddress this problem, we aim to propose a systematic data-drivenapproach that recommends parameter configurations by learningfrom crowdsourcing human preference data. Particularly, we studylayouts because it is a fundamental element of chart design [61].We focus on bar charts which are one of the most common charttypes [2].Despite the increasing acknowledgement of the importance ofvisual styles in charts [9, 24, 33, 46, 54], little work has attempted tounderstand and quantify layout qualities through large-scale userstudies. Research in graphical design has provided various layoutmetrics such as alignment and segmentation [49], but they are notreadily applicable to charts that are data-driven and yield differentvisual perception [7, 66]. There is a lack of empirical studies tounderstand metrics for chart layouts. This is challenging due tothe subjective nature of layout qualities, which requires a largenumber of participants to score charts. The scores, however, might a r X i v : . [ c s . H C ] J a n HI ’21, May 08–13, 2021, Online Virtual Wu, et al.

A B

Figure 1: Visualization tools such as Microsoft Excel utilize adefault heuristic to generate layouts: A the chart with fivebars; B increasing the number of bars results into a chartwith the same size but different bandwidths. Those layoutshave room for improvements through manual refinement. not be precise since participants might be hesitant and feel difficultto give an accurate score [55, 64]. Besides, the scoring scales canbe inconsistent among participants [21]. Those limitations con-strain the reliability of utilizing the scores as benchmarks for ma-chine learning models. To that end, our approach is inspired by thesuccessful applications of pair-wise ranking for assessing naturalimage qualities [35]. We propose a two-alternative forced-choiceexperiment [17], asking participants to select a better chart layoutbetween two candidates. This data acquisition method allows us toobtain more precise and consistent results [6, 55].We propose a novel approach, called Layout Quality Quantifier(LQ ), for learning to score and rank the chart layout configurationsfrom human preference through crowdsourced pair-comparisonexperiments. LQ utilizes neural networks to predict the score of anindividual chart by taking comparison pairs as training data. LQ predicts the pair-wise ranking with the accuracy of 78%, showingthat it could reasonably learn human preference for layout configu-rations. We further interpret the trained model by investigating theimpact of layout parameters on human preference, thereby summa-rizing rule-of-thumb for layout configurations in bar charts. Finally,quantitative user studies demonstrate that LQ could recommendmore visually appealing layouts than manual results by laypeopleand default styles in Excel and Vega-Lite [57]. Overall, our workdemonstrates the possibility of quantifying human aesthetics forcharts. We open source all our code and experimental material . Insummary, our contributions are as follows: • A novel approach for quantifying human preference for chartlayouts through crowdsourced paired comparison • A machine-learning method, LQ , for ranking and scoringlayout configurations in bar charts • A set of qualitative and quantitative evaluations as well astwo user studies that demonstrate the effectiveness and use-fulness of LQ Our work is related to aesthetics for visualizations, automatedvisualization designs, as well as data collection and training forvisualization research. https://github.com/shellywhen/LQ2 In a broader sense, our work is related to the aesthetic qualitiesof data visualizations. In the book

Information is Beautiful , Mc-Candless [45] lists aesthetics as one of the four criteria for a goodvisualization. However, aesthetics were traditionally considered asan add-on feature that was typically implemented at the very end ofthe design process. Already 13 years ago, Cawthon and Moere [9]argued for increased recognition for visualization aesthetics, bydemonstrating the relationships between aesthetics and usability indata visualizations. Since then, many empirical studies have shownthat the aesthetics of data visualizations could contribute to variousfactors such as first impressions [24], memorability [5], emotionalengagement [33], and task performances [54].Nevertheless, little work has studied what makes a data visu-alization visually appealing. Moere et al. [46] demonstrated thatthe visual styles could lead to different comments regarding aes-thetics. Quispel et al. [53] found that laypeople were attracted todesigns they perceived as familiar and easy to use. However, theyinvestigate aesthetics as a qualitative reflection of personal judg-ment rather than a quantifiable and comparable entity. Humanpreferences for aesthetics in the context of charts are still not me-thodically quantifiable from a data-driven perspective, and seemunderrepresented in large-scale empirical studies. We address thisgap by proposing a systematic machine learning approach for rank-ing and scoring layout qualities from crowdsourcing experimentdata. Besides, we propose our approach for interpreting the trainedmodel, whereby summarizing speculative hypotheses that warrantfuture empirical research to confirm.

Recently, there have been growing interests in applying machinelearning methods for automated visualization designs. Researchershave proposed many systems [14, 29, 44, 47] that recommend vi-sualizations based on data structures and characteristics. Thosesystems focus on deciding the effective chart type, visual encoding,and data transformation. In addition to effectiveness, much researchhas been devoted to optimizing visualizations from other aspects.For example, VisuaLint [28] addresses the data integrity by surfac-ing chart construction errors such as truncated axes. DataShot [70]and Calliope [59] focus on generating visualizations with inter-esting data-related facts from tabular data. Dziban [41] attemptsto balance automated suggestions with user intent by preservingsimilarities with anchored visualizations. Different from them, ourwork studies to improve the aesthetic quality of chart layouts.Researchers have also recently proposed many approaches forimproving the visualization layout. Several work automaticallyextracts reusable layout templates from visualizations [11, 12] andinfographics [43]. Nevertheless, they do not propose metrics forextracted layouts. Other systems optimize the layout accordingto various metrics such as mobile-friendliness [71], similaritieswith user-input layouts [63] and graph features [23, 69]. However,their metrics are not derived from empirical studies and thereforemight not reflect the overall perceived quality [3]. Therefore, ourwork studies how to quantify and optimize layout qualities fromcrowdsourcing experiments. earning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison CHI ’21, May 08–13, 2021, Online Virtual U S A G e r m a n y U K I r a n C h i n a A u s t r a l i a K o r e a (cid:31)(cid:30)(cid:31)(cid:31)(cid:29)(cid:31)(cid:31)(cid:28)(cid:31)(cid:31)(cid:27)(cid:31)(cid:31)(cid:26)(cid:25)(cid:31)(cid:31)(cid:31) Label RotationMax Label Length A u s t r a li a Aspect RatioBandwidthNumber of Bars O ri en t a ti on Exp.1 Exp.2 F i x e d C h a r t H e i g h t - - 3- 13 Total Size

Parameter

Exp2: 0° | -45° | -90°

Exp2 : hor i zonta l | vert i ca lExp1: 1 - 3.67, step = 0.33Exp2: 0.5 - 2, step = 0.25 Exp1: 2Exp1: 0.1 - 1, step = 0.15Exp2: 0.3 - 1, step = 0.1Exp2: 3 - 15, step = 1Exp1: 5 - 30, step = 1Exp2: 5 - 25, step = 1 Exp1 : hor i zonta lExp1: 0° ABCDEF

Figure 2: The layout parameters in Experiment 1 and 2. The table displays the sampled values of each parameter. Experiment 1concerns 3 parameters with 1,575 possible combinations, while Experiment 2 includes 6 parameters with 87,360 combinations.

Recent years have witnessed a growing recognition of data collec-tion to facilitate research in machine learning for visualizations.Many efforts have been made to collect real datasets of charts fromwebsites [2, 29, 50] and scientific literature [10, 39]. Those datasetsinclude annotations or original data as ground-truth labels andassume that those charts share the same quality. Therefore, anotherline of research conducts crowdsourcing user studies to obtain qual-ity metrics such as task completion time and accuracy [56] as wellas attentions [34], which can be measured objectively by devices.However, it is much more challenging to generate a reliable datasetfor subjective metrics, as participants might not share a reliable andconsistent scoring scale [21, 55, 64]. To that end, Saket et al. [56]extended their experiment by asking participants to rank five dif-ferent visualization types in the order of preference and found apositive correlation between user preference and task accuracy.However, the collected data is for statistical analysis instead ofmachine learning tasks.To generate dataset for machine learning, Luo et al. [44] pro-poses a pair-wise comparison approach, i.e. , to ask participants tochoose which chart is better from two candidates, which yieldsmore precise results. They subsequently compute the overall orderfrom pair-wise comparison, and choose the top-ranked ones astraining data. However, their approach is limited for two reasons.First, it is inefficient as they only obtained 2,520/30,892 good/badcharts after 285,236 comparisons. Second, they formulate the prob-lem as a classification task, neglecting the subtle differences amongcharts. To that end, we propose LQ which predicts a numericalscore of a single chart through regression neural networks, whiledirectly taking paired comparisons as the training input. LQ isbuilt on similar learning frameworks for image assessment [68],but integrates a parameter module and two sampling strategies tolearn the visualization-specified features. Data visualizations represent data with graphical elements accord-ing to visual specifications. Specifications can be classified intotwo types: visual encodings that map data to visual properties ( e.g. ,color, position, size) of graphical elements, and visual styles thatspecify the remaining visual properties irrespective of data ( e.g. ,label rotation, bar bandwidth). Our work aims to automate theparameter configurations for the latter, i.e. , visual styles, whichare largely neglected in existing automated charting tools. Con-cretely, we focus on the layout properties in bar charts, since thisis one of the first work that attempts to rank and recommend pa-rameter specifications of visual styles leveraging machine-learningapproaches. In this section, we describe our experiments, the designconsiderations, and the problem formulation.

Different from visual encodings that are typically described asdiscrete mappings, visual styles usually have greater cardinality andcontinuous values that increase their complexity. To keep this studycomplexity manageable, we select two concrete yet underexploredexperiments (Figure 2).Our first experiment considers three basic layout-related param-eters, i.e. , the number of bars, the aspect ratio of the chart, andthe bandwidth. This is because we observe that existing chartingtools determine those values by predefined heuristics. For instance,Microsoft Excel fixes the aspect ratio and computes the bandwidthaccording to the number of bars (Figure 1). In this paper, we argueand demonstrate that such default heuristics could result in sub-optimal layouts. Individuals, therefore, need to repeatedly adjustseveral parameters to achieve visually appealing layouts. However,such manual adjustments are unsystematic and time-consuming,without a guarantee of improvement. Therefore, we study how toautomatically configure those parameters by learning from humanpreferences.

HI ’21, May 08–13, 2021, Online Virtual Wu, et al. A SameData DifferentParameters Po s i t i ve N e g a t i ve DisagreeAgreeTwo-alternativeForced Choice D Gradient-based SamplingGradient-based SamplingImportance-based Sampling ScoresTraining A Genearting Comparison Pairs B Obtaining Ground Truth C T r a i n i n g S c o r i n g M ode l D Adaptive Sampling

Desk Reject

Figure 3: The data collection process in an iterative manner: A Generating paired charts with same data and different layoutparameters; B Labelling the training data through crowdsourcing experiments; C Training the scoring models; D Utilizingtwo offline adaptive sampling strategies to increase the representativeness of the training dataset.

The second experiment is extended with another three parame-ters, namely, the chart’s orientation, the max length and rotationdegree of axis tick labels. This experiment is motivated by the prac-tical needs for responsive visualization design, i.e. , how to adjustthe chart layout to fit into different sizes. This is a challenging tasksince chart creators need to manually examine and edit layouts formultiple chart sizes [27]. Therefore, we add those three parametersthat are often subject to adjustments in responsive visualizationdesigns. This experiment extends existing automated responsivevisualization approaches [71] by considering the aspect ratio andallowing rotating chart orientations.

The problems above can be summarized as optimization problems, i.e. , to find values of visual styles that maximize the layout quality.To guide the design of our solution, we summarize two primaryconsiderations:

C1: To quantify and score layout qualities.

One of the primarychallenges in optimization problems is to define the objective func-tion. Hence, our primary goal is to learn a loss function that mapsvalues of layout parameters onto a numerical score intuitively repre-senting the layout qualities. This loss function can be subsequentlyused for mathematical optimization.

C2: To learn the overall quality from human feedback.

Since judg-ments of layout qualities involve a wide range of factors, previouswork in graphic designs [40, 49] usually utilizes human-craftedmetrics ( e.g. , symmetry) to measure layout qualities. Their meth-ods face challenges in the context of data visualizations since fewhuman-crafted metrics are available for chart layouts. Besides, itis difficult to weigh different metrics to reflect the overall qualityperceived by users. Therefore, we aim to measure the overall qualityby learning from human feedback, and conduct post hoc analysisto summarize rule-of-thumb from the trained model.

Guided by the design considerations, our main task is to develop amachine-learning model that learns to predict a layout quality score of the given parameters from human feedback data. We formulatethis task as a learning-to-rank problem [42], which purposes toacquire a global ranking from partial orders. The ground truth ofpartial orders is harvested from experimental data on human pref-erence. Specifically, we conduct a paired comparison experiment,asking participants to choose their preferred layout from two can-didates. The results from paired comparisons contain partial orders,which constitute the training data.

We describe our process of constructing the training dataset con-taining ranked pairs of chart layout configurations. As shown inFigure 3, the process is iterative and contains four steps. In thissection, we describe the step A , B , and D in Figure 3 in detail.Step C will be introduced in section 5. Our first step is to create paired charts for crowdsourcing compari-son. We decide to synthesize charts since it is difficult to harvestreal-world chart pairs that are fairly comparable, that is, they arecontrolled to represent the same data. Charts are created usingVega-Lite [58], which allows specifying the aforementioned pa-rameters in a declarative manner. For each pair, we choose datafrom two popular real-world datasets, namely the Car dataset andthe Baseball dataset , and randomly select entries according to thenumber of bars. In Experiment 1, we replace the tick labels withmeaningless two-character tokens. In Experiment 2, the tick labelsare truncated according to the parameter of label lengths.The remaining parameters are generated with different valueswithin a chart pair, including the aspect ratio, chart orientation,bandwidth, and rotation of axis labels. It is expensive to conductcontrolled experiments for each parameter, since those parametersmay not be interdependent. Therefore, we choose to randomizeall parameters, intuitively intending to obtain a wide variety ofchart configurations. However, exhaustive enumeration of possible https://vega.github.io/vega-datasets/data/cars.json https://github.com/vincentarelbundock/Rdatasets/blob/e38552ac3cb40a532941b09d7332b03d19409919/doc/plyr/baseball.html earning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison CHI ’21, May 08–13, 2021, Online Virtual values and combinations of parameters is infeasible due to theircontinuous distributions. Thus, we decide to randomly sample fromuniform distributed values ( e.g. , the bandwidths range from 0.1 to1.0 with a step of 0.15). We choose a relatively large step in orderto make the differences notable. While parameters are sampledrandomly, we make the sampled values evenly distributed to avoidthe data imbalance problem.Figure 2 shows the sample values and the possible combinationsof parameter values in our experiments. We update the samplingvalues in Experiment 2 according to our findings in Experiment 1.For instance, we truncate the maximal aspect ratio to 2, since weobserve that larger aspect ratios are less favored. It should be notedthat the resulting design space is still considerably large that poseschallenges in solving the optimization problem. We harvest ground truths of ranked pairs of charts through a two-step process.

Desk Reject.

First, we “desk reject” charts that violate a set ofpredefined rules and label them as negative in a pair. We discarda chart pair if both charts violate rules. Specifically, two rules areincluded in Experiment 2: the axis labels should not overlap witheach other, and the axis label should not rotate in a horizontal barchart. This approach allows us to train an ML model that learnshuman-crafted rules during the training time and therefore betterreflect the overall quality. An alternative approach in visualizationrecommendation systems is to utilize rules as hard constraints inthe optimization phase [47], which, however, poses challenges insolving the optimization problem with the increasing number ofrules. This might be undesirable since it could prolong the executiontime and therefore degrade the usability.

Ki Fl Bu Gl Mc Me Ha Ba Es Ro Fo Hr Br Mo Co Ri Th Da Bi020406080100120140160180200 Ki Fl Bu Gl Mc Me Ha Ba Es Ro Fo Hr Br Mo Co Ri Th Da Bi020406080100120140160180200

Which of the following layout do you prefer (in terms of aesthetics)?Top Bottom

Figure 4: Illustration of the MTurk interface for crowd-sourcing experiments: we propose a two-alternative forced-choice design that makes it easier for participants to evalu-ate the relative quality of paired charts than scoring a singlechart.

Crowdsouring Experiments.

Second, we conduct crowdsourcingexperiments on Amazon Mechanical Turk (MTurk) to obtain experi-mental data from human preference. Figure 4 illustrates the settingsof the MTurk experiment. We propose a two-alternative forced-choice (2AFC) experiment, asking participants to choose “which ofthe following two layouts do you prefer, in terms of aesthetics?”.Two charts are placed vertically within a viewport since it is easyto compare by moving eyes between side-by-side views [48]. Wechoose a forced-choice method in an attempt to capture the subtledifferences [75]. Each MTurk HIT consists of 10 comparison tasks,and each task (paired comparison) is assigned to 3 participants. Forquality control, we randomly duplicate one comparison task withina HIT and swap the order of paired charts. We keep HITs whereparticipants offer consistent answers for duplicated tasks.We measure the inter-observer reliability by the joint probabilityof agreement. It is observed that three participants make the samechoices in 45.6% of pairs for two experiments. This observationprobability is much higher than that of the agreement by chance, i.e. , 25%, showing that human preference exhibits a fair degree ofagreement on layout qualities. This fair agreement can have severalreasons. First, the differences between the two charts in a pairmight be small and therefore cause uncertainties, since the layoutparameters are generated randomly. Second, individual participantshave different preferences. Third, it might be because of the noisesof MTurk experiments.We select paired comparisons with full agreements among partic-ipants as the training data to reduce noises [75]. Each pair consistsof two charts, denoted ⟨I + , I − ⟩ , where I + is preferred over I − . It is crucial to employ an effective pair sampling strategy to selectthe most important pairs for rank learning [68]. Our uniform sam-pling and random pairing strategy in subsection 4.1 is sub-optimal,since we are interested in finding the most “optimal” chart con-figurations. Therefore, we propose two offline adaptive samplingstrategies to improve the qualities and representativeness of thecomparison pairs. The term “offline” here is referred in the contextof machine learning, that is, we re-sample comparison pairs whenthe initial training phase has finished.

Importance-based Sampling.

We are interested in finding impor-tant pairs that allow us to determine the “best” chart configurations.Therefore, we borrow the idea of elimination tournaments, intu-itively conducting a second round of comparisons among previouswinners. However, this is not readily applicable since our samplesize is much smaller than the huge number of possible parame-ter combinations. Thus, we propose an important-based samplingscheme, which intends to increases the probability of samplingimportant charts with “good” parameter values.Suppose each chart I is configured by a set of parameters 𝑝 𝑖 ∈ P ,where the possible values of 𝑝 𝑖 are 𝑣 𝑗𝑖 ∈ V 𝑖 . Let 𝑤 𝑗𝑖 denote thatnumber of times that the chart I whose configuration contains 𝑣 𝑗𝑖 has won in paired comparisons in Figure 3 B . We update theprobability of sampling the value: 𝑃 ( 𝑣 𝑗𝑖 ) = min { 𝑤 𝑗𝑖 ,𝑇 } (cid:205) 𝑗 min { 𝑤 𝑗𝑖 ,𝑇 } , (1) HI ’21, May 08–13, 2021, Online Virtual Wu, et al. where 𝑇 is a parameter responding the exploration-exploitationtrade-off by avoiding empty probabilities. Gradient-based Sampling.

Having a large step size in uniformlysampling might cause the model to overlook a maximal. To addressthis problem, we use a gradient-based sampling method to sampleimportant parameter values with a smaller step size. As gradientsare computed on a differentiable function, we refer to our scoringmodel trained in Figure 3 C . This scoring model learns a regressionfunction 𝑓 (·) that maps the parameter vector 𝑝 = { 𝑝 , 𝑝 ..., 𝑝 𝑖 } to anumerical score. We compute the locations where the gradient of 𝑓 along with 𝑝 , ∇ 𝑓 ( 𝑝 ) , is smaller than a given threshold. We sampleparameter values within those locations with a smaller step-size, i.e. , 1/3 of the original step-size.In both experiments, we conduct each of the following adaptivesampling strategies once, and merge the resulting dataset. Thisprocedure results in 1,177 pairs in Experiment 1 and 1,333 pairsin Experiment 2. Overall, our data collection process involves 416unique MTurk participants. With the obtained pairs D = {⟨I + , I − ⟩} , LQ aims to quantifythe aesthetic scores of a given chart. Specifically, we formulate theproblem as a regression problem, that is, to output a numerical score S for an input chart I . Our goal is to learn a regression function 𝑓 (·) that predicts a higher score for the preferred chart in a pair: S + > S − , ∀⟨I + , I − ⟩ ∈ D (2)where S = 𝑓 (I) . Model Architecture. LQ adopts a Siamese neural network struc-ture, i.e. , to work in tandem on two different inputs with the sameweights to compute comparable output [13]. As shown in Figure 5,it consists of two identical scoring networks, and the loss func-tion is defined on the combined output of scoring networks. Thescoring network takes the parameter values as input and outputsa numerical score. We employ fully-connected neural networks(NN), which have proven effective in handling features describingdesign choices ( i.e. , parameters) in VizML [29]. Our NN contains 6hidden layers, each consisting of different numbers of neurons with Ki Fl Bu Gl Mc Me Ha Ba Es Ro Fo Hr Br Mo Co Ri Th Da Bi020406080100120140160180200Ki Fl Bu Gl Mc Me Ha Ba Es Ro Fo Hr Br Mo Co Ri Th Da Bi020406080100120140160180200 I + I - ScoringNetwork s + s - Loss

Figure 5: LQ utilizes a Siamese neural network structure towork in tandem on a pair to compute comparable output. ReLU activation functions and dropout layers. We perform min-max normalization on the parameter values so that each parametercontributes approximately proportionately to the results.We also tried to take the graphical features ( i.e. , images) as thetraining input with off-the-shelf Convolutional Neural Networks(CNNs) models. However, this method did not bring about remark-able performances despite the expensive training time. Our find-ings conform to earlier work [20, 22] that CNNs might not readilycapture human perception in visualizations. Thus, we utilize theparameter as the input, which serves as a compact and learningrepresentation that reduces the computational costs.

Loss Function.

We adopt the Pairwise Ranking Loss as the lossfunction, which explicitly exploits relative rankings of chart pairs [35]:

L(S + , S − ) = 𝑚𝑎𝑥 ( , S + − S − + 𝑚 ) , (3)where 𝑚 is a specified margin hyper-parameter. This loss imposesa ranking constraint by penalizing mistakes for assigning a lowerscore to the preferred chart. Implementation and Training.

We implement LQ with Pytorch.During training, we split the data by a ratio of 8:2 with the purposeof training and validation. We tune several hyper-parameters bydiagnosing the learning curves so that the plots of training andvalidation data converge to a good point of stability and have asmall gap. The model is trained with the Adadelta optimizer for 200epochs. The learning rate is 1, and subsequently is reduced by halfper 30 epochs. We found that only the margin hyper-parameter 𝑚 had a significant impact on the training performance, while weightdecay, optimizer, and dropouts had small effects. To evaluate the effectiveness of our method, we conduct experi-ments with baseline approaches and perform qualitative analyseswith the trained scoring network across different layout parameters.

We compare the performance of our model with several baselineapproaches. For experiment reproducibility, we adapt a Monte CarloCross-Validation strategy [72], that is, to randomly split the datainto training data and testing data with an 80-20 ratio, run theexperiment, and repeat the above process ten times.

Model Baseline.

Our problem is formed as a learning-to-rankproblem. Therefore, we consider the Ranking Support Vector Ma-chine (

RankSVM ) [31] as the baseline approach, which is a well-established method for computing the overall ranking based onpairwise preference. Similar to Draco [47], we use a linear SVMmodel with hinge loss.

Scoring Baseline.

We also compare our learned scoring networkwith existing human-crafted metrics for layout qualities in graphi-cal design [49]. We select four metrics that are mostly applicableto the context of charts, including

White Space , Scale , Unity ,and

Balance . Those metrics are implemented according to instruc-tions in the supplemental material. We discard Alignments andOverlapping whose value does not vary among our charts. Besides,Emphasis and Flow are not considered since they are mainly con-cerned with key text or graphics, which are not well defined in earning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison CHI ’21, May 08–13, 2021, Online Virtual

Table 1: Comparison of the performances between our method and baseline approaches in terms of the prediction accuracy(%) via Monte-Carlo Cross-Validation for 10 runs with an 80-20 training-testing split ratio.Ours RankSVM White Space Scale Unity Balance AllExp. 1 (N = 1,177) 76.60 70.83 57.28 56.26 52.00 56.08 60.81

Exp. 2 (N = 1,333) 78.27 64.48 58.24 61.72 56.21 63.18 68.73

Table 2: The Pearson correlation between predicted scores and each layout parameter in Experiment 1 and 2.Number of Bars Aspect Ratio Bandwidth Max Label Length Label Rotation OrientationExp. 1 -0.38 0.20 0.27 - - -

Exp. 2 -0.09 0.37 -0.05 -0.09 -0.43 0.04 bandwidth sc o r e bandwidth sc o r e A aspect ratio o r i e n t a t i o n aspect ratio r o t a t i o n tick label length o r i e n t a t i o n tick label length r o t a t i o n o r i e n t a t i o n number of bars o r i e n t a t i o n r o t a t i o n number of bars r o t a t i o n B Exp. 1 Exp. 2

Figure 6: Visualizing the predicted scores with A a single parameter by box-plots and B multiple parameters by heat-maps. charts. We also combine those metrics (

All ). Each metric consists ofseveral features, which are fed into RankSVM to learn their weights.

Result.

Table 1 shows the results of the performances. In bothexperiments, our model outperforms the baseline RankSVM ap-proaches. In particular, RankSVM performs much poorer in Exper-iment 2, showing that the relations between the impacts of eachparameter on predicted scores tend to be non-linear. All scoringbaselines cannot achieve desired performances, suggesting thatthose hand-crafted features for layout qualities in graphic designcannot be readily applicable to charts.

To understand the impact of layout parameters on the perceivedlayout quality, we conduct the quantitative and qualitative analyseswith the trained scoring model. Those analyses help relate our workwith prior knowledge about chart layout designs, inform designguidelines, and provide qualitative support for our methods. Specif-ically, we calculate the predicted layout quality score of differentcombinations of parameters.We study the relationships between the predicted score and eachparameter by computing the correlations and visualizing distribu-tions. Our findings are summarized in the following text. Thosefindings should be interpreted carefully since they are derived fromthe black-box ML models. Thus, they should not be considered asguidelines, but instead speculative hypotheses that warrant futureempirical research to confirm.

Table 2 shows the Pearsoncorrelations between the predicted scores and each parameter. Wefirst note the negative impacts of the number of bars in both experi-ments, showing that it is more challenging to find good layouts withmore bars. The aspect ratio contributes positively to the overallscore, suggest that landscape layouts might be superior to portraits.The impacts of bandwidths differ between our two experiments.Figure 6 A visualizes the predicted scores versus bandwidths us-ing box-plots. In Experiment 1, the average scores are relativelyhigher when the bandwidth is between 0.6 and 0.95, with a subtlepeak at 0.8. However, in Experiment 2 where horizontal bar chartsare introduced, the “optimal” interval become between 0.5 to 0.85,followed by a sharp drop after 0.9. Based on those observations, weform hypotheses for future studies to confirm: first, as a rule-of-thumb, the optimal bandwidth in vertical bar charts is 0.8; second,the optimal bandwidth in horizontal bar charts is less than thatin vertical bar charts. Our second hypothesis conforms to existingrule-of-thumb that suggests a bandwidth between 0.57 to 0.67 inhorizontal bar chart [19].The label rotation has a moderate negative correlation (-0.43)with the score. Figure 6 B presents the combined effects of thelabel rotation with other parameters on the predicted score. It isobserved that non-rotation (zero-degree) is acceptable when thenumber of bars is small, when the aspect ratio is large, and whenaxis labels are short. This is because axis labels are less likely tooverlap with each other under those conditions. Otherwise, the axis HI ’21, May 08–13, 2021, Online Virtual Wu, et al. aspectRatio b a n d w i d t h aspectRatio b a n d w i d t h aspectRatio N u m b e r o f B a r s aspectRatio N u m b e r o f B a r s A B

Before Sampling After Sampling Before Sampling After Sampling

Number of Bars = 20 Bandwidth = 0.8

Figure 7: Heatmaps showing the predicted scores with different parameter combinations in Experiment 1. Our adaptive sam-pling strategies allow us to obtain fine-tuned results. labels need to rotate to avoid overlapping. In general, rotations by45 degrees yield higher scores than rotations by 90 degrees.We observe no correlation (0.04) between the orientation andthe scores, showing that both horizontal and vertical bar chartshave own advantages. As shown in Figure 6 B , vertical bar chartsachieve much fewer scores when the number of bars is large, whenthe aspect ratio is small than 1, and when the length of axis labelsare large. On the other hand, the scores of horizontal bar chartsare less sensitive to the number of bars and the aspect ratio, whilehorizontal charts seem strongly useful when axis labels are lengthy.We also note that the score distributions are different betweentwo experiments (Figure 6 B ). The predicted scores in Experiment1 vary between 0 to 0.39, while the range in Experiment 2 is 0.03to 0.72. This might be due to the existence of comparison “dead-locks” in Experiment 1, e.g. , I > I , I > I , I > I . This leftoptimization constraints in Equation 3, i.e. , S − S > 𝑚, S − S > 𝑚, S − S > 𝑚 , without feasible solutions. In Experiment 1, weobserve 21 three-node, 23 four-node, 10 five-node, and 32 six-nodecircles, and the value of 𝑚 is 0.12. We investigate the effective-ness of adaptive sampling strategies by conducting ablation analysis.Specifically, we train two models on the dataset before and afterthe adaptive sampling process in Experiment 1, denoted BS and AS . Two datasets are down-sampled to ensure the same data size.Figure 7 presents two qualitative examples showing the learnedrelationships between the predicted scores and different param-eters. It is observed that the BS model yields a small region oflight colors (low scores) and a majority of dark colors (high scores).That said, it learns to reject bad conditions but could not furtherdifferentiate conditions scored “borderline and above”. On the con-trary, the AS model is able to identify a small region of parametersthat yield higher scores, which accords with our goal to optimizelayout parameters. Besides, it identifies difficult conditions. Forinstance, Figure 7 B (right) suggests that the optimal aspect ratioincreases with the number of bars before it reaches 3. This impliesthe difficulties in finding good layout parameters for charts withan aspect ratio over 3, which are uncommon and less favored. To demonstrate the usefulness of LQ , we present a novel appli-cation, i.e. , automatic optimization of layouts. Existing chartingtools typically generate layouts by predefined heuristics, whichrequires tedious manual adjustments. It would therefore be use-ful to automate this process by recommending layout parameters that improve the quality. To that end, we propose an automaticoptimization approach and conduct two user studies. We present two user studies in line with our experiments. In UserStudy 1, we propose a common real-world scenario - presentations,where individuals usually wish to create charts to convey data in-sights in an aesthetically pleasing manner. The task is to adjustthe aspect ratio and the bandwidth given the data. User Study 2concerns adaptive visualization designs, where a maximal width isposed as a hard constraint and the task is to adjust four parametersincluding the aspect ratio, the bandwidth, the orientation, and thelabel rotation. We create 50 and 80 design cases for two studies, re-spectively, each case encoding randomly chosen data. We compareour results (

Ours ) with those generated by laypeople (

Human )and default parameters (

Default ), and random values (

Random ). Our Approach.

Our optimization approach aims to find parame-ter values that maximize the layout quality score predicted by LQ .For that purpose, we adopt a brute-force method that enumeratescombinations of values and selects the one with the highest pre-dicted score. We choose brute-force methods since the maximalenumeration size is 87,360 which computers could operate withinseconds. Advanced optimization techniques are desired to copewith the expanding parameter space by avoiding enumeration [1]. Human Baseline.

To obtain human baselines, we run an experi-ment on MTurk. Participants are instructed to “adjust the parame-ters until you are mostly satisfied with the layout” with a What YouSee Is What You Get (WYSIWYG) editor implemented with Vega-Lite. Akin to standard charting tools, participants are provided witha slider and an input box for adjusting continuous values, and aradio group for editing discrete parameters. We record their editinghistory and the time used from the first editing to final submission.Each participant is assigned to one and only one design task.

Default Baseline.

In User Study 1, we choose Microsoft Excel asthe default baseline. In order to keep the comparison fair, we removecomponents that do not exist under other conditions, which includethe chart title and y-axis gridlines. Besides, the bars are filled withthe default color of Vega-Lite. In User Study 2, we compare ourmethod against the Responsive Bar Chart feature in Vega-Lite . https://vega.github.io/vega-lite/examples/bar_size_responsive.html earning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison CHI ’21, May 08–13, 2021, Online Virtual Ours v.s. HumanOurs v.s. DefaultOurs v.s. RandomHuman v.s. DefaultHuman v.s. RandomDefault v.s. Random

59% 41%63% 37%67% 33%52% 48%64% 36%54% 46% 51% 49%66% 34%66% 34%63% 37%66% 34%53% 47%User Study 1 (N = 50) User Study 2 (N = 80)nsns nsns A User Study 1User Study 2User Study 1User Study 2 B Time Used (Seconds)Number of Edits Avg±Std

Figure 8: Results of the user study: A displays the results of group-wise comparison among four groups in terms of percent-ages of favored votes. An “ns” denotes no statistical significance via Wilcoxon signed-rank tests; B presents two box-plotsvisualizing the time used and the number of edits by laypeople in configuring the chart layout.

Random Baseline.

The random baseline takes random parameters,which are sampled from values observed in the training data tomake the comparison fair.

We run another MTurk experiment, asking participants to comparethe results among the above four groups. Similar to the labeling pro-cess, we conduct pair-wise comparisons between every two groups.Each between-group comparison includes 50 paired charts in UserStudy 1 and 80 in User Study 2. Each paired chart is evaluated in atwo-alternative force decision (2AFC) paradigm by 10 participants.There is one duplicate pair out of per 10 for quality control.Figure 8 A summarizes the results of group-wise comparisons interms of the percentages of preferred votes in the 2AFC procedures.In User Study 1 (US1), our method outperforms Human, Default,and Random ( 𝑝 < .

05, Wilcoxon signed-rank tests). It is alsonoted that while Human has a higher preference than Default, thedifference is not statistically significant. Similarly, Default has asmall yet not significant superiority over Random. However, Humanperforms significantly better than Random. This said, laypeoplecould only achieve a relatively small improvement in the layoutquality, although they spent notable efforts, i.e. , 49.7 seconds and8.9 adjustments on average (Figure 8 B ).In User Study 2 (US2), both Ours and Human outperform Defaultand Random ( 𝑝 < . i.e. ,70.4 seconds and 17.2 adjustments on average (Figure 8 B ). It istherefore expected that participants could achieve better results.Second, US2 presents a much more challenging task than US1, sincethe total number of possible parameter combinations is 1,575 in US1and 87,360 in US2 (Figure 2). However, the training data size is only1,333 in US2, which is far from fully representing the whole designspace and therefore could not find the “optimal” solution all the time.Still, our results are positive as our method has achieved human-level performances by leveraging a small training data, showingthe effectiveness of our sampling strategies. Future work could further extend our work by augmenting the training data. Finally,our sample size (80) is relatively small, considering the variety ofthe parameter space and the random nature in task generation. Inthe future, we plan to conduct larger-scale user studies to betterunderstand the scalability.In summary, our results show that the default heuristics forgenerating layouts in existing charting tools could result in sub-optimal results. To improve the layout quality, laypeople need toengage in a time-consuming process to adjust the parameters overand again. Our automatic approach could achieve at least human-level performance via small-sample learning, while removing theheavy costs of manual adjustments. We reflect on the implications and future work of our research.

Do not trust the defaults.

Charting tools and libraries providedefault settings for user-configurable parameters. Default settingsare proven to introduce a default effect that people would blindlytrust and stick with them [15]. However, default settings are de-signed to be reasonable under most cases, i.e. , to prevent stupidmistakes. Thus, they are just acceptable but not good for all. Weprovide empirical evidence that the default layout parameters forbar charts in Excel and Vega-Lite are sub-optimal, which can besignificantly improved by manual or automatic fine-tuning. Thoseresults support the needs of increasing recognition for utilizingdefault values prudently.

Augmenting empirical studies with a machine learning approach.

Our experiments can be considered as empirical studies aiming toidentify the “best” combinations of variables. This is challengingdue to the vast design space, i.e. , 87,360 possible combinations,making exhaustive enumeration and controlled studies infeasible.In response, we propose a ML approach that learns to rank thecombinations from small samples (1,333 pairs), yielding notableresults. More importantly, we formulate hypothesises of optimalvariables via interpreting the ML model. Future work could verifythose hypothesises by conducting controlled experiments.

HI ’21, May 08–13, 2021, Online Virtual Wu, et al.

Quantifying visualizations with subjective metrics.

Recent yearshave witnessed a growing research interest in quantifying andbenchmarking visualizations for machine learning (e.g. [30, 56]).Those work has predominately focused on objective metrics suchas accuracy and effectiveness. Subjective metrics, however, arerelatively neglected, while they are considered more challenging tomeasure. Our work extends this line of research by benchmarkingcharts with subjective metrics, i.e. , human preference over layouts,through crowdsourcing experiments. We describe our proceduresand strategies for quantifying subjective metrics, hoping to informfuture research to measure and improve visualizations from morediverse perspectives, e.g. , understandability [60].

Improving aesthetic qualities of visualizations from a data-drivenperspective.

A good visualization consists of four necessary ele-ments: information, story, goal, and beauty [45]. In a broader sense,our work addresses the beauty, that is, the aesthetic quality. Wepropose a data-driven method to learn human preference for lay-outs, which outperforms hand-crafted layout metrics. Our resultsdemonstrate the promising research possibilities of understandingand improving the aesthetic quality via data-driven machine learn-ing approaches. This research direction is supported by real-lifepractical needs, i.e. , existing charting tools generate sub-optimalresults, while laypeople tend to rely on default values [15] or needto engage in a time-consuming process to tune the results untilthey are satisfied with the result (Figure 8 B ). These needs call foran increasing recognition for understanding what makes a chart vi-sually appealing and proposing more advanced automatic methodsfor improving the aesthetic quality. We assess the quality of charts by asking participants “which doyou prefer?”. Compared with scoring a single chart, this paired com-parison method is easier for participants and yields more preciseand consistent results. As such, we see potentials of adopting it forvarious purposes in visualization research. To better inform futureresearch, we discuss our critical reflections on this method.

Combating decision paralysis.

Decision-makings are not alwayseasy, especially when the differences between two charts are small.It could cause analysis paralysis where individuals overthink the sit-uation that makes decision-making “paralyzed” [38]. Subsequently,individuals tend to choose an arbitrary decision hesitantly [64]. Asshown in Figure 8 B , some participants spent much more efforts onediting the parameters than the average, showing that they seemedsubject to analysis paralysis. To alleviate this problem, future re-search should propose more effective sampling strategies that avoidover-subtle differences between paired charts. Besides, we mightborrow the idea of agile methodologies in software engineering toovercome the anti-pattern of decision paralysis [8]. One promisingapproach in the context of empirical research is to set time limitsfor viewing visualizations and making decisions ( e.g. , [24]). “Evils” can attract. Psychological studies reveal the physical at-tractiveness stereotype that people tend to assume “what is beau-tiful is good” [16]. In the context of data visualizations, this isexemplified by chartjunk [18, 65], where laypeople are attracted by elements that are visually appealing but usually at the expense of ef-fectiveness. This contributes to the paradigm in charting tools that“compromise” with such human preference. For example, GoogleSheets supports 3D pie charts despite their criticism by the visu-alization research community. Google Sheets also offers a SmoothLine Chart that improves the aesthetics but compromises the in-tegrity of the underlying numbers. Future research should be awareof this trade-off when designing the experiment settings.

Incorporating crowdsourced opinions with expert knowledge.

Vi-sualization researchers have increasingly leveraged crowdsourc-ing experiments for the sake of scalability and diversity. However,crowdsourcing experiments face challenges such as reduced controlin the assessment of participants’ capability that might harm thevalidity [4]. Besides, we observe disputes in crowdsourced opinions.To that end, we envision that expert knowledge could help increasevalidity, resolve disputes, and reduce costs. For instance, one mightselect expert-generated charts as positive and randomly-generatedcharts as negative in pair [51]. However, it is worthy noting thatexpert judgement could clash with crowdsourced opinions thatwarrants deeper investigation [36].

Balancing human preference and perceptional effectiveness.

Ourwork takes only the first step in improving the visual quality ofdata visualizations via a data-driven approach that learns fromhuman preference. In particular, we study six layout parametersin bar charts. We do not conduct comprehension experiments toevaluate their effects on perceptional effectiveness, because theeffect size of layouts on perceptions is typically small in standard barcharts [62, 74]. How to balance human preference and perceptionaleffectiveness is a clear next step for future work. This is criticalbecause layouts have proven to impose more influences in someother charts ( e.g. , [25, 26]). A key challenge here is that humanpreference and perception should be measured conjunctively inorder to obtain the training data for machine learning approaches.

Moving towards an more adaptive approach.

Although we pro-pose two sampling strategies that enable learning from small datasets (Small Sample Learning), our model in Experiment 2 onlyachieves human-level performance. This presents a significant chal-lenge as the size of the design space grows exponentially withthe increasing number of parameters. Future work should proposeadvanced sampling approaches to improve effectiveness. Recentresearch in online adaptive sampling [32] that automatically up-dates the sampling strategy during training is a promising methodto address this problem. An interesting research problem would behow to dynamically adjust the sampling probabilities during crowd-sourcing experiments. Moreover, we see research opportunities inleveraging the authoring provenance ( e.g. , the editing histories) toaugment the training data and develop a ML model that adaptiverecommends design suggestions based on the current configuration.

Understanding the representations and models for visualizationresearch.

In a broader sense, it remains an open challenge to choosethe feature representations and machine learning models for visu-alizations. Similar with Draco [47] and VizML [29], LQ is trainedon the parameter features that are compact and computationally earning to Automate Chart Layout Configurations Using Crowdsourced Paired Comparison CHI ’21, May 08–13, 2021, Online Virtual inexpensive, which, however, might not generalize to unobservedparameter values ( e.g. , more than 30 bars) and different chart typesor require labour-intensive feature engineering. Although graphi-cal features ( i.e. , bitmaps) might embrace generalisability, recentstudies [20, 22] suggest that CNNs, the most common model for ana-lyzing visual imagery [67], seem not currently capable of processingvisualization images. This underscores the research needs to exploreadvanced ML models, e.g. , VAE [20]. Furthermore, LQ does not in-clude the underlying data distributions and non-layout parameters( e.g. , colors) in the training representations, which could influencethe perceived aesthetic qualities. To that end, future research shouldstudy how to choose and fuse multiple representations includingthe underlying data, parameters, and graphics. Debating “what is beautiful is good”.

Finally, we propose a re-search agenda towards more understanding of the roles of aestheticqualities in data visualizations. This is critical since nowadays moreand more people are able to create visualizations, so does their ex-posure to the greater masses. This phenomenon contributes to theincreasingly popular pursuits of aesthetic qualities. We even see ex-treme cases where aesthetic concerns play a more crucial role thanusability and even usefulness, e.g. , the Smooth Line Chart. Howshould the research community respond to this shifting boundary?

ACKNOWLEDGMENTS

The research is partially supported by Hong Kong RGC GRF grant16213317.

REFERENCES [1] Satyajith Amaran, Nikolaos V Sahinidis, Bikram Sharda, and Scott J Bury. 2016.Simulation optimization: a review of algorithms and applications.

Annals ofOperations Research

Proc. of the Conference on Human Factors inComputing Systems (CHI) . ACM, NY, USA, 1–8. https://doi.org/10.1145/3173574.3174168[3] Michael Behrisch, Michael Blumenschein, Nam Wook Kim, Lin Shao, MennatallahEl-Assady, Johannes Fuchs, Daniel Seebacher, Alexandra Diehl, Ulrik Brandes,Hanspeter Pfister, et al. 2018. Quality metrics for information visualization. In

Computer Graphics Forum , Vol. 37. Wiley Online Library, Hoboken, NJ, USA,625–662. https://doi.org/10.1111/cgf.13446[4] Rita Borgo, Luana Micallef, Benjamin Bach, Fintan McGee, and Bongshin Lee.2018. Information visualization evaluation using crowdsourcing. In

ComputerGraphics Forum , Vol. 37. Wiley Online Library, Hoboken, NJ, USA, 573–595.https://doi.org/10.1111/cgf.13444[5] Michelle A Borkin, Azalea A Vo, Zoya Bylinskii, Phillip Isola, Shashank Sunkavalli,Aude Oliva, and Hanspeter Pfister. 2013. What makes a visualization memorable?

IEEE Transactions on Visualization and Computer Graphics

19, 12 (2013), 2306–2315. https://doi.org/10.1109/TVCG.2013.234[6] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete blockdesigns: I. The method of paired comparisons.

Biometrika

39, 3/4 (1952), 324–345.[7] Zoya Bylinskii, Nam Wook Kim, Peter O’Donovan, Sami Alsheikh, SpandanMadan, Hanspeter Pfister, Fredo Durand, Bryan Russell, and Aaron Hertzmann.2017. Learning visual importance for graphic designs and data visualizations. In

Proc. of the Annual Symposium on User Interface Software and Technology (UIST) .ACM, NY, USA, 57–69. https://doi.org/10.1145/3126594.3126653[8] Barbara A Carkenord. 2009.

Seven steps to mastering business analysis . J. RossPublishing, FL, USA.[9] Nick Cawthon and Andrew Vande Moere. 2007. The effect of aesthetic on theusability of data visualization. In

Proc. of the International Conference InformationVisualization (IV) . IEEE, NY, USA, 637–648. https://doi.org/10.1109/IV.2007.147[10] Xi Chen, Wei Zeng, Yanna Lin, Hayder Mahdi Al-maneea, Jonathan Roberts, andRemco Chang. 2020. Composition and configuration patterns in multiple-viewvisualizations.

IEEE Transactions on Visualization and Computer Graphics (2020),1–1. https://doi.org/10.1109/TVCG.2020.3030338 Early Access. [11] Zhutian Chen, Wai Tong, Qianwen Wang, Benjamin Bach, and Huamin Qu.2020. Augmenting Static Visualizations with PapARVis Designer. In

Proc. of theConference on Human Factors in Computing Systems (CHI) . ACM, NY, USA, 1–12.https://doi.org/10.1145/3313831.3376436[12] Zhutian Chen, Yun Wang, Qianwen Wang, Yong Wang, and Huamin Qu. 2019.Towards Automated Infographic Design: Deep Learning-based Auto-Extractionof Extensible Timeline.

IEEE Transactions on Visualization and Computer Graphics

26, 1 (2019), 917–926. https://doi.org/10.1109/TVCG.2019.2934810[13] Sumit Chopra, Raia Hadsell, and Yann LeCun. 2005. Learning a similarity metricdiscriminatively, with application to face verification. In

Proc. of the IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , Vol. 1. IEEE, NY, USA,539–546. https://doi.org/10.1109/CVPR.2005.202[14] Victor Dibia and Çağatay Demiralp. 2019. Data2Vis: Automatic generation ofdata visualizations using sequence-to-sequence recurrent neural networks.

IEEEComputer Graphics and Applications

39, 5 (2019), 33–46. https://doi.org/10.1109/MCG.2019.2924636[15] Isaac Dinner, Eric J Johnson, Daniel G Goldstein, and Kaiya Liu. 2011. Partition-ing default effects: why people choose not to choose.

Journal of ExperimentalPsychology: Applied

17, 4 (2011), 332. https://doi.org/10.1037/a0024354[16] Karen Dion, Ellen Berscheid, and Elaine Walster. 1972. What is beautiful isgood.

Journal of personality and social psychology

24, 3 (1972), 285. https://doi.org/10.1037/h0033731[17] Gustav Theodor Fechner. 1860.

Elemente der psychophysik . Vol. 2. Breitkopf u.Härtel, Germany.[18] Stephen Few and Perceptual Edge. 2011.

The chartjunk debate

Bar Widths and the Spaces in Between

Proc. of the IEEEVisualization Conference (VIS) . IEEE, NY, USA, 126–130. https://doi.org/10.1109/VISUAL.2019.8933570[21] Fei Gao, Dacheng Tao, Xinbo Gao, and Xuelong Li. 2015. Learning to rank for blindimage quality assessment.

IEEE Transactions on Neural Networks and LearningSystems

26, 10 (2015), 2275–2290. https://doi.org/10.1109/TIP.2017.2708503[22] Daniel Haehn, James Tompkin, and Hanspeter Pfister. 2018. Evaluating ‘graphicalperception’with CNNs.

IEEE Transactions on Visualization and Computer Graphics

25, 1 (2018), 641–650. https://doi.org/10.1109/TVCG.2018.2865138[23] Hammad Haleem, Yong Wang, Abishek Puri, Sahil Wadhwa, and Huamin Qu.2019. Evaluating the readability of force directed graph layouts: A deep learningapproach.

Computer Graphics and Applications

39, 4 (2019), 40–53. https://doi.org/10.1109/MCG.2018.2881501[24] Lane Harrison, Katharina Reinecke, and Remco Chang. 2015. Infographic aes-thetics: Designing for the first impression. In

Proc. of the Conference on Hu-man Factors in Computing Systems (CHI) . ACM, NY, USA, 1187–1190. https://doi.org/10.1145/2702123.2702545[25] Jeffrey Heer and Maneesh Agrawala. 2006. Multi-scale banking to 45 degrees.

IEEE Transactions on Visualization and Computer Graphics

12, 5 (2006), 701–708.https://doi.org/10.1109/TVCG.2006.163[26] Jeffrey Heer, Nicholas Kong, and Maneesh Agrawala. 2009. Sizing the horizon:the effects of chart size and layering on the graphical perception of time seriesvisualizations. In

Proc. of the Conference on Human Factors in Computing Systems(CHI) . ACM, NY, USA, 1303–1312. https://doi.org/10.1145/1518701.1518897[27] Jane Hoffswell, Wilmot Li, and Zhicheng Liu. 2020. Techniques for FlexibleResponsive Visualization Design. In

Proc. of the Conference on Human Factors inComputing Systems (CHI) . ACM, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376777[28] Aspen K Hopkins, Michael Correll, and Arvind Satyanarayan. 2020. VisuaLint:Sketchy In Situ Annotations of Chart Construction Errors. In

Computer GraphicsForum , Vol. 39. Wiley Online Library, Hoboken, NJ, USA, 219–228. https://doi.org/10.1111/cgf.13975[29] Kevin Hu, Michiel A Bakker, Stephen Li, Tim Kraska, and César Hidalgo. 2019.VizML: A Machine Learning Approach to Visualization Recommendation. In

Proc. of the Conference on Human Factors in Computing Systems (CHI) . ACM, NY,USA, 1–12. https://doi.org/10.1145/3290605.3300358[30] Kevin Hu, Snehalkumar’Neil’S Gaikwad, Madelon Hulsebos, Michiel A Bakker,Emanuel Zgraggen, César Hidalgo, Tim Kraska, Guoliang Li, Arvind Satya-narayan, and Çağatay Demiralp. 2019. Viznet: Towards a large-scale visual-ization learning and benchmarking repository. In

Proc. of the Conference onHuman Factors in Computing Systems (CHI) . ACM, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300892[31] Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In

Proc. of the ACM International Conference on Knowledge Discovery and Data Mining(SIGKDD) . ACM, NY, USA, 133–142. https://doi.org/10.1145/775047.775067[32] Angelos Katharopoulos and François Fleuret. 2018. Not all samples are createdequal: Deep learning with importance sampling. In

Proc. of the International

HI ’21, May 08–13, 2021, Online Virtual Wu, et al.

Conference on Machine Learning (ICML) . PMLR, Georgia, USA, 2525–2534. http://proceedings.mlr.press/v80/katharopoulos18a.html[33] Helen Kennedy and Rosemary Lucy Hill. 2018. The feeling of numbers: Emotionsin everyday engagements with data and their visualisation.

Sociology

52, 4 (2018),830–848. https://doi.org/10.1177/0038038516674675[34] Nam Wook Kim, Zoya Bylinskii, Michelle A Borkin, Krzysztof Z Gajos, AudeOliva, Fredo Durand, and Hanspeter Pfister. 2017. BubbleView: an interfacefor crowdsourcing image importance maps and tracking visual attention.

ACMTransactions on Computer-Human Interaction (TOCHI)

24, 5 (2017), 1–40. https://doi.org/10.1145/3131275[35] Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. 2016.Photo aesthetics ranking network with attributes and content adaptation. In

Proc.of the European Conference on Computer Vision . Springer, London, UK, 662–679.https://doi.org/10.1007/978-3-319-46448-0_40[36] Mucahid Kutlu, Tyler McDonnell, Yassmine Barkallah, Tamer Elsayed, andMatthew Lease. 2018. Crowd vs. expert: What can relevance judgment rationalesteach us about assessor disagreement?. In

Proc. of the International Conference onResearch & Development in Information Retrieval (SIGIR) . ACM, NY, USA, 805–814.https://doi.org/10.1145/3209978.3210033[37] Chufan Lai, Zhixian Lin, Ruike Jiang, Yun Han, Can Liu, and Xiaoru Yuan. 2020.Automatic Annotation Synchronizing with Textual Description for Visualization.In

Proc. of the Conference on Human Factors in Computing Systems (CHI) . ACM,NY, USA, 1–13. https://doi.org/10.1145/3313831.3376443[38] Ann Langley. 1995. Between’paralysis by analysis’ and’extinction by instinct’.

MIT Sloan Management Review

36, 3 (1995), 63. https://doi.org/10.1016/0024-6301(95)94294-9[39] Po-shen Lee, Jevin D West, and Bill Howe. 2017. Viziometrics: Analyzing visualinformation in the scientific literature.

IEEE Transactions on Big Data

4, 1 (2017),117–129. https://doi.org/10.1109/TBDATA.2017.2689038[40] Jianan Li, Jimei Yang, Jianming Zhang, Chang Liu, Christina Wang, and TingfaXu. 2020. Attribute-conditioned Layout GAN for Automatic Graphic Design.

IEEE Transactions on Visualization and Computer Graphics (2020). https://doi.org/10.1109/TVCG.2020.2999335 Early Access.[41] Halden Lin, Dominik Moritz, and Jeffrey Heer. 2020. Dziban: Balancing Agency& Automation in Visualization Design via Anchored Recommendations. In

Proc.of the Conference on Human Factors in Computing Systems (CHI) . ACM, NY, USA,1–12. https://doi.org/10.1145/3313831.3376880[42] Tie-Yan Liu. 2011.

Learning to rank for information retrieval . Springer Science &Business Media, Berlin, Germany.[43] Min Lu, Chufeng Wang, Joel Lanir, Nanxuan Zhao, Hanspeter Pfister, DanielCohen-Or, and Hui Huang. 2020. Exploring Visual Information Flows in Info-graphics. In

Proc. of the Conference on Human Factors in Computing Systems (CHI) .ACM, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376263[44] Yuyu Luo, Xuedi Qin, Nan Tang, and Guoliang Li. 2018. Deepeye: Towardsautomatic data visualization. In

Proc. of the International Conference on DataEngineering (ICDE) . IEEE, NY, USA, 101–112. https://doi.org/10.1109/ICDE.2018.00019[45] David McCandless. 2009.

Information is Beautiful . Collins, Scotland, UK.[46] Andrew Vande Moere, Martin Tomitsch, Christoph Wimmer, Boesch Christoph,and Thomas Grechenig. 2012. Evaluating the effect of style in informationvisualization.

IEEE Transactions on Visualization and Computer Graphics

18, 12(2012), 2739–2748. https://doi.org/10.1109/TVCG.2012.221[47] Dominik Moritz, Chenglong Wang, Greg L Nelson, Halden Lin, Adam M Smith,Bill Howe, and Jeffrey Heer. 2018. Formalizing visualization design knowledgeas constraints: Actionable and extensible models in Draco.

IEEE Transactions onVisualization and Computer Graphics

25, 1 (2018), 438–448. https://doi.org/10.1109/TVCG.2018.2865240[48] Tamara Munzner. 2014.

Visualization analysis and design . CRC press, Boca Raton,FL, USA.[49] Peter O’Donovan, Aseem Agarwala, and Aaron Hertzmann. 2014. Learninglayouts for single-page graphic designs.

IEEE Transactions on Visualization andComputer Graphics

20, 8 (2014), 1200–1213. https://doi.org/10.1109/TVCG.2014.48[50] Jorge Poco and Jeffrey Heer. 2017. Reverse-Engineering Visualizations: Re-covering Visual Encodings from Chart Images. In

Computer Graphics Forum ,Vol. 36. The Eurographics Association & John Wiley & Sons, Ltd., 353–363.https://doi.org/10.1111/cgf.13193[51] Chunyao Qian, Shizhao Sun, Weiwei Cui, Jian-Guang Lou, Haidong Zhang, andDongmei Zhang. 2020. Retrieve-Then-Adapt: Example-based Automatic Genera-tion for Proportion-related Infographics.

IEEE Transactions on Visualization andComputer Graphics (2020). https://doi.org/10.1109/TVCG.2020.3030448 EarlyAccess.[52] Xuedi Qin, Yuyu Luo, Nan Tang, and Guoliang Li. 2020. Making data visualizationmore efficient and effective: a survey.

The VLDB Journal

29 (2020), 93–117.https://doi.org/10.1007/s00778-019-00588-3[53] Annemarie Quispel, Alfons Maes, and Joost Schilperoord. 2016. Graph andchart aesthetics for experts and laymen in design: The role of familiarity andperceived ease of use.

Information Visualization

15, 3 (2016), 238–252. https://doi.org/10.1177/1473871615606478 [54] Irene Reppa and Siné McDougall. 2015. When the going gets tough the beautifulget going: aesthetic appeal facilitates task performance.

Psychonomic bulletin &review

22, 5 (2015), 1243–1254. https://doi.org/10.3758/s13423-014-0794-z[55] David M Rouse, Romuald Pépion, Patrick Le Callet, and Sheila S Hemami. 2010.Tradeoffs in subjective testing methods for image and video quality assessment.In

Human Vision and Electronic Imaging XV , Vol. 7527. International Society forOptics and Photonics, 75270F. https://doi.org/10.1117/12.845389[56] Bahador Saket, Alex Endert, and Çağatay Demiralp. 2018. Task-based effective-ness of basic visualizations.

IEEE Transactions on Visualization and ComputerGraphics

25, 7 (2018), 2505–2512. https://doi.org/10.1109/TVCG.2018.2829750[57] Arvind Satyanarayan, Dominik Moritz, Kanit Wongsuphasawat, and JeffreyHeer. 2016. Vega-lite: A grammar of interactive graphics.

IEEE Transac-tions on Visualization and Computer Graphics

23, 1 (2016), 341–350. https://doi.org/10.1109/TVCG.2016.2599030[58] Arvind Satyanarayan, Ryan Russell, Jane Hoffswell, and Jeffrey Heer. 2015. Reac-tive vega: A streaming dataflow architecture for declarative interactive visual-ization.

IEEE Transactions on Visualization and Computer Graphics

22, 1 (2015),659–668. https://doi.org/10.1109/TVCG.2015.2467091[59] Danqing Shi, Xinyue Xu, Fuling Sun, Yang Shi, and Nan Cao. 2020. Calliope:Automatic Visual Data Story Generation from a Spreadsheet.

IEEE Transactionson Visualization and Computer Graphics (2020). https://doi.org/10.1109/TVCG.2020.3030403 Early Access.[60] Xinhuan Shu, Aoyu Wu, Junxiu Tang, Benjamin Bach, Yingcai Wu, and HuaminQu. 2020. What Makes a Data-GIF Understandable?

IEEE Transactions on Vi-sualization and Computer Graphics (2020). https://doi.org/10.1109/TVCG.2020.3030396 Early Access.[61] Tableau. 2020.

Visual Best Practices . Retrieved Aug 20, 2020 from https://help.tableau.com/current/blueprint/en-us/bp_visual_best_practices.htm[62] Justin Talbot, Vidya Setlur, and Anushka Anand. 2014. Four experiments onthe perception of bar charts.

IEEE Transactions on Visualization and ComputerGraphics

20, 12 (2014), 2152–2160. https://doi.org/10.1109/TVCG.2014.2346320[63] Tan Tang, Renzhong Li, Xinke Wu, Shuhan Liu, Johannes Knittel, Steffen Koch,Thomas Ertl, Lingyun Yu, Peiran Ren, and Yingcai Wu. 2020. PlotThread: Cre-ating Expressive Storyline Visualizations using Reinforcement Learning.

IEEETransactions on Visualization and Computer Graphics (2020). https://doi.org/10.1109/TVCG.2020.3030467 Early Access.[64] Kristi Tsukida and Maya R Gupta. 2011.

How to analyze paired comparison data .Technical Report. Washington University Seattle Dept of Electrical Engineering.[65] Edward R Tufte. 2001.

The visual display of quantitative information . Vol. 2.Graphics press Cheshire, CT, USA.[66] Sara Vaca. 2018.

Difference between Graphic Design and Data Visualization

Mathematics andComputers in Simulation

177 (2020), 232–243. https://doi.org/10.1016/j.matcom.2020.04.031[68] Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, JamesPhilbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similaritywith deep ranking. In

Proc. of the IEEE Conference on Computer Vision and PatternRecognition (CVPR) . IEEE, NY, USA, 1386–1393. https://doi.org/10.1109/CVPR.2014.180[69] Yong Wang, Zhihua Jin, Qianwen Wang, Weiwei Cui, Tengfei Ma, and HuaminQu. 2019. Deepdrawing: A deep learning approach to graph drawing.

IEEETransactions on Visualization and Computer Graphics

26, 1 (2019), 676–686. https://doi.org/10.1109/TVCG.2019.2934798[70] Yun Wang, Zhida Sun, Haidong Zhang, Weiwei Cui, Ke Xu, Xiaojuan Ma, andDongmei Zhang. 2019. DataShot: Automatic Generation of Fact Sheets fromTabular Data.

IEEE Transactions on Visualization and Computer Graphics

26, 1(2019), 895–905. https://doi.org/10.1109/TVCG.2019.2934398[71] Aoyu Wu, Wai Tong, Tim Dwyer, Bongshin Lee, Petra Isenberg, and Huamin Qu.2020. MobileVisFixer: Tailoring Web Visualizations for Mobile Phones Leverag-ing an Explainable Reinforcement Learning Framework.

IEEE Transactions onVisualization and Computer Graphics (2020). https://doi.org/10.1109/TVCG.2020.3030423 Early Access.[72] Qing-Song Xu and Yi-Zeng Liang. 2001. Monte Carlo cross validation.

Chemo-metrics and Intelligent Laboratory Systems

56, 1 (2001), 1–11. https://doi.org/10.1016/S0169-7439(00)00122-2[73] Andre Ye. 2020.

Easy Ways to Make Your Charts Look More Professional . RetrievedAugust 20, 2020 from https://towardsdatascience.com/easy-ways-to-make-your-charts-look-more-professional-9b081655eae7[74] Mingqian Zhao, Huamin Qu, and Michael Sedlmair. 2019. Neighborhood Per-ception in Bar Charts. In

Proc. of the Conference on Human Factors in ComputingSystems (CHI) . ACM, NY, USA, 1–12. https://doi.org/10.1145/3290605.3300462[75] Nanxuan Zhao, Ying Cao, and Rynson WH Lau. 2018. What characterizes per-sonalities of graphic designs?