[PDF] An Aligned Rank Transform Procedure for Multifactor Contrast Tests

Abstract

Data from multifactor HCI experiments often violates the normality assumption of parametric tests (i.e., nonconforming data). The Aligned Rank Transform (ART) is a popular nonparametric analysis technique that can find main and interaction effects in nonconforming data, but leads to incorrect results when used to conduct contrast tests. We created a new algorithm called ART-C for conducting contrasts within the ART paradigm and validated it on 72,000 data sets. Our results indicate that ART-C does not inflate Type I error rates, unlike contrasts based on ART, and that ART-C has more statistical power than a t-test, Mann-Whitney U test, Wilcoxon signed-rank test, and ART. We also extended a tool called ARTool with our ART-C algorithm for both Windows and R. Our validation had some limitations (e.g., only six distribution types, no mixed factorial designs, no random slopes), and data drawn from Cauchy distributions should not be analyzed with ART-C.

Full PDF

AAn Aligned Rank Transform Procedure forMultifactor Contrast Tests

Lisa A. Elkin

Paul G. Allen School of Computer Science & EngineeringUniversity of WashingtonSeattle, WA, [email protected]

Matthew Kay

School of CommunicationNorthwestern UniversityEvanston, IL, [email protected]

James. J Higgins

Department of StatisticsKansas State UniversityManhattan, KS, [email protected]

Jacob O. Wobbrock

The Information SchoolUniversity of WashingtonSeattle, WA, [email protected]

ABSTRACT

Data from multifactor HCI experiments often violates the normal-ity assumption of parametric tests (i.e., nonconforming data ). TheAligned Rank Transform (ART) is a popular nonparametric analysistechnique that can find main and interaction effects in noncon-forming data, but leads to incorrect results when used to conductcontrast tests. We created a new algorithm called

ART-C for con-ducting contrasts within the ART paradigm and validated it on72,000 data sets. Our results indicate that ART-C does not inflateType I error rates, unlike contrasts based on ART, and that ART-C has more statistical power than a t -test, Mann-Whitney U test,Wilcoxon signed-rank test, and ART. We also extended a tool called ARTool with our ART-C algorithm for both Windows and R. Ourvalidation had some limitations (e.g., only six distribution types, nomixed factorial designs, no random slopes), and data drawn fromCauchy distributions should not be analyzed with ART-C.

KEYWORDS

Statistical methods; data analysis; experiments; quantitative meth-ods; nonparametric statistics; aligned rank transform.

Statistical procedures are a mainstay of human-computer interac-tion (HCI) research, particularly in the evaluation of human per-formance data, like times and error rates; subjective response data,like ordinal ratings and preference indications; and count data, likecounts of participants, behaviors, or choices. To improve the qualityof results and conclusions drawn from HCI experiments, many inthe HCI community have pointed out inadequacies in the methodsand tools we use to conduct our statistical analyses and have soughtto improve them. For example, Jun et al. [20] and Wobbrock et al.[46] created new tools that make it easier for researchers to con-duct analyses correctly, Kay et al. [24] and Robertson and Kaptein[36] introduced the community to modern statistical methods tobetter serve our needs, and Wobbrock et al. [46] and Kaptein et al.[22] developed statistical methods to better analyze data commonlyarising in HCI. , , . Parametric tests such as ANOVA and t -test are widely usedwithin HCI, but when experiments give rise to data with resid-uals that are not normally distributed (i.e., nonconforming data ),researchers and practitioners alike often turn to less familiar non-parametric tests. The Aligned Rank Transform ( ART ) [15, 16, 38] isa nonparametric procedure that can detect interaction effects inmultifactor experiments. It pre-processes data with an alignmentstep [18] followed by a ranking step [8], and the resulting aligned-and-ranked data can be analyzed with an omnibus test, typically anANOVA. Since its introduction to HCI by Wobbrock et al. [46] in2011, the ART procedure has quickly become a popular techniquewithin HCI, and many HCI venues have published papers that usethe ART in their analyses (e.g.,

CHI [2, 13, 14],

ASSETS [3],

UIST [21, 37]). Wobbrock et al.’s

ARTool [46] has also been used in publi-cations in several other fields (e.g., cellular biology [7], dentistry[34], zoology [9], and cardiology [12]), and has been cited nearly900 times thus far.Although Wobbrock et al. [46] mentioned in passing that theoriginal ART’s aligning and ranking procedure can be followed bycontrast tests, a later R package vignette by Kay [23] indicated thatcontrasts involving combinations of levels across multiple factors(i.e., multifactor contrasts ) cannot be conducted with the originalART without exploding Type I errors. As it turns out, the dataafter aligning and ranking for the ART procedure are not properlyaligned and ranked for the possible contrast tests that might beconducted. Rather, different alignment-and-ranking proceduresmust be carried out to enable accurate contrast tests. Our workin this paper contributes an algorithm for proper alignment-and-ranking for contrast tests, and updates open-source ART tools forWindows and R to enable them.Specifically, we conducted a large-scale analysis to confirm theinappropriateness of using the ART to perform multifactor contrasttests. Inspired by the procedure presented in the aforementionedR package vignette [23], we devised a new procedure for nonpara-metric multifactor contrasts within the ART paradigm:

AlignedRank Transform Contrasts (ART-C) . ART-C uses a novel aligning-and-ranking algorithm to pre-process data such that multifactorcontrasts can be conducted on the resulting aligned-and-rankeddata. To validate ART-C, we created 72,000 synthetic data sets a r X i v : . [ s t a t . M E ] F e b , Elkin, Kay, Higgins, Wobbrock simulating a range of experimental designs, sample sizes, and dis-tribution families and used established statistical simulation proce-dures [1, 5, 26, 33]. We compared the Type I error rate of our newmethod to a t -test, and compared its statistical power to a t -test[40], Wilcoxon signed-rank test [44], Mann-Whitney U test [27],and the original ART [15, 16, 38].Our key findings are that when used to conduct contrasts involv-ing levels from multiple factors, the original ART’s Type I errorrates are often far from their expected values, and ART’s statisticalpower for is low besides. By comparison, ART-C’s Type I errorrates are at their expected values and are generally not inflated.Also, for contrasts, ART-C has more statistical power than a t -test,Wilcoxon signed-rank test, Mann-Whitney U test, and the originalART procedure.Although ART-C has numerous strengths and general applica-bility, it should not be used in cases where data appears to havebeen drawn from a Cauchy distribution (i.e., has Cauchy-distributedresiduals). Additionally, the 72,000 data sets created and used in ourvalidation cover a wide range of experimental designs but were notexhaustive. Our synthetic data was limited to two factors with atmost three levels each, six types of population distributions, condi-tion sample sizes between 8 and 40, and fully between-subjects orwithin-subjects designs, not mixed factorial designs. Additionally,ART-C is an alignment-and-ranking procedure to be followed by acontrast test; we chose the t -test and did not validate it with othertests.To facilitate the use of our new ART-C procedure, we extendedboth the open source R version and Windows version of ARTool .Both tools are already in widespread use, and our modified versionsseamlessly integrate our new ART-C procedure for multifactorcontrasts. Thus, HCI researchers and others who use either toolcan easily use our new versions, and no longer risk incorrectlyconducting multifactor contrasts on their aligned-and-ranked dataor have to break from the ART paradigm to conduct multifactorcontrasts.Our work contributes: (1) a careful elucidation of the problemof multifactor contrast testing using the ART method describedby Wobbrock et al. [46], (2) a new algorithm, ART-C, to correctlyalign-and-rank data for multifactor contrasts within the ART par-adigm, (3) validation results from simulation studies showing thecorrectness and statistical power of our new ART-C procedure, and(4) significant additions to the widely used ARTool R package andARTool.exe Windows application.

We created a multifactor contrast testing procedure within theART paradigm, called ART-C, to enable multifactor contrasts fornonconforming data. Thus, relevant prior research includes theART procedure itself, the lack of a multifactor contrast testingmethod within the ART paradigm, and prior statistical contributionsdirected towards the HCI community. R package: https://cran.r-project.org/package=ARTool Code: https://dx.doi.org/10.5281/zenodo.594511 Windows Tool and Code: http://depts.washington.edu/acelab/proj/art/

Rank transforms have been explored in statistics for decades as abasis for nonparametric analyses (e.g. [11, 44]). Conover and Iman’s[8] popular rank transform (RT) procedure applies midranks onresponses and then conducts an ANOVA on ranks. However, it wasdiscovered for RT that while Type I error rates for main effectswere reasonable, they were unreasonably high for interactions.The aligned rank transform (ART) procedure was developed inresponse to this problem [15, 16, 28, 29, 38], where responses arefirst “aligned” [18] with respect to the main effect or interaction be-ing analyzed before midranks are applied. The upshot is that bothmain effects and interactions can be safely analyzed on alignedranks using ANOVA-type procedures without inflating Type I er-rors. Owing to (1) the prevalence of multifactor experiments inHCI, (2) the likelihood of data arising that do not conform to theassumptions of parametric analysis, and (3) the dearth of commonstatistical procedures to analyze such data, the need for the ARTwas evident, and in 2011, a paper at CHI was published [46] thatoffered

ARTool , a Windows application capable of performing dataalignment-and-ranking otherwise tedious to conduct by hand. Inthe decade since, this CHI paper has garnered almost 900 citationsaccording to Google Scholar, indicating the usefulness of ARTool.However, to the best of our knowledge, no prior publication (ortool) has offered a method for conducting contrast tests in the ARTparadigm, an essential missing piece particularly after detectinga statistically significant interaction. In this work, we supply thismissing piece by devising ART-C and augmenting the open-sourceARTool utilities for both Windows and R. Using a single example, Kay [23] demonstrated that using the ARTto conduct multifactor contrasts can lead to incorrect results; we val-idate this claim below. A thorough search of the statistics literaturedid not uncover a suitable solution to the problem of multifactorcontrast testing within the ART paradigm. Here we discuss themost closely related statistics work.ART contrast methods have been presented in the literature[1, 4, 28, 30], but the authors did not explain how or whether theirmethods can be used across multiple factors, showing only exam-ples of single-factor contrasts even in the presence of significantinteraction effects. Simulation studies analyzing the effectiveness ofART contrasts also only included data with a single factor [1, 4, 33].Mansouri et al. [28] developed ART analogues of well-knowncontrast procedures (Tukey’s HSD, Scheffé’s method, Fisher’s leastsignificant difference procedure) for data with two factors. How-ever, they did not specify or demonstrate whether their methodsare applicable to multifactor contrasts; nonetheless, devising ARTanalogues to complex contrast procedures is not our objective here.Rather, we sought to find an aligning-and-ranking procedure thatcan be followed by a common contrast test, especially the familiar t -test.Peterson et al. [33] compared the effectiveness of the ART usingsix different statistics in the alignment process (sample mean, sam-ple median, lightly trimmed Winsorized mean, heavily trimmedWinsorized mean, Huber M-estimator, and Harrel-Davis estimator https://scholar.google.com/scholar?cites=16254127723353600671 RT-C , , of the median). Rather than changing the type of statistic usedfor alignment, our method changes the alignment process itself,distinguishing it from Peterson et al.’s work. The authors also didnot test whether their methods can be used to analyze multifactorcontrasts.

Statistical analyses are powerful, but only meaningful when statis-tical tests are used correctly. HCI researchers are well positionedto improve the quality of results drawn from statistical analyses bylooking at them through a usability lens. Tools that aid researchersin using statistical tests correctly can improve the quality of ourquantitative practices. Wobbrock et al. [46] argued for the impor-tance of nonparametric tests that are as easy to use as ANOVAwhen they extended the ART procedure to multiple factors andprovided a tool for carrying it out. Their

ARTool

Windows applica-tion and ARTool R package made the ART easy to use, in HCI andbeyond.Other researchers in HCI have also recognized the value of pro-viding useful tools for statistical analysis. Jun et al. [20] provided

Tea , a system in which users specify their study design and hypothe-ses at a high level, and then Tea figures out which tests to run, runsthem, and returns the results, lowering the barrier to performingvalid statistical tests. Kay et al. [24] took a different approach touser-centered statistics and looked at how using Bayesian analysiscan help the HCI community accrue knowledge without having toconduct inconvenient larger studies or replication studies, whichconflict with the priority the community places on novelty.Another important aspect of usable statistics is their visibilityand framing. Kay et al. [24], Wobbrock et al. [46], and Kaptein et al.[22] introduced methods from other fields into the HCI literature.Robertson and Kaptein’s [36] book,

Modern Statistical Methods forHCI , introduced the HCI community to current statistical methods.None of these methods were wholly new, but framing them inan HCI context and curating them into an HCI book made themaccessible to an HCI audience, which might not otherwise discoverthem.Our work introduces the HCI community to our new method,ART-C, and an updated version of ARTool for both Windows andR that make multifactor contrasts on nonconforming data easierto conduct, lowering the barrier to performing correct statisticalanalyses within HCI and beyond.

The problem we address in this work is best explained with anexample. We refer to this example as our running example through-out this paper. Let’s consider a within-subjects experiment withthree factors having two levels each: 𝐴 : { 𝐴 , 𝐴 } , 𝐵 : { 𝐵 , 𝐵 } , 𝐶 : { 𝐶 , 𝐶 } , and response 𝑌 . There are 40 subjects and data foreach condition is drawn from a log-normal distribution. Table 1shows the log-scale true population means for each condition, andFigure 1 shows the resulting sample data.Suppose we analyze this data using the ART. A significant maineffect of 𝐴 would tell us that the level of 𝐴 , i.e., 𝐴 vs . 𝐴

2, has aneffect on the value of 𝑌 . A significant 𝐴 × 𝐵 interaction would tell Table 1: Log-scale population means for each condition inour running example.Condition A B C Log-scalePopulation Mean ---------------------------------------- ---------------------------------------- + ---------------------------------------- ---------------------------------------- + ---------------------------------------- ---------------------------------------- + ---------------------------------------- ---------------------------------------- + C2C1 A A B1 B2 B1 B2

Figure 1: Sample data for each condition in our running ex-ample. Dots indicate condition means, lines connect condi-tion means for visual comparison, and a plus indicates themean of both connected conditions. us that the effect 𝐴 has on the value of 𝑌 is statistically signifi-cantly different for different levels of 𝐵 . Indeed, the original ARTprocedure works very well for detecting main effects and interac-tions. But it lacks a suitable method for contrast tests. Contrasttests can tell us which levels of each factor cause these effects;they are commonly used to conduct post hoc pairwise comparisonsfollowing a statistically significant main effect or interaction, butthey can also be used to compare levels of factors directly whenwarranted by the research question (i.e., planned contrasts). Weuse the term single-factor contrasts to refer to comparisons betweenlevels within a single factor (e.g., post hoc tests following a maineffect), and multifactor contrasts to refer to comparisons betweencombinations of levels from multiple factors (e.g., post hoc testsfollowing a significant interaction effect). , Elkin, Kay, Higgins, Wobbrock Single-factor contrasts can be conducted safely using a t -teston data that has been aligned-and-ranked with the original ARTprocedure. However, conducting multifactor contrasts on data thathas been aligned-and-ranked with the original ART procedureproduces incorrect results.We show this using our running example. Since we know the datais drawn from a lognormal distribution, we fit a linear mixed model ( LMM ) [10, 42] to log-transformed data as a baseline, and fit an ARTmodel to the original (not log-transformed) data. Specifically, wewish to compare levels in 𝐴 and 𝐵 , averaging over the levels of 𝐶 .That is, 𝐶 is not directly involved in the contrasts. This is achievedin R using the following code. 𝐴 , 𝐵

1) and ( 𝐴 , 𝐵 is a true difference between( 𝐴 , 𝐵

1) and ( 𝐴 , 𝐵

2) (Table 1). Comparing results of both tests,contrasts on the LMM produce results that match the true effects( 𝐴 , 𝐵 − 𝐴 , 𝐵 𝑝 = . 𝐴 , 𝐵 − 𝐴 , 𝐵 𝑝 < . Type Ierror (i.e., finding a significant difference when there is no truedifference) ( 𝐴 , 𝐵 − 𝐵 , 𝐵 𝑝 < . ) , and a Type II error (i.e.,not finding a significant difference when there is a true difference) ( 𝐴 , 𝐵 − 𝐴 , 𝐵 𝑝 = . ) (Table 3, Figure 2). Both tests’ resultsagree for all other pairs of conditions. Table 2: Highlighted results of contrasts conducted on aLMM of log-transformed responses, comparing levels of 𝐴 and 𝐵 in our running example. In the top row, a differencewas correctly not detected between 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 = . ), and there is no true difference. In the bottom row, adifference was correctly detected between 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 < . ), and there is a true difference. contrast estimate SE df t.ratio p.valueA1 B1 - A1 B2 0.0 0.0 273 1.3 0.1792A1 B1 - A2 B2 -0.5 0.0 273 -165.9 < .0001 Table 3: Highlighted results of contrasts conducted usingART, comparing levels of 𝐴 and 𝐵 in our running example.In the top row, a difference was incorrectly detected between 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 < . ), but there is no true difference.In the bottom row, a difference was incorrectly not detectedbetween 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 = . ), but there is a true dif-ference. contrast estimate SE df t.ratio p.valueA1 B1 - A1 B2 -43.2 3.8 273 -11.3 < .0001A1 B1 - A2 B2 0.4 3.8 273 0.1 0.9144 -------------------------------------------------------------------------------- A1,B2 A1,B1 A2,B2 -------------------------------------------------------------------------------- --------------------------------------------------------------------------------

ART failed to ﬁ nd adi ﬀ erence betweenA1,B1 and A2,B2.ART incorrectly found a di ﬀ erence between A1,B2 and A1,B1. Figure 2: Sample data for ( 𝐴 , 𝐵 ), ( 𝐴 , 𝐵 ), and ( 𝐴 , 𝐵 ). Dotsindicate condition means. Contrasts using the ART proce-dure found a difference between 𝐴 , 𝐵 ad 𝐴 , 𝐵 even thoughthere is not a true difference; also, no difference was foundbetween 𝐴 , 𝐵 and 𝐴 , 𝐵 even though there is a true differ-ence. Obviously, we cannot judge the validity of a statistical procedureon one example alone. Therefore, we assessed the correctness of theoriginal ART procedure applied to multifactor contrasts on 72,000synthetic data sets representing several different experimental de-signs, and confirmed the patterns found in our running example.(More details on our simulation procedure are given below.) Ourresults show that using the the original ART procedure to conductmultifactor contrasts on data drawn from lognormal, Cauchy, or ex-ponential distributions produces inflated Type I error rates (Figure3); that conducting contrasts with the the original ART procedureon data drawn from any distribution produces Type I error ratesthat are far from their expected value ( 𝛼 = . We developed

ART-C , a procedure to conduct nonparametric multi-factor contrasts within the ART paradigm. Using our new alignmentprocess, ART-C aligns data, then ranks it with ascending midranks,and conducts multifactor contrasts comparing combinations of lev-els of factors for which the data was aligned. Thus, the process is

RT-C , ,

Lognormal A R T - C A R T A R T - C NormalCauchy t(3)

ExponentialDoubleExponential A R T Using ART (gray) to conduct multifactorcontrasts can result in Type I error ratesthat are in ﬂ ated and / or far from 𝛼 = .05.Conducting multifactor contrasts withART-C (teal) typically results in low Type I error rates that are close to 𝛼 = .05. Observed Type I Error Rate

ART and ART-C infalte Type I error rates on data drawn from a Cauchy distribution.

Figure 3: ART Type I error rates (gray) compared to ART-C Type I error rates (teal). Each data point represents theobserved Type I error rate of one "design," explained below.Values closer to 𝛼 = . are better, indicating greater correct-ness. ART-C Type I error rates are closer to .05 for all distri-butions. much like the original ART procedure, but the data is aligned notfor main effects and interactions, but for desired contrast tests.Data must be aligned and ranked for each set of factors whoselevels will be compared. In our running example, we found an 𝐴 × 𝐵 interaction effect. Response 𝑌 must be aligned and ranked tocompare combinations of levels of 𝐴 and 𝐵 . Had we also found an 𝐴 × 𝐶 interaction effect, response 𝑌 would have to be aligned andranked separately to compare combinations of levels of 𝐴 and 𝐶 .Like the original ART procedure, ART-C can be used on non-conforming data, i.e., data for which the residuals do not needto be normally distributed. (Although, do note that the originalART procedure has been shown to inflate Type I error rates onheteroscedastic data [35].) In this work, we validated ART-C ondata with continuous responses. In this section, we walk through the ART-C procedure with anexample, similar to our running example, with three factors: 𝐴 withlevels 𝐴 𝑖 , 𝑖 = ...𝑎 , 𝐵 with levels 𝐵 𝑗 , 𝑗 = ...𝑏 , and 𝐶 with levels 𝐶 𝑘 , 𝑘 = ...𝑐 , and response 𝑌 .We present the ART-C procedure in four steps: Step 1. Prepare data.

To prepare data for ART-C:1. Concatenate the factors of interest to create a new factor. Forexample, when conducting contrasts on 𝐴 and 𝐵 , we concatenate 𝐴 and 𝐵 and create a new factor labeled AB . For any response 𝑌 for which 𝐴 has level 𝐴 𝑖 and 𝐵 has level 𝐵 𝑗 , 𝐴𝐵 has level 𝐴𝐵 𝑖 𝑗 .2. Remove original copies of the factors involved (here 𝐴 and 𝐵 ). ART (gray) has lower power than our new method, ART-C (teal).

Lognormal Normal A R T A R T - C Exponential Cauchy A R T A R T - C Observed Statistical Power t(3)

Double Exponential A R T A R T - C Figure 4: ART statistical power (gray) compared to ART-Cstatistical power (teal). Each data point represents the ob-served statistical power of one "design," explained below.Larger values are better, indicating greater power. ART-Chas greater power for all distributions except Cauchy.

3. Keep as-is any factors not involved in the desired contrasts (here 𝐶 ). Step 2. Compute aligned response 𝑌 ′ . Regardless of which fac-tors were concatenated in Step 1.1 and which original factors wereremoved in Step 1.2, 𝑌 𝑖 𝑗𝑘 denotes all responses 𝑌 where 𝐴 hadlevel 𝐴 𝑖 , 𝐵 had level 𝐵 𝑗 , and 𝐶 had level 𝐶 𝑘 before Step 1.1 wascompleted. Sometimes, the levels of all factors need to be takeninto account when aligning 𝑌 𝑖 𝑗𝑘 . 𝐴𝐵 𝑖 𝑗 𝐶 𝑘 denotes the mean of allresponses where the new concatenated factor 𝐴𝐵 has level 𝐴𝐵 𝑖 𝑗 and factor 𝐶 has level 𝐶 𝑘 . Other times, we only care about the levelsof the concatenated factor. 𝐴𝐵 𝑖 𝑗 denotes the mean of all responseswhere 𝐴𝐵 has level 𝐴𝐵 𝑖 𝑗 regardless of the level of 𝐶 . 𝜇 denotes thegrand mean (i.e., the mean of all responses).In our running example, there are three possible types of con-trasts: three-factor contrasts, two-factor contrasts, and single-factorcontrasts. We demonstrate the ART-C aligning formula on all threetypes of contrasts and then present the general case. As an example,Table 4 shows a small subset of sample calculations for two-factorcontrasts in a three-factor model. Three-factor contrasts in a three-factor model.

To align re-sponse 𝑌 𝑖 𝑗𝑘 for contrasts between levels of factors 𝐴 , 𝐵 , and 𝐶 , , Elkin, Kay, Higgins, Wobbrock compute: 𝑌 ′ 𝑖 𝑗𝑘 = 𝑌 𝑖 𝑗𝑘 − 𝐴𝐵𝐶 𝑖 𝑗𝑘 + 𝐴𝐵𝐶 𝑖 𝑗𝑘 − 𝜇 = 𝑌 𝑖 𝑗𝑘 − 𝜇. Two-factor contrasts in a three-factor model.

To align response 𝑌 𝑖 𝑗𝑘 for contrasts between levels of factors 𝐴 and 𝐵 , compute: 𝑌 ′ 𝑖 𝑗𝑘 = 𝑌 𝑖 𝑗𝑘 − 𝐴𝐵 𝑖 𝑗 𝐶 𝑘 + 𝐴𝐵 𝑖 𝑗 − 𝜇 Single-factor contrasts in a three-factor model.

The focus ofthis work is not on single-factor contrasts since the original ARTalignment procedure can be used for single-factor contrasts, but itis worth noting that our method is mathematically equivalent to theART in the single-factor case. To align response 𝑌 𝑖 𝑗𝑘 for contrastsbetween levels of factor 𝐴 compute: 𝑌 ′ 𝑖 𝑗𝑘 = 𝑌 𝑖 𝑗𝑘 − 𝐴 𝑖 𝐵 𝑗 𝐶 𝑘 + 𝐴 𝑖 − 𝜇 General Case: 𝑀 -factor contrasts in an 𝑁 -factor model. Weneed more complex notation to describe the general case. In thegeneral case, we align response 𝑌 𝑖 𝑗...𝑛 for contrasts between levelsof 𝑀 factors in an 𝑁 -factor model. In the example above, we namedour factors 𝐴 , 𝐵 , and 𝐶 . Here, we name them 𝑋 , 𝑋 , ..., 𝑋 𝑁 anddenote level 𝑗 of factor 𝑋 𝑖 as 𝑋 𝑖,𝑗 (e.g., level 2 of factor 𝑋 is denotedas 𝑋 , ). In Step 1.1, we concatenated the 𝑀 factors for which wewere aligning to create a new factor 𝑋 𝑋 . . . 𝑋 𝑀 . The level of factor 𝑋 𝑋 . . . 𝑋 𝑀 that was created by concatenating 𝑋 ,𝑖 , 𝑋 ,𝑗 , ..., 𝑋 𝑀,𝑚 isdenoted as ( 𝑋 𝑋 . . . 𝑋 𝑀 ) 𝑖 𝑗...𝑚 . In Step 1.2, we removed the originalcopies of the 𝑀 factors concatenated in Step 1.1. After Step 1.3,there are 𝑁 − 𝑀 other (not concatenated) factors in the modeldenoted 𝑋 𝑀 + , 𝑋 𝑀 + , . . . , 𝑋 𝑁 . Thus, 𝑋 𝑀 + ,𝑚 + denotes a level offactor 𝑋 𝑀 + ; 𝑋 𝑀 + ,𝑚 + denotes a level of factor 𝑋 𝑀 + ; and 𝑋 𝑁,𝑛 denotes a level of factor 𝑋 𝑁 . With this notation in hand, to alignthe data in the general case, we compute: 𝑌 ′ 𝑖 𝑗...𝑛 = 𝑌 𝑖 𝑗...𝑛 −( 𝑋 𝑋 . . . 𝑋 𝑀 ) 𝑖 𝑗...𝑚 𝑋 𝑀 + ,𝑚 + 𝑋 𝑀 + ,𝑚 + . . . 𝑋 𝑁,𝑛 + ( 𝑋 𝑋 . . . 𝑋 𝑀 ) 𝑖 𝑗...𝑚 − 𝜇 For example, with this notation, our "two-factor contrasts in athree-factor model" formula would be: 𝑌 ′ 𝑖 𝑗𝑘 = 𝑌 𝑖 𝑗𝑘 − ( 𝑋 𝑋 ) 𝑖 𝑗 𝑋 ,𝑘 + ( 𝑋 𝑋 ) 𝑖 𝑗 − 𝜇 Step 3. Compute ranked response 𝑌 ′′ . Apply midranks to allvalues 𝑌 ′ in ascending order to create aligned-and-ranked responses 𝑌 ′′ (see example in Table 4). That is, the smallest 𝑌 ′ is given rank 𝑌 ′′ = 1, the next smallest 𝑌 ′ is given rank 𝑌 ′′ = 2, until all 𝑌 ′ valueshave been assigned a rank. If there is a tie among 𝑘 values, themean of the next 𝑘 ranks that would have been assigned is used asthe rank for all 𝑘 values (i.e., midranks). For example, if there is atie between the third and fourth smallest 𝑌 ′ , they would both beassigned rank 𝑌 ′′ = (3 + 4) / 2 = 3.5. This is a standard applicationof applying ascending midranks to data. Step 4. Conduct contrasts on 𝑌 ′′ . Contrast tests as post hoc testsare justified when the original ART procedure resulted in significantmain effects or interactions. However, as stated above, contrasts donot need to follow significant omnibus tests if warranted by the re-search question (i.e., planned contrasts). Also, note that conductingmultiple post hoc tests should be accompanied by a p -value correc-tion for multiple comparisons (e.g., with the Bonferroni correction[43], Holm’s sequential Bonferroni procedure [19], or Tukey’s HSD Table 4: Sample calculations to compute aligned response Y ′ and aligned and ranked response Y ′′ for two-factor contrastsin a three-factor model using the ART-C procedure. Only 4of 8 conditions are shown here for considerations of space.AB C Y Y ′ Y ′′ 𝐴𝐵 𝐶 + + + + + - 5 = 0 5.5 𝐴𝐵 𝐶 + + + + + - 5 = -2 1 𝐴𝐵 𝐶 + + + + + - 5 = -1 3 𝐴𝐵 𝐶 + + + + + - 5 = -1 3 𝐴𝐵 𝐶

10 10 - + + + + + - 5 = 2 7 𝐴𝐵 𝐶 + + + + + - 5 = 0 5.5 𝐴𝐵 𝐶 + + + + + - 5 = 3 8 𝐴𝐵 𝐶 + + + + + - 5 = -1 3test [41], to name a few). In the ART-C procedure, conduct contrastsusing the full factorial model comprising all factors that remainafter Step 1.3, but only interpret results of comparisons betweenlevels of the concatenated factor created in Step 1.1; comparisonsbetween levels of the non-concatenated factor(s) are meaningless.Returning to our running example of conducting contrasts tocompare levels of 𝐴 and 𝐵 , we have factors 𝐴𝐵 and 𝐶 , and havecomputed 𝑌 ′′ aligned-and-ranked for factor 𝐴𝐵 . We would there-fore conduct contrasts using a full-factorial model with factors 𝐴𝐵 and 𝐶 (e.g., 𝑌 ∼ 𝐴𝐵 ∗ 𝐶 ). We ignore the omnibus test results for thismodel, but we follow it with contrasts among desired levels of 𝐴𝐵 .Contrasts that would involve 𝐶 are meaningless. In this section, we describe how we validated our ART-C procedurefor multifactor contrasts. Specifically, we examined Type I errorrates and statistical power. We approached our validation in waysconsistent with simulation-based validations from the statisticsliterature [1, 5, 26, 33].

To create our 72,000 synthetic data sets, we drew responses as ran-dom samples from known populations. We use the term "condition"to refer to combinations of levels from any number of factors, butit is important to note that samples were only drawn for condi-tions with one level from each factor. Our synthetic data sets variedaccording to the following four properties: • layout: The number of factors and number of levels per factorin the data set. Values: two factors with two levels each (2 × × × × • population distribution: Type of distribution from which sam-ples in the data set were drawn. Specific distributions (see Table5) were chosen because they represent data frequently found inHCI studies (e.g., normal, lognormal, exponential), or becausethey’re commonly used in simulation studies in statistics due to RT-C , , their heavy tails [1, 5] (e.g., Cauchy, t with 3 degrees of freedom,double exponential). Note that the mean is a type of location andthe standard deviation is a type of scale; for consistency, we usethe general terms "location" and "scale." Table 5: population distributions and their parameters.Distribution Parameters

Normal Mean, Standard DeviationLognormal Log Mean, Log Standard DeviationExponential RateCauchy Location, Scale t (3) Location, ScaleDouble Exponential Location, Scale • condition sample size: The number of data points randomlysampled from a population for each condition. Values: 8, 16, 24,32, and 40, selected because they represented typical sample sizesin HCI. • between or within subjects: In a between-subjects design,each subject contributes one response to the data set, and thenumber of responses is equal to the number of subjects. In awithin-subjects design, each subject contributes one response ineach condition, and the number of subjects is equal to the condi-tion sample size. Values: between and within. Mixed factorialdesigns were left for future work.Our running example has a 2 × × × × × Step 1. Determine latent location.

We begin by determining a latent location for each condition ( 𝜇 ∗ 𝑐 ), which will undergo severaltransformations before being used as parameter values in a pop-ulation distribution. When conditions have equal populationlocations, the latent location is fixed at 0 (Equation (1a)). Otherwise,its value is sampled from a standard normal distribution (Equation(1b)). Scale is always equal to 1 in our analyses (Equation (2)). 𝜇 ∗ 𝑐 = 𝜇 ∗ 𝑐 ∼ N ( , ) (1b)Used when creating data to measure statistical power. 𝜎 𝑐 = 𝜇 ∗ 𝑐 is the latent location for condition 𝑐 ,and in Equation (2), 𝜎 𝑐 is the scale for condition 𝑐 . Step 2. Add random intercepts per subject.

When generatingwithin-subjects data, each subject is assigned a unique random offset ( 𝛽 𝑠 ) sampled from a normal distribution with mean 0, andstandard deviation 𝑆𝐷 (Equation (3b)), where 𝑆𝐷 is randomly cho-sen from { . , . , . } (Equation (3a)) and is the same value for theentire data set. These values were chosen to represent a reasonableratio between within-subject variance and between-subject vari-ance. We now update our latent mean notation to ( 𝜇 ∗ 𝑐,𝑠 ) to representthe latent mean for each combination of condition and subject, anda subject’s random offset is added to all of its associated latentlocations (Equation (3c)). For consistency, we use this notation forbetween-subjects data as well, but with a random per-subject offsetof 0 (Equation (3d)). 𝑆𝐷 ∼ 𝑅𝑎𝑛𝑑𝑜𝑚 ( . , . , . ) (3a) 𝛽 𝑠 ∼ N ( , 𝑆𝐷 ) (3b) 𝜇 ∗ 𝑐,𝑠 = 𝜇 ∗ 𝑐 + 𝛽 𝑠 (3c)Used when generating within-subjects data 𝜇 ∗ 𝑐,𝑠 = 𝜇 ∗ 𝑐 + 𝛽 𝑠 is the random offset for subject 𝑠 , and in Equation(3c) 𝛽 𝑠 is added to all latent locations 𝜇 ∗ 𝑐 associated subject 𝑠 , result-ing in a new latent location 𝜇 ∗ 𝑐,𝑠 for condition 𝑐 and subject 𝑠 . InEquation (3d) 𝜇 ∗ 𝑐 is simply relabeled 𝜇 ∗ 𝑐,𝑠 for consistency, howevereach subject still has the same latent location 𝜇 ∗ 𝑐 for condition 𝑐 . Step 3. Transform latent location with inverse link function.

Latent location is currently expressed as a linear model, but somedistributions’ parameters must be expressed as a function of a linearmodel. This function, called the inverse link function ( 𝑔 ), transformsthe latent location ( 𝜇 ∗ ) into the appropriate location for the distri-bution ( 𝜇 ) (Equations (4)). 𝜇 𝑐,𝑠 = 𝑔 𝜇 ( 𝜇 ∗ 𝑐,𝑠 ) (4)In Equations (4), 𝑔 𝜇 is the location inverse link function and ittransforms a latent location 𝜇 ∗ 𝑐,𝑠 into a location 𝜇 𝑐,𝑠 .All population distributions (Table 5) use the identity inverselink function (Equations 5a) except for the exponential distribution(Equation 5b). 𝑔 𝜇,𝑖𝑑 ( 𝑥 ) = 𝑥 (5a) 𝑔 𝜇,𝑒𝑥𝑝 ( 𝑥 ) = 𝑒𝑥𝑝 ( 𝑥 ) (5b)In Equations (5a), 𝑔 𝜇,𝑖𝑑 is the identity location inverse link function.In Equation (5b), 𝑔 𝜇,𝑒𝑥𝑝 is the location inverse link function usedby the exponential distribution. Step 4. Generate data.

Response 𝑌 𝑐,𝑠 is sampled from the relevantdistribution, represented here by the generic function 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 ( 𝑥, 𝑦 ) (Equation (6)). The exponential distribution onlyhas a single parameter ( 𝑟𝑎𝑡𝑒 = / 𝑙𝑜𝑐𝑎𝑡𝑖𝑜𝑛 ) and follows Equation(7): 𝑌 𝑐,𝑠 ∼ 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 ( 𝜇 𝑐,𝑠 , 𝜎 𝑐,𝑠 ) (6) 𝑌 𝑐,𝑠 ∼ 𝐸𝑥𝑝 ( / 𝜇 𝑐,𝑠 ) (7) Example.

We illustrate this process by generating response 𝑌 , for condition 5 and subject 2 in our running example. Step 1.

Since ourexample does not have equal population locations, we use Equations1b and 2. 𝜇 ∗ , ∼ N ( , ) = .

75 (1b) 𝜎 , = , Elkin, Kay, Higgins, Wobbrock Step 2.

Our example uses a within-subjects design, so we add per-participant offsets. Note that 𝑆𝐷 would have already been chosenfor condition 1 . The same value would be used here. 𝑆𝐷 ∼ 𝑅𝑎𝑛𝑑𝑜𝑚 ( . , . , . ) = . 𝛽 ∼ N ( , . ) = . 𝜇 ∗ , = . + . = .

85 (3c)

Step 3.

We use the inverse location link function for the lognormaldistribution. 𝜇 , = 𝑔 𝜇,𝑖𝑑 ( . ) = .

85 (5a)

Step 4.

Finally, we sample a lognormal distribution with log mean 𝜇 , and log standard deviation 𝜎 , to get response 𝑌 , . 𝑌 , ∼ 𝐿𝑜𝑔𝑛𝑜𝑟𝑚𝑎𝑙 ( . , ) = .

27 (6)

To best explain our testing procedure, we first introduce somedefinitions: • An x-factor contrast is a contrast between two conditions com-posed of one level each from x factors. • contrast size is the x in x -factor contrast. • A design as a unique combination of a layout, population dis-tribution, condition sample size, between or within sub-jects, and contrast size. • A trial consists of one contrast test result, and all possible con-trasts were conducted. There were: – × –

42 trials in a data set with a 3 × –

49 trials in a 2 × × , × ( + + ) = 2,376,000 trials. There were 1,094 datasets out of 72,000 data sets (1.5%) with at least one trial for whichART-C did not converge. All of these data sets were within-subjectsand were therefore modeled as linear mixed models using the lmer method in R, and it is common for lmer to cause convergence issues.These data sets were removed.By definition, ART-C is an aligning and ranking procedure fol-lowed by a contrast testing method—our validation used a t -test.Since we were validating a contrast testing method and not investi-gating the cause of a significant omnibus result, we did not correctfor multiple comparisons. The R programming language was usedto generate all data sets, conduct all contrasts, and analyze theresults. All R code is included as supplementary material and isavailable online for replication and extension. Following a common approach in statistics literature [1, 5, 26, 31–33], we validated our method on two metrics: Type I error rate andpower. A significance level ( 𝛼 ) represents the probability of a Type I error(false positive) and is used as a threshold to reject a null hypothesis( 𝑝 < 𝛼 ). Many readers will recognize that typically, alpha is set to.05, although other values may be used. A large-scale simulationshows a method is correct when the proportion of tests in whicha true null hypothesis was rejected ( observed Type I error rate ) isclose to the significance level. That is, the proportion of tests inwhich 𝑝 was less than 𝛼 should be close to 𝛼 .For example, 5,516 trials were conducted on data from a 2 × × .

05, ART-C found a significant difference in 265 trials, resultingin a = .

048 observed Type I error rate, which is very close tothe 𝛼 = .

05 significance threshold, indicating the correctness ofART-C for this design.Each data point in the following results represents the observedType I error rate for one design. All population locations were setto 0 (Equation 1a), and thus the null hypothesis that there is notrue difference between conditions’ population locations is true forall trials. As is common practice in statistics, we include observedType I error rates for the t -test as a baseline comparison [31, 32].Contrasts conducted with ART-C on designs with a combinationof contrast size one, Cauchy population distribution, and 3 × × × 𝑀 = .373, 𝑆𝐷 = .076), while t -test contrasts did not ( 𝑀 = .025, 𝑆𝐷 =.004). Those Cauchy designs were considered outliers and were notincluded in the remainder of our analysis of Type I error rates; weaddress this further in our discussion. Table 6: Mean Type I error rates (and standard deviations) forART-C and, for comparison, the t -test, grouped by contrastsize and layout over all designs, excluding designs with aCauchy population distribution. Values closer to .05 arebetter, indicating greater correctness. ART-C has compara-ble Type I error rates to the t -test, but as our additional re-sults show, much greater power.ContrastSize Layout ART-C T -test × × × × × × × × × × t -test on remaining designs were clusteredaround .

05: ART-C ( 𝑀 = .050, 𝑆𝐷 = .009) and t -test ( 𝑀 = .048, 𝑆𝐷 = .012), and design properties do not appear to have an effect on RT-C , , observed Type I error rates, confirming the robustness of the ART-Cprocedure. Observed Type I error rates for all designs are includedas supplementary material (see "Type_I_all_designs.csv"). Table 6and Figure 5 illustrate both methods’ observed Type I error rates,closely clustered around .05, and show egregious Type I error ratesfor ART-C with a Cauchy distribution. L a y o u t T - t e s t A R T - C L a y o u t T - t e s t A R T - C L a y o u t T - t e s t A R T - C ART-C Type I error rates(teal) and t-test Type Ierror rates (gray) are bothclustered around 𝛼 = 0.05.ART-C 1-factor contrasts on data with a Cauchy Population Distribution (red) have high Type I error rates. Figure 5: ART-C (teal) and t -test (gray) observed Type I errorrates by contrast size and layout. Designs with a Cauchypopulation distribution are shown in red. Each point rep-resents observed Type I error rate for one design. Valuescloser to .05 are better, indicating greater correctness. ART-C has comparable Type I error rates to the t -test, but as ouradditional results show, must greater power. Statistical power is the probability of rejecting a false null hypothe-sis (detecting a true difference) given a particular significance level.Observed power is the proportion of tests in which a false null hy-pothesis was rejected. Unlike for Type I errors, there is no expectedvalue to compare observed power to; instead, we followed commonpractices in statistics [1, 5, 26] and compared to other methods,specifically the t -test, Mann-Whitney U test for between-subjectsdesigns, and Wilcoxon signed-rank test for within-subjects designs. For example, ART-C contrast tests conducted on data with a2 × × = .

76 observed power.Population locations for each condition were randomly sampledfrom a standard normal distribution (Equation (1b)). Although wecannot guarantee these locations were different, there is an infin-itely small chance they were the same, and we therefore assumethat the null hypothesis of no difference between condition popula-tion locations is false. In the following results, a significance levelof .05 was used, and each data point represents the observed powerof one design.When averaged over all designs, our results show that ART-Chad the highest observed power ( 𝑀 = .598, 𝑆𝐷 = .143), followedby Mann Whitney U test / Wilcoxon signed-rank test ( 𝑀 = .521, 𝑆𝐷 = .149), and finally the t -test ( 𝑀 = .461, 𝑆𝐷 = .149). Observedpowers for all designs are included as supplementary materials (see"power_all_designs.csv").population distribution and condition sample size werethe only design properties that had a large impact on observedpower. ART-C had higher observed power than the t -test for allpopulation distributions other than the normal distribution,for which it was the same, and had higher observed power thanthe Mann-Whitney U test and Wilcoxon signed-rank test for allpopulation distributions (Table 7, Figure 6). Table 7: Mean statistical power (and standard deviations)for ART-C, t -test, Mann-Whitney U test (M-W) / Wilcoxonsigned-rank test (W.S.R.), and ART, grouped by populationdistribution. Higher values are better, representing morestatistical power. Lognormal Normal Exponential ART-C .69 (.10) .66 (.09) .66 (.11) T -test .46 (.11) .66 (.09) .52 (.12)M-W/ W.S.R. .58 (.12) .59 (.12) .56 (.12)ART .62 (.14) .49 (.18) .58 (.16) t (3) DoubleExp. Cauchy ART-C .58 (.11) .60 (.10) .41 (.13) T -test .50 (.10) .56 (.10) .07 (.02)M-W/ W.S.R. .52 (.13) .54 (.12) .34 (.12)ART .42 (.18) .44 (.19) .51 (.16)condition sample size affected power in that ART-C had higherobserved power than the t -test and Mann-Whitney U test /Wilcoxon signed-rank test regardless of condition sample size,but all tests’ observed power increased as condition sample sizeincreased, which is expected. , Elkin, Kay, Higgins, Wobbrock M - W / W . S . R M - W / W . S . R M - W / W . S . R Figure 6: Mean statistical power by population distribu-tion for ART-C (teal), t -test (gray), and Mann-Whitney U test / Wilcoxon signed rank-test (black outline). Higher val-ues are better indicating greater power. ART-C comparesmost favorably in all cases except when compared to a t -test for normal population distributions. Each point rep-resents observed statistical power from one design. Contrasts conducted with ART-C had lower observed Type I error( 𝑀 = .067, 𝑆𝐷 = .072) than contrasts conducted on data aligned-and-ranked using the original ART procedure ( 𝑀 =.121, 𝑆𝐷 = .174), andhigher observed power (ART-C: 𝑀 = .598, 𝑆𝐷 = .143 vs . ART: 𝑀 =.511, 𝑆𝐷 = .182).When separated by population distribution, ART-C had lowerobserved Type I error rates than ART for the lognormal, exponen-tial, and Cauchy population distributions (Table 8, Figure 3), butART-C observed Type I error rates were closer to the significancelevel ( 𝛼 = .

05) than ART’s for all population distributions, indi-cating that ART-C is more correct. ART-C also had higher observedpower than ART for all population distributions except Cauchy(Table 7, Figure 4).

Table 8: Mean Type I error rates (and standard deviations)for ART-C and ART, grouped by population distribution.Values closer to 𝛼 = . are better, indicating greater correct-ness. Lognormal Normal Exponential ART-C .054 (.015) .049 (.008) .051 (.007)ART .141 (.096) .024 (.023) .065 (.048) t (3) DoubleExp. Cauchy ART-C .049 (.006) .049 (.007) .14 (.154)ART .033 (.015) .026 (.023) .425 (.209)contrast size also had an interesting effect on power. Observedpower with ART-C was highest for single-factor contrasts, followedby three-factor contrasts, and then two-factor contrasts, but thedifferences were small. However, ART’s power decreased as con-trast size increased, and the differences were much larger (Table9, Figure 7). Recall that the alignment formulas for ART and ART-Cbecome mathematically equivalent in the single factor case.Recall in our running example that we conducted multifactorcontrasts on levels of factors 𝐴 and 𝐵 , and contrasts conducted withART produced a Type I error and a Type II error (see Table 3, Figure2), while contrasts conducted with a linear mixed model (LMM) onlog-transformed data resulted in correct conclusions (see Table 2)that agreed with the ground truth (see Table 1). Contrasts conducted Table 9: Mean statistical power (and standard deviations) forART-C and ART, grouped by contrast size. Higher valuesare better, indicating greater power.1-FactorContrasts 2-FactorContrasts 3-FactorContrasts

ART-C .620 (.150) .580 (.140) .590 (.130)ART .620 (.150) .460 (.150) .340 (.180)

RT-C , ,

Observed Statistical Power1-Factor Contrasts 2-Factor Contrasts 3-Factor Contrasts

As Contrast Size increases,ART power (gray) decreases and ART-C power (teal) increases.ART-C and ART are equivalentfor 1-factorcontrasts. A R T A R T - C Figure 7: ART-C (teal) and ART (gray) observed statisticalpower by contrast size. Each point represents observedstatistical power for one design. Higher values are better in-dicating greater power. ART-C power is higher for all con-trast sizes and increases with contrast size compared toART which decreases. Both methods are equivalent whenconducting single-factor contrasts. on the same data with ART-C agree with the LMM results and theground truth in finding a difference between 𝐴 , 𝐵 𝐴 , 𝐵

2, andnot finding a difference between 𝐴 , 𝐵 𝐴 , 𝐵 Table 10: Highlighted results of contrasts conducted usingART-C, comparing levels of 𝐴 and 𝐵 in our running exam-ple. In the top row, a difference was correctly not detectedbetween 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 = . ), and there is not a truedifference. In the bottom row, a difference was detected cor-rectly between 𝐴 , 𝐵 and 𝐴 , 𝐵 ( 𝑝 < . ), and there is a truedifference. contrast estimate SE df t.ratio p.valueA1,B1 - A1,B2 1.5 3.7 273 0.4 0.6758A1,B1 - A2,B2 -89.9 3.7 273 -24.5 < .0001Thus, taken as a whole, our results show that ART-C has appro-priate Type I error rates clustered around 𝛼 = .

05, except for datasampled from Cauchy distributions, for which ART-C should not beused. Furthermore, ART-C has high statistical power, outperform-ing the t -test, Mann-Whitney U test, Wilcoxon signed-rank test,and ART. These results show that ART-C is a correct and powerfulprocedure for use within the ART paradigm for conducting contrasttests within or across levels of multiple factors. To make ART-C widely available to the community, we updatedthe existing open-source ARTool.exe Windows application (seefootnote 2) [46] and the ARTool R package (see footnote 1) toinclude ART-C.

The ARTool.exe Windows application was released as an open-source tool in 2011 [46] to facilitate the aligning and ranking of data for analysis using the ART procedure. We extended this open-sourcetool to include our ART-C procedure for multifactor contrasts. Userscan now indicate that they want contrasts with a checkbox (Figure8 (top)), which then offers them a separate dialog box (Figure 8(bottom)) from which they can select the factors whose levels areinvolved in their desired contrast test. ARTool then uses the ART-C procedure to produce aligned and ranked output suitable foranalysis. See footnote 3 for the link to our updated ARTool Windowsapplication.

The current “ARTool” R package makes it easy to conduct non-parametric tests of main effects and interactions using the ARTprocedure. A single function call aligns and ranks data for eachfixed effect in a formula f provided by the user. The result is anART model m that keeps a copy of formula f and the data. Given m ,another function in ARTool, anova , runs multiple ANOVAs behindthe scenes, one for each fixed effect in f , and returns the results ofeach test. In this work, we have added a new function, art.con , thatuses our ART-C procedure to conduct multifactor contrast tests.Given the same model m and a contrast formula 𝑓 𝑐 , the ART-Cprocedure is used to align and rank the data saved in m for thecontrasts specified in 𝑓 𝑐 . It then parses the formula f saved in m ,conducts the contrasts, and returns the results.In our running example, we first conducted 𝐴 × 𝐵 contrasts withART, which, of course, is incorrect given ART’s propensity forType I errors. Now, we can correctly use ART-C to perform thesecontrasts. Figure 9 shows how we would use ART-C to conductcontrasts correctly.See footnote 1 for the link to our updated R package and footnote2 for the link to the code. Our results showed that ART-C’s Type I error rate is typically clus-tered around 𝛼 = .

05, offering strong evidence for its correctness.In addition to having more power than a t -test for all non-normaldistributions and more power than a Mann-Whitney U test andWilcoxon signed-rank test for all distributions, it is worth notingthat the increase in power achieved by using ART-C is largest fordata drawn from lognormal and exponential distributions. Thisfinding is particularly meaningful because the lognormal and expo-nential distributions were included due to their frequent appearancein HCI.That said, our results showed that single-factor ART-C contrastsconducted on data drawn from a Cauchy-distributed populationhad high observed Type I error rates. This is not unique to ART-C;the Cauchy distribution is known to be “pathological” and manywell-known statistics concepts do not hold on Cauchy-distributeddata (e.g., the Central Limit Theorem [25]). This situation occursbecause Cauchy distributions have tails that are so fat that neithertheir mean nor variance is well defined. In practice, this concerncan arise in data with extreme outliers. Thus, we encourage usersto avoid using ART-C if they have theoretical reasons to suspectthe data is drawn from a Cauchy-distributed population or if thedata has extreme outliers. , Elkin, Kay, Higgins, Wobbrock Figure 8: Top: Our ARTool.exe tool with a “Want contrasts”checkbox. Table 5 from Higgins et al. [15] is being alignedand ranked. Bottom: Our new tool for specifying multifac-tor contrasts. Two factors,

Moisture and

Fertilizer , from thesame Table 5, each have levels 1-4. Selecting both factorswould allow, e.g., a comparison of (Moisture 2, Fertilizer 3) vs. (Moisture 4, Fertilizer 1). Figure 9: Screenshot using ART-C to conduct 𝐴 × 𝐵 contrastsin our running example in R. The anova call first would pro-duce omnibus test results for any 𝐴 , 𝐵 , and 𝐶 main effectsand interactions; if, for example, the 𝐴 × 𝐵 interaction werestatistically significant, art.con could be used to conduct posthoc pairwise comparisons as shown here. In HCI, nonparametric tests are typically used as a catch-allwhen parametric tests are not appropriate. The particular Cauchyresult above illuminates that this practice can be problematic. Infact, ART-C is mathematically equivalent to ART in the single-factor case, and ART was thought to be appropriate for single-factor contrasts, but would also be ill-suited for the Cauchy case.A disclaimer to not use a method to analyze data with a particularresidual distribution is not useful unless researchers investigateexperimental data distributions beyond checking for normality.The American Psychological Association’s Taskforce on StatisticalInference encourages researchers to take a closer look at their data:As soon as you have collected your data, before youcompute any statistics, look at your data . Data screen-ing is not data snooping. It is not an opportunity todiscard data or change values to favor your hypothe-ses. However, if you assess hypotheses without exam-ining your data, you risk publishing nonsense. [45]( emphasis in original )Even nonparametric tests are subject to assumptions. There aremany tried-and-tested visualizations for model diagnostics that canbe applied to assess assumptions relevant to ART: quantile-quantile(qq) plots [17], for example, allow one to check for the presence offat tails in the distribution of residuals (i.e., excess kurtosis , whichin extreme cases could indicate the presence of Cauchy-distributeddata). Modern visualizations like worm plots [6] can make it eveneasier to diagnose fat tails. The point, though, is that there is no all-encompassing solution in statistical analysis: model fit and assump-tions cannot be assumed and must be checked, and nonparametricapproaches are no exception to this rule.

RT-C , ,

There are infinitely many combinations of layouts, population dis-tributions, and condition sample sizes one could examine in a studylike ours, but we could only analyze a finite amount of data and hadto be selective. These decisions were carefully made, consideringthe needs of the HCI community and statistical norms, but theywere certainly not exhaustive.Our validation only investigated data in which all conditions’populations had the same location or all conditions’ populationshad different locations. Additionally, even when parameter valueswere varied, conditions in the same data set were always drawnfrom the same distribution. Data in which there are differencesbetween some conditions and not others arises frequently in HCI,but we chose this validation process because it is commonly usedin statistics [1, 5, 26, 39].We included models with random intercepts, which representedthe impact each subject had on the response. However, we did notinclude models with random slopes, which allow, for some typesof responses, better fitting models where subjects’ responses varydifferentially across another variable (e.g., time). Although randomslope models would certainly be valuable, fixed effects models andmodels with random intercepts are used more frequently in HCI,so we chose to focus our validation on such models, leaving othermodels for future work.ART-C is defined as an alignment procedure, followed by a rank-ing procedure, and then a contrast test. We chose to use the t -testas the contrast test in our analyses because it is the most familiarto the HCI community and therefore how we anticipate most re-searchers will use our new method. Still, it would be worthwhile tosee how ART-C performs when a different contrast test is used.In addition to our extension to the ARTool R package and AR-Tool.exe Windows application, we envision a platform-agnostictool that does not require programming experience, and even anART and ART-C package for other common statistics software (e.g.,SAS, SPSS). With the algorithmic and validation work we have donehere, it should be relatively straightforward to create additionaladd-ons for common statistical packages.We leave addressing our limitations to future work. To facilitatethis, we have published the R code used to generate synthetic data,conduct all contrasts, and log the results (see footnote 5), as wellas our analysis code (also footnote 5). The “ARTool” R package(see footnotes 1 and 2) and ARTool.exe application (see footnote 3),including our new additions to them, are open source. The ART method has enabled anyone familiar with an ANOVAto conduct nonparametric analyses on data from factorial experi-ments and detect main effects and interactions [46], but we haveshown that it inflates Type I error rates and has low statisticalpower when used to conduct multifactor contrasts. To remedy this,we have developed, validated, and presented the ART-C procedurefor aligning-and-ranking data for nonparametric multifactor con-trasts within the ART paradigm, giving researchers a techniqueand tools to analyze data nonparametrically from factorial experi-ments. We have validated our method’s correctness and statisticalpower on 72,000 synthetic data sets whose properties represent data commonly found within HCI and statistics. Our results show thatART-C does not inflate Type I error, and has higher statistical powerthan a t -test, Mann-Whitney U test, Wilcoxon signed-rank test, andART. To facilitate the widespread use of ART-C, we have addedit to the existing “ARTool” R package and ARTool.exe Windowsapplication.It is our hope that the simplicity of conducting multifactor con-trasts with ART-C will impact the HCI community by enablingresearchers to easily investigate their nonconforming multifactordata at a level of granularity that previously required statisticalexpertise and a departure from the ART paradigm. Owing to ART’sevident popularity, we believe ART-C can be immediately useful tomany researchers in HCI and beyond. REFERENCES [1] Marisela Abundis. 2001.

Multiple comparison procedures in factorial designs usingthe aligned rank transformation . Thesis. Texas Tech University. https://ttu-ir.tdl.org/handle/2346/22569[2] Saleema Amershi, James Fogarty, and Daniel Weld. 2012. Regroup: interactivemachine learning for on-demand group creation in social networks. In

Proceedingsof the 2012 ACM annual conference on Human Factors in Computing Systems - CHI’12 . ACM Press, Austin, Texas, USA, 21. https://doi.org/10.1145/2207676.2207680[3] Shiri Azenkot, Kyle Rector, Richard Ladner, and Jacob Wobbrock. 2012. Pass-Chords: secure multi-touch authentication for blind people. In

Proceedings of the14th international ACM SIGACCESS conference on Computers and accessibility -ASSETS ’12 . ACM Press, Boulder, Colorado, USA, 159. https://doi.org/10.1145/2384916.2384945[4] Eric Barefield and H. Mansouri. 2001. An empirical study of nonparametricmultiple comparison procedures in randomized blocks.

Journal of NonparametricStatistics

13, 4 (Jan. 2001), 591–604. https://doi.org/10.1080/10485250108832867[5] R. Clifford Blair and James J. Higgins. 1980. A Comparison of the Power ofWilcoxon’s Rank-Sum Statistic to That of Student’s t Statistic under VariousNonnormal Distributions.

Journal of Educational Statistics

5, 4 (1980), 309. https://doi.org/10.2307/1164905[6] Stef van Buuren and Miranda Fredriks. 2001. Worm plot: a simple diagnosticdevice for modelling growth reference curves.

Statistics in medicine

20, 8 (2001),1259–1277.[7] D. Ciavardelli, C. Rossi, D. Barcaroli, S. Volpe, A. Consalvo, M. Zucchelli, A.De Cola, E. Scavo, R. Carollo, D. D’Agostino, F. Forlì, S. D’Aguanno, M. Todaro,G. Stassi, C. Di Ilio, V. De Laurenzi, and A. Urbani. 2014. Breast cancer stem cellsrely on fermentative glycolysis and are sensitive to 2-deoxyglucose treatment.

Cell Death & Disease

5, 7 (July 2014), e1336–e1336. https://doi.org/10.1038/cddis.2014.285 Number: 7 Publisher: Nature Publishing Group.[8] W. J. Conover and Ronald L. Iman. 1981. Rank Transformations as a BridgeBetween Parametric and Nonparametric Statistics.

The American Statistician

Bioinspiration &Biomimetics

10, 3 (April 2015), 036002. https://doi.org/10.1088/1748-3190/10/3/036002 Publisher: IOP Publishing.[10] Brigitte N. Frederick. 1999.

Fixed-, Random-, and Mixed-Effects ANOVA Models: AUser-Friendly Guide for Increasing the Generalizability of ANOVA Results . https://eric.ed.gov/?id=ED426098[11] Milton Friedman. 1937. The Use of Ranks to Avoid the Assumption of NormalityImplicit in the Analysis of Variance.

J. Amer. Statist. Assoc.

32, 200 (1937), 675–701. https://doi.org/10.2307/2279372 Publisher: [American Statistical Association,Taylor & Francis, Ltd.].[12] António Gaspar, André P. Lourenço, Miguel Álvares Pereira, Pedro Azevedo,Roberto Roncon-Albuquerque, Jorge Marques, and Adelino F. Leite-Moreira. 2018.Randomized controlled trial of remote ischaemic conditioning in ST-elevationmyocardial infarction as adjuvant to primary angioplasty (RIC-STEMI).

BasicResearch in Cardiology

Proceedings of the 2017 CHI Conference on Human Factorsin Computing Systems . ACM, Denver Colorado USA, 4021–4033. https://doi.org/10.1145/3025453.3025683[14] Nur Al-huda Hamdan, Adrian Wagner, Simon Voelker, Jürgen Steimle, andJan Borchers. 2019. Springlets: Expressive, Flexible and Silent On-Skin Tac-tile Interfaces. In

Proceedings of the 2019 CHI Conference on Human Factorsin Computing Systems - CHI ’19 . ACM Press, Glasgow, Scotland Uk, 1–14. , Elkin, Kay, Higgins, Wobbrock https://doi.org/10.1145/3290605.3300718[15] James J. Higgins, R. Clifford Blair, and Suleiman Tashtoush. 1990. THE ALIGNEDRANK TRANSFORM PROCEDURE.

Conference on Applied Statistics in Agriculture (April 1990). https://doi.org/10.4148/2475-7772.1443[16] James J Higgins and Suleiman Tashtoush. 1994. An aligned rank transform testfor interaction.

Nonlinear World

1, 2 (1994), 201–211.[17] David C Hoaglin. 2006. Using quantiles to study shape.

Exploring data tables,trends, and shapes (2006), 417–460.[18] J. L. Hodges and E. L. Lehmann. 1962. Rank Methods for Combination of In-dependent Experiments in Analysis of Variance.

The Annals of MathematicalStatistics

33, 2 (June 1962), 482–497. https://doi.org/10.1214/aoms/1177704575[19] Sture Holm. 1979. A simple sequentially rejective multiple test procedure.

Scan-dinavian journal of statistics (1979), 65–70.[20] Eunice Jun, Maureen Daum, Jared Roesch, Sarah Chasins, Emery Berger, ReneJust, and Katharina Reinecke. 2019. Tea: A High-level Language and RuntimeSystem for Automating Statistical Analysis. In

Proceedings of the 32nd AnnualACM Symposium on User Interface Software and Technology . ACM, New OrleansLA USA, 591–603. https://doi.org/10.1145/3332165.3347940[21] Shaun K. Kane, Meredith Ringel Morris, Annuska Z. Perkins, Daniel Wigdor,Richard E. Ladner, and Jacob O. Wobbrock. 2011. Access overlays: improvingnon-visual access to large touch screens for blind users. In

Proceedings of the24th annual ACM symposium on User interface software and technology - UIST’11 . ACM Press, Santa Barbara, California, USA, 273. https://doi.org/10.1145/2047196.2047232[22] Maurits Clemens Kaptein, Clifford Nass, and Panos Markopoulos. 2010. Powerfuland consistent analysis of likert-type ratingscales. In

Proceedings of the SIGCHIConference on Human Factors in Computing Systems (CHI ’10) . Association forComputing Machinery, Atlanta, Georgia, USA, 2391–2394. https://doi.org/10.1145/1753326.1753686[23] Matthew Kay. 2020. Contrast tests with ART. https://cran.r-project.org/web/packages/ARTool/vignettes/art-contrasts.html[24] Matthew Kay, Gregory L. Nelson, and Eric B. Hekler. 2016. Researcher-CenteredDesign of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentivesof HCI. In

Proceedings of the 2016 CHI Conference on Human Factors in ComputingSystems (CHI ’16) . Association for Computing Machinery, San Jose, California,USA, 4521–4532. https://doi.org/10.1145/2858036.2858465[25] Kalimuthu Krishnamoorthy. 2016.

Handbook of statistical distributions withapplications . CRC Press.[26] Dong Li. [n.d.]. Robustness And Power Of The Student T, Welch-Aspin, Yuen,Tukey Quick, And Haga Tests. ([n. d.]), 1753.[27] H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two RandomVariables is Stochastically Larger than the Other.

The Annals of MathematicalStatistics

Computational Statistics & Data Analysis

29, 2 (Dec. 1998),177–189. https://doi.org/10.1016/S0167-9473(98)00077-2[29] H. Mansouri. 1999. Aligned rank transform tests in linear models.

Journal ofStatistical Planning and Inference

79, 1 (June 1999), 141–155. https://doi.org/10.1016/S0378-3758(98)00229-8[30] H. Mansouri, R. L. Paige, and J. G. Surles. 2004. Aligned Rank TransformTechniques for Analysis of Variance and Multiple Comparisons.

Communi-cations in Statistics - Theory and Methods

33, 9 (Dec. 2004), 2217–2232. https://doi.org/10.1081/STA-200026599[31] H. R. Neave and C. W. J. Granger. 1968. A Monte Carlo Study Comparing VariousTwo-Sample Tests for Differences in Mean.

Technometrics

10, 3 (1968), 509–522.https://doi.org/10.2307/1267105 Publisher: [Taylor & Francis, Ltd., AmericanStatistical Association, American Society for Quality].[32] Daryle Alan Olson. 2013. The Efficacy Of Select Nonparametric And Distribution-Free Research Methods: Examining The Case Of Concomitant HeteroscedasticityAnd Effect Of Treatment.[33] Kathleen Peterson. 2002. Six Modifications Of The Aligned Rank Transform TestFor Interaction.

Journal of Modern Applied Statistical Methods

1, 1 (May 2002),100–109. https://doi.org/10.22237/jmasm/1020255240[34] Marta Revilla-León, Peng Jiang, Mehrad Sadeghpour, Wenceslao Piedra-Cascón,Amirali Zandinejad, Mutlu Özcan, and Vinayak R. Krishnamurthy. 2019. Intraoraldigital scans—Part 1: Influence of ambient scanning light conditions on theaccuracy (trueness and precision) of different intraoral scanners.

The Journal ofProsthetic Dentistry (Dec. 2019). https://doi.org/10.1016/j.prosdent.2019.06.003[35] Scott J. Richter. 1999. Nearly exact tests in factorial experiments using thealigned rank transform.

Journal of Applied Statistics

26, 2 (Feb. 1999), 203–217.https://doi.org/10.1080/02664769922548[36] Judy Robertson and Maurits Kaptein (Eds.). 2016.

Modern Statistical Methods forHCI . Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-26633-6[37] Joan Sol Roo and Martin Hachet. 2017. One Reality: Augmenting How thePhysical World is Experienced by combining Multiple Mixed Reality Modalities.In

Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology . ACM, Québec City QC Canada, 787–795. https://doi.org/10.1145/3126594.3126638[38] KC Salter and RF Fawcett. 1993. The ART test of interaction: a robust andpowerful rank test of interaction in factorial models.

Communications in Statistics-Simulation and Computation

22, 1 (1993), 137–153.[39] Shlomo S. Sawilowsky and R. Clifford Blair. 1992. A more realistic look at therobustness and Type II error properties of the t test to departures from populationnormality.

Psychological Bulletin

Biometrika

6, 1 (1908), 1–25.https://doi.org/10.2307/2331554 Publisher: [Oxford University Press, BiometrikaTrust].[41] John W Tukey. 1949. Comparing individual means in the analysis of variance.

Biometrics (1949), 99–114.[42] James H. Ware. 1985. Linear Models for the Analysis of Longitudinal Studies.

The American Statistician

39, 2 (1985), 95–101. https://doi.org/10.2307/2682803Publisher: [American Statistical Association, Taylor & Francis, Ltd.].[43] Eric W Weisstein. 2004. Bonferroni correction. https://mathworld. wolfram. com/ (2004).[44] Frank Wilcoxon. 1945. Individual Comparisons by Ranking Methods.

BiometricsBulletin

1, 6 (1945), 80–83. https://doi.org/10.2307/3001968 Publisher: [Interna-tional Biometric Society, Wiley].[45] Leland Wilkinson and The Task Force on Statistical Inference. 1999. Statisti-cal methods in psychology journals: Guidelines and explanations.

Americanpsychologist

54, 8 (1999), 594.[46] Jacob O. Wobbrock, Leah Findlater, Darren Gergle, and James J. Higgins. 2011.The Aligned Rank Transform for Nonparametric Factorial Analyses Using OnlyAnova Procedures. In