[PDF] Empirical Decision Rules for Improving the Uncertainty Reporting of Small Sample System Usability Scale Scores

Abstract

The System Usability Scale (SUS) is a short, survey-based approach used to determine the usability of a system from an end user perspective once a prototype is available for assessment. Individual scores are gathered using a 10-question survey with the survey results reported in terms of central tendency (sample mean) as an estimate of the system's usability (the SUS study score), and confidence intervals on the sample mean are used to communicate uncertainty levels associated with this point estimate. When the number of individuals surveyed is large, the SUS study scores and accompanying confidence intervals relying upon the central limit theorem for support are appropriate. However, when only a small number of users are surveyed, reliance on the central limit theorem falls short, resulting in confidence intervals that suffer from parameter bound violations and interval widths that confound mappings to adjective and other constructed scales. These shortcomings are especially pronounced when the underlying SUS score data is skewed, as it is in many instances. This paper introduces an empirically-based remedy for such small-sample circumstances, proposing a set of decision rules that leverage either an extended bias-corrected accelerated (BCa) bootstrap confidence interval or an empirical Bayesian credibility interval about the sample mean to restore and bolster subsequent confidence interval accuracy. Data from historical SUS assessments are used to highlight shortfalls in current practices and to demonstrate the improvements these alternate approaches offer while remaining statistically defensible. A freely available, online application is introduced and discussed that automates SUS analysis under these decision rules, thereby assisting usability practitioners in adopting the advocated approaches.

Full PDF

EEmpirical Decision Rules for Improving theUncertainty Reporting of Small Sample SystemUsability Scale Scores

Nicholas Clark ∗ , Matthew Dabkowski † , Patrick J. Driscoll † , Dereck Kennedy † ,Ian Kloo † , Heidy Shi ∗∗ Department of Mathematical Sciences † Department of Systems EngineeringU.S. Military Academy, West Point, New York 10996Email: [email protected]: (+1) 845-938-1111 Fax: (+1) 845-938-1111

Abstract

The System Usability Scale (SUS) is a short, survey-based approach used to determine the usability of a systemfrom an end user perspective once a prototype is available for assessment. Individual scores are gathered using a10-question survey with the survey results reported in terms of central tendency (sample mean) as an estimate ofthe system’s usability (the SUS study score), and conﬁdence intervals on the sample mean are used to communicateuncertainty levels associated with this point estimate. When the number of individuals surveyed is large, the SUSstudy scores and accompanying conﬁdence intervals relying upon the central limit theorem for support are appropriate.However, when only a small number of users are surveyed, reliance on the central limit theorem falls short, resultingin conﬁdence intervals that suffer from parameter bound violations and interval widths that confound mappings toadjective and other constructed scales. These shortcomings are especially pronounced when the underlying SUS scoredata is skewed, as it is in many instances. This paper introduces an empirically-based remedy for such small-samplecircumstances, proposing a set of decision rules that leverage either an extended bias-corrected accelerated (BCa)bootstrap conﬁdence interval or an empirical Bayesian credibility interval about the sample mean to restore andbolster subsequent conﬁdence interval accuracy. Data from historical SUS assessments are used to highlight shortfallsin current practices and to demonstrate the improvements these alternate approaches offer while remaining statisticallydefensible. A freely available, online application is introduced and discussed that automates SUS analysis under thesedecision rules, thereby assisting usability practitioners in adopting the advocated approaches. a r X i v : . [ s t a t . M E ] J a n RE-PRINT 1

Empirical Decision Rules for Improving theUncertainty Reporting of Small Sample SystemUsability Scale Scores

Index Terms

System Usability Scale, small sample size, bias-corrected accelerated bootstrap, conﬁdence interval, empiricalBayesian credible interval D ISCLAIMER

The views expressed herein are those of the authors and do not reﬂect the position of the United States MilitaryAcademy, the Department of the Army, or the U.S. Department of Defense.I. I

NTRODUCTION

The System Usability Scale (SUS) has been employed for over 20 years as a reliable end-of-test subjectiveassessment tool to evaluate the perceived usability of a system (Brooke, 2013). First introduced by Brooke (1996),it has been used extensively in industry to provide valid and consistent design feedback for assessing the usabilityof human-machine systems, software, and websites (Peres et al., 2013), as well as everyday products (Kortum andBangor, 2013). Usability in these settings encompass a broader scope than “ease of use” or “user friendliness”(ISO/TC-159, 2018), dimensions of usability that affect system technical acceptance by end users (King and He,2006). Among the SUS’s greatest strengths is the simplicity of its design. As seen in Table 1, the SUS is a Likertscale-based survey composed of ten questions with ﬁve levels of response used to elicit a user’s level of agreementconcerning system characteristics.Post-survey analysis regarding individual SUS questions typically proceeds by treating responses as interval dataand using appropriate parametric analysis (Boone and Boone, 2012; Cariﬁo and Perla, 2008; Joshi et al., 2015). Ofprimary interest in systems analysis is the aggregate of each respondent’s answers, dubbed a

SUS score . This SUSscore estimates an individual’s subjective judgment regarding the usability of a system. The SUS score is obtainedby converting each question’s response x i ∈ { , , , , } , for i = 1 , . . . , , into a single value depending onwhether the question was positively worded (the odd numbered questions): ( x i − , or negatively worded (theeven numbered questions): (5 − x i ) , and then multiplying the sum of these by 2.5, thus bounding a respondent’sSUS score to the interval [0, 100] in increments of 2.5 units. The sample average of SUS scores across all n SUSrespondents yields a

SUS study score . The overall SUS results for a system are often reported as a single SUSstudy score (K¨oltringer and Grechenig, 2004; Lewis and Sauro, 2018; Tullis and Stetson, 2004), a SUS study scorewith a standard error (Bangor et al., 2009; Everett et al., 2006; Kortum and Sorber, 2015), or a standard conﬁdencere-Print 2

Q1 I think that I would like to use this system frequently.Q2 I found the system unnecessarily complex.Q3 I thought the system was easy to use.Q4 I think that I would need the support of a technical person to be able to use this system.Q5 I found the various functions in this system were well integrated.Q6 I thought there was too much inconsistency in this system.Q7 I would imagine that most people would learn to use this system very quickly.Q8 I found the system very cumbersome to use.Q9 I felt very conﬁdent using the system.Q10 I needed to learn a lot of things before I could get going with this system.TABLE IS

TANDARD

SUS

QUESTIONS . a Respondents indicate their agreement with each statement from “1 = Strongly Agree” to “5 = Strongly Disagree.” interval (CI) for the mean of the SUS study score constructed from a z or t statistic (Blaˇzica and Lewis, 2015;Borsci et al., 2015; Orfanou et al., 2015).In some applications, organizations establish speciﬁc descriptors associated with numerical results based on pastexperience with similar systems. When organizations map SUS study scores to pre-deﬁned usability labels in thismanner using acceptability ranges (Bangor et al., 2008), letter grades (Sauro and Lewis, 2016), adjective ratings(Bangor et al., 2009), or score percentiles (Sauro and Lewis, 2016), SUS study scores can be communicated toproject managers and acquisition ofﬁcials in an intuitive and actionable way. Figure 1 shows an example mappingof SUS scores to all four of the noted usability labels. Here, all four mappings are shown for convenience inexposition as typically an organization would adopt only one for their use.Mapping the CI bounds of the SUS study score to usability labels adds further insights into survey results byportraying underlying uncertainty associated with SUS study results. When the CI for the SUS study score is mostlycontained in a single usability label’s interval, sufﬁcient evidence supports the usability characterization associatedwith that label. For example, Figure 1 SUS results support the associated system’s usability being labeled as nearly100% acceptable with little data in the category labeled ‘Marginal’. On the other hand, CIs that cross intervalbounds can complicate one’s usability interpretation. Returning to Figure 1, mapping the same CI to letter gradesindicates that the system’s usability ranges between a high C and low A, adjective ratings indicate the resultsexisting mostly in the category labeled ‘Good’ with high side spillover into ‘Excellent’, and Score Percentiles areconcentrated between roughly 50 and 90% when compared to previously evaluated similar systems. One can seefrom this example that tighter CIs are more desirable when SUS results are used in this context because they narrowthe range of mapped SUS score interpretations.The SUS has undergone extensive psychometric evaluation since its introduction, consistently demonstrating itsreliability (Bangor et al., 2008; Lewis and Sauro, 2009b; Lucey, 1991; Sauro and Lewis, 2016) and content validity(Finstad, 2010; Lewis et al., 2013). Moreover, these favorable characteristics appear to persist despite languagetranslations (Blaˇzica and Lewis, 2015; Dianat et al., 2014; Katsanos et al., 2012), using English language versionsThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 3

Fig. 1. SUS scores mapped to published usability scales, namely: acceptability ranges (from Figure 13 of Bangor et al. (2008, p. 592)), lettergrades (from Table 8.5 of Sauro and Lewis (2016, p. 204)), adjective ratings with 95% CIs (derived from Table 3 of Bangor et al. (2009,p. 118)), and score percentiles (from Table 8.4 of Sauro and Lewis (2016, p. 203)). with non-native English speakers (Finstad, 2006), modiﬁcations that use an all-positive version of the survey (Sauroand Lewis, 2011), as well as removing a single test item from the questionnaire (Lewis and Sauro, 2017) if deemedinappropriate for use with the system at-hand. In this last instance, practitioners must adjust the multiplier from 2.5to 2.78 to accommodate the reduction in test items from 10 to 9.SUS scores have been shown to be sensitive to types of interfaces and changes to a product (Bangor et al.,2008), the task order used for assessment (Tullis and Stetson, 2004), user experience (Kortum and Bangor, 2013;McLellan et al., 2012), and differences in user age but not gender (Bangor et al., 2008). Moreover, the SUShas been evaluated for efﬁcacy against other known methods for testing usability such as the Computer SystemUsability Questionnaire (CSUQ) and the Usability Metric for User Experience (UMUX) (Borsci et al., 2015; Lewis,2018a; Tullis and Stetson, 2004). Interest in capturing, conﬁrming, and expounding on SUS utility through researchcontinues to grow. Brooke’s (1996) seminal article on the SUS reported 8,948 Google Scholar citations as of 20May 2020, a gain of 3,284 since reported by Lewis (2018b) two years earlier.While the vast majority of research efforts have focused on validating the SUS as a sound methodology andleveraging its normative data to evaluate system usability, there remains opportunity to improve the analysis andThis article has been accepted for publication in the

International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 4reporting of SUS study scores when practitioners are faced with very small sample ( n ≤ ) and extremely smallsample ( n ≤ ) circumstances. In practice, this most often happens when cost constraints, security considerations,system operational complexity, or the limited availability of users with highly specialized skills impose on thesystem assessment process (Chen, 1995).When the system under assessment represents a capability gap-ﬁlling, new technology prototype, it tends to alsoreceive more enthusiastically positive responses from users who quickly comprehend the system’s potential to solveor partially-solve nagging operational deﬁciencies. This effect, along with its counterpart, introduces skewness inthe data that further complicates statistical methods. The challenge is exacerbated still when assumptions regardingthe continuity of the underlying distribution are not appropriate. In this context, we have two concerns.Firstly, the distribution of historical SUS study scores is skewed (Bangor et al., 2008; Sauro and Lewis, 2016),violating symmetry assumptions that call into question methods relying on the central limit theorem (CLT) whenthe sample size is small. Historically, SUS study scores across all usability studies analyzed for this paper have anaverage skewness of about -0.4 (Lewis and Sauro, 2009a), as given in Sauro and Lewis (2016). Data from Bangoret al. (2008) demonstrate that for a single study, this skewness can range from highly negative to highly positive.One logical explanation for this characteristic is that by the time users engage with system prototypes, features arereﬁned to a point that a majority of users will ﬁnd the system under investigation to be reasonably simple to use.Secondly, SUS study scores are bounded to the interval [0 , in discrete increments of 2.5. Therefore, asystem perceived as highly usable will tend to produce SUS scores closer to the upper bound. The theory allowingpractitioners to use standard methods for constructing conﬁdence intervals relies on asymptotics that manifest slowlywhen skewness is present, a condition exacerbated when skewness is high. The resulting negative skewness in SUSstudies causes the upper bounds of CIs to violate the parameter space of the scoring interval in small samplestudies where higher levels of variability are expected. Figure 2 shows an example of this for a SUS study with 6participants, where 4 of the 6 SUS scores were near the upper bound of 100, yielding a sample skewness of -0.9.From a practical perspective, if small sample SUS studies were uncommon, the concerns identiﬁed would bereal but somewhat irrelevant. This is not the case. In particular, determining the usability of highly complicated orcomplex systems tends to be restricted to small sample sizes, mainly due to limited accessibility to these systemsand cost considerations involving highly specialized test subjects. As some datasets suggest (e.g., Bangor et al.(2008)), very small and extremely small SUS studies are effectively the rule rather than the exception, and in suchsituations, reporting a standard error or using traditional conﬁdence intervals neglects potential statistical issuesthat exist in the data. Ultimately, the method of CI construction should generate narrow conﬁdence intervals withnominal coverage of the underlying population mean.In this paper, the impact of very small sample ( n ≤ ) SUS study results are examined with speciﬁc focus onachieving desirable characteristics for a reported SUS study conﬁdence interval, and the accessibility of potentialSUS study scores given the discrete nature of the underlying survey scale for individual responses. The results areintended to assist usability professionals in increasing the accuracy and validity of their SUS results when facedwith a very small number of SUS respondents.In Section II, two alternatives to the common practice of using t distribution-based conﬁdence intervals for smallThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 5

Fig. 2. Upper CI bound for the mean SUS score violating the upper bound of the parameter space. sample SUS study scores are examined to assess their efﬁcacy towards achieving the previously stated desirablecharacteristics. Section III applies these alternative approaches to a repository of data from actual SUS studies usedwith permission from the owners. Results demonstrate achievable improvements in reporting accuracy regardingSUS study results. In Section IV, a publicly available, custom R-based application that implements the three majorcomputational approaches presented in this paper is described. This application is intended to enable practitionersto leverage the recommendations provided herein in an efﬁcient manner. Section V summarizes this paper’s majorresults and recommendations along with potential future opportunities.II. M

ETHODS FOR Q UANTIFYING U NCERTAINTY

Generally speaking, the CLT states that the distribution of the sample mean approximates a normal distributionas the sample size becomes large, regardless of the shape of the population’s distribution. Oftentimes, sample sizesgreater than 30 are considered sufﬁcient for the CLT to hold. In such cases, a CI that reﬂects a range of plausiblevalues for the population mean ( µ ) can be calculated using the standard expression: ¯ x ± z α/ s √ n where ¯ x is the sample mean, z α/ is the critical value from the standard normal distribution evaluated at α/ when − α is the conﬁdence level, s is the sample standard deviation, and n is the sample size.For the reasons mentioned earlier, the distribution of the sample mean ¯ X is likely not symmetric for smallsamples. To demonstrate this, assume that the underlying distribution of SUS scores follows Azzalini’s (2005)skew-normal distribution with a population mean ( µ ), standard deviation ( σ ), and skewness ( λ ) of 65, 20, andThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 6

Fig. 3. Density plots of the sample means from 100,000 samples of size n = 5 from skew-normal distributions. -0.4, respectively. Due to the upper truncation of the distribution at 100, these correspond to µ = 63 , σ = 19 , and λ = − . , which closely mirror the values reported in Lewis and Sauro (2009a). To visualize the distribution ofthe sample mean when n = 5 , 100,000 random samples from this skew-normal parent distribution were generatedwith realizations rounded to the nearest 2.5 to match the domain of observable SUS scores. As seen in the bluedensity plot shown in Figure 3, the sample mode sits to the right of the sample mean and the skewness of ¯ X isapproximately -0.22. Moreover, as the population mean gets closer to the SUS score’s upper bound, the magnitudeof ¯ X ’s skewness increases. For instance, if the mean shifts to 81, the skewness of ¯ X becomes roughly -0.38, whichis noticeable in Figure 3’s red density plot. While rules of thumb suggest that skewness values between -1 and 1are acceptable, normal-based conﬁdence intervals often fail to achieve nominal coverage within this range.To understand why the CLT may be inadequate for very small sample SUS studies, it is beneﬁcial to considerthe Edgeworth expansion for the studentized sample mean: n / X − µ ) σ . As given in Hall (2013), the probabilitydensity function (PDF) of this statistic can be expressed by: f ( x ) = φ ( x ) − √ n λ φ (3) ( x ) + O ( n − ) (1)where φ ( i ) ( x ) is the i th derivative of the standard normal distribution and λ is the third central moment (i.e.,skewness) over the standard deviation cubed. Therefore, with skewness present the distribution of the sample meanapproaches a standard normal at a rate of n − / , whereas in the absence of skewness it approaches a standardnormal at a rate of at most n − . Practically, this means that if n is small (i.e., few respondents participating in aThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 7SUS study), the distribution of the sample mean should still have some amount of skewness present. When samplesizes are small, the standard normal distribution is often replaced with the Student’s t distribution to generate thecritical value to construct the CI: ¯ x ± t α/ ,n − s √ n where t α/ ,n − corresponds to the α/ quantile from a t distribution with n − degrees of freedom. In suchsituations, a symmetric CI such as one formed using the t distribution, may be inappropriate, with this conditionbecoming exacerbated as n approaches one. On the other hand, when n is large, the √ n term in (1) goes to zero,meaning the distribution of the sample mean becomes symmetric. A. Options for Conﬁdence Interval Formulation

Despite the lack of evidence supporting a CLT assumption for small sample data, it does not necessarily meanthat using the CLT is categorically bad or always inappropriate in such cases. Several researchers have proposedmodiﬁcations to the t distribution approach when sample sizes are as small as 13 (Chen, 1978; Sutton, 1993),although as researchers note, results can be quite inaccurate. And, although the t distribution has nice coverageproperties, it is not guaranteed to obey the parameter space. For example, consider the case when n = 5 . Supposethat among the ﬁve SUS responses, three respondents ﬁnd the product exceptional, and two rate it as good, resultingin potential SUS scores of 97.5, 97.5, 97.5, 80, and 80. In this case, the 95% CI formed by using a t distributionis (78.5, 102.4), which is nonsensical as µ cannot be greater than 100. As described earlier, this occurs becauseusing the t distribution assumes that the distribution of ¯ X is symmetric, and when n is small this is not guaranteedto be true.While it is tempting to truncate the above CI to (78.5, 100), the result is no longer a valid 95% conﬁdenceinterval. In general, a conﬁdence interval of a parameter, µ , is of level − α if P ( L ( X ) < µ < U ( X )) = 1 − α (Casella and Berger, 2002). Using the t distribution in the above example yields a 95% CI with L ( x ) = 78 . and U ( x ) = 102 . . Inherent to this construction is the belief that P ( ¯ X ∈ (100 , . | µ ) > , and in this case, theconﬁdence interval calculation used implies P ( ¯ X ∈ (100 , . | µ ) ≈ . . Simply truncating the interval at100 renders the probability that ¯ X is above 100 to be zero, resulting in a conﬁdence interval with less than itsnominal coverage (Agresti and Coull, 1998), perhaps signiﬁcantly (Mandelkern et al., 2002). Under these conditions,a practitioner should not be comfortable reporting results at the nominal level. In order to guarantee a 95% conﬁdenceinterval in this instance, the lower bound must be shifted left to account for the missing 4.5% probability. In otherwords, a properly constructed CI requires a L shifted ( X ) such that P ( L shifted ( X ) < µ < − α . In theabove example, this would result in an interval of (70, 100). Depending on the adjective label intervals being used,the usability of a system might easily be considered marginal instead of acceptable.An additional detractor to a truncating approach is that the distribution for ¯ X is no longer valid. To ﬁx this, onemight abandon the belief that ¯ X follows a non-central t distribution and instead assume that it follows a truncated t distribution, and hence formulate a one-sided conﬁdence interval. While explorations involving the truncated t distribution are beyond the scope of this paper, preliminary simulations suggest that using this distribution mayThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 8have beneﬁts. Additional alternate ways to form a conﬁdence interval on a bounded parameter space are given inWu and Neale (2012), Bebu and Mathew (2009), and Andrews (1999).In sum, while the t distribution may validly address some issues with small samples, it fails to account forbounds on the underlying sample space for µ in SUS applications, and it does not account for potential skewnessin the distribution of the sample mean. As demonstrated in Section III that follows, many SUS studies limited tosmall sample sizes encounter this issue (Bangor et al., 2008). Fortunately, two alternatives offer relief: an expandedversion of the bias-corrected accelerated bootstrap approach (Efron, 1987) and uncertainty bounds created using aBayesian credible interval. B. CIs Built Using the Expanded Bias-Corrected Accelerated Bootstrap

In general, bootstrap methods are empirical statistical sampling approaches used to estimate characteristics ofunknown distributions when faced with small sample sizes or inappropriate use of parametric assumptions. Inparametric bootstrap methods, random samples are generated from a parametric model ﬁt to the data. In non-parametric resampling, bootstrap samples are constructed using resampling with replacement from the originalsample (Efron and Tibshirani, 1986; Kysely, 2010). Given the limitations noted earlier, a non-parametric randomsampling approach is warranted for this exploration (Diciccio and Romano, 1988; Flowers-Cano et al., 2018).In comparison to the detractors present in a t distribution approach noted earlier, a percentile bootstrap CI for thepopulation mean ( µ ) is simple to form, easy to understand, and guaranteed to obey bounds on a parameter space.It uses the α/ and the − α/ percentiles of the bootstrap distribution as its bounds, where α is thelikelihood that µ lies outside of the CI. For example, to construct a 95% CI using the percentile bootstrap method, α = 0 . , and the CI would be constructed using the 2.5 and 97.5 percentiles of the bootstrap distribution.To generate sufﬁcient realizations of a SUS sample mean, the percentile bootstrap CI for µ resamples fromSUS score data with replacement B times to form B bootstrap samples. At each iteration, the sample mean iscalculated for each bootstrap sample. Taken together, these B realizations of the sample mean approximate asampling distribution for the population mean by ordering them from smallest to largest such that: ˆ θ ∗ (1) ≤ ˆ θ ∗ (2) ≤ ˆ θ ∗ (3) ≤ ... ≤ ˆ θ ∗ ( B ) , where ˆ θ ∗ ( i ) represents i th smallest sample mean for i = 1 , , . . . , B . Using this approach, the corresponding 95%conﬁdence interval for the population mean is given by: [ ˆ θ ∗ (0 . B ) , ˆ θ ∗ (0 . B ) ] . As an illustration, consider the small SUS score dataset introduced earlier, namely { } , and set B = 1 , . The percentile bootstrap with B = 1 , resamples 1,000 datasets with n = 5 , subsequently calculatingthe sample mean for each dataset. With resampling with replacement, it is possible to obtain bootstrap sampleswith many repeated values, such as {

80, 80, 80, 80, 80 } , which would result in a mean of 80. Once resampling iscomplete and 1,000 sample means are calculated, the 2.5 th and 97.5 th smallest values would be extracted and usedas the 95% conﬁdence interval’s lower and upper bounds, respectively.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 9To account for potential skewness in data, Efron (1987) introduced a bias-corrected and accelerated (BCa)bootstrap methodology for both parametric and nonparametric situations, improving an earlier percentile method(Efron, 1981). Similar to conﬁdence intervals constructed using the percentile bootstrap method, bias-correctedand accelerated (BCa) bootstrap CIs are formed using percentiles of the bootstrap distribution. However, unlikethe percentile bootstrap method, the choice of which percentiles to use is more complicated and requires one tocalculate two additional factors: a bias correction factor b and an acceleration factor a . The bias correction factor b is calculated as b = Φ − ( p ) , where Φ is the cumulative distribution function of a standard normal random variableand p is the proportion of bootstrap samples less than the average. This factor estimates the difference between themedian of the bootstrap distribution and ¯ X . Clearly, if there is no skew to the sampling distribution of ¯ X , this biascorrection factor will be zero on average.The acceleration factor a is obtained by jackknife resampling of the n original data to estimate the second termin (1) using the following relation: a = 1 √ n λ φ (3) ( x ) . (2)This involves generating B replicates of the original sample. The ﬁrst jackknife sample leaves out the ﬁrst valueof the original sample, the second by leaving out the second value, and so on until B samples of size ( n − areobtained (Carpenter and Bithell, 2000). Once this term is estimated, both the bias correction and the accelerationterms can be used to make a bootstrap estimate that converges at a rate of O ( n − ) rather than O ( n − / ) as shownin Hall (2013).Although the BCa bootstrap method is an improvement over the percentile bootstrap technique, the CIs producedby the BCa bootstrap method tend to be too narrow for small samples and may fail to achieve their nominal coverageprobability. In other words, when the sample size is small, BCa bootstrap CIs may not cover their parameters’ truevalues as often as they claim to. One approach to addressing this issue is given in Hesterberg (2015a) and adoptedherein, which is similar to using a t distribution instead of the CLT-based normal distribution to form traditionalCIs. Using a t distribution instead of a normal distribution is the same as multiplying the length of a CI by ( s × t α/ ,n − ) / (ˆ σ × z α/ ) . If the underlying distribution is not normally distributed, applying this correction is nottheoretically sound, but it is a commonly used correction factor in practice.This same approach can be applied to bootstrap CIs in a straightforward manner, most easily explained usingthe percentile bootstrap as an example. The percentile bootstrap uses the α/ and the − α/ highestvalues from the bootstrap sample to form a − α )% CI. The expanded percentile bootstrap sets α (cid:48) / − (cid:113) n ( n − t α/ ,n − ) , where Φ is the standard normal CDF and t α/ ,n − is the critical value found from the tdistribution. The expanded bootstrap then uses the α (cid:48) / and the − α (cid:48) / values from the bootstrapsample. This expansion can also be easily applied to the BCa bootstrap using the resample library (Hesterberg,2015b) within the statistical software R. When applied to the example dataset introduced earlier: { } , the expanded BCa bootstrap yields an interval of (80, 97.5) and thereby supporting the conclusion that thesystem’s usability is acceptable. Additionally, Figure 4 illustrates a 95% expanded BCa bootstrap CI for the sameThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 10

Fig. 4. 95% expanded BCa bootstrap CI for the mean SUS score of the survey highlighted in Figure 2.

SUS study highlighted in Figure 2.Comparing the t distribution CI in Figure 2 to the expanded BCa bootstrap CI in Figure 4 reveals two importantdifferences. First, unlike the upper bound of the CI derived from the t distribution (106.16), the upper bound of theBCa bootstrap CI (97.50) abides the parameter space of the SUS score. In essence, a CI represents a plausible rangeof values for a parameter of interest. CIs that exclusively cover the feasible values for this parameter should bepreferred. In the case of mean SUS scores, values below 0 or above 100 are not realizable, and hence not feasible.Additionally, the width of the expanded BCa bootstrap CI (51.67) is slightly narrower than the t distribution CI(53.99). In general, when choosing between multiple valid CIs, the narrowest interval is preferred, assuming that theCI’s construction method preserves the nominal coverage probability. While in this instance the practical conclusionswould not change if an analyst selected the expanded BCa CI over the t distribution CI, in some instances theywould. For example, consider the small dataset presented earlier (i.e., { } ). After propertruncation the adjusted lower bound of the t distribution CI would lead a practitioner to conclude that the system’susability is unacceptable, whereas the expanded BCa CI would not.The above discussion indicates the expanded BCa bootstrap CI is the better option for this small sample SUSstudy, and Section III investigates the generalizability of these results using simulation. However, before movingon, there is a ﬁnal, more subtle point worth mentioning.When confronted with small samples, using the t distribution assumes the distribution of the underlying dataThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 11is normally distributed or at least reasonably symmetric and continuous. As noted earlier, SUS scores often lacksymmetry; case in point, the sample skewness of the SUS scores in Figure 2 is -0.9. Moreover, while SUS scoresare clearly discrete, leading SUS literature currently suggests the sampling distribution of the mean SUS score iseffectively continuous, stating “the combination of average SUS scores for a study is virtually inﬁnite” (Sauro andLewis, 2016, p. 202). Figure 5 suggests otherwise, especially for small n . Fig. 5. Possible mean SUS scores for samples of size n , where c is the number SUS score combinations of size n and m is the number ofdistinct mean SUS scores. As seen in Figure 5, when n = 6 over 9 million SUS score combinations are possible, but there are only 241distinct SUS means available. Increasing the sample size to n = 10 produces more than 10 billion SUS scorecombinations. However, these combinations yield a mere 401 distinct SUS means. This is far from inﬁnite, and thecounter-intuitive result is a function of the SUS scores’ special structure. Speciﬁcally, prior to scaling an individualSUS score by 2.5, the 41 possible scores are S = { , , , ..., } , which can be rewritten as S = { s + id : i =0 , , ..., k − } with s = 0 , d = 1 , and k = 41 . With s ∈ S and d ≥ , S is a 41-term arithmetic progression.Leveraging a result from number theory, the mean SUS score for a sample of size n is simply an n-fold sumset of S (scaled by . /n ), and when n ≥ this sumset’s cardinality is n − ( n − (Mistri and Pandey, 2014, p.335). Each time the sample size increases 1, the number of distinct SUS means increases by 40. When n is small,these means are sparsely distributed along the real line between 0 and 100. Accordingly, treating the SUS meanas continuous and using a CI construction method that relies on this continuity, such as the t distribution, appearsill-advised. This sentiment is reinforced in recent literature, notably Liddell and Kruschke (2018). Fortunately, theexpanded BCa bootstrap CI accommodates discrete data and provides a defensible alternative CI construction bothin theory and practice.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 12

Fig. 6. Two choices of priors for µ overlayed on top of the empirical distribution. C. Credible Intervals Constructed Using an Empirical Bayesian Approach

Although the expanded BCa bootstrap method is an attractive option as it obeys the parameter space for µ , it doesignore any preconceived belief or historical evidence regarding a given system’s mean SUS score. For example, itfails to account for the extremely low probability that the population mean is 0 or 100. Furthermore, as we showin Section III, when n is less than 5, CIs formed using even the expanded BCa bootstrap method do not cover thetrue mean as often as they purport to do.Bayesian inference offers an approach that both takes advantage of prior information and properly facilitatesinference when n is extremely small by relying on prior beliefs to inform posterior probabilities regarding aparameter of interest. While in some cases this might appear superﬂuous, historical SUS scores can meaningfullyinform inference regarding sample means. Notably, Bangor et al. (2008) demonstrated that for 206 usability studiesusing the SUS, there was never a mean SUS score below 30 or one above 95. Moreover, their data would suggest aprior distribution (density) regarding the true SUS score mean π ( µ ) similar to the empirical density denoted by theblue line in Figure 6, created by assuming ¯ X follows a truncated normal distribution and matching moments. Thisyields a truncated normal distribution with mean 70 and standard deviation 12 as indicated by the thick pink linein Figure 6. This choice appears to align with the empirical distribution of ¯ X fairly well, as does a non-truncatednormal distribution with mean of 70 and standard deviation of 12. Thus, as can be seen in Figure 6, although thetruncated normal prior distribution is theoretically sound, including truncation in the model to construct a priordensity on µ increases the model’s complexity without substantively improving the results. The largest differencebetween the two exists in the region above 100, and this is only 0.007% of the data. Lest this appear contradictoryto earlier evidence against relying on the CLT, the concern here is on a prior distribution of µ rather than a samplingdistribution of ¯ X .The marginal posterior distribution for µ given the data from any given usability assessment can be estimatedThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 13by appealing to Bayes theorem as follows: π ( µ , σ | y ) ∝ (cid:89) i,j f ( y i,j | µ i , σ ) π ( µ i ) π ( σ ) . (3)Here f ( y i,j ) is the likelihood function for observation j within test i assuming a truncated Normal distribution.Letting π ( µ i ) denote the prior distribution of the true average score of test i , π ( µ i ) can be determined from the datain Bangor et al. (2008). Similarly, letting π ( σ ) denote the prior distribution for the population standard deviation,which is assumed to be Uniform(0,30) as it is highly unlikely that the standard deviation is greater than 30. The jointposterior distribution for µ and σ , π ( µ , σ | y ) , can then be used to ﬁnd credible intervals for either both parameters,or by integrating out σ , a marginal posterior distribution for µ .Typically, (3) is too complex to evaluate directly without the use of specialized software such as Stan (StanDevelopment Team, 2019), which relies on Markov Chain Monte Carlo (MCMC) techniques to simulate fromthe posterior distributions. Practically, an empirical Bayesian approach prevents reporting values for CI or credibleinterval bounds on µ that are unrealistic. This is done above by assuming that a truncated normal distribution appliesas opposed to a distribution that has support outside of (0,100). Figure 7 shows a 95% empirical Bayesian credibleinterval for the same SUS study highlighted earlier in Figure 2. As demonstrated in what follows, using empiricalBayesian techniques dictate that a practitioner believes that the current study’s SUS scores are likely similar tothose that have been collected in the past. When sample sizes are extremely small, the mean SUS scores will shifttowards the historical norms. It is also important to note that the resulting intervals are not conﬁdence intervals butrather credible intervals that yield a true probability for µ , a useful distinction when communicating results.III. S IMULATION S TUDY

A simulation experiment was conducted to assess the coverage of the expanded BCa bootstrap and t distributionCI approaches on very small SUS study results to determine whether 95% CIs cover the true mean 95% of thetime, and the proportion of CI bounds that fall outside of the SUS score’s parameter space. A. Simulation Methodology and Results

Small sample SUS studies with 4 to 10 respondents were created using a skew normal distribution (Azzalini,2005) with a mean of 68, a standard deviation of 20, and a skewness ranging between -0.99 and 0.99. Thesechoices mirror those seen in practice (Bangor et al., 2008; Lewis and Sauro, 2009a) while abiding the theoreticalbounds for skewness appearing in the skew normal distribution (Azzalini, 1985). For each combination of samplesize and skewness, 500 sets of scores were generated, along with 95% CIs for each set using the expanded BCabootstrap and the t distribution techniques. The results of this experiment are summarized in the four panels shownin Figure 8.The results illustrated in Panel (a) of Figure 8 show that the t distribution CIs begin violating the SUS score’sparameter space at n = 8 , and the situation gets progressively worse as n decreases. At a skewness of -0.39, theaverage skewness observed in real-world SUS studies (Sauro and Lewis, 2016), roughly 30% of the t distributionCIs exceed the SUS score’s upper bound when n = 4 . As seen earlier, the expanded BCa bootstrap CI’s boundsThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 14

Fig. 7. 95% empirical Bayesian credible interval for the mean SUS score of the survey highlighted in Figure 2. are percentiles of the bootstrap sample drawn with replacement from the observed scores. As a consequence, theexpanded BCa bootstrap CI always abides the parameter space. Moreover, as seen in Panel (b), for n ≤ theexpanded BCa bootstrap also tends to produce much narrower intervals. However, Panel (c) demonstrates that thisnarrowness comes at the expense of coverage, and for n ≤ the expanded BCa bootstrap CI fails to cover the truemean at an unacceptable rate. The results in Panel (d) shows that the t distribution CI’s coverage performance issigniﬁcantly better for extremely small samples, but this is not surprising given its considerably wider intervals -intervals that often violate the SUS score’s parameter space.In summary, the following decision rules apply:1) n ≤ : With 5 or fewer respondents, both the expanded BCa bootstrap and the t distribution CI have signiﬁcantshortcomings. Additional information is at a premium, and the empirical Bayesian approach offers a way toharness it, albeit at a cost of having to assume that the current study’s mean SUS score will follow similarpatterns as previously recorded studies.2) n ∈ { , , } : When compared to the t distribution, the expanded BCa bootstrap CI offers acceptable,comparable coverage and narrower or similar widths. It also abides the SUS score’s parameter space, and itsbounds represent feasible realizations of the true mean SUS score for a sample of size n .3) n ≥ : With 9 or more respondents, Figure 8 suggests the t distribution CI abides the parameter space, isThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 15

Fig. 8. Summary of 95% CI performance for a simulation of 500 samples of size n ∈ { , , . . . , } from a skew normal distribution with amean of 68, a standard deviation of 20, and a skewness ranging between -0.99 and 0.99. Panel (a) highlights the proportion of the t distributionCIs that exceed the bounds of the SUS score’s parameter space. Panel (b) shows the ratio of the mean CI widths, where values less than1 indicate the BCa bootstrap (with Hesterberg’s expansion (2015a)) is narrower. Panels (c) and (d) give the observed coverage for the BCabootstrap and t distribution CIs, respectively. This article has been accepted for publication in the

International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 16slightly narrower than the expanded BCa bootstrap CI, and has good coverage. Based on this ﬁnding, onecould argue that a practitioner should strive for at least 9 participants when conducting a SUS study, as itwould allow for safely constructing traditional CIs using the t distribution. Although this logic has merit, beingnarrower on average does not imply always narrower, and it is easy to construct pathological representativeSUS score examples where a t distribution CI will exceed the SUS score’s upper bound. With this in mind,it appears prudent to construct CIs using both the expanded BCa bootstrap and the t distribution, and tosubsequently pick the one that is both narrower and abiding of the parameter space. Simulation validated thismethodology, yielding observed coverage probabilities ranging from a low of 0.92 to a high of 0.96 with anaverage of 0.943. In short, nominal coverage is preserved.Although it may seem contrary to the philosophy of Bayesian statistics, for completeness the simulation describedabove was repeated using credible intervals from the empirical Bayesian methodology outlined in Section II-C.Echoing a comment made earlier, unlike a traditional conﬁdence interval, a Bayesian credible interval allows for aprobabilistic statement to be made directly about the parameter of interest. For example, if [ L, U ] is a − α )% credible interval for µ , then there is a (1 − α ) probability that the population mean lies between L and U . To verifythat such probabilistic statements are reasonable, the percentage of Bayesian credible intervals that covered µ = 68 over the values of n and skewness used in Figure 8 were calculated. These percentages ranged from 92% to 99%,which agrees nicely with the intended 95% chance of the credible intervals containing µ . Nonetheless, this levelof agreement should be expected, as the simulation was run with a mean that is close to the mean of the priordistribution, which is 70. If the true mean happened to be far from the mean of the prior distribution, this wouldnot be the case. That said, in the absence of additional SUS scores, the Bayesian approach assumes the systemunder consideration will most likely have a mean similar to other systems that have been previously studied. B. Implications for a Practitioner

While the above simulations appear academic in nature, there are several practical reasons to care about thereporting of the upper conﬁdence bound for SUS study scores. In practice, a usability practitioner can make one oftwo errors when conducting a SUS study. One could fail to conclude a system is acceptable when it is acceptable,or one could fail to conclude a system is unacceptable when it is unacceptable. The latter appears to be a gravererror in the context of assessing usability, as it potentially places an unacceptable system in the hands of the targetuser base, or it wastes additional resources on further conﬁrmatory studies.Examining the upper bound of a SUS study score’s conﬁdence interval helps the practitioner to avoid this typeof error. For example, if the true SUS study score is indeed unacceptable, say 50, one would not want to concludethat the product’s usability is acceptable. To test whether earlier results hold in this regard, SUS study scores for n ∈ { , , , , , , } over a range of skewness were simulated, and the number of times that the upper boundsexceeded 70 for both the t distribution and the expanded BCa bootstrap were counted. Overall, 40% of the t distribution conﬁdence intervals contained 70, while the expanded BCa bootstrap intervals contained 70 only 31%of the time. Furthermore, over the range of the skewness tested, the expanded BCa bootstrap intervals reportedfewer errors than the t distribution intervals in 62% of the simulations.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 17 (a) Stacked dot plot of sample sizes. (b) Density of sample skewness for small sample SUS studies.Fig. 9. Sample sizes and skewness of the 206 SUS studies in Bangor et al. (2008).

Although the focus here is on the upper bound, if a practitioner’s primary concern is with reporting an acceptablesystem as unacceptable, the lower bound should be emphasized. In such circumstances, simulation testing suggeststhe t distribution and expanded BCa bootstrap intervals perform similarly for n ∈ { , , } . Although beyond thescope of this paper, preliminary analysis indicates that truncated t distribution intervals are a promising alternative. C. Validation of Decision Rules against Actual SUS Studies

The effectiveness of the decision rules introduced in the previous section was assessed using a dataset of 206SUS studies (Bangor et al., 2008). As seen in Panel (a) of Figure 9, these SUS studies range from a minimumof 3 usability respondents to a maximum of 32 with a median of 10, as indicated by the dashed vertical red line.Additionally, 15 of the studies (7.2%) had sample sizes of 3 to 5, and 44 (21.4%) had sample sizes of 6 to 8.Moreover, as Panel (b) of Figure 9 shows, for the 109 studies with 10 or less respondents, the sample skewnessranged from -2.06 to 0.85 with a mean of -0.43. Taken together, these observations suggest small sample SUSstudies are common, and the range of sample skewness simulated earlier is reasonable. Hence, when practitionersﬁnd themselves confronted with small sample SUS studies, the procedures outlined above are recommended.To investigate Bangor et al.’s (2008) dataset further, 95% CIs for the mean of each SUS study were constructedusing the t distribution, expanded BCa bootstrap, and empirical Bayesian methods. Panels (a) and (b) of Figure 10show the upper conﬁdence bounds (UCBs) of these CIs for the t distribution and the expanded BCa bootstrap meth-ods, respectively. Additionally, within each plot, the blue colored points denote that the associated CI constructionmethod is preferred based on the decision rules outlined in the previous section.Applying the ﬁrst decision rule, for n ∈ { , , } neither the t distribution nor the expanded BCa bootstrapmethods should be trusted to generate 95% CIs that abide the SUS score’s parameter space and attain the nominalcoverage. Accordingly, none of the points for n ≤ are colored blue in either of Figure 10’s panels. Moving to thesecond decision rule, for samples with 6 to 8 respondents Panel (a) shows that the UCBs of the t distribution CIsoften exceed the SUS score’s maximum value of 100, while Panel (b) highlights that the UCBs of the expandedThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 18 (a) UCBs using the t distribution(b) UCBs using BCaFig. 10. Plot of sample sizes vs. upper conﬁdence bounds for the data of Bangor et al. (2008), where the sample sizes have been jittered forreadability. For each of the 206 SUS studies represented in these plots, a blue colored point denotes that the associated CI construction methodis preferred based on the decision rules outlined in the previous section. As seen in Panel (a), it is common for the t distribution CIs’ upperconﬁdence bounds to exceed the parameter space for n ≤ . Additionally, there are two studies with n > ; in both cases, the t distributionCI is preferred. BCa bootstrap CIs abide the parameter space. In accordance with the second decision rule, all of the points inPanel (b) for n ∈ { , , } are blue. Finally, for samples with 9 or more respondents, the third decision rule impliesthat if the t distribution CI abides the parameter space and is narrower than the expanded BCa bootstrap CI, itsrepresentative point in Panel (a) will be blue. Otherwise, the expanded BCa bootstrap CI is the better option, andthe associated point in Panel (b) will be blue. Notably, the expanded BCa bootstrap CI is the preferred option in 53of the 147 SUS studies (36%) with n ≥ , including several of the studies with the largest number of respondents.Clearly, the expanded BCa bootstrap method is not simply a small sample solution.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 19

Fig. 11. Comparison of the labels corresponding to the 95% CI bounds of the gray points in Panel (a) of Figure 10, where U = unacceptable,M = marginal, and A = acceptable. Speciﬁcally, for n ≤ , the t distribution CI is compared to the empirical Bayesian CI, and for n ≥ , itis compared to the expanded BCa bootstrap CI. The numbers in the cells represent counts, and pink colored cells highlight SUS studies wherethe acceptability labels corresponding to the CIs’ bounds disagree. Figure 11 provides a tabular summary of the practical consequences of utilizing the empirical Bayesian orexpanded BCa bootstrap methods in lieu of the commonly used t distribution. In particular, it displays the agreementbetween the acceptability labels corresponding to the bounds of the t distribution CI and the alternative CI suggestedby the decision rules, where the numbers in the cells represent counts. For example, when n ≤ the ﬁrst decisionrule suggests the empirical Bayesian method should be applied. In Bangor et al.’s (2008) dataset, there are 15 SUSstudies that fall into this category. After applying both the t distribution and empirical Bayesian methods, there are7 SUS studies where the acceptability labels disagree. As seen in Panel (a), in 5 of these studies the t distribution CIsuggests the system’s usability ranges from unacceptable to acceptable, while the empirical Bayesian CI sees it asmarginal to acceptable. Although the underlying reason for the extremely small number of respondents is unknown,it is reasonable to assume that one or more of the practical limitations mentioned earlier are present. Applying themore appropriate empirical Bayesian method has returned tighter results and sharpened the conclusiveness of thestudies. For the remaining two studies in Panel (a), the opposite is true, as the empirical Bayesian method suggeststhe plausible range of acceptability labels should be loosened to include marginal.Moving to the second decision rule, in Panel (b) of Figure 11 the agreement between the acceptability labelsimproves, as only 6 of 44 SUS studies (13.6%) disagree. In all 6 cases, the expanded BCa bootstrap method returneda more pessimistic usability result than the t distribution, and in 3 of the 6 the upper bound of the t distributionCI exceeded 100. Finally, after applying the third decision rule, there are 53 SUS studies where the expanded BCabootstrap CI is preferred over the t distribution CI. Among these 53 studies, Panel (c) shows only 5 (9.4%) disagree,and once again, each disagreement is due to the more pessimistic usability result of the expanded BCa bootstrapCI.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 20IV. D

ISCUSSION AND C USTOM A PPLICATION

While oftentimes the boundaries of the conﬁdence interval shift only slightly, Figure 11 highlights that such shiftsmay affect the usability labels assigned to system usability constructs. Given that practitioners make decisions basedon these labels, accurately calculating the boundaries of the conﬁdence interval has practical signiﬁcance. To thisend, the empirical Bayesian or expanded BCa bootstrap methods should be used in practice; however, the statisticalacumen necessary for a usability practitioner to employ them presents a potential barrier. With this in mind, theauthors developed an intuitive, freely accessible online application that automates the calculation of and the decisionrules for these alternative CIs (available at http://sus.dse-apps.com). The user interface accepts and processes SUSdata, provides a recommendation for which method(s) to use, and creates effective visualizations to communicateresults to both technical and non-technical clients. This application complements a recently introduced free mobiledevice application called

SUSapp (Xiong et al., 2020) that helps practitioners administer the SUS and collect datafor later analysis.While the full code base is available under an open source license, the application removes the tasks of conﬁguringand running the code which can be complicated by dependencies (e.g., the need to work with Stan (Stan DevelopmentTeam, 2019) for Bayesian computation). Furthermore, correctly interpreting the direct code output can be difﬁcultwithout prior knowledge of each method.

Fig. 12. The System Usability Scale (SUS) Analyzer application splash page.

The application accepts data in CSV format where each row represents an individual’s response to the SUSquestionnaire. As seen in Figure 12, a sample data set showing the required formatting is available on the welcomepage via the blue hyperlink. After loading data using the “Upload File” button, a facility is provided for users toview and conﬁrm that data were loaded correctly (see Figure 13). Once complete, users select the appropriate buttonto either edit input data or submit their data for the decision rules to be applied as described in this paper (seeFigure 14). Recommended methods that adhere to these decision rules appear as blue tabs. If a particular methodis not recommended (e.g., an option of using the t distribution method with a sample size lower than 5), the tabwill display as grey, and a user will not be able to select it.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 21

Fig. 13. Users are able to verify their data were loaded correctly before proceeding to the recommendations and visualizations.Fig. 14. The app recommends which method(s) to use based on the decision rules presented in this paper.

A user can move between the recommended tabs to see visualizations for Bayesian, expanded BCa bootstrap,and t distribution frequency plots, resulting means and conﬁdence intervals, and appropriate mapping to the fourcommon scales mentioned in Section I. All visualizations can be exported in high resolution images using the “Saveas PNG” button at the bottom of each plot (see Figure 15).While there are many additional features that could be included in this application to meet the needs of speciﬁcuser bases, the version available is intended as a general purpose tool for usability practitioners. Because the sourcecode is available under MIT open source license, practitioners needing to modify the application, its underlyingThis article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 22

Fig. 15. Results of empirical Bayesian analysis on SUS data showing a 95% credible interval. analysis options, decision rules, or output visualizations can do so to add features or modify any of the methods toaccommodate speciﬁc needs. V. C

ONCLUSIONS

The effectiveness of the SUS for assessing usability of systems is well-established among usability practitioners.When a sufﬁcient number of survey respondents are available to leverage the central limit theorem for analyzingand reporting SUS study uncertainties, current practices invoking either a normal distribution or t distribution forconstructing conﬁdence intervals are sound. However, when only a small number of users are surveyed, as incases in which the desired user pool is not available or affordable, reliance on the central limit theorem yieldsconﬁdence intervals that suffer from parameter bound violations and interval widths that confound mappings toadjective and other constructed scales that organizations rely upon for decision making. These shortcomings areespecially pronounced when the underlying SUS score data is skewed, as it is in many instances. Using actual SUSdata made available for this study, the t distribution’s speciﬁc inadequacies are illustrated.Unfortunately, when the sample size is small there is not a single tool that a user can apply in all situations.This paper helps to remedy this by introducing two attractive alternatives that improve the accuracy and reportingof SUS study results when practitioners are faced with very small ( n ≤ ) and extremely small ( n ≤ ) samples,namely the expanded BCa bootstrap and an empirical Bayesian approach. These alternative approaches facilitatethree novel decision rules when constructing conﬁdence or credible intervals on small sample SUS study means.Additionally, a freely accessible, online application for the usability practitioner to implement these decision rulesand produce effective visualizations is developed and presented for general use.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 23It is important to note that for the empirical Bayesian approach the interpretation of the resultant interval is not thesame as parametric or non-parametric conﬁdence intervals. In fact, under this paradigm the deﬁnition of probabilityitself is different. Speciﬁcally, Bayesian inference relies on subjective probability, meaning it is a measure of beliefrather than the long-run frequency. As such, the posterior distribution and associated statistics represent a belief inthe population mean, and probabilities can be directly calculated from it. So, if the interval for a mean SUS score is,for example, (45, 75), then it would be correct to conclude that there is a 95% probability that the population meanis between 45 and 75. While this interpretation is only available when using the empirical Bayesian constructionand not the expanded BCa bootstrap technique, it does provide a much more intuitive and natural interpretationregarding uncertainty than a conﬁdence interval, offering a small but signiﬁcant advantage when communicatingSUS results to a non-technical client.While this paper expands the current options for practitioners in reporting SUS scores, there are several questionsyet to be explored. Particularly, the current empirical Bayesian approach does not account for the differing typesof systems whose usability is assessed using the SUS. For example, a software program may have importantinherent differences than a cellphone. To account for system-to-system differences, a hierarchical Bayesian modelcould potentially be useful. However, recognizing that more nuanced models such as this require additional data todevelop, practitioners should be encouraged to share their SUS data to the maximum extent possible for the beneﬁtof the broader usability community. In this spirit of continuous improvement, the authors welcome any feedback orsuggestions for improving the online application. Ultimately, our hope is that usability practitioners will gain valuefrom its use in this setting or in other topical areas which exhibit the type of problem characteristics addressed inthis paper. D

ATA A VAILABILITY

The data that support the ﬁndings of this study were obtained from the corresponding author of Bangor et al.(2008). Restrictions apply to the availability of these data, and it is unavailable.R

EFERENCES

A. Agresti and B. A. Coull. Approximate is better than “exact” for interval estimation of binomial proportions.The American Statistician, 52(2):119–126, 1998.D. W. Andrews. Estimation when a parameter is on a boundary. Econometrica, 67(6):1341–1383, 1999.A. Azzalini. A class of distributions which includes the normal ones. Scandinavian Journal of Statistics, 12(2):171–178, 1985.A. Azzalini. The skew-normal distribution and related multivariate families. Scandinavian Journal of Statistics, 32(2):159–188, 2005.A. Bangor, P. Kortum, and J. Miller. An empirical evaluation of the system usability scale. International Journalof Human-Computer Interaction, 24(6):574–594, 2008.A. Bangor, P. Kortum, and J. Miller. Determining what individual sus scores mean: Adding an adjective ratingscale. Journal of Usability Studies, 4(3):114–123, 2009.This article has been accepted for publication in the

International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 24I. Bebu and T. Mathew. Conﬁdence intervals for limited moments and truncated moments in normal and lognormalmodels. Statistics & probability letters, 79(3):375–380, 2009.B. Blaˇzica and J. Lewis. A slovene translation of the system usability scale: The sus-si. International Journal ofHuman-Computer Interaction, 31(2):112–117, 2015.H. Boone and D. Boone. Analyzing likert data. Journal of Extension, 50(2):1–5, 2012.S. Borsci, S. Federici, S. Bacci, M. Gnaldi, and F. Bartolucci. Assessing user satisfaction in the era of userexperience: Comparison of the sus, umux, and umux-lite as a function of product experience. InternationalJournal of Human-Computer Interaction, 31(8):484–495, 2015.J. Brooke. Sus-a quick and dirty usability scale. Usability Evaluation in Industry, 189(194):4–7, 1996.J. Brooke. Sus: a retrospective. Journal of Usability Studies, 8(2):29–40, 2013.J. Cariﬁo and R. Perla. Resolving the 50-year debate around using and misusing likert scales. Medical Education,42(12):1150–1152, 2008.J. Carpenter and J. Bithell. Bootstrap conﬁdence intervals: When, which, what? a practical guide for medicalstatisticians. Statistic in Medicine, 19:1141–1164, 2000.G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Paciﬁc Grove, CA, 2002.L. Chen. Modiﬁed t tests and conﬁdence intervals for asymmemetrical populations. Journal of the AmericanStatistical Association, 73:536–544, 1978.L. Chen. Testing the mean of skewed distributions. Journal of the American Statistical Society, 90(430):767–772,1995.I. Dianat, Z. Ghanbari, and M. Asghari-Jafarabadi. Psychometric properties of the persian language version of thesystem usability scale. Health Promotion Perspectives, 4(1):82, 2014.T. Diciccio and J. Romano. A review of bootstrap conﬁdence intervals. Journal of the Royal Statistical Society,50(3):338–354, 1988.B. Efron. Non-parametric standard errors and conﬁdence intervals. Canadian Journal of Statistics, 9:139–172, 1981.B. Efron. Better bootstrap conﬁdence intervals. Journal of the American Statistical Association, 82(397):171–185,1987.B. Efron and R. Tibshirani. Bootstrap methods for standard errors, conﬁdence intervals, and other measures ofstatistical accuracy. Statistical Science, 1:54–75, 1986.S. Everett, M. Byrne, and K. Greene. Measuring the usability of paper ballots: Efﬁciency, effectiveness, andsatisfaction. In Proceedings of the Human Factors and Ergonomics Society Annual Meeting, volume 50, pages2547–2551. SAGE Publications Sage CA: Los Angeles, CA, 2006.K. Finstad. The system usability scale and non-native english speakers. Journal of Usability Studies, 1(4):185–188,2006.K. Finstad. The usability metric for user experience. Interacting with Computers, 22:323–327, 2010.R. Flowers-Cano, R. Ortiz-G´omez, J. Le´on-Jim´enez, R. Rivera, and L. Perera-Cruz. Comparison of bootstrapconﬁdence intervals using monte carlo simulations. Water, 10:166–187, 2018.P. Hall. The bootstrap and Edgeworth expansion. Springer Science & Business Media, 2013.This article has been accepted for publication in the International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 25T. Hesterberg. What teachers should know about the bootstrap: Resampling in the undergraduate statisticscurriculum. The American Statistician, 69(4):371–386, 2015a.T. Hesterberg. resample: Resampling Functions, 2015b. URL https://CRAN.R-project.org/package=resample. Rpackage version 0.4.ISO/TC-159. Usability: Deﬁnitions and concepts. International Organization for Standardization, Geneva,Switzerland, 2018.A. Joshi, S. Kale, S. Chandel, and D. Pal. Likert scale: Explored and explained. British Journal of Applied Scienceand Technology, 7:396–403, 2015.C. Katsanos, N. Tselios, and M. Xenos. Perceived usability evaluation of learning management systems: a ﬁrststep towards standardization of the system usability scale in greek. In 2012 16th Panhellenic Conference onInformatics, pages 302–307. IEEE, 2012.W. King and J. He. A meta analysis of the technology acceptance model. Information & Management, 43(6):740–755, 2006.T. K¨oltringer and T. Grechenig. Comparing the immediate usability of grafﬁti 2 and virtual keyboard. In CHI’04Extended Abstracts on Human Factors in Computing Systems, pages 1175–1178. ACM, 2004.P. Kortum and A. Bangor. Usability ratings for everyday products measured with the system usability scale.International Journal of Human-Computer Interaction, 29(2):67–76, 2013.P. Kortum and M. Sorber. Measuring the usability of mobile applications for phones and tablets. InternationalJournal of Human-Computer Interaction, 31(8):518–529, 2015.J. Kysely. Coverage probability of bootstrap conﬁdence intervals in heavy tailed frequency models. TheoreticalApplied Climatology, 101:345–361, 2010.J. Lewis. Measuring perceived usability: The csuq, sus, and umux. International Journal of Human-ComputerInteraction, 34(12):1148–1156, 2018a.J. Lewis. The system usability scale: past, present, and future. International Journal of Human-Computer Interaction,34(7):577–590, 2018b.J. Lewis and J. Sauro. The factor structure of the system usability scale. In International Conference on HumanCentered Design, pages 94–103. Springer, New York, 2009a.J. Lewis and J. Sauro. The factor structure of the system usability scale. Human Centered Design, 12(2):94–103,2009b.J. Lewis and J. Sauro. Can i leave this one out?: The effect of dropping an item from the sus. Journal of UsabilityStudies, 13(1):38–46, 2017.J. Lewis and J. Sauro. Item benchmarks for the system usability scale. Journal of Usability Studies, 13(3):158–167,2018.J. Lewis, B. Utesch, and D. Maher. Umux-lite: When there’s no time for the sus. In Proceedings of the SIGCHIConference on Human Factors in Computing Systems, pages 2099–2102. ACM, 2013.T. Liddell and J. Kruschke. Analyzing ordinal data with metric models: What could possibly go wrong? Journalof Experimental Social Psychology, 79:328–348, 2018.This article has been accepted for publication in the

International Journal of Human-Computer Interaction ,published by Taylor & Francisre-Print 26N. Lucey. More than meets the i: User-satisfaction of computer systems. Master’s thesis, Cork, Ireland, 1991.M. Mandelkern et al. Setting conﬁdence intervals for bounded parameters. Statistical Science, 17(2):149–172, 2002.S. McLellan, A. Muddimer, and S. C. Peres. The effect of experience on system usability scale ratings. Journal ofUsability Studies, 7(2):56–67, 2012.R. Mistri and R. Pandey. A generalization of sumsets of set of integers. Journal of Number Theory, 143(2):334–356,2014.K. Orfanou, N. Tselios, and C. Katsanos. Perceived usability evaluation of learning management systems: Empiricalevaluation of the system usability scale. International Review of Research in Open and Distributed Learning, 16(2):227–246, 2015.S. Peres, T. Pham, and R. Phillips. Validation of the system usability scale: Sus in the wild. In Human Factors andErgonomics Society 57th Annual Meeting, volume 1. HFES: Santa Monica, CA, 2013.J. Sauro and J. Lewis. When designing usability questionnaires, does it hurt to be positive? In Proceedings of theSIGCHI Conference on Human Factors in Computing Systems, pages 2215–2224. ACM, May 2011.J. Sauro and J. Lewis. Quantifying the user experience: Practical statistics for user research. Morgan Kaufmann,2016.Stan Development Team. RStan: the R interface to Stan, 2019. URL http://mc-stan.org/. R package version 2.19.2.C. Sutton. Computer-intensive methods for tests about the mean of an asymmetrical distribution. Journal of theAmerican Statistical Society, 88:802–810, 1993.T. Tullis and J. Stetson. A comparison of questionnaires for assessing website usability. In Usability ProfessionalAssociation Conference, volume 1. Minneapolis, USA, 2004.H. Wu and M. C. Neale. Adjusted conﬁdence intervals for a bounded parameter. Behavior genetics, 42(6):886–898,2012.J. Xiong, C. Acemyan, and P. Kortum. Susapp: A free mobile application that makes the system usability scale(sus) easier to administer. Journal of Usability Studies, 15(3):135–144, 2020.This article has been accepted for publication in the