[PDF] Evaluating the Success of a Data Analysis

Abstract

A fundamental problem in the practice and teaching of data science is how to evaluate the quality of a given data analysis, which is different than the evaluation of the science or question underlying the data analysis. Previously, we defined a set of principles for describing data analyses that can be used to create a data analysis and to characterize the variation between data analyses. Here, we introduce a metric of quality evaluation that we call the success of a data analysis, which is different than other potential metrics such as completeness, validity, or honesty. We define a successful data analysis as the matching of principles between the analyst and the audience on which the analysis is developed. In this paper, we propose a statistical model and general framework for evaluating the success of a data analysis. We argue that this framework can be used as a guide for practicing data scientists and students in data science courses for how to build a successful data analysis.

Full PDF

aa r X i v : . [ s t a t . O T ] A p r Evaluating the Success of a Data Analysis

Stephanie C. Hicks ∗ and Roger D. Peng Department of Biostatistics, Johns Hopkins Bloomberg School of Public HealthApril 29, 2019

Abstract

A fundamental problem in the practice and teaching of data science is how to evaluatethe quality of a given data analysis, which is diﬀerent than the evaluation of the science orquestion underlying the data analysis. Previously, we deﬁned a set of principles for describingdata analyses that can be used to create a data analysis and to characterize the variationbetween data analyses. Here, we introduce a metric of quality evaluation that we call the success of a data analysis, which is diﬀerent than other potential metrics such as completeness,validity, or honesty. We deﬁne a successful data analysis as the matching of principles betweenthe analyst and the audience on which the analysis is developed. In this paper, we proposea statistical model and general framework for evaluating the success of a data analysis. Weargue that this framework can be used as a guide for practicing data scientists and studentsin data science courses for how to build a successful data analysis. t Keywords : Data science, data analyses, quality, evaluation, education

Running title : Evaluating the success of a data analysis

Author Contributions : SCH and RDP equally conceptualized, wrote and approved the manuscript.

Disclosures : The authors do not have any disclosures.

Acknowledgements : The authors do not have any funding to acknowledge. ∗ Corresponding author email: [email protected] Introduction

Within the practice and teaching of data science [1–10], a data scientist builds a data analysis[11–17] to extract knowledge and insights from examining data [18]. However, there is surprisinglylittle discussion on how to evaluate the quality of a given data analysis, which is diﬀerent thanthe evaluation of the science or question underlying the data analysis. Three possible reasons forthis include (1) there is an insuﬃcient vocabulary to describe how to characterize the variationbetween data analyses, (2) there is a lack of deﬁnitive and precise performance metrics to evaluatethe quality of the analyses, and (3) there is lack of speciﬁcity by whom the data analysis is beingevaluated. This leaves the educator or the practicing data scientist to focus the discussion of dataanalysis quality assessment on speciﬁc methods, technologies or programming languages used in adata analysis, with the vague hope that such discussion will lead to success.Much previous work dedicated to studying data analysis has focused primarily on the notionof “statistical thinking”, or developing an understanding of the mental processes that occur withinthe analyst while doing data analysis [7, 16, 18, 19]. Such an approach is beneﬁcial in that byunderstanding how data analyses are conceived we can design teaching strategies that are purpose-built to emphasize certain processes. An alternate approach is to characterize the data analyticprocess based on its observed outputs—the data analysis—and provide principled feedback onwhy it might have failed or how it could be more successful. However, the literature provides littleinsight into how we might execute this approach, largely because there is no rigorous descriptionof a “successful” data analysis.The current situation leads us to re-think the purpose of a data analysis and the audience thatit serves. While the audience could be one individual, or a group of individuals, each individualaudience member plays a critical role in evaluating the quality of a given data analysis. Eachaudience member evaluates the quality with her or his own preconceived notions, characteristics,and biases towards valuing what makes a good or bad analysis [16]. Therefore, to be able to deﬁneprecise performance metrics to evaluate the quality of a data analysis, we ﬁrst need to formallyspecify (i) who is the audience and (ii) what characteristics do they value, or not value, in a givendata analysis. With this information in hand, a data analyst could then hypothetically choose toadjust or tailor a given data analysis to the characteristics that the audience members value, leadingto potentially a more successful data analysis, compared to one that did not take into account theaudience and the characteristics in a data analysis that the audience members value [20, 21].In contrast, there are other potential metrics of quality evaluation one could consider, such aswhether or not an analysis is valid or complete, or even evaluating the strength and quality ofevidence in a given data analysis for the particular hypothesis of interest [22]. While all of thesequality evaluations of data analyses are important, in this paper, we are focused on the question2f how to evaluate the success of a data analysis, which will depend on formally specifying who isthe audience and the characteristics in a data analysis that the audience members value.To tackle this question, we start by leveraging a set of principles of data analyses that wepreviously introduced that can be used to create a data analysis and to characterize the variationbetween data analyses. These principles of data analysis are “prioritized qualities or characteristicsthat are relevant to the analysis, as a whole or individual components, and that can be objectivelyobserved or measured” [22]. For a given data analysis, the inclusion or exclusion of certain principlesdoes not convey a judgment or assessment with respect to the overall quality of the data analysis.However, a data analyst can assign weights to these principles to increase or decrease the presenceof these objective characteristics in a given data analysis, which can be also highly inﬂuencedby outside constraints or resources, such as time or budget. In this way, diﬀerent weighting ofthe principles by the analyst can lead to diﬀerent data analyses, all addressing the same primaryquestion underlying the data analysis [23].Next, we use this set of principles for data analysis to propose a framework for evaluating thequality of a data analysis that relies critically on the audience for which the analysis is developed.In particular, as every data analysis has an audience that views the analysis with her or his ownpreconceived notions, characteristics, and biases, we consider the weights of the principles by boththe analyst and the audience members, who may have a diﬀerent perspective of how these variousprinciples should be weighted for a given data analysis. For example, one audience (AudienceA) may value one set of principles while another audience (Audience B) may value a diﬀerentset of principles. Neither set of principles weighted by the analyst or either set of audiences, iscorrect or incorrect. However, we previously hypothesized that the success of a data analysis maydepend on how well-matched the analyst’s weightings are to the audience’s weightings for a givenanalysis [22]. In this way, educators can use this idea in the classroom to teach students how tobuild more successful analyses that take into account who is the audience and what principles ofdata analysis that they value. In addition, managers of data analysts in industry can use this ideato frame the discussion of how to to build more successful data analyses for their clients, customers,or executives.In this paper, we make these ideas more concrete and introduce a metric of quality evaluationthat we call the success of a data analysis. We deﬁne a successful data analysis as the matching ofweighted principles between the analyst and the audience on which the analysis is developed. Inthe following sections, we mathematically formalize those ideas by proposing a statistical modeland general framework for evaluating the success of a data analysis (Section 2). Then, we discussthe implications of this framework (Section 3) and argue how this framework can be used as a guidefor practicing data scientists and students in data science courses for how to build a successful data3nalysis.

As described above and our in previous work, we consider data analyses to be constructed in amanner guided by a set of K principles [22] or objective characteristics about the data analysis.Speciﬁcally, we deﬁned the principles of data analysis as data-matching , exhaustive , skeptical , second-order , transparent , and reproducible . In this paper, we assume that for each principle, thedata analyst assigns a positive integer score whose interpretation corresponds to how much weightthat individual gives to that principle. We consider smaller values to be interpreted as a “lowerweight” assigned to a given principle and larger values interpreted as a “higher weight”. In somecircumstances, it may make sense to think of the weight as the number of units of a particularresource, such as time or budget, that is devoted to a given principle.For a given data analysis, an analyst will assign a weight W ( k ) to principle k . For example,if we assume principle k is reproducibility, the analyst might assign a weight W ( k ) = 100 to adata analysis because the analyst believes reproducibility is very important for that analysis. Fora diﬀerent analysis, where the reproducibility of the results is perhaps not so critical, the analystmay assign a weight W ( k ) = 10 for this speciﬁc principle. Given a set of K principles, an analystassigns a set of weights (cid:8) W (1) , . . . , W ( K ) (cid:9) ∈ Z K + to guide the development of this data analysis.The sets of weights assigned to each principle may diﬀer from analysis to analysis.Data analyses are built to be viewed by an audience, which can be an individual person or agroup of people and can include the data analyst them self. For now, we will consider the audienceto be an individual person, other than the data analyst, and consider the case when an audienceis more than one individual in Section 2.5. As such, the audience has their own weights for eachprinciple governing a data analysis, which reﬂects how they balance the importance of variousproperties of a data analysis. For a given data analysis, the audience weights will be denoted bythe set (cid:8) A (1) , . . . , A ( K ) (cid:9) ∈ Z K + . These values are assigned before seeing the full data analysis, butmay be based on partial information available about the analysis or analyst beforehand. We allow for the possibility that there will be variation in the weightings of the principles fromanalysis to analysis, for both analyst and audience. Some of that variation can be characterizedas ﬁxed, while other variation may be best considered as random. From the analyst’s perspective,some of the determinants of how a given principle may be weighted are:1.

Analysis-speciﬁc Resources . Considerations about computing resources, time, budget, per-4onnel, and other such resources and analysis characteristics can often require that an analystplace more or less weight on certain principles for analysis. For example, analyses that mustbe conducted in a short amount of time may be limited in their ability to explore multiplecompeting hypotheses and exhibit low skepticism.2.

Question Signiﬁcance and Problem Characteristics . The signiﬁcance of the question beingaddressed with the data may play a role in determining principle weightings. Questions ofhigh signiﬁcance, for example, may require a high degree of transparency or reproducibility.Questions of lower signiﬁcance may be done in a “quick-and-dirty” fashion; should the ques-tion’s signiﬁcance change in the future the analysis may need to be re-done with a diﬀerentset of principle weightings.3.

Field-speciﬁc Conventions . Analysts are often members of a ﬁeld from which they may havereceived their training (e.g. statistics, economics, computer science, bioinformatics). Eachﬁeld develops conventions regarding how analyses in their ﬁeld should be conducted and wecharacterize this using a ﬁeld-speciﬁc mean value for a given principle. Tukey [11] emphasizedthat in data analysis, there is a heavy emphasis on “judgment”, one particular form of whichis based upon the experience of members of a given ﬁeld.4.

Analytic Product . Depending on the analytic product that will ultimately be presented to theaudience (e.g. PDF document, web-based dashboard, executable R Markdown document),the analyst may determine that certain principles should receive more or less weight.Similarly, the audience for whom the analysis is being developed will determine their principleweightings based on a variety factors, including their perception of resources available to the analyst,their judgment of the signiﬁcance of the question, their own ﬁeld-speciﬁc conventions (assumingthe audience and the analyst are not members of the same ﬁeld), and their perception of what theanalytic product should contain.

The above-enumerated list describes some of the ﬁxed factors that may drive variation in howvarious data analytic principles are weighted. However, there may be variation that is more randomin nature. In particular, we consider the randomness as arising from sampling from a population ofanalysts or potential audience members. Diﬀerent analysts, presented with the exact same questionand data, will likely weight principles diﬀerently and hence produce diﬀerent analyses based ontheir own personal characteristics. Similarly, diﬀerent audience members, seeing the same analyticproduct, will weight principles diﬀerently and evaluate the success of the analysis diﬀerently.5e consider each analyst and each audience member to be a member of a ﬁeld or profession.Let f i ∈ { , . . . , F } be the index into a set of F ﬁelds or professions for analyst i . One source ofrandom variation that we highlight here is what we call an individual’s ﬁeld-speciﬁc deviation forprinciple k . An analyst who belongs to ﬁeld f i will be trained in the conventions of that ﬁeld,which places a ﬁeld-speciﬁc mean value λ ( k ) f i for a given principle k . An individual analyst i willdeviate from their ﬁeld-speciﬁc mean by an amount δ ( k ) i which we think of as being randomlydistributed with mean and ﬁnite variance. Therefore, the ﬁeld-speciﬁc principle contribution foranalyst i is λ ( k ) f i + δ ( k ) i for principle k in any given data analysis. Similarly, audience member j who belongs to ﬁeld f j will have a ﬁeld-speciﬁc principle contribution of λ ( k ) f j + η ( k ) j , where η ( k ) j israndomly distributed with mean and ﬁnite variance. Throughout text, we consider just one data analysis a at a time, but we do not include the notationfor the a th data analysis to keep the notation minimal. Now, for a given analysis and analyst i , theweight assigned to a speciﬁc principle k is W ( k ) i and N i = P Kk =1 W ( k ) i is the total weight assignedto the analysis by analyst i . Given the total weight N i , we model the individual principle-speciﬁcweights W ( k ) i with the multinomial distribution, W i = (cid:16) W (1) i , . . . , W ( K ) i (cid:17) ∼ Multinomial (cid:16) N i ; π (1) i , . . . , π ( K ) i (cid:17) . (1)The parameters π ( k ) i from the multinomial distribution can be thought of as the probability ofanalyst i assigning weight to a speciﬁc principle k where the probabilities must sum to 1 acrossthe K principles, i.e. P Kk =1 π ( k ) i = 1 , reﬂecting the reality that all analysts must decide how toallocate their priorities towards each principle when building a data analysis. For a given principle k , we can derive the marginal distribution from the multinomial and have W ( k ) i ∼ Binomial ( N i ; π ( k ) i ) . We can then model the π ( k ) i s as ψ ( k ) i = log π ( k ) i − π ( k ) i ! = λ ( k ) f i + δ ( k ) i + x ′ i β ( k ) i , (2)where λ ( k ) f i is the ﬁeld-speciﬁc mean for principle k and analyst i in the ﬁeld f i , δ ( k ) i is analyst i ’sdeviation from the ﬁeld-speciﬁc mean for principle k , x i is a vector of analysis-speciﬁc resources andcharacteristics for the analysis (i.e. time, budget, personnel, signiﬁcance), and β ( k ) i is a vector ofcoeﬃcients that indicate how each resource is related to the up-weighting or down-weighting of the6 th principle for this analysis. We consider the analyst deviation δ ( k ) i to be randomly distributedacross the set of potential analysts with mean and ﬁnite variance.Analogous to the analyst’s weights, the weight given to principle k by audience member j (whois a member of ﬁeld f j ) can be written as A ( k ) j with N j = P Kk =1 A ( k ) j being the total weight givento the analysis. We similarly model the vector A j = (cid:16) A (1) j , . . . , A ( K ) j (cid:17) as multinomial with total N j and proportions ω (1) j , . . . , ω ( K ) j . We then similarly model the proportions ω ( k ) j as α ( k ) j = log ω ( k ) j − ω ( k ) j ! = λ ( k ) f j + η ( k ) j + z ′ j γ ( k ) j (3)where z j is the audience’s perception of resources available and question signiﬁcance, λ ( k ) f j and η ( k ) j are the ﬁeld-speciﬁc mean and individual-level deviation for the j th audience member, respectively,and γ ( k ) j is the audience member’s sense of the relationship between a given resource and the weightthat should be given to the principle. Note that we consider η ( k ) j to be independent of δ ( k ) i in theanalyst’s weight model.With the analyst weightings in Equation (1) and the audience weightings, we can then writethe principle-speciﬁc weight diﬀerence for a given data analysis as D ( k ) ij = ψ ( k ) i − α ( k ) j = (cid:16) λ ( k ) f i − λ ( k ) f j (cid:17) + (cid:16) δ ( k ) i − η ( k ) j (cid:17) + (cid:16) x ′ i β ( k ) i − z ′ j γ ( k ) j (cid:17) (4)The overall analyst-audience distance for a given data analysis is then characterized by the collec-tion of distances for the set of K principles D ij = (cid:16) D (1) ij , . . . , D ( K ) ij (cid:17) .In the next section, we will introduce three ways that a given data analysis can be deﬁned assuccessful. In this section, we propose three ways to achieve a successful data analysis pairwise between theanalyst i and audience member j : Strong Pairwise Success (Deﬁnition 1),

Weak Pairwise Success (Deﬁnition 2),

Potential Pairwise Success (Deﬁnition 3).

Deﬁnition 1 (Strong Pairwise Success) . A data analysis is strongly successful for the pairing ofanalyst i with audience member j if k D ij k ∞ = max k =1 ,...,K (cid:12)(cid:12)(cid:12) D ( k ) ij (cid:12)(cid:12)(cid:12) < ε. for some small ε . Because of the randomness in δ ( k ) i and η ( k ) j , the D ( k ) ij values can never be equal7o zero. However, the deﬁnition of strong pairwise success requires that the diﬀerences are nevertoo large for any given principle.We can propose a weaker form of analysis success that allows for some diﬀerences in how theprinciples are weighted, but places a limit on the total variation of those diﬀerences. Deﬁnition 2 (Weak Pairwise Success) . A data analysis is weakly successful for the pairing ofanalyst i with audience member j if for some p ≥ k D ij k p = K K X k =1 (cid:12)(cid:12)(cid:12) D ( k ) ij (cid:12)(cid:12)(cid:12) p ! /p < ε. (5)With this deﬁnition, the analyst and audience may diﬀer slightly with respect to how each principleis weighted, but the overall diﬀerences between analyst and audience must be small. The choice of p here (and hence, the norm) will have an impact on how much deviation is allowed between analystand audience and how much any single principle may diﬀer. For now, we do not comment on whichnorm is most appropriate or useful, but only note that diﬀerent circumstances may require the useof diﬀerent norms.From our deﬁnition of strong pairwise success of a data analysis, we can see how success may beachieved or, in some circumstances, may never be achieved. In particular, if we consider δ ( k ) i and η ( k ) j to be random (with mean and ﬁnite variance) and independent, then the principle-speciﬁcweight diﬀerence has expectation E h D ( k ) ij i = ( λ ( k ) f i − λ ( k ) f j ) + ( x ′ i β ( k ) i − z ′ j γ ( k ) j ) , (6)which in general will be diﬀerent from .A separate measure of success can be deﬁned in situations where the analyst i may only havegeneral information about the audience member j , but may not know speciﬁcally who the audiencewill be. In such cases, the analyst may have information about the population parameters of theaudience and so may wish to measure success based on the mean values for the population. Welook at the diﬀerence in expected values for the weightings for all K principles and denote this the potential pairwise success of an analysis, because we have not yet observed the audience’s principleweighting. Deﬁnition 3 (Potential Pairwise Success) . A data analysis is potentially successful for the pairingof analyst i with audience member j if E [ D ij ] = . A key distinction between strong (or weak ) pairwise success and potential pairwise success is that8he former can only be evaluated when analyst and audience meet and a data analysis is presented.Potential pairwise success can be evaluated before an analyst presents the analysis to the audience.As such, the potential pairwise success metric could serve as a target for optimization by the analystand we discuss this brieﬂy in the Discussion below.

Up until this point we have assumed the audience consisted of a single member indexed by j .However, it is common that a data analysis will be reviewed by or presented to a group of audiencemembers. If there are J members of the audience, then we can extend Equation (4) to be asfollows. D ( k ) i · = 1 J J X j =1 D ( k ) ij = ψ ( k ) i − J J X j =1 α ( k ) j =  λ ( k ) f i − J X j λ ( k ) f j  +  δ ( k ) i − J X j η ( k ) j  +  x ′ i β ( k ) i − J X j z ′ j γ ( k ) j  . (7)In this formulation, D ( k ) i · is small if principle k is weighted by the analyst in a manner that is equalto the mean of the members of the audience. With this extension of the principle-speciﬁc weightdiﬀerence to group audiences, we can modify our deﬁnition of pairwise potential success to be Deﬁnition 4 (Potential Group Success) . A data analysis is potentially successful for analyst i presenting to a group consisting of members j = 1 , . . . , J if for the vector D i · = (cid:16) D (1) i · , . . . , D ( K ) i · (cid:17) ,we have E [ D i · ] = . Analogous deﬁnitions for strong group success and weak group success could be constructed, butwe omit them here. We believe the deﬁnition of potential group success is the most relevant todata analysts who will be presenting their work to multiple people and may need to consider theheterogeneity of the audience to which they will be presenting.

The deﬁnitions of pairwise success and potential pairwise success presented in Section 2 lead toseveral implications about how data analyses may or may not succeed and what could potentiallybe done to improve the success of any given analysis. We discuss some of these implications inthis section. First, it follows from Equation (4) that one way in which E h D ( k ) ij i could be made to9e smaller would be to have the analyst and audience member be from the same ﬁeld. If analyst i and audience member j have f i = f j , then we have λ ( k ) f i − λ ( k ) f j = 0 .The interpretation of this is that members of the same ﬁeld share similar conventions withrespect to a given principle. For example, if “computational reproducibility” is the k th principle,then members of the ﬁeld of computational biology (for example), which generally places a highweight on computational reproducibility, would on average place a high weight on that principle.We might then expect data analyses in this ﬁeld to generally demonstrate a high weight on repro-ducibility, with perhaps code and data routinely made available. As a result, we would expect ahigher potential for success (i.e. smaller E h D ( k ) ij i ) if analyst i and audience member j are both inthe ﬁeld of computational biology.The random variation in D ( k ) ij ensures that D ( k ) ij can never be equal to . In deﬁning stronglysuccessful and weakly successful analyses, we allow for some diﬀerences between analyst and au-diences. The magnitude of allowable diﬀerences in principle weightings, ε , is likely to be analysis-speciﬁc and will depend in part on the context and circumstances surrounding the analysis. Fora quick, “work-in-progress” type of analysis, the audience may allow for larger deviations, withthe presumption that the ﬁnal version will have the appropriate principle weighting. More “ﬁnal”analyses, such as a published paper, may require a stricter adherence to the audience’s principleweightings in order to declare success.In addition, the resources available to the analyst and the speciﬁc characteristics of the problembeing addressed may lead an analyst to re-prioritize the weights assigned to diﬀerent principles,leading to a deviation from what they might typically assign based solely on ﬁeld conventions andpersonal preference. For analyst i , these resources and problem characteristics are denoted by x i and the manner in which an analyst re-prioritizes principle k in response to changes in resourcesor problem characteristics is encoded in the vector β ( k ) i .For example if principle k is computational reproducibility and x i is the time available to theanalyst i to do the analysis, then β ( k ) i = 0 would imply that analyst places the same amount ofweight on this principle no matter how much time is available. However, if principle k is “exhaus-tiveness” and x i is time available, then β ( k ) i > would imply that the more time that is availablefor an analysis, the more the analyst prioritizes exhaustiveness relative to the other principles (andsimilarly, less time available would lead to less weight on exhaustiveness). Furthermore, in ourformulation, exp (cid:16) β ( k ) i (cid:17) can be interpreted as how many more times principle k is weighted versusall of the other principles under consideration.The audience’s perception of the resources available for conducting the analysis and the problem-speciﬁc characteristics of the analysis is encoded in z j and can play a role in how diﬀerent principlesare weighted via γ ( k ) j . If x i = z j then that implies the audience’s perception of the resources and10haracteristics is equal to that of the analyst. If β ( k ) i = γ ( k ) j then the audience and analyst have thesame understanding of how resources and problem-speciﬁc characteristics should aﬀect principleweightings (if at all).Ultimately, even if E h D ( k ) ij i = 0 , we can still observe a mismatch between analyst and audiencebased on individual-level random variation. Each analyst and audience member will randomlydeviate from their ﬁeld-speciﬁc mean and the variance of those deviations will play a role in thelikelihood of success for a given analysis. If analyst i presents an analysis to audience member j of a ﬁeld that exhibits wide variation in how they weight a given principle, then the probability ofa mismatch is large, even if the analyst’s and audience’s ﬁelds have similar mean values on thatprinciple.Our deﬁnition of potential group success in Deﬁnition 4 suggests that an analysts are successfulwhen their individual principle weightings match those of the audience’s mean values. If theaudience members are are all members of the same ﬁeld, then we have λ ( k ) f = · · · = λ ( k ) f J , whichgreatly simpliﬁes any analytic presentation. However, if the audience members are all of diﬀerentﬁelds, then it may be more diﬃcult to identify the mean of the audience members’ λ ( k ) f j values.Interestingly, as the audience size grows, we should have that J P j η ( k ) j → , regardless of theindividual audience members’ respective ﬁelds, because we assume that an individual audiencemember’s ﬁeld-speciﬁc deviation is random with mean . This suggests that for audiences beyonda certain size, that component of D ( k ) i · will always be near . This is intuitive as when presentingto a very large audience, it is generally challenging for the analyst to consider the individual needsof every audience member. In proposing how to deﬁne the success of a data analysis, our goal is to provide a guide for practicingdata scientists and students in data science courses for how to build a successful data analysis.We aim for this to serve as one of many possible performance metrics in the evaluation of dataanalyses as part of our larger goal for developing a theory of data analysis and data science. Otherperformance metrics could include the validity of a data analysis, the evaluation of the strength ofevidence in the science or question underlying the data analysis, or the honesty or intention of thedata analyst when building a data analysis.We have presented the idea of data analysis success in a manner that places the analyst andthe audience as essentially passive actors embedded in a larger framework. However, the realityis that the analyst will typically have far more agency in determining the success of an analysis.Furthermore, when the audience is small, there will likely be some communication between an-11lyst and audience, either before the analysis is conducted or while it is ongoing, to better laythe groundwork for success. Such communication could be used to broker agreement on whichprinciples guide the analysis and how each principle is weighted. We have presented the notion ofa data analysis as a snapshot of what is more likely a complex dynamical activity, with constantfeedback and adjustments taking place over time.One important feature that we have not discussed is how can a data analyst i adjust heror his weights (before presenting the analysis) for each principle k based on how they perceive the audience would weight each principle. Another way of stating this is that as the analyst i might have partial information about the audience member j , such as their background, and othercontextual information, the analyst i may choose to adjust their principle weightings based onwhat they perceive the audience preferences to be. One way to obtain this information would beto directly ask the audience member j about their audience principle-speciﬁc weights A ( k ) j for agiven data analysis. In this way, one could imagine an audience correction factor could be addedto Equation (4) to allow for possibility that an analyst might attempt to adjust their analystprinciple-speciﬁc weighting W ( k ) i based on what they expect the audience to prefer. In addition,this correction factor could be a function of the degree by which the analyst attempts to correctfor the audience’s weighting preferences. In some cases, the analyst will make a strong attempt toadjust the analysis to the audience’s preferences and in other cases, the analyst will make minoradjustments, if at all. Developing and characterizing strategies for analysts to actively improve thechances of success is an important area of future work.Our deﬁnition of success in data analysis depends solely on the participants—the analyst andthe audience—and the outputs of the data analysis. In theory, one could calculate the pairwisesuccess of an analysis with just those elements. Critically, we do not consider events or informationthat occur outside the analysis or perhaps in the future. For example, an analysis may make certainconclusions based on the evidence available in the data that are later invalidated by more in-depthanalysis (perhaps with better data). We do not therefore conclude that the original analysis wasby deﬁnition a failure. At any given moment, an analysis can only draw on the data and evidencethat are available. It therefore seems inappropriate to judge the success of a data analysis basedon information that were not accessible at the time.Our approach to deﬁning success in data analysis shares many elements with the ﬁeld of designthinking in its approach to building a solution matched to a speciﬁc audience [24, 25]. In someways, one could think of a data analysis as a kind of “product”, in the sense that it is not a naturallyoccurring object in nature. As such, someone—the analyst—must design the analysis in a mannerthat makes it useful to the audience. The success of the analysis will depend in part on consideringthe audience, much like the success of any designed product.12inally, a beneﬁt of deﬁning success in data analysis is that we can now clearly recognizefailure. Learning from failed data analyses is an important aspect of the training of any dataanalyst and the ﬁrst step in that process is knowing when failure has occurred. Dialog betweenaudience and analyst about why an analysis has failed can improve the quality of future analyses,as well as improve the quality of the relationship between analyst and audience. Critical to such“post-mortem” discussions is that it be conducted in a blameless manner [26] so that analyst andaudience can quickly come to a resolution over how problems should be ﬁxed. The practice and teaching of data science requires the careful evaluation of data analyses. In thispaper, we introduce a precise metric of quality evaluation to assess the success of a data analysis.This metric depends on input from both the data analyst and the audience evaluating the dataanalysis. The beneﬁts of this general framework for evaluating the success of a data analysis includeproviding a guide to practicing data scientists and students in data science courses on how to builda successful data analysis. 13 eferences [1] William S. Cleveland. Data science: An action plan for expanding the technical areas of theﬁeld of statistics.

International Statistical Review / Revue Internationale de Statistique , 69(1):21–26, 2001. ISSN 03067734, 17515823. URL .[2] Deborah Nolan and Duncan Temple Lang. Computing in the statistics curricula.

The American Statistician , 64(2):97–107, 2010. doi: 10.1198/tast.2010.09132. URL https://doi.org/10.1198/tast.2010.09132 .[3] American Statistical Association Undergraduate Guidelines Workgroup. . American Statistical Associa-tion, 2014. URL .[4] Ben Baumer. A data science course for undergraduates: Thinking with data.

The AmericanStatistician , 69:334–342, 2015.[5] PricewaterhouseCoopers.

What’s next for the data science and analytics job market? , 2019.URL .[6] Johanna Hardin, Roger Hoerl, Nicolas J. Horton, and Deobrah Nolan. Data science in statisticscurricula: Preparing students to “think with data”.

The American Statistician , 69:343–353,2015.[7] Nicholas J. Horton and Johanna S. Hardin. Teaching the Next Generation of Statistics Stu-dents to "Think With Data": Special Issue on Statistics and the Undergraduate Curriculum.

The American Statistician , 69(4):259–265, 2015. doi: 10.1080/00031305.2015.1094283. URL https://doi.org/10.1080/00031305.2015.1094283 .[8] David Donoho. 50 years of data science.

Journal of Computational and Graph-ical Statistics , 26(4):745–766, 2017. doi: 10.1080/10618600.2017.1384734. URL https://doi.org/10.1080/10618600.2017.1384734 .[9] Daniel Kaplan. Teaching stats for data science.

The American Statis-tician , 72(1):89–96, 2018. doi: 10.1080/00031305.2017.1398107. URL https://doi.org/10.1080/00031305.2017.1398107 .[10] Stephanie C. Hicks and Rafael A. Irizarry. A guide to teaching data science.

TheAmerican Statistician , 72(4):382–391, 2018. doi: 10.1080/00031305.2017.1356747. URL https://doi.org/10.1080/00031305.2017.1356747 .1411] John W Tukey. The future of data analysis.

The annals of mathematical statistics , 33(1):1–67, 1962.[12] W. Tukey and M. B. Wilk. Data analysis and statistics: an expository overview. In

InProceedings of the November 7-10, 1966, fall joint computer conference , pages 695–709, 1966.[13] G. E. P. Box. Science and statistics.

Journal of the American Statistical Association , 71(356):791–799, 1976.[14] C. J. Wild. Embracing the "wider view" of statistics.

The American Statistician , 48(2):163–171, 1994.[15] C. Chatﬁeld.

Problem solving: a statistician’s guide . Chapman and Hall/CRC, 1995.[16] C. J Wild and M. Pfannkuch. Statistical thinking in empirical enquiry.

International StatisticalReview/Revue Internationale de Statistique , 1999.[17] D. Cook and D. F. Swayne.

Interactive and dynamic graphics for data analysis with R andGGobi . Springer Publishing Company, Incorporated, 2007.[18] Garrett Grolemund and Hadley Wickham. A cognitive interpretation of data analysis.

Inter-national Statistical Review , 82(2):184–204, 2014.[19] Jane Watson and Rosemary Callingham. Statistical literacy: A complex hierarchical construct.

Statistics Education Research Journal , 2(2):3–46, 2003.[20] Roger D. Peng. What is a successful data analysis? Technical report, 2018. URL https://simplystatistics.org/2018/04/17/what-is-a-successful-data-analysis/ .[21] Roger D. Peng.

Essays on Data Analysis . Leanpub, 2018. URL https://leanpub.com/dataanalysisessays .[22] Stephanie C Hicks and Roger D Peng. Elements and Principles of Data Analysis. arXiv , pages1–13, 2019. URL https://arxiv.org/abs/1903.07639 .[23] R. Silberzahn, E. L. Uhlmann, D. P. Martin, P. Anselmi, F. Aust, E. Awtrey, A. BahnÃk,F. Bai, C. Bannard, E. Bonnier, R. Carlsson, F. Cheung, G. Christensen, R. Clay, M. A.Craig, A. Dalla Rosa, L. Dam, M. H. Evans, I. Flores Cervantes, N. Fong, M. Gamez-Djokic,A. Glenz, S. Gordon-McKeon, T. J. Heaton, K. Hederos, M. Heene, A. J. Hofelich Mohr,F. HÃgden, K. Hui, M. Johannesson, J. Kalodimos, E. Kaszubowski, D. M. Kennedy, R. Lei,T. A. Lindsay, S. Liverani, C. R. Madan, D. Molden, E. Molleman, R. D. Morey, L. B. Mulder,B. R. Nijstad, N. G. Pope, B. Pope, J. M. Prenoveau, F. Rink, E. Robusto, H. Roderique,A. Sandberg, E. SchlÃŒter, F. D. SchÃnbrodt, M. F. Sherman, S. A. Sommer, K. Sotak,15. Spain, C. SpÃrlein, T. Staﬀord, L. Stefanutti, S. Tauber, J. Ullrich, M. Vianello, E.-J. Wagenmakers, M. Witkowiak, S. Yoon, and B. A. Nosek. Many analysts, one data set:Making transparent how variations in analytic choices aﬀect results.

Advances in Methodsand Practices in Psychological Science , 1(3):337–356, 2018. doi: 10.1177/2515245917747646.[24] Nigel Cross.