[PDF] Perspective from the Literature on the Role of Expert Judgment in Scientific and Statistical Research and Practice

Abstract

This article, produced as a result of the Symposium on Statistical Inference, is an introduction to the literature on the function of expertise, judgment, and choice in the practice of statistics and scientific research. In particular, expert judgment plays a critical role in conducting Frequentist hypothesis tests and Bayesian models, especially in selection of appropriate prior distributions for model parameters. The subtlety of interpreting results is also discussed. Finally, external recommendations are collected for how to more effectively encourage proper use of judgment in statistics. The paper synthesizes the literature for the purpose of creating a single reference and inciting more productive discussions on how to improve the future of statistics and science.

Full PDF

aa r X i v : . [ s t a t . O T ] S e p Perspective from the Literature on the Role ofExpert Judgment in Scientiﬁc and StatisticalResearch and Practice

Naomi C. Brownstein ∗ Abstract

This article, produced as a result of the Symposium on Statistical Inference, isan introduction to the literature on the function of expertise, judgment, and choicein the practice of statistics and scientiﬁc research. In particular, expert judgmentplays a critical role in conducting Frequentist hypothesis tests and Bayesian models,especially in selection of appropriate prior distributions for model parameters. Thesubtlety of interpreting results is also discussed. Finally, external recommendationsare collected for how to more eﬀectively encourage proper use of judgment in statistics.The paper synthesizes the literature for the purpose of creating a single reference andinciting more productive discussions on how to improve the future of statistics andscience.

Keywords:

Bayesian modeling, Hypothesis testing, Inference, Insight, Interpretation, Knowl-edge, Recommendations, Signiﬁcance, Statisticians, Subjectivity ∗ This work was not supported by any grant. However, the author would like to thank Jeﬀ Harman,Tom Louis, Tony O’Hagan, and Jane Pendergast for their helpful comments during the preparation of thismanuscript. Introduction

As a quantitative discipline, statistics is often considered by practitioners as objective in itsmethods, which proliferate throughout the scientiﬁc enterprise. Basic statistical methods,including Frequentist hypothesis tests and p -values, are ubiquitous in academic literature.The American Statistical Association (ASA) recently warned the research community aboutcommon misuses of p -values in a statement (Wasserstein & Lazar 2016) that has receivedwidespread attention in a variety of ﬁelds, even outside of statistics. As a follow-up, theASA organized the 2017 Symposium on Statistical Inference (SSI), where statisticians andconsumers of data presented research, contemplated these issues, and brainstormed possiblesolutions. Yet, the statistical community remains divided on explicit recommendationsfor researchers to improve their quantitative practice (Matthews et al. 2017). Due to theinherently quantitative nature of the ﬁeld of statistics, much of the conversation on the p -value statement has revolved around quantitative solutions. Discussions of the statisticalproperties of these and other remedies are included elsewhere in this special issue.Despite perceived objectivity of the ﬁeld, statisticians previously argued that uncer-tainty and choice abound in the scientiﬁc process, from the deﬁnitions of questions of inter-est to the analysis and interpretation of results (Gelman & Hennig 2017, Goldstein et al.2006, Berger & Berry 1988). Hence, expert judgment is required to implement scientiﬁcresearch (Bertolaso & Sterpetti 2017). A session at SSI examined the role of judgment instatistical and scientiﬁc practice. Session participants present their joint expert opinionson the topic in Brownstein et al. (2018). While preparing the paper, a wealth of relatedliterature was collected and synthesized. The present article provides an overview of theliterature on the role of expert judgment in statistics and science.First, section 2 deﬁnes expertise as it relates to this paper. Then, sections 3, 4 and 5discuss the role of expert judgment in Frequentist hypothesis testing, Bayesian inference,and general interpretation. Recommendations are provided in section 6 for stakeholders ineducation, publishing, and funding. Finally, appendices are provided for interested readersincluding overviews of the Frequentist and Bayesian paradigms.2 What is Expert Judgment?

Before discussing the role of expert judgment in science, deﬁnitions of the relevant termsare needed. Weinstein (1993) provides the following deﬁnitions of expert, expertise, andexpert opinion that are invoked throughout this paper:1. An individual is an expert in the ‘epistemic’ sense if and only if he or sheis capable of oﬀering strong justiﬁcations for a range of propositions in adomain.2. An individual is an expert in the ‘performative’ sense if and only if he orshe is able to perform a skill well.3. A claim is an ‘expert opinion’ if and only if it is oﬀered by an expert, theexpert provides a strong justiﬁcation for it, and the claim is in the domainof the expert’s expertise.4. ‘Expertise’ is the capacity either to oﬀer expert opinions or to demonstrateone or more skills in a domain, and expertise in a domain does not entailexpertise in the entire range of the domain.In brief, epistemic experts possesses deep knowledge about a ﬁeld and are considered cred-ibile by others (Carrier 2010), while performative experts are adept at completing actionsfor the ﬁeld. The late Stephen Hawking (Gribbin & White 2016) is a famous epistemicexpert in physics. An olympic athlete, such as Usain Bolt (Gómez et al. 2013), exempliﬁesperformative expertise in their sport. The two types of expertise frequently overlap. Physi-cians, for instance, are both versed in ﬁelds such as anatomy and proﬁcient at diagnosingailments, performing medical procedures, and prescribing treatments and cures.Expert judgment involves one or more evidence-based claims proposed by credible ex-perts. Judgments may arise when multiple conclusions or actions are possible and supportedby credible evidence. For example, multiple treatments may be possible for a condition,and two physicians may disagree on the preferred treatment plan for the same patient.Sections 2.1 and 2.2 relate these deﬁnitions to statistics and science.3 .1 Statistical expertise

Merriam-Webster Online Dictionary (2018 b ) deﬁnes statistics as “a branch of mathematicsdealing with the collection, analysis, interpretation, and presentation of masses of numeri-cal data.” In general, like any professional skill, statistical analysis should be conducted bypractitioners who are qualiﬁed based on their knowledge of and experience with statisticalprocedures and principles. Example key principles include knowledge of modeling assump-tions and diagnostics, as well as experience using and interpreting statistical software.Based on the deﬁnition of expert in Weinstein (1993), an active statistician, deﬁnedas “one versed in or engaged in compiling statistics” (Merriam-Webster Online Dictionary2018 a ), is clearly considered an expert. Similar to the ability of a primary care doctorto identify and treat general health needs for their patients, statisticians are trained withknowledge and skill in choosing, creating, implementing, and validating statistical proce-dures for a wide variety of settings. Others can be considered statistical experts as well ifthey have suﬃcient relevant knowledge and training in statistical theory and practice. Bothtypes of expertise are needed for valid statistics, even if divided among multiple people. Acommon model is for a person with extensive training in statistical theory to supervise thestudy design and statistical analysis, which is executed by an analyst facile in programming.Statistics is a broad ﬁeld with numerous subﬁelds, rendering it impossible to know allaspects of each. Instead, analogous to how physicians choose medical specialties, moststatisticians specialize in one or more subﬁelds, such as genetics, Bayesian modeling, orsurvival analysis. Classiﬁcations of subﬁelds in statistics are found in Schell (2010) andDe Battisti et al. (2015). Specialized problems require expertise in diﬀerent types of an-alytical methods. Identifying and deciding between reasonable analytical choices requiresjudgment, as described throughout this paper and highlighted by others (Francis 2017). While statisticians are critical members of scientiﬁc teams, interdisciplinary collaborationsinclude experts in multiple ﬁelds. For instance, to develop a clinical trial to test an oncologydrug, expertise could be required in ﬁelds such as medicine, biology, and genetics. For thepurpose of this article, a content expert, or subject matter expert, has expertise in a ﬁeld4f science other than statistics. Content experts are key in advancing scientiﬁc endeavors,such as generating hypotheses based on knowledge of biological mechanisms, designing andsupervising experiments using state of the art laboratory techniques, providing intuitionon clinically meaningful values, and contextualizing statistical results. Communication ofcontent knowledge to the statistician is also critical, as it insight may shape the studydesign and analysis in subtle ways, as described in (Brownstein et al. 2018).

This section discusses the judgment of experts as deﬁned in Section 2 for Frequentist hy-pothesis testing. Topics include the role of expert judgment in hypothesis testing, includingﬁnding balance between between type I and II errors, multiple comparisons adjustment,and special considerations such as orphan drugs and non-inferiority trials.

It is well known that for a ﬁxed sample size, there is a trade-oﬀ between minimizing falsepositives (type I error) and maximizing true positives (power). Clearly, it is desirable bothto have small error probabilities and high power. One way to minimize the probabilities ofboth errors simultaneously is to increase the sample size of an experiment (Asendorpf et al.2013). However, this solution may not be feasible, as outlined in Section 6.3. Instead, ex-perts should consider their preferred balance between the two types of errors and designtheir experiments appropriately. The statistician is responsible for engaging the subjectmatter experts in discussions to determine the levels of each type of error that are scien-tiﬁcally and ethically reasonable for their study. Current methods of using judgment tobalance errors in Frequentist hypothesis testing are discussed along with their implications.

For most experiments, the signiﬁcance level is ﬁrst set. In other words, the judgmentis that type I errors should be set at a certain low level, after which other considera-tions can be made. The most famous value is 5%, usually chosen by convention. Rather,5akens et al. (2018) argues that researchers should determine their ideal level for their in-dividual study beforehand using decision theory and report the decision and methodologyin the manuscript. Unfortunately, this practice is uncommon; brief discussion is providedin section 3.1.3. Instead, most researchers compare to a single threshold, the customaryvalue for which is currently under discussion. A summary of the discussion is provided.Recent concern about the reproducibility in science focuses on the fact that interestingassociations in the literature are not easily replicated in future studies (Ioannidis 2005). Ahypothesized cause is that the observed type I error rate may far exceed that of the nominalsigniﬁcance level. This hypothesis inspired recent recommendations to drop the most com-monly used signiﬁcance level further, e.g. to 0.005 (Benjamin et al. 2017, Johnson et al.2017, Johnson 2013). The goal is to minimize the probability of a type I error in hopes ofﬁltering out weak eﬀects and increasing the reproducibility of science as a whole.On the other hand, there is concern that the reproducibility crisis may be due to awidespread lack of power due to inadequate power calculations (Marino 2017, Smaldino & McElreath2016, Asendorpf et al. 2013). Calls to lower the type I error rate have been criticized forenabling underpowered studies if no other action is taken, such as raising funding levels toaccommodate increased sample sizes (Asendorpf et al. 2013, Lakens et al. 2018). Further,Lakens et al. (2018) argue that the recommendation may be hurting the very cause that itchampions by decreasing resources and incentives for replication studies.

The paradigm of prioritizing a ﬁxed signiﬁcance level chosen by convention only may beinappropriate for speciﬁc applications, especially if excessive harm could be associatedwith a lack of discovery. For example, if the question evaluates the harm of a certainenvironmental factor, then the precautionary principle (Fischer & Ghelardi 2016, Fjelland2016) states that it is better to err on the side of caution, as failing to detect underlying harmwould result in failure to reduce personal harm for individuals currently aﬀected and futureindividuals. In these cases, minimizing the type II error probability is judged to be of theutmost concern. Power calculations necessitate speciﬁcation of particular realizations of thealternative hypothesis or eﬀect size. Determination of clinically and statistically meaningful6ﬀect sizes is highly dependent on the judgment of both statisticians and content experts;guidance is provided elsewhere (Murphy et al. 2014, Ellis 2010, Lakens 2013).The story of Love Canal, a suburban neighborhood near Niagra Falls, prioritizes mini-mization the harm of a type II error over that of a type I error. The general null hypothesisis the safety of the area; the alternative is an association between toxic chemicals from theLove Canal and increased cancer prevalence. The relative (bodily) harm of a type II error,declaring the area safe when it was hazardous to the residents, exceeded the (ﬁnancial)harm of a type I error. Initially, an investigation erroneously concluded that the area wassafe. The judgment of experts, namely a scientist with detailed observational data from theresidents, helped diagnose the error in the initial investigation. Eventually, the area wasclosed. Brieﬂy, the ﬁrst study included a reasonable test to answer the wrong question. Theoriginal investigation tested the seemingly intuitive hypothesis that proximity to a chemicalwaste site was associated with more negative outcomes. The second investigation reﬁnedthe hypothesis, namely that residents in homes near the site built over former stream-bedswere more at risk than homes built over dry land. The latter test found a strong associationbased on sound biological principles with clear implications for evacuation prioritization.A detailed discussion of the story and the role of judgment is found in Fjelland (2016).Similarly, priorizing power, the United States Food and Drug Administration has aseparate regulatory category for orphan drugs to allow new treatments to have a higherchance of adoption if no other therapeutic option is available to mitigate a rare condition(Braun et al. 2010). By deﬁnition, small populations of patients with rare conditions maymake traditional recruitment targets infeasible and may require creativity in designing validtrials. Experts may design lower powered studies, increase the target signiﬁcance level,consider alternative or surrogate outcomes, or even develop new methodology for smallsamples (Parmar et al. 2016, Billingham et al. 2016). Orphan drugs trials, while usuallyrelatively small, have sometimes, but not always (Orfali et al. 2012), found to suﬀer frommethodological shortcomings, such as lack of blinding or randomization (Kesselheim et al.2011, Bell & Smith 2014). Furthermore, determination of the value of a drug is complex,based on factors such as disease prevalence, severty, mortality, morbidity, treatment beneﬁtand safety, and cost eﬀectiveness (Paulden et al. 2015).7 .1.3 Minimizing Overall Error

Mudge et al. (2012) recommend to consider the two types of error together and minimizethe overall error rate. Two approaches include averaging the errors themselves and cal-culating the overall cost based on individual error costs. Similar approaches are appliedin ﬁelds such as climate science, where the optimal α level is calculated via simulation(Kemp 2016). (It is worth noting that in Kemp’s study, α = 0 . , which was close tothe recommendation by Benjamin et al. 2017) Mudge et al. (2017) even argue that thesemethods can simplify analyses with multiple hypothesis tests, which also require judgmentas described in section 3.3. More generally, Grieve (2015) analyze how overall minimizationapproaches relate to the likelihood principle and Bayesian decision making. Implicationsof minimizing the overall error rate for research outcomes are modeled in Miller & Ulrich(2016). The two hypotheses (the null and alternative) tested in standard statistical methods servediﬀerent functions, described in the appendix (section 8). Sprenger (2018) elaborateson the asymmetry between the two hypotheses and the inherent value judgments im-plicit in Frequentist methods. In non-inferiority trials, the roles of the two hypothesesare switched. (See the appendix for more details.) Non-inferiority studies require carein their design, analysis, and interpretation (Mauri & D’Agostino Sr 2017, Fleming et al.2011, Fleming 2008). Examples include justiﬁcation of an appropriate comparison group,choice of endpoint and non-inferiority margins “based on statistical reasoning and clin-ical judgment” (Fleming 2008), and careful a priori design to handle known challenges(Mauri & D’Agostino Sr 2017, D’Agostino Sr et al. 2003) Rehal et al. (2016) found thatstatistical recommendations are unclear and those that exist aren’t necessarily followed.8 .3 Multiple Comparisons

While testing multiple hypotheses in a single study is common, how and whether to ad-just each test in light of the others is a non-trivial question. In certain settings, suchas exploratory research, adjustment may be optional (Wason et al. 2014, Li et al. 2017,Gelman et al. 2012). Often, whether or not adjustment is needed may require judgmentfrom a statistical expert about relationships between questions of interest (e.g. correla-tions or other measures of association). For example, tests comparing the eﬀects of distincttreatments to a single control may not need adjustment, but tests comparing repeated mea-surements or multiple outcomes likely do (Candlish et al. 2017, Li et al. 2017, Wason et al.2014).Even when is adjustment for multiple comparisons is generally agreed to be important,expertise is required in specifying an appropriate procedure for a particular problem. Oneissue to consider ﬁrst is the number of tests for which to adjust (primary questions of interestonly, primary and secondary questions, all planned tests, etc.). This issue requires inputfrom both statisticians and subject matter experts about research priorities. The choice ofadjustment method depends on the desired balance of type I and II errors, as discussed inSection 3.1. Additional considerations for multiple testing are provided elsewhere (Li et al.2017, Wason et al. 2014, Alosh & Huque 2009, Proschan & Waclawiw 2000)

The p -value was designed as an informal measure of the evidence that could cast doubt onthe null hypothesis, not to inform a binary choice. Fisher (1955) objected to the decisiontheory framework of Neyman & Pearson (1928). Christensen (2005) describes a Fisherianview that, “an α level should never be chosen; that a scientist should simply evaluate theevidence embodied in the p-value.” The modern use of p-values arguably has lost the em-phasis on scientiﬁc judgment and extrapolates far beyond their intended use. Commentariesabound on the scientiﬁc and moral implications of the abuse of the p -value (Goodman 2016,Ziliak & McCloskey 2009, Steel et al. 2013, Pittenger 2001, Nickerson 2000, Folger 1989).This section discusses judgment of of p-values outside of the decision-theory paradigm.While some authors (e.g. Cumming 2014) and journals (see section 6.3) recommend9bandoning p-values entirely, others argue that alternatives aren’t necessarily better (Murtaugh2014, Macnaughton n.d., Ionides et al. 2017, de Valpine 2014). Instead, p-values could beused in ways other than as sole decision-makers.First, “borderline” values in either direction could be judged accordingly (Cohen 2011)with replication studies encouraged to reevaluate the ﬁndings. Given the nominal signif-icance level α is an arbitrary choice, there is little qualitative diﬀerence between p -valuesin the interval ( α − ǫ, α + ǫ ) , where ǫ is a small number much smaller than the signiﬁ-cance level. Yet, many researchers, even statisticians (McShane & Gal 2017) reject nullhypotheses corresponding to p -values in the lower half of the interval and fail to reject nullhypotheses corresponding to p -values in the upper half. While intuitively, there is littlediﬀerence between p -values of 0.049 and 0.051, papers associated with the former are farmore likely to be published than papers associated with the latter.Additionally, p-values can be considered just one piece of evidence to be evaluated withother factors, such as study design quality and eﬀect size (Spurlock 2017, Wasserstein & Lazar2016). In fact, the United States Supreme Court issued an opinion to use statistical infor-mation as evidence rather than decision tools with blunt cut-oﬀs (Liptak 2011). To mitigate problems with Frequentist inference, a large portion of the statistical commu-nity recommends greater engagement with Bayesian methods (Held & Ott 2018, Page & Satake2017, Goldstein et al. 2006). The Bayesian paradigm enables more intuitive interpretationsthat directly answer questions of interest (Brownstein et al. 2018, Goldstein et al. 2006).The vast collection of literature on Bayesian methods extends far beyond the scope of thepresent paper. Appendix 8.2 provides a brief overview of the Bayesian paradigm.It is well known that Bayesian statistics requires judgment in the choice of a prior dis-tribution and that results may be sensitive to the choice of a prior (Gelman & Hennig 2017,Gelman et al. 2014). Therefore, careful consideration should be paid to the speciﬁcationof the prior and its parameters. Determination of a prior requires input from both statis-ticians and experts in the ﬁelds of application. The focus of this section is an overview ofprior development in Bayesian methods with emphasis on expert judgment and choice.10 .1 Expert Judgment in Choice of a Prior

Priors can be chosen in a variety of ways. Certain “standard” priors may be chosen for math-ematical elegance, computational simplicity, or posterior robustness. Classes of priors, suchas non-informative priors, conjugate families, and reference priors, are described extensivelyelsewhere (Bernardo 1979, Berger & Bernardo 1992, Syversveen 1998, Berger et al. 2009,DeGroot 2005, Fraser et al. 2010). More commonly, prior distributions and parametersmay be determined from pilot data or other values from the literature. Details on theﬁeld of empirical Bayes, involving priors derived based on the current data, are found inchapter 5 of Carlin & Louis (2008). Little (2011) describes a hybrid between Bayesian andFrequentist methods called calibrated Bayes, in which Frequentist methods help determinethe prior for a Bayesian model, which is used for inference. Priors for Bayesian models canalso be chosen to coincide with Frequentist procedures (Daita & Ghosh 1995).

Alternatively, experts can directly develop and calibrate a prior distribution for theirproject. The process, called elicitation, is detailed in numerous places, such as Morgan(2014), O’Hagan et al. (2006), Garthwaite et al. (2005) and even in this special issue (O’Hagan2018). In brief, a facilitator carefully works with content experts to quantify their judgmentabout parameters of interest and either directly uses or transforms the elicited densitiesinto prior distributions for the model. Elicitation, which falls into the subjective Bayesianparadigm (Goldstein et al. 2006), is used in a variety of ﬁelds, such as clinical trials, environ-mental modeling economics and ecology (Mason et al. 2017, Krueger et al. 2012, O’Hagan2012, Kuhnert et al. 2010, Martel et al. 2009).Johnson et al. (2010) urge examination of elicitation methods for validity and reliability.Morgan (2014) details uses and abuses of elicitation, and Heitjan (2017) provides recentcriticisms. While bias may be induced if elicitation is done poorly (Kynn 2008), elicitationcan be used to estimate and mitigate bias (Turner et al. 2009). In fact, expert judgmentand elicitation is critical to modern practice in environmental modeling (Krueger et al.2012), 11 .2 Model Checking in the Bayesian Paradigm

Regardless of the type of prior chosen for modeling, Gelman & Shalizi (2013) and Sprenger(2018) stress the importance of checking that the observed data matches the prior wellenough for the application and considering alternative action if it diverges in key ways. Foranother example and discussion of implications for clinical trials when the prior doesn’tmatch the observed data, please see Brownstein et al. (2018).

Interpretation of results involves both knowledge of the statistical modeling process andbroader experience with the subject matter. For example, to understand the results ofa t-test of the null hypothesis that means in two populations are equal, one must ﬁrstunderstand the basics behind hypothesis testing in general as well as assumptions andconsiderations speciﬁc to the two-sample t-test, such as normally distributed populationsor large samples. Misinterpretation at this stage, such as thinking a p -value means some-thing it doesn’t, or otherwise wrongly acting on a basic deﬁnition (conﬁdence intervals,odds ratios, areas under ROC curves, etc.), are straightforward to correct with training instatistical literacy, as discussed in section 6.1. Of course, interpretation requires more thansimply correctly applying deﬁnitions. Contextualizing results requires content expertisewithin the scientiﬁc ﬁeld. Both types of expertise are required for other aspects, such asplanning an analysis and evaluating potential bias, to name a few. This section focuseson examples including post-hoc diﬃculties with the planned analysis and missing data.Additional biases in interpretation of scientiﬁc evidence are examined in Kaptchuk (2003).Choices abound in designing a study, from the experimental design to the statisti-cal analysis and interpretation; these choices require scientiﬁc judgment. (Please seeBrownstein et al. 2018 for details). Ideally, content experts and statisticians should jointlydeﬁne and prioritize scientiﬁc questions and plan statistically valid, practical, and inter-pretable methods to answer them. Examples requiring judgment include decisions on thenumber and type of primary questions of interest, deﬁning practically signiﬁcant eﬀectsizes on which to base sample size calculations, and whether and when to conduct interim12nalyses.A statistical analysis plan (SAP) is a formal document written prior to a study thatdescribes the planned protocol. Adams-Huet & Ahn (2009) provide guidance for cliniciansto work with statisticians in writing SAPs. Detailed SAPs are important to safeguardteams from changing their analyses after seeing the data (Finfer & Bellomo 2009). Suchchanges bias results and reduce reliability and validity.Even with a well-written SAP, diﬃculties may arise. For instance, interpretation ofthe SAP for adjudication of complex potential events may be diﬃcult. In one example,a single complex case in a clinical trial required an extensive investigation to interpretthe statistical analysis plan properly (Gibson et al. 2017). In fact, the status of the singlepatient determined whether or not the p -value fell below the a priori deﬁned signiﬁcancelevel! When possible, SAPs can include strategies for dealing with missing or ambiguousdata, such as sensitivity analyses, multiple imputation.Another area which may necessitate careful statistical consideration is in the presence ofexcessive missing data. While a full review of missing data is outside of the scope of this pa-per, readers may consult Little & Rubin (2014). Brieﬂy, analysis with missing data requiresconsideration of potential relationships between missing observations and the variables inthe study. Standard methods can be complicated if missing data arises in non-randomways. As an illustrative example, the gold standard diagnosis of temporomandibular disor-ders (TMD) requires an invasive examination by an expert dentist (Dworkin 2010); missingdata arises when subjects who are suspected to be new cases fail to complete the examina-tion (Slade et al. 2013). The research team showed that the proportion of missing data waslarge, the assumptions of standard methods were violated, and new methods were requiredfor modeling in the presence of the missing data (Brownstein et al. 2015, Bair et al. 2013).Recommendations, detailed further in section 6, include experimental protocols to checkfor and minimize missing data during data collection, and a priori plans (including existingmethods or development of new methods) to handle analyses with missing data.13 Recommendations from the Literature

Commentary on the role of statistical inference in the reproducibility crisis includes callsto action for members of the scientiﬁc community, including educational reform, improvedstandards for publication and funding and incentives for scientists. Section 6.1.1 details rec-ommendations for statistics education, especially regarding teaching statistical inference tostudents not planning to become statisticians. Section 6.1.2 focuses on how to train statis-ticians with an eye toward improving reproducibility. Sections 6.2, 6.3, and 6.4 synthesizerecommendations for stakeholders to facilitate better research practice for all ﬁelds.

The need for better statistical training in many ﬁelds has been discussed (Ogino & Nishihara2016, Sørensen & Rothman 2010, Peng 2015). Other ﬁelds are recognizing the impor-tance of detail at each step of the research progress (Shipworth & Huebner 2018). Indeed,Crane & Martin (2018) claim that statistical training can “[empower] scientists to makesound judgements.” Educational programs are discussed for scientists inside and out ofstatistics.

Despite the tendency for misuse among practitioners (Colquhoun 2017, Gardenier & Resnik2002) and even statisticians (McShane & Gal 2017), basic statistical inference proceduresand the accompanying p -values remain ubiquitous. One facilitator of the presence of sig-niﬁcance tests is their central role in introductory courses and ease of implementationin statistical software (Searle 1989, Dallal 1990). In fact, the “cook-book” approach tostatistical analysis commonly taught in introductory courses (Gigerenzer 2004) tempts re-searchers to apply basic procedures without investigating model diagnostics (Steel et al.2013) or considering the appropriateness of the procedures for particular applications.Crane & Martin (2018) and Brown & Kass (2009) stress that students in statisticscourses should learn how to think critically. Furthermore, “statistical practice is com-plex, relying on nuanced principles honed through years of experience” (Crane & Martin14018). In writing their guidelines to weed researchers, Onofri et al. (2010) make the follow-ing statement, “We would like to reinforce the idea that statistical methods are not a setof recipes whose mindless application is required by convention; each experiment or studymay involve subtleties that these guidelines cannot cover.”Paralleling the discussion for research, curriculum reform discussions are ongoing (Gould, Wild, Baglin, McNamara, Ridgway & McConway2018, Park 2018, Gould, Peng, Kreuter, Pruim, Witmer & Cobb 2018). Example remediesfor teaching students to use better judgment include increased focus on eﬀect sizes and con-ﬁdence intervals (Calin-Jageman 2017, Fritz et al. 2012), greater attention to the Bayesianparadigm (Page & Satake 2017), and practical demonstrations (Park 2018, Mitchell 2018).Beyond the classroom, recommendations for biomedical scientists include programs onexperimental design for both students and mentors, emphasis on methodology in journalclubs, developing continuing education materials as new methods arise, and increasingtraining for peer reviewers (Casadevall & Fang 2016). It is especially important to bettertrain peer reviewers in all ﬁelds to enforce statistical standards in published literature (Peng2015). Scientists can pursue additional formal statistical training while in graduate schoolor postdoctoral programs through funded mechanisms, such as NIH T32 training grants.However, the implementation of these specialized programs requires care to prevent traineesfrom simply learning a small amount of statistical tools and more conﬁdently using themeven when the tools are inappropriate for future problems (Gelfond et al. 2011). Bell et al.(2013) recommend both improved statistical training for content experts and continuedcollaboration with biostatisticians throughout the research process. Statisticians serve as collaborators in a wide variety of ﬁelds, some of which strongly rec-ommend or even require a statistician on the team (Obremskey & Archer 2011, Bell et al.2013). Consequently, proper training for statisticians and other quantitative science isparamount. Recognizing the widespread need for biostatistics in research and health prac-tice as early as the 1960s, the National Institutes of Health (NIH) created biostatisticstraining programs (Hemphill 1961). A more modern overview of training opportunities forbiostatisticians is included in Kennedy et al. (2007)15ften working in ﬁelds outside of their primary educational training, statisticians shouldstrive to continually learn about the subject-matter problems at hand (Brown & Kass2009). Brown & Kass (2009) encourage graduate students in statistics to pursue a secondprogram of study in joint or separate programs focused on the subject matter of theirfuture research. Training opportunities, such as NIH K25, are available for this purpose, asdescribed in Pickering et al. (2015). Formal training in a second ﬁeld may not be feasible forall statisticians. However, applied practitioners could instead be encouraged to concentratetheir collaborations in a small number of areas where they can slowly gain deeper familiaritywith the scientiﬁc content, rather than to consult in a large number of disciplines wheresuch depth is infeasible. In fact, this recommendation is often communicated in informalsettings to junior faculty members with foci in applied statistics.In addition, statisticians should hone their soft skills, especially listening, communica-tion, and leadership skills (Gibson et al. 2017, Caliﬀ 2016). While listening and communi-cation are obviously essential for collaboration, leadership training can provide statisticianswith the tools and conﬁdence to continually and actively shape the statistical validity of aproject from its inception. In her March 2018 president’s corner article of AMSTAT news,LaVange (2018) reiterated the importance of leadership in biostatistics and announced theASA’s vision for a new leadership initiative. As an example of the addition of soft skills tothe curriculum, the biostatistics department at the University of North Carolina at ChapelHill previously oﬀered a course in leadership (LaVange et al. 2012).

A ﬁnal idea about educational interventions is to study the research process scientiﬁcallyand use the ﬁndings to improve training. Leek et al. (2017) summarized the state of aﬀairswith a few apt quotes:“The root problem is that we know very little about how people analyse andprocess information... We need to appreciate that data analysis is not purelycomputational and algorithmic — it is a human behaviour... We need more ob-servational studies and randomized trials — more epidemiology on how peoplecollect, manipulate, analyse, communicate and consume data. We can then use16his evidence to improve training programmes for researchers and the public.”To this end, in addition to actions described in section 6.2, funders could seek studies ofscientiﬁc judgment with components to develop training materials based on the ﬁndings.

Due to the fact that funding is often necessary for research and highly valued or even re-quired for promotions, funding agencies should take an active role in deﬁning standards. Tothis end, the NIH deﬁned guidelines for increased rigor, transparency and reproducibility(Hewitt et al. 2017, Collins & Tabak 2014). The NSF followed with guidelines to encouragedata sharing and citation (Stan Ahalt et al. 2015). The ASA deﬁned its own recommendedactions for funding agencies to improve reproducibility, including funding mechanisms fortraining in reproducible research methods, development of reproducible software, repli-cation studies (Broman et al. 2017). Actions by funding agencies should catalyze morewidespread adoption of the recommendations for scientiﬁc rigor.

Changes in publication standards percolate to the scientiﬁc practice of authors, as opinedby Asher (1993) over two decades ago. In fact, this special

TAS issue stems from highproﬁle journal articles and rules. Namely, after Nuzzo (2014) bemoaned the ritual, oftenthoughtless, use of p -values, the editors of Basic and Applied Social Psychology banned p -values in manuscripts submitted thereafter (Traﬁmow & Marks 2015, Traﬁmow 2014).Recently, in BMC Medical Research Methodology , hypothesis testing has been “discouraged”(Hanin 2017). This section considers journal guidelines and their implications.A recent set of journal guidelines for more careful publication practices (McNutt 2014)has been adopted by hundreds of journals thus far (Hewitt et al. 2017). Similarly,

BMCMedical Research Methodology (Hanin 2017) advises authors that “Health care decisions[should be] based on... a combination of statistical and biomedical evidence.” Sørensen & Rothman(2010) propose that statistical training for journal editors could enable editors to betterenforce statistical standards for publication in the journals that they oversee.17ournal guidelines include calls for transparency, including sharing of data and code,such as with pre-submission check-lists to ensure key aspects are considered (McNutt 2014).Other key aspects include reporting and justiﬁcation of sample size calculations, randomiza-tion and blinding procedures, and inclusion and exclusion criteria (Asendorpf et al. 2013).Finally, emphasis can shift from inference to parameter estimation. (Asendorpf et al. 2013,McNutt 2014). These guidelines encourage authors to confront, acknowledge and justifythe scientiﬁc judgments used in their studies and enable others to examine or reproducetheir ﬁndings. The recommendations will likely improve rigor throughout science. Yet,implications from guidelines of some journals (Hanin 2017, Traﬁmow 2014) to avoid sharpthresholds for statistical inference is unclear, underlying a large portion of this special issue.

Because statistics requires choices throughout the process, statistics has elements of subjec-tivity and objectivity, and Gelman & Hennig (2017) argue that the discussion of relativesubjectivity and objectivity is distracting from solutions for best research practices, in-cluding transparency. Instead, paralleling the aim of journals to strive for transparency,Gelman & Hennig (2017) detail principles for which to strive during analysis, including,among others, acknowledgment and investigation of multiple perspectives, impartiality indecision making, and transparency in reporting. In brief, they argue that statistical judg-ment is inevitable, and therefore, the those judgments should be detailed and shared forfuture evaluation. Gibson (2017) stresses that statisticians should communicate “relativeadvantages” and disadvantages of each choice.It has been shown that not only are descriptions of methods sections insuﬃcientlydetailed, but the quality of the reporting, methodology, and analysis are not necessarilyassociated with increased visibility (Nieminen et al. 2006). In addition, poor descriptionsor unwillingness to share data may be associated with author concern over the robustnessof results, especially for borderline results (Wicherts et al. 2011). This may be related tothe need discussed in section 6.1 to increase statistical literacy broadly across other ﬁeldsand professions. Statisticians can serve key roles as co-investigators by co-writing methodssections with suﬃcient detail for reproducibility, as peer reviewers who are specially trained18o look for detail in reporting of study design, methods, and results, and as mentors andteachers who help others improve the completeness of the writing and peer review skills.The ASA statement urges that “Proper inference requires full reporting and trans-parency.” Others similarly recommend increased transparency of study design and analysis(Gelman & Hennig 2017, Greenland et al. 2016, Peng 2015, McNutt 2014, Asendorpf et al.2013), which would allow readers of the literature to evaluate ﬁndings in light of thestrength of the methodology. There are even journals dedicated to reporting raw data(

Data in Brief - Making Your Data Count a priori ﬁxation and reporting of analysis plans stiﬂes discovery(Poole 2010), and transparency alone cannot overcome poor data quality (Gelman 2017).

Clearly, changes to scientiﬁc and statistical standards will require coordination and buy-in from multiple groups concurrently, ensure that changes improve scientiﬁc practice as awhole (Ioannidis 2014, Sørensen & Rothman 2010). Asendorpf et al. (2013) include an ex-cellent summary of recommended changes for these groups. Importantly, Smaldino & McElreath(2016) point out that incentives of various groups may conﬂict, such as funding agenciesrequesting publications from their grantees but not necessary checking the publication qual-ity. As another example of potential misalignment of incentives, a seemingly simple solutionadvocated in Asendorpf et al. (2013) to increase sample sizes across science would improvereproducibility by improving power overall and likely subsequently increase the proportionof reported discoveries that are true. On the other hand, as the cost of each additional studyparticipant may be large, increasing sample sizes may pose ﬁnancial diﬃculties for alreadystrained funding bodies. If total budgets were not increased, then requiring studies to belarger (and thus more expensive) could result in fewer projects funded. Such a consequencewould negatively impact individual researchers who would face increased competition forcomparatively fewer grants. Additional discussion of guidelines and their implications forvarious groups is found in Williams et al. (2018).19

Conclusion

There is broad consensus that rigorous statistical practice and teaching require a mixture oftechnical knowledge, communication skills, and the ability to interpret data. The presentpaper highlights the less frequently discussed presence of choice and judgment in scientiﬁcand statistical practice. The universe of potential methods for analysis is large, and decidingamong the options is challenging, even for experts (Gelman 2014). Properly synthesizingexpert judgment along with quantitative tools is critical for high quality evidence-basedresearch. Failure to appropriately capitalize on statistical and scientiﬁc judgment couldfurther erode public trust in science (Spiegelhalter 2017, Saltelli & Funtowicz 2017) andpolicy (Sutherland et al. 2015, Weinberg & Elliott 2012). More importantly, scientists havean ethical obligation to conduct valid statistical analyses, which eventually have broadersocietal impacts (Shmueli 2017, Zook et al. 2017, Gelfond et al. 2011).Rather than argue for a single solution to the problem of improving quantitative prac-tice, this paper collects in a single document many of the thoughts on expert judgment andrecommendations in statistics. (In a related piece submitted to this special issue, a strongstatement on expert judgement in science with example applications by participants in theSSI is provided in Brownstein et al. 2018.) The present article gives the reader a startingpoint to understand the vast literature related to expert judgment in statistics. Viewpointsare highly diverse, indicating that there is much more work to be done toward develop-ing concise, uniﬁed recommendations for improved methods. The author intends for thepresent paper to facilitate ongoing discussions of expert judgment and recommendationsto improve statistical practice in the twenty-ﬁrst century and beyond.

References

Adams-Huet, B. & Ahn, C. (2009), ‘Bridging clinical investigators and statisticians’,

Jour-nal of Investigative Medicine (8), 818–824.Alosh, M. & Huque, M. F. (2009), ‘A ﬂexible strategy for testing subgroups and overallpopulation’, Statistics in Medicine (1), 3–23.20sendorpf, J. B., Conner, M., De Fruyt, F., De Houwer, J., Denissen, J. J. A., Fiedler, K.,Fiedler, S., Funder, D. C., Kliegl, R., Nosek, B. A., Perugini, M., Roberts, B. W.,Schmitt, M., van Aken, M. A. G., Weber, H. & Wicherts, J. M. (2013), ‘Recom-mendations for increasing replicability in psychology’, European Journal of Personality (2), 108–119.Asher, W. (1993), ‘The role of statistics in research’, The Journal of Experimental Education (4), 388–393.Bair, E., Brownstein, N. C., Ohrbach, R., Greenspan, J. D., Dubner, R., Fillingim, R. B.,Maixner, W., Smith, S. B., Diatchenko, L., Gonzalez, Y. et al. (2013), ‘Study protocol,sample characteristics, and loss to follow-up: The oppera prospective cohort study’, TheJournal of Pain (12), T2–T19.Bell, M. L., Olivier, J. & King, M. T. (2013), ‘Scientiﬁc rigour in psycho-oncology trials:why and how to avoid common statistical errors’, Psycho-Oncology (3), 499–505.Bell, S. A. & Smith, C. T. (2014), ‘A comparison of interventional clinical trials in rareversus non-rare diseases: an analysis of clinicaltrials. gov’, Orphanet journal of rarediseases (1), 170.Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E.-J., Berk,R., Bollen, K. A., Brembs, B., Brown, L., Camerer, C. et al. (2017), ‘Redeﬁne statisticalsigniﬁcance’, Nature Human Behaviour p. 1.Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: a practical andpowerful approach to multiple testing’,

Journal of the royal statistical society. Series B(Methodological) pp. 289–300.Berger, J. O. & Bernardo, J. M. (1992), ‘On the development of the reference prior method’,

Bayesian statistics , 35–60.Berger, J. O., Bernardo, J. M., Sun, D. et al. (2009), ‘The formal deﬁnition of referencepriors’, The Annals of Statistics (2), 905–938.21erger, J. O. & Berry, D. A. (1988), ‘Statistical analysis and the illusion of objectivity’, American Scientist (2), 159–165.Bernardo, J. M. (1979), ‘Reference posterior distributions for bayesian inference’, Journalof the Royal Statistical Society. Series B (Methodological) pp. 113–147.Bertolaso, M. & Sterpetti, F. (2017), ‘Evidence amalgamation, plausibility, and cancerresearch’,

Synthese pp. 1–39.Billingham, L., Malottki, K. & Steven, N. (2016), ‘Research methods to change clinicalpractice for patients with rare cancers’,

The Lancet Oncology (2), e70 – e80.Braun, M. M., Farag-El-Massah, S., Xu, K. & Coté, T. R. (2010), ‘Emergence of orphandrugs in the united states: a quantitative assessment of the ﬁrst 25 years’, Nature ReviewsDrug Discovery (7), 519.Broman, K., Cetinkaya-Rundel, M., Nussbaum, A., Paciorek, C., Peng, R., Turek, D. &Wickham, H. (2017), Recommendations to funding agencies for supporting reproducibleresearch, in ‘American Statistical Association’, Vol. 2.Brown, E. N. & Kass, R. E. (2009), ‘What is statistics?’, The American Statistician (2), 105–110.Brownstein, N. C., Cai, J., Slade, G. D. & Bair, E. (2015), ‘Parameter estimation in Coxmodels with missing failure indicators and the OPPERA study’, Statistics in Medicine pp. 3984–3996.Brownstein, N. C., Louis, T. A., O’Hagan, A. & Pendergast, J. (2018), ‘The role of expertjudgement in statistical inference and evidence-based decision-making’,

The AmericanStatistician p. In press.Byun, K. & Croucher, J. S. (2018), ‘Teaching statistics through the law’,

Teaching Statistics pp. n/a–n/a. TST-Nov-17-1005.R2.Caliﬀ, R. M. (2016), ‘Pragmatic clinical trials: Emerging challenges and new roles forstatisticians’,

Clinical Trials (5), 471–477.22alin-Jageman, R. J. (2017), ‘After p values: The new statistics for undergraduate neuro-science education’, Journal of Undergraduate Neuroscience Education (1), E1.Candlish, J., Pate, A., Sperrin, M. & van Staa, T. (2017), ‘Evaluation of biases presentin the cohort multiple randomised controlled trial design: a simulation study’, BMCMedical Research Methodology (1), 17.Carlin, B. P. & Louis, T. A. (2008), Bayesian methods for data analysis , CRC Press.Carrier, M. (2010), ‘Scientiﬁc knowledge and scientiﬁc expertise: Epistemic and socialconditions of their trustworthiness’,

Analyse & Kritik (2), 195–212.Casadevall, A. & Fang, F. C. (2016), ‘Rigorous science: a how-to guide’, mBio (6).Christensen, R. (2005), ‘Testing ﬁsher, neyman, pearson, and bayes’, The American Statis-tician (2), 121–126.Cohen, H. W. (2011), ‘P values: Use and misuse in medical literature’, American Journalof Hypertension (1), 18–23.Collins, F. S. & Tabak, L. A. (2014), ‘Nih plans to enhance reproducibility’, Nature (7485), 612.Colquhoun, D. (2017), ‘The reproducibility of research and the misinterpretation of p-values’,

Royal Society Open Science (12).Crane, H. & Martin, R. (2018), ‘Is statistics meeting the needs of science?’.Cumming, G. (2014), ‘The new statistics: Why and how’, Psychological Science (1), 7–29. PMID: 24220629.D’Agostino Sr, R. B., Massaro, J. M. & Sullivan, L. M. (2003), ‘Non-inferiority trials:design concepts and issues–the encounters of academic consultants in statistics’, Statisticsin medicine (2), 169–186.Daita, G. S. & Ghosh, J. K. (1995), ‘On priors providing frequentist validity for bayesianinference’, Biometrika (1), 37–45. 23allal, G. E. (1990), ‘Statistical computing packages: dare we abandon their teaching toothers?’, The American Statistician (4), 265–266. Data in Brief - Making Your Data Count (2014),

Data in Brief , 5 – 6.De Battisti, F., Ferrara, A. & Salini, S. (2015), ‘A decade of research in statistics: a topicmodel approach’, Scientometrics (2), 413–433.de Valpine, P. (2014), ‘The common sense of p values’,

Ecology (3), 617–621.DeGroot, M. H. (2005), Optimal Statistical Decisions , John Wiley & Sons, Inc., chapterConjugate Prior Distributions, pp. 155–189.Dworkin, S. F. (2010), ‘Research diagnostic criteria for temporomandibular disorders: cur-rent status & future relevance1’,

Journal of Oral Rehabilitation (10), 734–743.Ellis, P. D. (2010), The essential guide to eﬀect sizes: Statistical power, meta-analysis, andthe interpretation of research results , Cambridge University Press.Finfer, S. & Bellomo, R. (2009), ‘Why publish statistical analysis plans’,

Crit Care Resusc (1), 5–6.Fischer, A. & Ghelardi, G. (2016), ‘The precautionary principle, evidence-based medicine,and decision theory in public health evaluation’, Frontiers in Public Health , 107.Fisher, R. (1955), ‘Statistical methods and scientiﬁc induction’, Journal of the Royal Sta-tistical Society. Series B (Methodological) (1), 69–78. URL:

Fjelland, R. (2016), ‘When laypeople are right and experts are wrong: Lessons from lovecanal’,

HYLE–International Journal for Philosophy of Chemistry , 105–125.Fleming, T. R. (2008), ‘Current issues in non-inferiority trials’, Statistics in medicine (3), 317–332.Fleming, T. R., Odem-Davis, K., Rothmann, M. D. & Li Shen, Y. (2011), ‘Some essen-tial considerations in the design and conduct of non-inferiority trials’, Clinical Trials (4), 432–439. 24olger, R. (1989), ‘Signiﬁcance tests and the duplicity of binary decisions.’.Francis, G. (2017), ‘Equivalent statistics and data interpretation’, Behavior Research Meth-ods (4), 1524–1538.Fraser, D., Reid, N., Marras, E. & Yi, G. (2010), ‘Default priors for bayesian and frequentistinference’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) (5), 631–654.Fritz, C. O., Morris, P. E. & Richler, J. J. (2012), ‘Eﬀect size estimates: current use,calculations, and interpretation.’, Journal of experimental psychology: General (1), 2.Gardenier, J. & Resnik, D. (2002), ‘The misuse of statistics: Concepts, tools, and a researchagenda’,

Accountability in Research (2), 65–74. PMID: 12625352.Garthwaite, P. H., Kadane, J. B. & O’Hagan, A. (2005), ‘Statistical methods for elicitingprobability distributions’, Journal of the American Statistical Association (470), 680–701.Gelfond, J. A., Heitman, E., Pollock, B. H. & Klugman, C. M. (2011), ‘Principles for theethical analysis of clinical and translational research’,

Statistics in medicine (23), 2785–2792.Gelman, A. (2014), ‘How do we choose our default methods’, Past, Present, and Future ofStatistical Science pp. 293–301.Gelman, A. (2017), ‘Ethics and statistics: Honesty and transparency are not enough’,

CHANCE (1), 37–39.Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A. & Rubin, D. B. (2014), Bayesian data analysis , Vol. 2, CRC press Boca Raton, FL.Gelman, A. & Hennig, C. (2017), ‘Beyond subjective and objective in statistics’,

Journalof the Royal Statistical Society: Series A (Statistics in Society) (4), 967–1033.Gelman, A., Hill, J. & Yajima, M. (2012), ‘Why we (usually) don’t have to worry aboutmultiple comparisons’,

Journal of Research on Educational Eﬀectiveness (2), 189–211.25elman, A. & Shalizi, C. R. (2013), ‘Philosophy and the practice of bayesian statistics’, British Journal of Mathematical and Statistical Psychology (1), 8–38.Gibson, C. M., Goldhaber, S. Z., Cohen, A. T., Nafee, T., Hernandez, A. F., Hull, R.,Korjian, S., Daaboul, Y., Chi, G., Yee, M. et al. (2017), ‘When academic research organi-zations and clinical research organizations disagree: Processes to minimize discrepanciesprior to unblinding of randomized trials’, American heart journal , 1–8.Gibson, E. W. (2017), ‘Leadership in statistics: Increasing our value and visibility’,

TheAmerican Statistician (ja), 00–00.Gigerenzer, G. (2004), ‘Mindless statistics’, The Journal of Socio-Economics (5), 587 –606. Statistical Signiﬁcance.Goldstein, M. et al. (2006), ‘Subjective bayesian analysis: principles and practice’, Bayesiananalysis (3), 403–420.Gómez, J. H., Marquina, V. & Gómez, R. (2013), ‘On the performance of usain bolt in the100 m sprint’, European journal of physics (5), 1227.Goodman, S. N. (2016), ‘Aligning statistical and scientiﬁc reasoning’, Science (6290), 1180–1181.Gould, R., Peng, R. D., Kreuter, F., Pruim, R., Witmer, J. & Cobb, G. W. (2018),Challenge to the established curriculum: A collection of reﬂections, in ‘InternationalHandbook of Research in Statistics Education’, Springer, pp. 415–432.Gould, R., Wild, C. J., Baglin, J., McNamara, A., Ridgway, J. & McConway, K. (2018), Revolutions in Teaching and Learning Statistics: A Collection of Reﬂections , SpringerInternational Publishing, Cham, pp. 457–472.Greenland, S. & Poole, C. (2010-2011), ‘Problems in common interpretations of statisticsin scientiﬁc articles, expert reports, and testimony’,

Jurimetrics , 113.Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N. &Altman, D. G. (2016), ‘Statistical tests, p values, conﬁdence intervals, and power: aguide to misinterpretations’, European Journal of Epidemiology (4), 337–350.26ribbin, J. & White, M. (2016), Stephen Hawking: a life in science , Pegasus Books.Grieve, A. P. (2015), ‘How to test hypotheses if you must’,

Pharmaceutical Statistics (2), 139–150.Hanin, L. (2017), ‘Why statistical inference from clinical trials is likely to generate falseand irreproducible results’, BMC medical research methodology (1), 127.Heitjan, D. F. (2017), ‘Commentary on mason et al.’, Clinical Trials (4), 368–369.Held, L. & Ott, M. (2018), ‘On p-values and bayes factors’, Annual Review of Statisticsand Its Application (1), null.Hemphill, E. C. (1961), ‘The probable impact of the nih training grant program on thefuture supply of biostatisticians’, American Journal of Public Health and the NationsHealth (12), 1775–1779.Hewitt, J. A., Brown, L. L., Murphy, S. J., Grieder, F. & Silberberg, S. D. (2017), ‘Acceler-ating biomedical discoveries through rigor and transparency’, ILAR Journal (1), 115–128.Ioannidis, J. P. (2005), ‘Why most published research ﬁndings are false’, PLoS medicine (8), e124.Ioannidis, J. P. (2014), ‘How to make more published research true’, PLoS medicine (10), e1001747.Ionides, E. L., Giessing, A., Ritov, Y. & Page, S. E. (2017), ‘Response to the asa’s statementon p-values: Context, process, and purpose’, The American Statistician (1), 88–89.Johnson, S. R., Tomlinson, G. A., Hawker, G. A., Granton, J. T. & Feldman, B. M. (2010),‘Methods to elicit beliefs for bayesian priors: a systematic review’, Journal of clinicalepidemiology (4), 355–369.Johnson, V. E. (2013), ‘Revised standards for statistical evidence’, Proceedings of the Na-tional Academy of Sciences (48), 19313–19317.27ohnson, V. E., Payne, R. D., Wang, T., Asher, A. & Mandal, S. (2017), ‘On the re-producibility of psychological science’,

Journal of the American Statistical Association (517), 1–10.Kaptchuk, T. J. (2003), ‘Eﬀect of interpretive bias on research evidence’,

Bmj (7404), 1453–1455.Kemp, D. B. (2016), ‘Optimizing signiﬁcance testing of astronomical forcing in cyclostratig-raphy’,

Paleoceanography (12), 1516–1531.Kennedy, R. E., Yeatts, S. D., Archer, K. J., Gennings, C. & Ramakrishnan, V. (2007),‘Opportunities for biostatistics students: Training and fellowship grants from the nationalinstitutes of health’, The American Statistician (2), 120–126.Kesselheim, A., Myers, J. & Avorn, J. (2011), ‘Characteristics of clinical trials to supportapproval of orphan vs nonorphan drugs for cancer’, JAMA (22), 2320–2326.Krueger, T., Page, T., Hubacek, K., Smith, L. & Hiscock, K. (2012), ‘The role of expertopinion in environmental modelling’,

Environmental Modelling & Software , 4–18.Kuhnert, P. M., Martin, T. G. & Griﬃths, S. P. (2010), ‘A guide to eliciting and usingexpert knowledge in bayesian ecological models’, Ecology Letters (7), 900–914.Kynn, M. (2008), ‘The ‘heuristics and biases’ bias in expert elicitation’, Journal of theRoyal Statistical Society: Series A (Statistics in Society) (1), 239–264.Lakens, D. (2013), ‘Calculating and reporting eﬀect sizes to facilitate cumulative science:a practical primer for t-tests and anovas’,

Frontiers in Psychology , 863.Lakens, D., Adolﬁ, F. G., Albers, C. J., Anvari, F., Apps, M. A., Argamon, S. E., Baguley,T., Becker, R. B., Benning, S. D., Bradford, D. E. et al. (2018), ‘Justify your alpha’, Nature Human Behaviour (3), 168.LaVange, L. (2018), ‘Building a leadership institute (from the ground up)’, Amstat News , 3–4. 28aVange, L., Sollecito, W., Steﬀen, D., Evarts, L. & Kosorok, M. (2012), ‘Preparing bio-statisticians for leadership opportunities’,

Amstat News , 5–6.Leek, J., McShane, B. B., Gelman, A., Colquhoun, D., Nuijten, M. B. & Goodman, S. N.(2017), ‘Five ways to ﬁx statistics’,

Nature (7682), 557–559.Li, G., Taljaard, M., Van den Heuvel, E. R., Levine, M. A., Cook, D. J., Wells, G. A.,Devereaux, P. J. & Thabane, L. (2017), ‘An introduction to multiplicity issues in clinicaltrials: the what, why, when and how’,

International Journal of Epidemiology (2), 746–755.Liptak, A. (2011), ‘Supreme court rules against zicam maker’, New York Times .Little, R. (2011), ‘Calibrated bayes, for statistics in general, and missing data in particular’, Statist. Sci. (2), 162–174.Little, R. J. & Rubin, D. B. (2014), Statistical analysis with missing data , Vol. 333, JohnWiley & Sons.Macnaughton, D. B. (n.d.), ‘The p-value is best to detect eﬀects’. [Online; accessed 13-July-2018].

URL: https://matstat.com/macnaughton2017b.pdf

Marino, M. J. (2017), ‘How often should we expect to be wrong? statistical power, p-values,and the expected prevalence of false discoveries’,

Biochemical Pharmacology .Martel, M., Hernández, M. Á. N. & Polo, F. J. V. (2009), ‘Eliciting expert opinion forcost-eﬀectiveness analysis: a ﬂexible family of prior distributions’,

SORT-Statistics andOperations Research Transactions (2), 193–212.Mason, A. J., Gomes, M., Grieve, R., Ulug, P., Powell, J. T. & Carpenter, J. (2017),‘Development of a practical approach to expert elicitation for randomised controlledtrials with missing health outcomes: application to the improve trial’, Clinical Trials (4), 357–367. 29atthews, R., Wasserstein, R. & Spiegelhalter, D. (2017), ‘The asa’s p-value statement,one year on’, Signiﬁcance (2), 38–41.Mauri, L. & D’Agostino Sr, R. B. (2017), ‘Challenges in the design and interpretation ofnoninferiority trials’, New England Journal of Medicine (14), 1357–1367.McNutt, M. (2014), ‘Journals unite for reproducibility’,

Science (6210), 679–679.McShane, B. B. & Gal, D. (2017), ‘Statistical signiﬁcance and the dichotomization ofevidence’,

Journal of the American Statistical Association (519), 885–895.Merriam-Webster Online Dictionary (2018 a ), ‘Statistician’. [Online; accessed 11-July-2018]. URL:

Merriam-Webster Online Dictionary (2018 b ), ‘Statistics’. [Online; accessed 11-July-2018]. URL:

Miller, J. & Ulrich, R. (2016), ‘Optimizing research payoﬀ’,

Perspectives on PsychologicalScience (5), 664–691. PMID: 27694463.Mitchell, P. (2018), ‘Teaching statistical appreciation in quantitative methods’, MSORConnections (2), 37–42.Morey, R. D., Romeijn, J.-W. & Rouder, J. N. (2016), ‘The philosophy of bayes factors andthe quantiﬁcation of statistical evidence’, Journal of Mathematical Psychology , 6–18.Morgan, M. G. (2014), ‘Use (and abuse) of expert elicitation in support of decision makingfor public policy’, Proceedings of the National Academy of Sciences (20), 7176–7184.Motulsky, H. (2014),

Intuitive biostatistics: a nonmathematical guide to statistical thinking ,Oxford University Press, USA.Mudge, J. F., Baker, L. F., Edge, C. B. & Houlahan, J. E. (2012), ‘Setting an optimal α that minimizes errors in null hypothesis signiﬁcance tests’, PLOS ONE (2), 1–7.Mudge, J. F., Martyniuk, C. J. & Houlahan, J. E. (2017), ‘Optimal alpha reduces error ratesin gene expression studies: a meta-analysis approach’, BMC Bioinformatics (1), 312.30urphy, K. R., Myors, B. & Wolach, A. (2014), Statistical power analysis: A simple andgeneral model for traditional and modern hypothesis tests , Routledge.Murtaugh, P. A. (2014), ‘In defense of p values’,

Ecology (3), 611–617.Neyman, J. & Pearson, E. S. (1928), ‘On the use and interpretation of certain test criteriafor purposes of statistical inference: Part i’, Biometrika pp. 175–240.Nickerson, R. S. (2000), ‘Null hypothesis signiﬁcance testing: a review of an old and con-tinuing controversy.’,

Psychological methods (2), 241.Nieminen, P., Carpenter, J., Rucker, G. & Schumacher, M. (2006), ‘The relationship be-tween quality of research and citation frequency’, BMC Medical Research Methodology (1), 42.Nuzzo, R. (2014), ‘Scientiﬁc method: statistical errors’, Nature News (7487), 150.Obremskey, W. T. & Archer, K. R. (2011), ‘Getting started in an academic setting’,

Journalof orthopaedic trauma , S124–S127.Ogino, S. & Nishihara, R. (2016), ‘All biomedical and health science researchers, includ-ing laboratory physicians and scientists, need adequate education and training in studydesign and statistics’, Clinical Chemistry (7), 1039–1040.O’Hagan, A. (2012), ‘Probabilistic uncertainty speciﬁcation: Overview, elaboration tech-niques and their application to a mechanistic model of carbon ﬂux’, Environmental Mod-elling & Software , 35–48.O’Hagan, A. (2018), ‘Expert knowledge elicitation: Subjective, but scientiﬁc’, The Ameri-can Statistician p. In press.O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J.,Oakley, J. E. & Rakow, T. (2006),

Uncertain judgements: eliciting experts’ probabilities ,John Wiley & Sons.Onofri, A., Carbonell, E. A., Piepho, H.-P., Mortimer, A. M. & Cousens, R. D. (2010),‘Current statistical issues in weed research’,

Weed Research (1), 5–24.31rfali, M., Feldman, L., Bhattacharjee, V., Harkins, P., Kadam, S., Lo, C., Ravi, M.,Shringarpure, D., Mardekian, J., Cassino, C. et al. (2012), ‘Raising orphans: how clinicaldevelopment programs of drugs for rare and common diseases are diﬀerent’, Clinicalpharmacology & Therapeutics (2), 262–264.Page, R. & Satake, E. (2017), ‘Beyond p values and hypothesis testing: Using the mini-mum bayes factor to teach statistical inference in undergraduate introductory statisticscourses’, Journal of Education and Learning (4), 254.Park, R. (2018), ‘Practical teaching strategies for hypothesis testing’, The American Statis-tician (ja), 0–0.Parmar, M. K. B., Sydes, M. R. & Morris, T. P. (2016), ‘How do you design randomisedtrials for smaller populations? a framework’, BMC Medicine (1), 183.Paulden, M., Staﬁnski, T., Menon, D. & McCabe, C. (2015), ‘Value-based reimbursementdecisions for orphan drugs: a scoping review and decision framework’, Pharmacoeco-nomics (3), 255–269.Peng, R. (2015), ‘The reproducibility crisis in science: A statistical counterattack’, Signif-icance (3), 30–32.Pickering, C. R., Bast, R. C. & Keyomarsi, K. (2015), ‘How will we recruit, train, and retainphysicians and scientists to conduct translational cancer research?’, Cancer (6), 806–816.Pittenger, D. J. (2001), ‘Hypothesis testing as a moral choice’,

Ethics & Behavior (2), 151–162.Poole, C. (2010), ‘A vision of accessible epidemiology’, Epidemiology (5), 616–618.Proschan, M. A. & Waclawiw, M. A. (2000), ‘Practical guidelines for multiplicity adjust-ment in clinical trials’, Contemporary Clinical Trials (6), 527–539.Rehal, S., Morris, T. P., Fielding, K., Carpenter, J. R. & Phillips, P. P. J. (2016), ‘Non-inferiority trials: are they inferior? a systematic review of reporting in major medical32ournals’, BMJ Open (10). URL: https://bmjopen.bmj.com/content/6/10/e012594

Saltelli, A. & Funtowicz, S. (2017), ‘What is science’s crisis really about?’,

Futures , 5 –11. Post-Normal science in practice.Schell, M. J. (2010), ‘Identifying key statistical papers from 1985 to 2002 using citationdata for applied biostatisticians’, The American Statistician (4), 310–317.Searle, S. R. (1989), ‘Statistical computing packages: Some words of caution’, The Ameri-can Statistician (4), 189–190.Shipworth, D. & Huebner, G. M. (2018), Designing research, in ‘Exploring Occupant Be-havior in Buildings’, Springer, pp. 39–76.Shmueli, G. (2017), ‘Research dilemmas with behavioral big data’, Big data (2), 98–119.Slade, G. D., Fillingim, R. B., Sanders, A. E., Bair, E., Greenspan, J. D., Ohrbach, R.,Dubner, R., Diatchenko, L., Smith, S. B., Knott, C. et al. (2013), ‘Summary of ﬁndingsfrom the oppera prospective cohort study of incidence of ﬁrst-onset temporomandibulardisorder: implications and future directions’, The Journal of Pain (12), T116–T124.Smaldino, P. E. & McElreath, R. (2016), ‘The natural selection of bad science.’, RoyalSociety Open Science (9), 160384.Sørensen, H. T. & Rothman, K. J. (2010), ‘The prognosis for research’.Spiegelhalter, D. (2017), ‘Trust in numbers’, Journal of the Royal Statistical Society: SeriesA (Statistics in Society) (4), 948–965.Sprenger, J. (2018), ‘The objectivity of subjective bayesianism’,

European Journal for Phi-losophy of Science pp. 1–20.Spurlock, D. (2017), ‘Beyond p<. 05: Toward a nightingalean perspective on statisticalsigniﬁcance for nursing education researchers’,

Journal of Nursing Education (8), 453–455. 33tan Ahalt, R., Couch, A., Ibanez, L. & Ray Idaszak, R. (2015), ‘Nsf workshop on sup-porting scientiﬁc discovery through norms and practices for software and data citationand attribution’.Steel, E. A., Kennedy, M. C., Cunningham, P. G. & Stanovick, J. S. (2013), ‘Appliedstatistics in ecology: common pitfalls and simple solutions’, Ecosphere (9), 1–13.Sutherland, W. J., Burgman, M. A. et al. (2015), ‘Use experts wisely’, Nature (7573), 317–318.Syversveen, A. R. (1998), ‘Non-informative bayesian priors: Interpretation and problemswith construction and applications’.

URL:

Tibshirani, R. (1989), ‘Noninformative priors for one parameter of many’,

Biometrika (3), 604–608.Traﬁmow, D. (2014), ‘Editorial’, Basic and Applied Social Psychology (1), 1–2.Traﬁmow, D. & Marks, M. (2015), ‘Editorial’, Basic and Applied Social Psychology (1), 1–2.Turner, R. M., Spiegelhalter, D. J., Smith, G. & Thompson, S. G. (2009), ‘Bias modellingin evidence synthesis’, Journal of the Royal Statistical Society: Series A (Statistics inSociety) (1), 21–47.Vasilevsky, N. A., Minnier, J., Haendel, M. A. & Champieux, R. E. (2017), ‘Reproducibleand reusable research: are journal data sharing policies meeting the mark?’.Wang, H.-R. (2014), “ ‘publish or perish”: Should this still be true for your data?’,

Data inBrief , 85 – 86.Wason, J. M. S., Stecher, L. & Mander, A. P. (2014), ‘Correcting for multiple-testing inmulti-arm trials: is it necessary and is it done?’, Trials (1), 364.Wasserstein, R. L. & Lazar, N. A. (2016), ‘The asa’s statement on p-values: Context,process, and purpose’, The American Statistician (2), 129–133.34einberg, J. & Elliott, K. C. (2012), ‘Science, expertise, and democracy’, Kennedy Instituteof Ethics Journal (2), 83–90.Weinberger, D. (2009), ‘Transparency: The new objectivity. km world’, Trend-Setting Prod-ucts 2009 .Weinstein, B. D. (1993), ‘What is an expert?’, Theoretical Medicine (1), 57–73.Wicherts, J. M., Bakker, M. & Molenaar, D. (2011), ‘Willingness to share research data isrelated to the strength of the evidence and the quality of reporting of statistical results’, PloS one (11), e26828.Williams, M., Mullane, K. & Curtis, M. J. (2018), Chapter 5 - addressing reproducibil-ity: Peer review, impact factors, checklists, guidelines, and reproducibility initiatives, in M. Williams, M. J. Curtis & K. Mullane, eds, ‘Research in the Biomedical Sciences’,Academic Press, pp. 197 – 306.Ziliak, S. T. & McCloskey, D. N. (2009), ‘The cult of statistical signiﬁcance’.

URL:

Zook, M., Barocas, S., Crawford, K., Keller, E., Gangadharan, S. P., Goodman, A., Hol-lander, R., Koenig, B. A., Metcalf, J., Narayanan, A. et al. (2017), ‘Ten simple rules forresponsible big data research’,

PLoS computational biology (3), e1005399. Hypothesis testing compares two hypotheses, called the null and the alternative. Twoparadigms are presented for comparing the hypotheses using data collected in scientiﬁcexperiments. Other overviews are found elsewhere, e.g. Greenland & Poole (2010-2011).35 .1 The Frequentist Paradigm

In Frequentist hypothesis testing, the null hypothesis is considered the default, assumed tobe true unless the data casts doubt on it. The alternative hypothesis is often what a contentexpert may hope to conclude is true by casting doubt on the null hypothesis. For example,the developer of a new cancer treatment may compare changes in tumor size for patientsrandomized to either their new treatment or a placebo. In this case, the null hypothesis isthat average change in tumor size is identical for patients randomized to both interventions,and the alternative hypothesis is that the change in tumor size diﬀers based on whetherthe patents were randomized to the new drug or a placebo. Put another way, the drugdeveloper hopes to convince regulatory bodies that their drug is eﬀective by casting doubton the default theory that the eﬀect of the drug is identical to the eﬀect of a placebo.If one assumes that the two (mutually exclusive) hypotheses exhaust reasonable possi-bilities, then exactly one of the hypotheses should be true and the other should be false.The practitioner either rejects or fails to reject the null hypothesis based on the evidencein the experiment. While the hope is that the conclusion aligns with the (unknown) truth,two errors can take place. The ﬁrst, called a type I error, occurs when the null hypothesisis true, but the practitioner rejects the null hypothesis. The second, called a type I error,occurs when the alternative hypothesis is true, but the practitioner fails to reject the nullhypothesis. Type I errors can be conceptualized as false positives, while type II errors canbe considered false negatives. The signiﬁcance level is the probability of a type I error.Power, equal to the probability of the complement of a type II error, is the probability of(correctly) rejecting the null hypothesis when the alternative hypothesis is true.For a ﬁxed sample size, type I and type II errors are inversely related (when one in-creases, the other decreases), and the appropriate balance should be decided carefully.Typically, type I error is set at an appropriate value to minimize the occurrence of falsediscoveries, which are typically considered worse than missing a true discovery. The sam-ple size for the experiment is then set based on both the signiﬁcance level (e.g. 5%) and adesired level of power (e.g. 80%) under a speciﬁed realization of the alternative hypothe-sis. Alternatively, often in the presence of extremely limited resources, power is calculatedbased on the ﬁxed signiﬁcance level and the maximum sample size considered feasible.36he signiﬁcance level is arbitrary, but 5% has been the most frequently adopted valueby tradition (Ziliak & McCloskey 2009). The idea is that the probability of falsely rejectingthe null hypothesis, e.g. concluding that an association of interest in the study is present inthe population when in reality no such association exists, needs to be minimized. Otherwise,not only will the original ﬁndings mislead the research community, but future research willbe designed based on the previous (incorrect) ﬁndings.As detailed in Motulsky (2014), one can conceive of the Frequentist hypothesis frame-work as similar to a jury evaluating the evidence in a criminal trial: the defendant ispresumed by default not to be guilty of the crime and is only deemed to be guilty of thecrime if there is suﬃcient evidence of guilt beyond a reasonable doubt. In the court room,wrongful convictions are considered worse than failing to convict a person who committeda crime in the presence of insuﬃcient evidence. Similarly, type I errors are typically set atlower rates (e.g. 5%) than type II errors (10% or 20%). Additional connections betweenstatistics and law and examples for teaching are provided in Byun & Croucher (2018).It is important to note that the null and alternative hypotheses serve diﬀerent functionsand are not interchangeable. A common misconception is to interpret a failure to rejectthe null hypothesis as “accepting the null hypothesis.” Critically, p -values do not quantifythe evidence in favor of the null hypothesis, analogous to the fact that criminal trials neverclaim to prove the innocence of a client and merely deliver verdicts without convictions asas “not guilty” (Motulsky 2014). In some cases, the order of the hypotheses may need tobe ﬂipped, such as for non-inferority trials, where the goal is to disprove the notion thata generic compound diﬀers from the original version in favor of the alternative that thetwo compounds are suﬃciently similar. Thus the framework, analysis, and interpretationof non-inferiority studies diﬀer markedly from the standard hypothesis testing framework.The body of the paper discusses an inverse relationship between type I and type II errorswhen the sample size is ﬁxed. Exacerbating the trade-oﬀ is the fact that most researchaims to answer more than one question of interest, and therefore, a signiﬁcant portion ofthe literature includes more than one hypothesis test. The family-wise error rate, deﬁnedas the probability of having at least one type I error among a ﬁxed number of tests, isknown to be higher than the nominal signiﬁcance level of any single test, as shown in37i et al. (2017). In ﬁelds, such as genomics, with a large number of planned hypothesistests, adjusted methods aim to control the false discovery rate, deﬁned as the proportion ofthe null hypotheses that are true among all tests for which the null hypotheses is rejected(Benjamini & Hochberg 1995). It is important to note that family-wise error rates and falsediscovery rates are not equivalent. Calculation of the false discovery rate utilizes Bayes’rule and necessitates a judgment on the proportion of null hypotheses that are true. As described in Section 8.1, the Frequentist paradigm answers questions about the dataassuming that one of the competing hypotheses is true. In particular, a p -value quantiﬁesthe evidence an experiment produces assuming that the null hypothesis is trueassuming that the null hypothesis is true