[PDF] A few statistical principles for data science

Abstract

In any other circumstance, it might make sense to define the extent of the terrain (Data Science) first, and then locate and describe the landmarks (Principles). But this data revolution we are experiencing defies a cadastral survey. Areas are continually being annexed into Data Science. For example, biometrics was traditionally statistics for agriculture in all its forms but now, in Data Science, it means the study of characteristics that can be used to identify an individual. Examples of non-intrusive measurements include height, weight, fingerprints, retina scan, voice, photograph/video (facial landmarks and facial expressions), and gait. A multivariate analysis of such data would be a complex project for a statistician, but a software engineer might appear to have no trouble with it at all. In any applied-statistics project, the statistician worries about uncertainty and quantifies it by modelling data as realisations generated from a probability space. Another approach to uncertainty quantification is to find similar data sets, and then use the variability of results between these data sets to capture the uncertainty. Both approaches allow 'error bars' to be put on estimates obtained from the original data set, although the interpretations are different. A third approach, that concentrates on giving a single answer and gives up on uncertainty quantification, could be considered as Data Engineering, although it has staked a claim in the Data Science terrain. This article presents a few (actually nine) statistical principles for data scientists that have helped me, and continue to help me, when I work on complex interdisciplinary projects.

Full PDF

aa r X i v : . [ s t a t . O T ] F e b A few statistical principles for data science

Noel CressieNational Institute for Applied Statistics AustraliaUniversity of WollongongWollongong NSW 2522, Australiaemail: [email protected] February 2021

Abstract

In any other circumstance, it might make sense to deﬁne the extent of the terrain (Data Science)ﬁrst, and then locate and describe the landmarks (Principles). But this data revolution we areexperiencing deﬁes a cadastral survey. Areas are continually being annexed into Data Science. Forexample, biometrics was traditionally statistics for agriculture in all its forms but now, in DataScience, it means the study of characteristics that can be used to identify an individual. Examples ofnon-intrusive measurements include height, weight, ﬁngerprints, retina scan, voice, photograph/video(facial landmarks and facial expressions), and gait. A multivariate analysis of such data would be acomplex project for a statistician, but a software engineer might appear to have no trouble with it atall. In any applied-statistics project, the statistician worries about uncertainty and quantiﬁes it bymodelling data as realisations generated from a probability space. Another approach to uncertaintyquantiﬁcation is to ﬁnd similar data sets, and then use the variability of results between thesedata sets to capture the uncertainty. Both approaches allow ‘error bars’ to be put on estimatesobtained from the original data set, although the interpretations are diﬀerent. A third approach,that concentrates on giving a single answer and gives up on uncertainty quantiﬁcation, could beconsidered as Data Engineering, although it has staked a claim in the Data Science terrain. Thisarticle presents a few (actually nine) statistical principles for data scientists that have helped me,and continue to help me, when I work on complex interdisciplinary projects.

Keywords : Applied statistics; hierarchical statistical models; measurement error; regression; spatialstatistics

Science and Engineering have long been held separate by our universities and learned academies. Broadlyspeaking, Science is concerned with ‘why,’ and Engineering is concerned with ‘how.’ Data Science seemsto include both (Cressie, 2020), and sometimes it may be hard to ﬁnd the science in a Data Sciencesolution. For example, an idea is born on how a particular data set could be analysed, perhaps with onlya vague idea of the question being answered. Then algorithms and software are developed and, in somecases, the analysis results in superior performance as judged by ﬁgures-of-merit applied to that data set,which leads to a publication. Are the results extendible and do they have ‘error bars’ ?Usually ﬁgures-of-merit are concerned with the eﬃciency of a method. However, validity should beestablished ﬁrst, and error bars allow this to be done. No matter how eﬃcient a prediction method is, if1ts error bars deﬁne a nominal 90% prediction interval but the true quantity (i.e., the predictand) is onlycontained in these intervals 60% of the time, the method lacks validity. This important ﬁgure-of-meritis called coverage and is analogous to one in the medical literature called speciﬁcity.This article is aimed at data scientists who want their error bars to be based on probabilities andtheir coverages to be (approximately) as stated. There are data scientists who have error bars thatare based on empirical distributions derived from ensembles of like data sets or from partitioning asingle data set into like subsets. While the principles I present are for data scientists who quantifyuncertainties using probabilities, they could act as a guide for both types. If your analyses includeuncertainty quantiﬁcations, then you may ﬁnd some of the principles in the following sections useful forhypothesis generation, for climbing the knowledge pyramid from data to information to knowledge, andfor scientiﬁc inference.

When I was planning this article, I hovered between the words ‘laws,’ ‘rules,’ ‘principles,’ and ‘guidelines’to establish some landmarks in Data Science. Laws should not be broken, exceptions prove the rule,and guidelines prevent accidents. But principles should make the analysis go better without interdictingother approaches: Improvising on something Groucho Marx once said, if you don’t like my principles,you may have others you like better . . .

A data scientist might see variability between strata and within strata. A statistical scientist usesprobability distributions to capture some or all of this variability. For example, probability densitieswould typically be used to model within-strata variabilities, and these densities may be diﬀerent acrossstrata in order to capture the between-strata variability. In many, many cases, those strata diﬀerencesare modelled via a regression model, leaving the errors around the regression line (i.e., the within-stratavariability) to be described by a probability distribution, often a Gaussian (or sometimes called normal)distribution.In my opinion, the worst word in the Statistics lexicon, which happens to have four letters, is ‘mean.’It is the ﬁrst moment of a probability distribution, and it is also used to describe the average of acollection of numbers, leading to many misunderstandings in Data Science. To data scientists whocapture variability through the empirical distributions of summary statistics from similar data sets, theaverage is their ﬁrst moment, so they could legitimately call it a mean of their empirical distribution. Toavoid confusion, it will be henceforth assumed in this article that variability is captured by probability distributions, and a mean iss (is and only is) the ﬁrst moment of that distribution.In this section, ﬁve general statistical principles are presented. Diﬀerent areas of study will have theirown and, in Section 3, I present three principles that are important in Spatial Statistics. In an epilogicalsection (Section 4), I present one more, the principle of respecting units, which makes nine principles intotal. . . . but some are wrong-er than others

In my work on remote sensing of carbon dioxide, data z from a single sounding is a more-than-2,000-dimensional vector of radiances observed at wavelengths in the near infra-red part of the spectrum.Cressie (2018) contains a literature review that features the relationship between z and the state of theatmosphere y (a 40-to-50 dimensional vector) via the statistical model, z = f ( y ) + ε , (1)2here unobserved y and ε are statistically independent; ε ∼ N( , Σ ε ), where N( µ , Σ ) refers to a mul-tivariate Gaussian distribution with mean vector µ and covariance matrix Σ ; f is a known vector ofnonlinear forward models obtained from atmospheric physics/chemistry; Σ ε is a covariance matrix knownfrom experiments on the remote sensing instrument, performed in the laboratory before the satellite islaunched; y = µ α + α , where α ∼ N( , Σ α ); and µ α and Σ α are the ﬁrst two moments of the (hidden)atmospheric state y . The goal is to predict y from data z . In my experience, it is the speciﬁcation of Σ α (the covariance matrix of the state of the atmosphere) where the statistical model can be seriouslywrong (Cressie, 2018). Principle 2.1 : Establish a true model (TM), perhaps diﬀerent from the scientist’s working model (WM).Critically, compute the TM-distributional properties of the WM estimators.

At the very least, this principle gives the data scientist an idea of the sensitivity of the model tomisspeciﬁcation. In the remote sensing example, the WM-based statistical analysis provides a predictedatmospheric state, ˆ y WM , called a retrieval. However, the atmosphere does not care what WM waschosen; it acts under the true model (TM), and the principal ﬁgures-of-merit that are calculated are(using obvious notation): True bias: E TM (ˆ y WM − y ) , True uncertainty: var TM (ˆ y WM − y ) . (2)While ˆ y WM may be optimal under WM, it is not under TM, and the properties given by (2) will exposethe potential weaknesses of a WM chosen for its simplicity or its computational convenience.It is tempting (but wrong) for scientists to calculate instead:E WM (ˆ y WM − y ) , var WM (ˆ y WM − y ) , (3)which are often facile calculations compared to (2). This was the case for NASA’s Orbiting CarbonObservatory-2 (OCO-2) satellite. Over time, empirical distributions of retrieval-error ensembles showedthat bias and variance should have been calculated from (2), not (3). This is discussed by Cressie (2018),and an illustration of applying Principal 2.1 is given in Nguyen et al. (2019).In the rest of this article, I shall assume that the WM and TM are one and the same. However, thisprinciple is potentially applicable in all the examples below. There is an equivalent way to write the model (1) that shows how statistical scientists can naturallybuild complexity into their models: z | y ∼ N( f ( y ) , Σ ε ) y ∼ N( µ α , Σ α ) . (4)This is a hierarchical statistical model (HM) , which is deﬁned by layers of conditional probabilities (heretwo layers).The top layer is the data model and can be written in abbreviated notation as [ z | y ]. It models whatwe see – the data (conditional on the process y ). The next layer is the process model that can be writtenas [ y ]. It models what we want to get – the latent process, or state, hidden behind the data. In fact,in the remote sensing problem described in Section 2.1, the distribution of y is conditional on w , themeteorology of the atmosphere. If need be, the process model could be modiﬁed to now consist of [ y | w ]and [ w ]. Then the HM would be made up of three levels, and the sequence of conditional probabilities3ould be [ z | y ], [ y | w ], and [ w ]. In the speciﬁc case of OCO-2 retrievals, the meteorological process, w ,is obtained from a numerical weather forecasting model and is considered known. Principle 2.2 : Build statistical models conditionally, through a data model and a process model. Inferthe unknown process from the predictive distribution.

The predictive distribution is, [ y | z ] = [ z | y ] · [ y ][ z ] , (5)where for convenience any parameters θ in [ z | y ] and [ y ] are dropped from the notation (but they arestill there). The HM components are featured in the numerator, which is equal to the joint distribution,[ z , y ]; and the denominator, [ z ], is simply a ‘normaliser’ to ensure that [ y | z ] is a density (or probabilitymass) function that integrates (or sums) to 1. Equation (5) is Bayes’ Theorem but, in this case, theunknowns are the state elements in y , not the parameters θ . No prior distributions on θ have beenassumed in the making of this HM! However, a prior could be assumed or, pragmatically, an estimate ˆ θ could be obtained from z .Derivation of the predictive distribution in (5) can be problematic in many cases of practical interest,such as when [ y ] is a highly multivariate probability distribution of a complex scientiﬁc process. Com-putational methods, particularly Markov chain Monte Carlo (MCMC), are in constant development toallow realisations from [ y | z ] to be simulated, from which distributional properties such as the predictivemean and the predictive quantiles can be used to predict the state y and the error bars, respectively.The simplest special case of (4) is when the data (here, z ) and the process (here, y ) are univariate,and f ( · ) is the identity function: z | y ∼ N( y, σ ε ) ,y ∼ N( µ α , σ α ) . Then [ y | z ] is Gaussian, characterised by its ﬁrst two moments:E( y | z ) = { (1 /σ α ) µ α + (1 /σ ε ) z }{ /σ α + 1 /σ ε } − , var( y | z ) = { /σ α + 1 /σ ε } − . The predictor, ˆ y = E( y | z ), is a weighted combination of the data and the mean of the unknown state y ,where the weights depend on the signal-to-noise ratio, σ α /σ ε . These results can be generalised to themultivariate case with linear retrieval equations in (4), namely f ( y ) = c + Ky . In the case of OCO-2, K has more rows than columns, and the ﬁrst two moments of the predictive distribution, which is Gaussian,are: E( y | z ) = y α + G ( z − c − Ky α ) , var( y | z ) = { Σ − α + K ⊤ Σ − ε K } − , (6)where G = { Σ − α + K ⊤ Σ − ε K } − K ⊤ Σ − ε is sometimes called the gain matrix.From (6), it can be seen that the precision matrix (i.e., the inverse of the variance matrix), writtenprec( · ), of each component of the predictive distribution satisﬁes,prec( y | z ) = prec( y ) + K ⊤ prec( z | y ) K , a result that holds when the matrix K is any size. This decomposition of precision demonstratesthat, when going from [ y ] to [ y | z ], the precision increases (i.e., the variance decreases). Moreover, thepredictive mean is unbiased; that is, E(E( y | z )) = E( y ) = µ α .In geostatistics, the importance of (5) is deeply misunderstood, because Matheron (1963) originallyformulated kriging in terms of what he called a regionalised variable, { z ( s ) : s ∈ D ⊂ R d } , where R d is4 -dimensional Euclidean space and D is a subset of R d with volume | D | >

0. In this formulation, the datamodel and the process model are collapsed into the single probability distribution, [ { z ( s ) : s ∈ D } ]. Thenthe goal is to predict z ( s ) from data z = ( z ( s ) , . . . , z ( s n )) ⊤ , for which the generic kriging predictor isE( z ( s ) | z ), the mean of [ z ( s ) | z ].However, when measurement of the process is taken into account, there is a HM that diﬀerentiatesbetween the observations z and the underlying latent process { y ( s ) : s ∈ D } and that is observedimperfectly through z . The spatial trend and the spatial covariance function are deﬁned in the processmodel, although they are estimated from the noisy data z . Once the measurement error is includedin the model, it is clear that geostatistics should do kriging using E( y ( s ) | z ) and not E( z ( s ) | z ); seeCressie and Wikle (2011, Section 4.1).Consequently, much of the earlier geostatistics software did not make inference on y ( s ) but choseto make inference on z ( s ) where the measurement error is confounded with the process error. Userbeware: some still do, but those written for environmental applications (e.g., geoR , gstat , FRK ) give thecorrect kriging equations. The diﬀerence is most apparent when the kriging variance is computed asvar( z ( s ) | z ) at location s , but then interpreted incorrectly as the predictive variance var( y ( s ) | z ) ofthe true process at s . Building physical models usually involves ensuring that mass or energy is conserved. If the system isleaking energy, then it needs to be plugged or the energy needs to be followed as it moves into anothersystem.Now consider a designed experiment where data z are obtained and the statistical model ﬁtted is z = Xβ + ξ . (7)The model in (7) is a linear regression with covariate (or design) matrix X , β is an unknown p -dimensionalvector of regression coeﬃcients, and ξ = ( ξ , . . . , ξ n ) ⊤ consists of random variables that are independentand identically distributed (iid) Gaussian random variables with mean 0 and variance σ ξ .Suppose that the measuring instrument was carefully calibrated and, in the study protocol, its samplevariance from repeated measurements was reported as a number that I shall write as s ε >

0. Note thatthe uncertainty in the measurements usually needs to be quantiﬁed outside the experiment in order toidentify the linear-model-error variance. Scientiﬁc interest is primarily in β , but σ ξ is by no means anuisance parameter. Its restricted maximum likelihood (REML) estimate is: s ξ = ( z − X ˆ β ) ⊤ ( z − X ˆ β ) / ( n − p ) , where ˆ β is the maximum likelihood (and also the ordinary least squares) estimate of β .It could almost be a rule that in any study of this sort, you will see s ξ > s ε . Where did the componentof variability, ( s ξ − s ε ), go? Principle 2.3 : In any well deﬁned statistical model, there is conservation of variability.

Indeed, the model given by (7) is not deﬁned well enough: The error term ξ i is, in fact, made up of twocomponents of variance: ξ i = δ i + ε i ; i = 1 , . . . , n , where { ε i : i = 1 , . . . , n } represent measurement errors. Often forgotten are { δ i : i = 1 , . . . , n } , whichrepresent model errors resulting from using the linear-regression model, { x ⊤ i β : i = 1 , . . . , n } , and whichare key in accounting for the inexactness of using any model (linear or nonlinear).5he scientiﬁc process { y i : i = 1 , . . . , n } is given by y i = x i β + δ i , and an observation of it is z i = y i + ε i , for i = 1 , . . . , n . The HM captures the variability of z and y beautifully, as follows: z | y ∼ N( y , σ ε I ) [or, z = y + ε ] , y ∼ N( Xβ , σ δ I ) [or, y = Xβ + δ ] . (8)Hence, from (8), we obtain the earlier result in vector form: z = ( Xβ + δ ) + ε = Xβ + ( δ + ε ) = Xβ + ξ , (9)where δ and ε are independent mean-zero random vectors.Comparing (7) and (8), we see that the model given by (7) should be augmented with the equation, ξ = δ + ε , which results in the conservation-of-variability equation, σ ξ = σ δ + σ ε . Consequently, knowing s ε (an estimate of σ ε that is obtained outside the experiment) and, having computed the REML estimateof σ ε , an estimate of the model error, σ δ , can be obtained as, s δ = s ξ − s ε (10)(provided the right-hand side is non-negative).In its simplest form, conservation of variability says that the total variability is equal to the variabilitydue to model uncertainty plus the variability due to measurement uncertainty . When variability iscaptured using a probability space deﬁned by a HM, this principle can be expressed as:var( z ) = var(E( z | y )) + E(var( z | y )) . (11)For example, consider the HM deﬁned by (8). Since var( z ) = var( ξ ) and var( y ) = var( δ ), we have σ ξ I = σ δ I + σ ε I , and variability is conserved.This might seem obvious to you, or perhaps even trivial. Again consider (7), and suppose you wantto predict an unknown value, y n +1 , outside the data set, but only the p -dimensional estimate ˆ β andthe covariate, x n +1 , are at your disposal. Most regression textbooks would say that you should useas predictor, x ⊤ n +1 ˆ β . However, a scientist wants to predict the value of the scientiﬁc process, y n +1 = x ⊤ n +1 β + δ n +1 . Using the well deﬁned HM (8), its predictor is ˆ y n +1 = E( y n +1 | z ) = x ⊤ n +1 ˆ β + ˆ δ n +1 , whereˆ δ n +1 = E( δ n +1 | z ). Since { δ i } are iid N(0 , σ δ ), ˆ δ n +1 = E( δ n +1 | z ) = 0; however, var( δ n +1 | z ) is not zero ,and that has to be recognised in order to perform valid predictive inference, as given below (e.g., Cressie,2020).The prediction error is (ˆ y n +1 − y n +1 ), and its ﬁrst two moments are:E(ˆ y n +1 − y n +1 ) = 0 , E(ˆ y n +1 − y n +1 ) = E( x ⊤ n +1 (ˆ β − β )) + E(var( δ n +1 | z )) , since the expectation of the cross-product is zero. Thus, if var( δ n +1 | z ) were forgotten by the datascientist, such as would happen if the presence of { δ i } were not recognised in the statistical model, aforecasting decision about future values of the process { y i } would be overly optimistic and potentiallyharmful.Principle 2.3 also shows up in the analysis of variance (ANOVA) method, where the ‘betweensum of squares’ corresponds to the variance of the conditional expectation in (11), the ‘within sumof squares’ corresponds to the expectation of the conditional variance in (11), and the ‘total sum of6quares’ corresponds to the left-hand side of (11). Conservation of variability implicitly includes vari-ability and covariability . For example, if two random variables ε and ε have correlation ρ = 0, thenit should be recognised that var( ε + ε ) = var( ε ) + var( ε ) This has often been ignored by scien-tists doing an error budget (e.g., in remote sensing retrievals; see Connor et al., 2016). Depending onthe sign of ρ , var( ε ) + var( ε ), may be either larger than or smaller than the total variability, sincevar( ε + ε ) = var( ε ) + var( ε ) + 2 ρ var( ε ) / var( ε ) / .The rules of probability theory can explain easily how the variability can seem to disappear, and thenthey can show us where to ﬁnd it. It is less natural to do so simply with empirical distributions, becausethe pairs { ( z i , y i ) } involve the unavailable, hidden variable { y i } . Further, if { z i } and { x j } are two setsof observations, there is no guarantee that they will occur in pairs, and hence the empirical correlation, r , might be diﬃcult to obtain.The famous bias–variance trade-oﬀ is another manifestation of Principle 2.3. In the context ofestimation of a ﬁxed but unknown parameter θ with an estimator ˆ θ ( z ), where recall z is the data vector,the mean-squared error is the sum of the estimator’s squared bias and its variance:E(ˆ θ ( z ) − θ ) = (E(ˆ θ ( z )) − θ ) + var(ˆ θ ( z )) . Generally speaking, an estimator that decreases the bias will increase the variance, and vice versa . Themean-squared error might be decreased using diﬀerent estimators, but there is a trade-oﬀ to be madewhen trying to decrease both bias and variance at the same time.

A statistical analysis cannot get very much simpler than ﬁtting a simple linear regression (a special caseof (7)) to { ( z i , x i ) : i = 1 , . . . , n } , as follows: z i = β + β x i + ξ i ; i = 1 , . . . , n , (12)where { ξ i : i = 1 , . . . , n } are iid N(0 , σ ξ ). However, is (12) appropriate if { z i : i = 1 , . . . , n } are photoncounts at wavelengths in a given band of the electro-magnetic spectrum, or if they are percentages oftrace elements in soil?Exploratory data analysis (EDA) might reveal a highly skewed histogram of { z i } . After plotting thehistogram of { log z i } , we might achieve a more symmetric appearance; assume for the moment that wedo. This would prompt a plot of not only { z i } versus { x i } , but also of { log z i } versus { x i } , and thelatter might look more linear. Further EDA, in the form of residual plots, after ﬁtting a simple linearregression to the data { z i } and one to the transformed { log z i } , would be carried out. And, for example,it might be that the residual variability of { z i } appears to increase with x (i.e., heteroskedasticity), butthe residual variability of { log z i } shows no dependence on x (i.e., homoskedasticity).Admittedly, this is a story, not methodology, but in my experience with analysing data, it happensenough to propose the next principle (Cressie, 1985). Principle 2.4 : Seek a transformation of the scientiﬁc process where all components of variation act andinteract additively.

This principle looks more like a doctrine but, as I explain below, a ﬁtting of (12) to the data { z i } isfollowing this principle, where the transformation is simply the identity.Suppose the quantity y has variabilities that can be expressed as ‘large-scale’ and ‘small-scale.’ Sci-entists usually put their theories in the large scale and their errors (both in the theory and in themeasurements) in the small scale. Statistical scientists know that to make inference on the large scale7f the scientiﬁc process, its small scale has to be modelled well, usually by a random error term, δ .However, the model I now give for the scientiﬁc process is more basic, given in terms of those large andsmall scales of variability: It is additive and of the following form, y = ( µ (1) + . . . + µ ( p ) ) + ( δ (1) + . . . + δ ( N ) ) , (13)where p and N are positive integers and, for convenience, the subscript ‘ i ’ has been dropped. Forexample, the large-scale variation in simple linear regression has p = 2, µ (1) = β , and µ (2) = β x .For the small-scale variation in (13), N is large, and { δ ( l ) : l = 1 , . . . , N } are physically interpretablecomponents with variabilities that are very small but such that their total variability is the variabilityof the diﬀerence, y − µ (1) − . . . − µ ( p ) ; see Principle 2.3.The model (13) is a physical model where the large scales act additively with each other, the smallscales act additively with each other, and the large scales and the small scales interact additively. It can bemade statistical by assuming that these many small-scale eﬀects, δ (1) , . . . , δ ( N ) , are random, statisticallyindependent, have mean zero and variances that are O (1 /N ), with the same leading coeﬃcient, denotedhere as σ δ . Now let N be large and deﬁne the small-scale variations collectively as δ ; then (13) can bewritten as, y = µ (1) + . . . + µ ( p ) + δ , where1. E( y ) = µ (1) + . . . + µ ( p ) (additivity) ,2. var( y ) = σ δ (homoskedasticity) ,3. δ ∼ N(0 , σ δ ) (central limit theorem) .The regression (7) is obtained when the large-scale eﬀects { µ ( k ) : k = 1 , . . . , p } are identiﬁed with { β k x k : k = 1 , . . . , p } . It is the default model used in much of science, but this discussion shows that itoriginates from the imposition of additivity within and between scales of variability. From Principle 2.2,the measurement of a phenomenon should be separated from how the phenomenon behaves in nature and,indeed, the generalised linear model does this through a link function (McCullagh and Nelder, 1989).Hence, Principle 2.4 covers many statistical models and oﬀers a plausible explanation of why a linearrelation, homoskedasticity, and Gaussianity are often found to occur together after a transformation ofthe data (e.g., Cressie, 1978). Of course, it is easy to construct probability models where the principledoes not hold, but one should keep in mind that probability theory is used to model nature’s variability,not the other way around.When nonlinearity is inherent to the physical system, such as would occur when there are barriers orthresholds, the quest looks to be futile. However, after following Principle 2.2, it may just be that thegrail, Principle 2.4, is hidden deep in the process model (Berliner et al., 2000). Taleb (2007) published a book about ‘Black Swan’ events that, when they happen, are considered tohave been unpredictable. Such events could have a major impact on Earth’s environment or, as hashappened in 2020, on global-population health and economies around the globe.This black-swan meme has its roots in seventeenth-century Europe. Up to then, no swans had everbeen observed that were black but, in 1697, Dutch explorers became the ﬁrst Europeans to observe blackswans during a voyage that took them to Western Australia. The implication in Taleb’s book is that if8he scientiﬁc world in the 1600s could not predict black swans, how can scientists predict catastrophicenvironmental events in the twenty-ﬁrst century?My response is that 300+ years ago the best scientiﬁc minds in Europe were too certain about theirscience; that is, if they had been asked to put probabilities (according to abundance) on the coloursthat swans could be: red, orange, . . . , violet, white, black, they would have put 0 , , . . . , ,

0, which is awell deﬁned probability mass function. Certainly, black-swan events are not predictable if the scientiﬁcmodel is 100% certain that they do not exist. Because their probability model gave black swans andindeed coloured (including red) swans, zero probability, this unimagined event did not emerge out ofthe ‘ether’ in subsequent inferences, until one was observed. Do we have to wait until a highly unusualevent occurs before we are forced to change the probability model? The real lesson from the black-swanmeme is that scientiﬁc knowledge is never perfect, that modellers need to explore the parameter spacethoroughly, and that they need to ‘spend’ some probability on highly unusual events.At the time of writing this article, our species is under attack from a virus that was originally called‘novel coronavirus,’ so new that it took several weeks before the infection it caused had a name: COVID-19. Virologists certainly had assigned a small positive probability that each new decade would have itsown ‘novel’ pandemic. However, politicians (and most economists) appear to have put zero probabilityon a severe, worldwide economic disruption.Given a swine-ﬂu-like pandemic has occurred, the conditional probability of some economic disruptionis not zero . But, given a severe pandemic has occurred, the conditional probability of severe economicdisruption is substantial. This conditional probability is then multiplied by the probability of a severepandemic, which is by no means zero given the ability of viruses to mutate and occasionally jump fromanimals to humans. The product of these two is the joint probability of a severe pandemic followedby severe economic disruption, which is not negligible. This has happened twice in the last hundredyears, and it will likely happen more often with humans and animals sharing in an ever-more-crowdedenvironment.

Principle 2.5 : When building probability models, look carefully where zero probabilities are assumed(perhaps implicitly) and, with the same care, move appropriate probabilities away from zero. Calculatejoint probabilities from products of conditional (not marginal) probabilities, unless entropy is maximal.

Unfortunately, uncertainty quantiﬁcation through joint probabilities all too often comes from multi-plying marginal probabilities as if each event in the causal chain were independent. It does not ‘hurt’for small probabilities to be assigned to Pr(bird is colour c | bird is a swan) where c also covers thecolour ‘black.’ Then Pr(bird is a swan and it is black) = Pr(bird is black | bird is a swan) × Pr(bird is aswan). The point being made here is philosophical, and it could be applied to Pr(pandemic followed byglobal economic disruption), for diﬀerent severities of both events. The worst thing would be to makePr(global economic disruption | pandemic) equal to zero, since then there would be a lack of planningfor the health and economic crises brought on by a pandemic (Mackenzie, 2020).If nothing were known about the conditional probability, a fall-back is the maximum-entropy modelwhere the marginal probability, Pr(bird is colour c ), is used in place of Pr(bird is colour c | bird is aswan). The marginal probability is not zero, since it is based on abundance and, for example, thereare birds that are predominantly coloured red (e.g., Australia’s king parrot). The maximum-entropyprinciple is discussed in Cressie et al. (2004).Experiences over the last 100 years mean that Pr(global economic disruption), whether caused bya pandemic or not, should not be zero, yet governments described the events of March–April 2020 asunimaginable. Events of small (but not zero) probabilities with consequences expressed through a lossfunction, allow a non-zero expected loss to be calculated, which can then be used to make optimal9ecisions that mitigate the consequences (e.g., see Berger, 1985 for a discussion of decision theory).I have just presented ﬁve statistical principles that should be useful in Data Science. Data often comewith location information, in which case the data scientist will likely be using spatial-analysis methods.In the next section, I present three principles that are speciﬁc to Spatial Statistics. Those of us who work in Spatial Statistics will know of Tobler’s First Law of Geography (Tobler, 1970).In spatial statistical science, it really is a ‘principle’ rather than a ‘law’ and, in the following subsections,I shall present it and two other principles that I have found useful in this area of research. . . .

In his famous 1935 book on experimental design, Fisher (1935, p. 66), wrote: ‘After choosing the area weusually have no guidance beyond the widely veriﬁed fact that patches in close proximity are commonlymore alike, as judged by the yield of crops, than those which are further apart.’ A spatial statisticiansees this as the making of a principle but, at that time, Fisher made a sharp right turn. In his analyses ofﬁeld trials he applied randomisation to eradicate the pest: spatial correlation! Cressie and Wikle (2011,Ch.1 and Ch. 4) give some historical perspective to the work of researchers who took roads less travelledand developed this vibrant area we now call Spatial Statistics. Some of this development comes fromGeography, and so it is ﬁtting that the ﬁrst and most important principle in this section has becomeknown as the First Law of Geography. Originally articulated by Tobler (1970), it is given here in exactlyhis words.

Principle 3.1

Everything is related to everything else, but near things are more related than distantthings (Tobler, 1970).

This principle is at the core of what we do in spatial and spatio-temporal statistics. For example, inremote sensing of Earth’s surface, the scene of interest D contains many retrievals, { z ( s i ) : i = 1 , . . . , n } ,where { s i : i = 1 , . . . , n } are the (lon, lat) locations of the n data inside D (retrieved over a short timeperiod). There is a hidden spatial process { y ( s ) : s ∈ D } that the geophysicist would like to infer and,in a spatial HM, a spatial statistical model for { y ( s ) : s ∈ D } is built around Principle 3.1.For example, consider a simple process model, appropriate for a small scene D : y ( s ) = β + β lat( s ) + δ ( s ) , for s ∈ D , where the mean of the process is a linear function of latitude, and { δ ( s ) : s ∈ D } is a spatial error processwith mean zero and stationary covariance function, C δ ( h ) = cov( δ ( s + h ) , δ ( s )) = σ δ · exp( −k h k /φ ).Notice that C y ( h ) = cov( y ( s + h ) , y ( s )) = C δ ( h ); C y ( ) = C δ ( ) = σ δ ; and the scale parameter φ controls how ‘more related’ things are. With this parameterisation, φ = 0 is the degenerate case of nospatial correlation and, as φ increases, the spatial correlation increases for a given distance k h k . Thedata model in this example is also simple, that of additive independent measurement error. The datavector is z = ( z ( s ) , . . . , z ( s n )) ⊤ and z ( s i ) = y ( s i ) + ε i ; i = 1 , . . . , n , where { ε i : i = 1 , . . . , n } are independent mean-zero errors.Assuming for the moment that all the parameters θ = ( β ⊤ , σ δ , φ, σ ε ) ⊤ are known, inference on { y ( s ) : s ∈ D } comes from summaries of the predictive distribution, [ { y ( s ) : s ∈ D }| z ]. For a Gaussian process10odel y ( · ) and Gaussian measurement errors { ε i } , the predictive distribution is also a Gaussian processwhose ﬁrst two moments, E( y ( s ) | z ) and cov( y ( s ) , y ( u ) | z ), for s , u ∈ D , can be obtained analytically.From Section 2.2, a common predictor of y ( s ) is ˆ y ( s ) = E( y ( s ) | z ), whose statistical properties are neededfor inference. It is straightforward to see that E(ˆ y ( s )) = E( y ( s )) (i.e., the predictor is unbiased) and themean-squared predictor error is:E(ˆ y ( s ) − y ( s )) = E(var( y ( s ) | z )) = var( y ( s ) | z ) , which is the predictive variance (the last equality being due to the Gaussian assumptions made). Furthersummaries might come from well chosen percentiles (e.g., 2.5%, 25%, 50%, 75%, 97.5%) of [ y ( s ) | z ]. In spatial statistics, it might appear that the spatial-prediction problem is quite diﬃcult, because thenumber of predictions to be made is often greater than the number of data. Even for the problem ofparameter estimation, the number of ‘degrees of freedom’ in n spatially dependent data will not be n , as Inow show. Applying Principle 3.1 with a stationary spatial covariance model C z ( h ) = cov( z ( s + h ) , z ( s )),for h ∈ R d , that exhibits only positive correlations, it is easy to see that for z = (1 /n ) P ni =1 z ( s i ),var( z ) = (1 /n ) { σ z + 2 X X i σ z n , where σ z = C z ( ). Hence one can deﬁne the eﬀective degrees of freedom , n eﬀ , as n eﬀ = σ z / var( z ) < n ;that is, under the almost ubiquitous spatial model of positive spatial correlation, n eﬀ < n . In the remotesensing context where there are many observations taken over a short period of time, Zhang et al. (2017)gave an example where n = 2961 and n eﬀ = 202 .

3, less than one-tenth of the sample size!These calculations are a warning that, in the spatial (and spatio-temporal) setting, intuition learnedfrom the ‘iid errors’ model has to be modiﬁed, sometimes substantially. Critically, the probability modelthat captures the spatial variability has to be well speciﬁed, and we now discuss how this can be done.The classical spatial statistical model consists of a deterministic mean process, { µ ( s ) : s ∈ D } , anda random spatial error process, δ ( s ) = y ( s ) − µ ( s ), so that y ( s ) = µ ( s ) + δ ( s ) , for s ∈ D ; (14)see Cressie (1993, Ch. 2). It is often a personal choice what components go into µ ( s ). For linearregression, µ ( s ) = x ( s ) ⊤ β , for s ∈ D , but the set of possible covariates, x ( s ), typically comes from themodeller’s subject-matter knowledge, augmented by spatial trend terms that are linear and higher-orderfunctions of s = ( s , . . . , s d ) ⊤ .What is the eﬀect of not having included an important spatially varying covariate, x p +1 ( s ) say, in µ ( s )? From Principle 2.3, { δ ( s ) } has to absorb the contribution to spatial variability that { x p +1 ( s ) } would have made had it been included in the regression. This means that a ﬁxed eﬀect that has beeninadvertently left out will be picked up in the spatial statistical model (14), as a random eﬀect. Principle 3.2 : Assume a decomposition of a spatial process into ﬁxed eﬀects plus random eﬀects. Whileit is not unique, the decomposition must be chosen to conserve variability. regional variance of µ ( · ) as, s µ = (cid:18) | D | (cid:19) Z D ( µ ( s ) − µ ) d s , where µ = (1 / | D | ) R D µ ( s ) d s is the regional average of µ ( · ) (Lahiri et al., 1999). Suppose that a sta-tionary covariance function, C δ ( · ), describes the covariability in { δ ( s ) : s ∈ D } , with σ δ = C δ ( ); recallthat C δ ( h ) = cov( δ ( s + h ) , δ ( s )), for h ∈ R d . Let ˆ C δ ( · ) denote the empirical covariance function, and s δ = ˆ C δ ( ). Then, according to Principle 3.2, approximately speaking, s µ + s δ should not change asdiﬀerent decompositions given by (14) are ﬁtted. Now suppose that the additional covariate vector, x p +1 = ( x p +1 ( s ) , . . . , x p +1 ( s n )) ⊤ , is included in the regression, and it is not in the column space of X . Then generally, the new empirical covariance function ˆ C δ ( · ), yields a new estimate s δ = ˆ C δ (0).The principle says that this new s δ should decrease to allow for the additional spatial variability in { µ ( s ) : s ∈ D } . (Understandably, the general shape of the new empirical covariance function will changeas well.)How will we ever know which model is better after each is ﬁtted to the same spatial data { z ( s i ) } ?Model-selection criteria such as ‘Akaike Information Criterion,’ ‘Deviance Information Criterion,’ andcross-validation will remove bad models but, in the ‘diﬃcult middle,’ there is no way to know whethera graph of { z ( s i ) } versus { x p +1 ( s i ) } is showing the behaviour of a component of the mean function orthe behaviour of a component of the error process. Germane to Principle 3.2, it is well known to spatialstatisticians that a single simulation of a strongly spatially dependent random eﬀect, { δ ( s ) : s ∈ D } , canlook like deterministic spatial trend. If a replicate of the data were available, and if the analogous plotshowed the same behaviour as seen in the original plot, then { x p +1 ( s ) : s ∈ D } would probably belongin the mean function µ ( · ). From just one realisation, the decomposition (14) is not unique, and hencewhat is one person’s mean function could be another person’s spatial error. Most of us have heard of, or encountered,

Simpson’s Paradox , which basically says that two variables x and y could be positively correlated but, conditional on a third variable w , it is perfectly acceptable that x and y be negatively correlated (or uncorrelated, or positively correlated)! Usually, Simpson’s Paradoxis expressed in terms of categorical data, which we could do here by deﬁning ordinal categories, or bins,from the ranges of w, x , and y .Let a sample { ( w i , x i , y i ) : i = 1 , . . . , n } be assigned to the pre-deﬁned bins, creating a three-waycontingency table with counts in each cell of the table. A two-way table showing (marginal) dependencebetween binned x and binned y is obtained by aggregating the counts across the third variable, namelybinned w . Then a measure of dependence in the two-way table (e.g., the statistic gamma , due toGoodman and Kruskal, 1954) could show positive dependence, but the same measure applied to thetwo-way tables of binned x and binned y conditional on each of the values of binned w , could all shownegative (or no, or positive) dependence.What can you do about it? First, understand why it happens, and then spend the rest of yourdata-science career looking for those lurking variables like w that you may have missed! It manifests inother settings as well, as I now discuss.Let x and y be jointly Gaussian random variables and related through the simple linear regressionmodel given by (12); the probability model of how y varies with x is as follows: Conditional on x , the12andom variable y is Gaussian with mean and variance, respectively,E( y | x ) = E( y ) + { ρ xy var( y ) / / var( x ) / }{ x − E( x ) } , var( y | x ) = var( y ) { − ρ xy } , where ρ xy = corr( x, y ).Now consider regression of y on both x and w , where again joint Gaussianity of random variables w, x ,and y is assumed. To make the calculations easier to follow, in the rest of this subsection consider thesimple case where E( w ) = E( x ) = E( y ) = 0, and var( w ) = var( x ) = var( y ) = 1. Then the covariances,cov( x, y ), cov( w, y ), and cov( x, w ) are identical to correlations, which are denoted here as ρ xy , ρ wy , and ρ xw , respectively. The regression lines to be compared are,E( y | x, w ) = (cid:26) ρ wy − ρ xy ρ xw − ρ xw (cid:27) w + (cid:26) ρ xy − ρ wy ρ xw − ρ xw (cid:27) x , (15)and E( y | x ) = 0 + ρ xy · x . Then, conditional on w , corr( x, y | w ) = (cid:26) ρ xy − ρ wy ρ xw − ρ xw (cid:27) var( x | w ) / var( y | w ) / = ( ρ xy − ρ wy ρ xw (1 − ρ xw ) / (1 − ρ wy ) / ) , (16)which is to be compared to corr( x, y ) = ρ xy .If w has zero correlation with both x and y , then from (16), corr( x, y | w ) = corr( x, y ) = ρ xy , which isan intuitively reasonable result. In general, ρ wy = 0 and ρ xw = 0, so from (15) it is clear that a lurkingvariable w can wreak havoc on any honest attempt to interpret a simple linear regression of y on x . But,what does Simpson’s Paradox have to do with spatial statistics?The answer is ‘plenty,’ if you think about w as being a variable that describes a range of geographicalstrata. For example, the Australian Bureau of Statistics divides Australia up into small areas , at diﬀerentlevels of aggregation. A study of y =mean weekly income in NSW, Australia, regressed on selecteddemographic variables x at the ﬁnest level of aggregation, was given in Burden et al. (2015). Given thegreat divide in Australia between ‘city’ and ‘country,’ the most obvious variable w to choose is: w = 1if the small area is in Greater Sydney (city), and otherwise w = 0 (country). In this set-up, it is easy toimagine that the two regressions, E( y | x , w = 1) and E( y | x , w = 0), would result in more interpretableresults than the single regression, E( y | x ). (In fact, Burden et al., 2015, used Principle 3.2 and capturedthe ‘geography’ by using a spatial error term rather than a geographic covariate w .)Simpson’s Paradox is potentially everywhere in the spatial context, because regressions of y on x canbe done at diﬀerent levels of spatial aggregation. A regression of y on x at the ﬁnest level of aggregationmay show positive dependence, but when the variables are aggregated to a coarser level, the regressionmay show negative (or no, or positive) dependence! In Sociology, this phenomenon has been called the ecological eﬀect (Robinson, 1950), an unfortunate and misleading name that has no direct connectionto Ecology. In Geography, the combination of Simpson’s Paradox and the ecological eﬀect has beencalled the Modiﬁable Areal Unit Problem, or MAUP; a spatial statistical discussion of MAUP is givenin Cressie (1996).Because it is so natural to aggregate processes over space, meta-data in the spatial data set shouldinclude the spatial support of each datum. Deﬁne y ( B ) = (1 / | B | ) R B y ( s ) d s ; then y ( B ) has spatial upport B with volume, | B | = R B d s >

0. In geostatistics, B is called a block , and we say that y ( B ) is anaggregation (or block average) of the original process. Then a change-of-support (COS) is said to haveoccurred from point support to spatial support B , with a corresponding change of statistical properties.Data on mutually exclusive supports, { B , . . . , B n } , are typically represented as: z ( B i ) = y ( B i ) + ε i ; i = 1 , . . . , n . The simplest model would be { ε i : i = 1 , . . . , n } iid N(0 , σ ε ), but modiﬁcations to account for theprotocols of measurement and the possible overlap of the supports { B i } have been developed (e.g.,Wikle and Berliner, 2005).The two earlier principles of Spatial Statistics (nearby things are more alike; and decompose spatialvariability into ﬁxed eﬀects plus random eﬀects) are joined here by a change-of-support principle. Principle 3.3 : The variance of the aggregation, y ( B ) , is a decreasing function of the volume, | B | . This is the most important of a number of COS properties (e.g., Gotway and Young, 2002), whosediscussion I have left for a future time.Let y ( s ) = y + δ ( s ), for s ∈ D , where y is a non-degenerate random variable with possibly non-zeromean and independent of { δ ( s ) : s ∈ D } . Then for B ⊂ D ,var( y ( B ))) = σ + var( δ ( B )) , (17)where σ = var( y ) >

0. Therefore, if var( δ ( B )) decreases to 0 as the volume | B | increases, var( y ( B ))decreases to σ >

0. It is this type of behaviour that is of interest to geoscientists analysing remotesensing data. In that literature (reviewed in Cressie, 2018), they distinguish between two types of error,‘systematic error’ and ‘random error,’ as follows.

Random error : An error is a random error if the average of a collection of n of them has variability thatdecreases to zero like 1 /n , as n becomes large. Systematic error : An error is a systematic error if the average of a collection of n of them has variabilitythat does not decrease to zero, as n becomes large.These might be considered ‘verbal working deﬁnitions,’ but a statistical scientist looks at these andtries to ﬁnd an appropriate probability model. The ones I give below are building blocks for parts of aHM. In practice, the most diﬃcult aspect is to know which errors are of which type and how to groupthem together for inference.For ‘random error,’ the obvious probability model is: Let δ ( s ) , . . . , δ ( s n ) be iid random variableswith mean 0 and variance σ δ . Now the average, δ = ( P ni =1 δ ( s i )) /n , has E( δ ) = 0 and var( δ ) = σ δ /n ,which decreases to zero like 1 /n . In a ‘data-rich’ situation and with enough averaging of the data,random errors can be shown to be annihilated by the averaging (using the law of large numbers). Ina spatial setting, Principle 3.1 suggests that the independence assumption between the { δ ( s i ) } is notappropriate. Under increasing-domain asymptotics, it can be shown that var( δ ) still decreases like1 /n (Cressie, 1993), however strong spatial dependence can make the errors look more systematic thanrandom (Morris and Ebey, 1984).For ‘systematic error,’ an obvious probability model is a random-eﬀects model: Let e ( s ) , . . . , e ( s n )be written as e ( s i ) = y + δ ( s i ) ; i = 1 , . . . , n , where y has mean 0 and variance σ >

0, and y and { δ ( s i ) } are independent. Now, if { δ ( s i ) } are iidmean 0 and variance σ δ , then the average error e = P ni =1 e ( s i ) /n has variance,var( e ) = σ + σ δ /n . n becomes large, and hence e ( · ) is a form of systematic error.Zhang et al. (2019) used spatial dependence in the { δ ( s i ) } and an additive random eﬀect y with mean0, as part of the OCO-2 calibration of remote sensing data to ‘ground-truth’ data.Consider spatial prediction errors, { ˆ y ( s i ) − y ( s i ) : i = 1 , . . . , n } , which are located at { s , . . . , s n } in the spatial domain D . Typically, these prediction errors have individual means that are not zero,and it is often the case in Spatial Statistics that there is no replication to estimate them. A way out ofthe conundrum is to assume exchangeability (e.g., Spiegelhalter et al., 2004, Section 3.17), which resultsin a spatial statistical model that exhibits both systematic error and random error; see Cressie (2018,Rejoinder). In the previous sections, I presented ﬁve principles for Statistical Science and three special ones forSpatial Statistics. In what follows, I add a ninth principle that speaks to all of Science, not just DataScience, and I give some concluding remarks.

There is one further principle that has been my ‘rock’ in the diverse applied-statistics projects in whichI have participated. I once saw a cartoon that showed a small town’s ‘Welcome’ sign, and it lookedsomething like this:

Centerville

Population : 2,390Elevation : 862 feetTotal : 3,252Keeping track of units is a fundamental part of all of science, including Data Science.

Principle 4.1 : Only add quantities that have the same units. Multiply quantities that have diﬀerentunits, and cancel units from the product whenever possible. Take logs, exponents, and other specialfunctions of unit-free quantities.

The ﬁrst part of the principle is illustrated with the cartoon I referred to above. You might say youwould never add apples and oranges but, if in (7), z is measured in petagrams (Pg) of carbon in Earth’satmosphere and x = 1, x = t (in years), and x = t , then a unit analysis using the second part ofthe principle would reveal that regression coeﬃcient β is in units of Pg, β is in Pg/yr, and β is inPg/yr . Principle 4.1 is respected, since it is the regression components { β k x k } that are added. However,it is the regression coeﬃcients { β k } that are often interpreted, so I often standardise the covariates by,respectively, subtracting their averages and dividing by their sample standard deviations. Then, afterthis standardisation, β , β , . . . , β k all have the same units as those of z .Statistical scientists know that probabilities have no units. Using Principle 4.1, a unit analysis ofthe fundamental probability equation, R f ( y ) dy = 1, where f ( · ) is the probability density function ofthe random variable y , whose units are, say, Pg of carbon, reveals that f ( y ) has units of (Pg) − . Usingthe same principle, E( y ) has units of Pg, and so forth. While probability density functions have units,cumulative distribution functions and probability mass functions do not.The last part of Principle 4.1 applies to any special function that can be expressed as a Taylor series.The most common ones in science are logs and exponents, whose Taylor series arelog e (1 + x ) = x − x / x / − . . . e x = 1 + x + x /

2! + x /

3! + . . . , which only make sense when x has no units; see the ﬁrst part of Principle 4.1. Euler’s number, e , hasno units, because e = lim n →∞ (1 + 1 /n ) n .Counts, percentages, and correlations have no units. Also, the Box-Cox transformation (Box and Cox,1964), g λ ( x ) =  ( x λ − /λ , for λ ∈ ( −∞ , ∪ (0 , ∞ )log x , for λ = 0 , only makes sense if x has been rendered unit-free. One way to accomplish this is to specify x relative toa given standard.Every term in the process model (and data model) in a HM, should be assigned their rightful units.The scientiﬁc models that make up parts of the process model are sometimes derived theoretically,sometimes empirically. Beware of a ‘scientiﬁc constant.’ It may be an (estimated) regression coeﬃcient,in which case its units (the ratio of the dependent variable’s units to the corresponding covariate’s units)are key, since that ‘constant’ might change if the units change.In conclusion, it is a critical part of every collaboration to do a ‘unit analysis,’ which avoids obviousmistakes made by confusing diﬀerent parts of diﬀerent systems of units (e.g., metric and imperial), aswell as the more subtle ones discussed above. Principle 2.1 : Establish a true model (TM), perhaps diﬀerent from the scientist’s working model (WM).Critically, compute the TM-distributional properties of the WM estimators. [All models are wrong . . . but some are wrong-er than others.]

Principle 2.2 : Build statistical models conditionally, through a data model and a process model. Inferthe unknown process from the predictive distribution. [What you see is not what you want to get.]

Principle 2.3 : In any well deﬁned statistical model, there is conservation of variability. [Geophysicistsconserve energy but what do data scientists conserve?]

Principle 2.4 : Seek a transformation of the scientiﬁc process where all components of variation act andinteract additively. [The holy grail: all scales of variation are additive.]

Principle 3.1

Everything is related to everything else, but near things are more related than distantthings (Tobler, 1970). [Patches in close proximity are commonly more alike . . . ] Principle 3.2 : Assume a decomposition of a spatial process into ﬁxed eﬀects plus random eﬀects. Whileit is not unique, the decomposition must be chosen to conserve variability. [What is one person’s meanfunction could be another person’s spatial error.]

Principle 3.3 : The variance of the aggregation, y ( B ) , is a decreasing function of the volume, | B | . [COSis the DNA of Spatial Statistics.] Principle 4.1 : Only add quantities that have the same units. Multiply quantities that have diﬀerentunits, and cancel units from the product whenever possible. Take logs, exponents, and other specialfunctions of unit-free quantities. [You can’t add apples and oranges.]16 .3 A disclaimer

These nine principles of Statistical Science are personal, leading to a certain amount of self-referencing,but I hope others ﬁnd them useful. There are no theorem-proof results, but there are back-of-the-envelopecalculations that I use to justify the principles in simple settings. At the very least, they should givedata scientists boundaries in their analytics that should be respected, and criteria by which supervisedand unsupervised machine-learning methods could be assessed.

I once introduced Adrian Baddeley at a conference session I was chairing (ASC2010, Fremantle, WesternAustralia) as a national treasure. I repeat it here, and I wish him a very happy 65th birthday!

Acknowledgements : This research was supported by the Australian Research Council under DiscoveryProject DP190100180. The Sydney Business School, University of Wollongong, generously provided anenvironment where most of this article was written. The referees have generously given comments thathave helped me clarify the exposition of my nine principles. My thanks go to Dr B. Maloney for his skillfultechnical assistance. I would like to express my appreciation to Dr J. Wong for insightful discussions onthis and other material.

References

Berger, J. O. (1985).

Statistical Decision Theory . Springer, New York, NY.Berliner, L. M., Wikle, C. K., and Cressie, N. (2000). Long-lead prediction of Paciﬁc SSTs via Bayesiandynamic modeling.

Journal of Climate , 13:3953–3968.Box, G. E. P. and Cox, D. R. (1964). An analysis of transformations.

Journal of the Royal StatisticalSociety, Series B , 26:211–252.Burden, S., Cressie, N., and Steel, D. (2015). The SAR model for very large datasets: A reduced-rankapproach.

Econometrics , 3:317–338.Connor, B., B¨osch, H., McDuﬃe, J., Taylor, T., Fu, D., Frankenberg, C., O’Dell, C., Payne, V. H.,Gunson, M., Pollock, R., Hobbs, J., Oyafuso, F., and Jiang, Y. (2016). Quantiﬁcation of uncertaintiesin OCO-2 measurements of XCO2: Simulations and linear error analysis.

Atmospheric MeasurementTechniques , 9:5227–5238.Cressie, N. (1978). Removing nonadditivity from two-way tables with one observation per cell.

Biomet-rics , 34:505–513.Cressie, N. (1985). When are relative variograms useful in geostatistics?

Journal of the InternationalAssociation for Mathematical Geology ( now

Mathematical Geosciences) , 17:693–702.Cressie, N. (1993).

Statistics for Spatial Data, rev. edn.

Wiley, New York, NY.Cressie, N. (1996). Change of support and the modiﬁable areal unit problem.

Geographical Systems ,3:159–180.Cressie, N. (2018). Mission CO ntrol: A statistical scientist’s role in remote sensing of atmosphericcarbon dioxide (with discussion and a rejoinder). Journal of the American Statistical Association ,113:152–181. 17ressie, N. (2020). When is it Data Science and when is it Data Engineering? Comment on “Prediction,Estimation, and Attribution” by B. Efron.

Journal of the American Statistical Association, , 115:660–662.Cressie, N., Richardson, S., and Jaussent, I. (2004). Ecological bias: Use of maximum-entropy approxi-mations.

Australian and New Zealand Journal of Statistics , 46:233–255.Cressie, N. and Wikle, C. K. (2011).

Statistics for Spatio-Temporal Data . John Wiley & Sons, Hoboken,NJ.Fisher, R. A. (1935).

The Design of Experiments . Oliver and Boyd, Edinburgh, UK.Goodman, L. A. and Kruskal, W. H. (1954). Measures of association for cross classiﬁcations.

Journal ofthe American Statistical Association , 49:732–764.Gotway, C. and Young, L. J. (2002). Combining incompatible spatial data.

Journal of the AmericanStatistical Association , 97:632–648.Lahiri, S. N., Kaiser, M. S., Cressie, N., and Hsu, N. J. (1999). Prediction of spatial cumulative distribu-tion functions using subsampling (with discussion and a rejoinder).

Journal of the American StatisticalSociety , 94:86–110.Mackenzie, D. (2020).

Covid-19: The Pandemic that Should Never Have Happened, and How to Stop theNext One . The Bridge Street Press, London, UK.Matheron, G. (1963).

Trait´e de Geostatistique Appliqu´ee, Tome II: Le Krigeage . Memoires du Bureaude Recherche G´eologique et Mini`ere, No. . Paris, France.McCullagh, P. and Nelder, J. A. (1989). Generalized Linear Models, 2nd edn.

John Wiley & Sons, NewYork, NY.Morris, M. D. and Ebey, S. F. (1984). An interesting property of the sample mean under a ﬁrst-orderautoregressive model.

The American Statistician , 38:127–129.Nguyen, H., Cressie, N., and Hobbs, J. (2019). Sensitivity of Optimal Estimation satellite retrievalsto misspeciﬁcation of the prior mean and covariance, with application to OCO-2 retrievals.

RemoteSensing , 11:2770.Robinson, W. S. (1950). Ecological correlations and the behavior of individuals.

American SociologicalReview , 15:351–357.Spiegelhalter, D. J., Abrams, K. R., and Myles, J. P. (2004).

Bayesian Approaches to Clinical Trials andHealth-Care Evaluation . John Wiley & Sons, Chichester, UK.Taleb, N. (2007).

The Black Swan: The Impact of the Highly Improbable . Random House, New York,NY.Tobler, W. R. (1970). A computer movie simulating urban growth in the Detroit region.

EconomicGeography , 46:234–240.Wikle, C. K. and Berliner, L. M. (2005). Combining information across spatial scales.

Technometrics ,47:80–91. 18hang, B., Cressie, N., and Wunch, D. (2017). Statistical properties of atmospheric greenhouse gasmeasurements looking down from space and looking up from the ground.

Chemometrics and IntelligentLaboratory Systems , 162:214–222.Zhang, B., Cressie, N., and Wunch, D. (2019). Inference for errors-in-variables models in the presenceof systematic errors with an application to a remote sensing campaign.