[PDF] Bernoulli Trials With Skewed Propensities for Certification and Validation

Abstract

The impetus for writing this paper are the well publicized media reports that software failure was the cause of the two recent mishaps of the Boeing 737 Max aircraft. The problem considered here though, is a specific one, in the sense that it endeavors to address the general matter of conditions under which an item such as a drug, a material specimen, or a complex, system can be certified for use based on a large number of Bernoulli trials, all successful. More broadly, the paper is an attempt to answer the old and honorable philosophical question, namely," when can empirical testing on its own validate a law of nature?" Our message is that the answer depends on what one starts with, namely, what is one's prior distribution, what unknown does this prior distribution endow, and what has been observed as data. The paper is expository in that it begins with a historical overview, and ends with some new ideas and proposals for addressing the question posed. In the sequel, it also articulates on Popper's notion of "propensity" and its role in providing a proper framework for Bayesian inference under Bernoulli trials, as well as the need to engage with posterior distributions that are subjectively specified; that is, without a recourse to the usual Bayesian prior to posterior iteration.

Full PDF

BBernoulli Trials With Skewed PropensitiesforCertiﬁcation and Validation byNozer D. SingpurwallaThe George Washington University, Washington, D.C.andBoya LaiThe City University of Hong Kong, Hong KongJanuary 2020

Abstract

The impetus for writing this paper are the well publicized media reportsthat software failure was the cause of the two recent mishaps of the Boeing737 Max aircraft. The problem considered here though, is a speciﬁc one, inthe sense that it endeavors to address the general matter of conditions underwhich an item such as a drug, a material specimen, or a complex, system canbe certiﬁed for use based on a large number of Bernoulli trials, all successful.More broadly, the paper is an attempt to answer the old and honorable philo-sophical question, namely,“when can empirical testing on its own validate alaw of nature?” Our message is that the answer depends on what one startswith, namely, what is one’s prior distribution, what unknown does this priordistribution endow, and what has been observed as data.The paper is expository in that it begins with a historical overview, andends with some new ideas and proposals for addressing the question posed. Inthe sequel, it also articulates on Popper’s notion of “ propensity ” and its rolein providing a proper framework for Bayesian inference under Bernoulli trials,as well as the need to engage with posterior distributions that are subjectivelyspeciﬁed ; that is, without a recourse to the usual Bayesian prior to posterioriteration.

Keywords:

Bayes-Laplace Priors, Jeﬀreys’ Prior, Probabilistic Induction,Exchangeability, Drug Approval, Subjective Posteriors, Machine Learning .1 a r X i v : . [ s t a t . O T ] M a r Preamble: Perspective

Despite the science and engineering oriented implication of the title of thispaper, its essential contribution is to address the general philosophical ques-tion as to whether empirical evidence on its own can ever prove a law ofnature. The related practical question is when can a complex system, likecomputer software, a new drug, or an autonomous system, be certiﬁed foruse based on testing alone. Karl Pearson (1920) characterized this matteras a “Fundamental Problem of Practical Statistics”; his foresight continuesto hold almost 100 years later. Later on, in (1926), E.B. Wilson wrote anarticle alluding to this matter which the June 2015 issue of the Proceedingsof the (US) National Academy of Sciences has labelled “

A Sleeping Beautyin Science” .Whereas the problem considered is as ancient as Hume, Newton, andKant, it was Bayes in 1763, followed by Laplace in 1774, who proposed probabilistic induction , as a deﬁnitive response. Indeed, this was oneof their signal achievements. As a matter of perspective, Bayes’ Law turnsout to be merely a methodological by-product of the bigger philosophical,“ problem with induction ”, of Thomas Hobbs (1588-1679), that Bayes andLaplace were trying to address. In eﬀect, probabilistic induction can be seenas a compromise between objections to the principles of induction and ofdeduction. An interested reader may also want to see Zabell (1989) for amore philosophical as well as a detailed technical discourse on the problem.With the advent of Data Science and Machine Learning, one is tempted tore-visit the broader question and ask, “Can Big Data and Machine Learning,on their own, certify a complex system, or ever prove a Law of Nature?”Were we to adopt a Bayes-Laplace like disposition, then we would arriveupon the viewpoint that data science and machine learning are the modernday enablers of probabilistic induction. Computer scientists may disagree ith this viewpoint, and declare it limited. All the same, the material of thispaper can help address the nagging question as to whether, and when, canAI based systems, like self-driving cars, and other robotic entities be trustedfor routine use.The matter of certiﬁcation and/or asserting a law of nature based onevidence alone can be explicated via the archetypal example of Bernoullitrials under almost identical conditions . For the purpose of the next fewsections, we take the notion “almost identical” as an undeﬁned primitive,and refrain from any attempt to make it more precise. The Bernoulli Trials Framework

Consider an experiment whose purpose is to certify a product, or to ascertaina law of nature, which entails n Bernoulli trials, where the outcome of eachtrial is a success, if the law be true, and failure otherwise; similarly with theproduct. Suppose that N such Bernoulli trials are to be performed undersimilar conditions, and let R be the number of trials leading to a success;0 ≤ R ≤ N . To assert a law, or to certify a product, we need to thinkof N as being very large, conceptually inﬁnite. The law is deemed assertedif R = N , but since N is inﬁnite, we are unable to conduct all N trials;thus R will be unknown. To overcome this obstacle, n out of the N trialsare chosen at random (i.e. without prejudice) and tested for successes andfailures. Suppose that all the n trials result in success. Based on the above,can one declare with certitude, that R = N , even if n is very, very large?Because of well documented diﬃculties with induction, the answer is anemphatic no! Indeed, from a mathematical point of view, one cannot invokethe inductive hypothesis, because n out of n successes does not imply ( n + 1)out of ( n +1) successes. However, under probabilistic induction, one is able toassert that, under certain conditions, with n out of n successes at hand, and n large, P ( R = N ) ≈ P denotes probability. In this paper, all probabilities re personal. Strategies for articulating these conditions, and approaches forassessing P ( R = N ), have been the topic of many investigations; some arehighlighted, in Section 2.Our focus is on Bayesian approaches; that is methods which assign priorsand invoke Bayes’ formula. There are two general directions in which this hasbeen done. The ﬁrst is the classical approach of Bayes and Laplace in which aprior probability is assigned to the various values that R can take. The secondis, by now, a commonly used modern approach, whose foundation lies in DeFinetti’s famous representation theorem on exchangeable sequences . The Bayes-Laplace Approach and Its Variants

Suppose, as a start, that N is ﬁnite, so that the unknown R can take values,0 , , · · · , N . Then, in a sample of size n , the probability of observing T successes, is given by the hypergeometric distribution, so that P ( T = n | R ) = (cid:32) Rn (cid:33)(cid:32) Nn (cid:33) , R = n, n + 1 , · · · , N = 0 , R = 0 , , · · · , N − . Our aim is to ﬁnd P ( R = N | T = n ) which by an application of Bayes’ Law,with P ( R = r ) as a prior for R , is P ( R = N | T = n ) = P ( R = N ) N (cid:80) r = n  rn  Nn  P ( R = r ) hus all that is required to assess the probability that ( R = N ) is a priordistribution for R. The essential spirit of the Bayes-Laplace approach (andits variants) is the assignment of a prior distribution on observables, like R .Of the several priors on R that have been considered, those given in Sections2.1-2.4 are noteworthy. Also discussed therein are the consequences of eachprior vis a vis their relevance to certiﬁcation and validation. The Bayes-Laplace strategy is that of expressing prior indiﬀerence among allpossible values that R can take, namely, R = 0 , , · · · , N . This is tantamountto a discrete uniform prior on R , namely P ( R = r ) = ( N + 1) − , for r =0 , , · · · , N . Such a prior accords well with the principle of insuﬃcient reason,and is therefore also known as a public prior .Under this prior, it can be seen that P ( R = N | T = n ) = ( n + 1) / ( N + 1),so that for any ﬁxed and large n , lim N →∞ ( n + 1) / ( N + 1) = 0. This meansthat even if all the n observed trials result in success, then irrespective ofhow large n is, the probability that all future N trials, (where N is large),will lead to success is zero! A result such as this goes against the grain ofexperimental scientists, who having observed a slew of successes would balkat the prospect of being told that the probability of all future trials beinga success is zero; they would prefer that the answer be something closer toone, if not one itself. As a reaction to the above concern spawned by the Bayes-Laplace prior,Jeﬀreys (1961) proposed a prior which places a large point mass at 0 and at , and spreads the diﬀerence over the remaining values of R . Speciﬁcally P ( R = N ) = (cid:26) − kN +1 , r = 1 , · · · , N − − kN +1 + k, r = 0 , N, for 0 < k ≤ . For k = 0, Jeﬀreys prior reduces to the Bayes-Laplace prior.Under the Jeﬀreys prior, it can be seen thatlim N →∞ P ( R = N | T = n ) = ( n + 1) k ( n + 1) k + 1 − k , which for any large value of n and k (cid:54) = 0 is not zero. Speciﬁcally, for k = ,the above limit is ( n + 1) / ( n + 3) which increases in n , and for large n attainsone as a limit.Jeﬀreys’ prior thus yields a result which accords well with the intuition ofexperimental scientists, but its construction, though ingenious, is ad hoc. Animprovement over Jeﬀreys’ prior, vis a vis a faster rate of convergence to one(as a function of n) as well as a justiﬁcation that is grounded in informationtheoretic arguments, is a prior proposed by Bernardo (1985). The essence of Bernardo’s prior is a large point mass at R = N only, asopposed to Jeﬀreys’ large point masses at both R = 0 and at R = N . Here P ( R = N ) = (cid:26) − kN , r = 0 , , · · · , N − k, r = 0 , N. Under Bernardo’s prior, it can be seen thatlim N →∞ P ( R = N | T = n ) = ( n + 1) k ( n + 1) k + (1 − k ) , and this for k = 1 /

2, yields lim N →∞ P ( R = N | T = n ) = ( n + 1) / ( n + 2) whichconverges to one faster than Jeﬀreys’ ( n + 1) / ( n + 3) with its comparable k of 1 /

2, split between R = 0 and R = N . ernardo’s improvement over Jeﬀreys’ prior via its grounding in informa-tion theoretic ideas, is a noteworthy development; see Section 3.2. The matter of certifying a product, as a way of ensuring its future per-formance, is a topic in reliability and risk analysis that is akin to proving alaw of nature. An archetypal example is a piece of pre-tested and debuggedsoftware which will now be tested n times, under diﬀerent inputs, to see ifit satisfactorily processes them all. Once it does so successfully, the soft-ware is certiﬁed and released for actual use. In Singpurwalla and Wilson(2004) a prior distribution for R, which satisﬁes certain criteria germane toeﬃcient certiﬁcation testing is motivated and proposed. Whereas the detailsof this prior are cumbersome to present, Figure 1 below illustrates its generalcharacter.There are ﬁnite point masses at R = 0 and R = N , with q ∈ (0 , k >

0, and λ >

0, as parameters. The parameter q , together with a function f ( x ) = kx , for 0 < x < ∞ , controls the rate of exponential decay (andrevival) over R = 0 , , , · · · , N and λ controls the sizes of the point massesat R = 0 and R = N . This prior is not unlike Jeﬀreys’ prior save for theexponential decay and revival between the point masses.For k = 5, λ = 3, and q = 0 .

5, Figure 2 illustrates the behavior oflim N →∞ P ( R = N | T = n ), as n → ∞ . n , N → ∞ A virtue of this prior is that it leads to an assertion of an item’s highprobability to not fail only after a few successful trials, and this probabilitymonotonically increases to one with additional successful tests. A prior suchas this will be germane to testing for certiﬁcation when the item in questionhas been previously vetted, and most of its ﬂaws eliminated. Like the Jeﬀreys’and the Bernardo’s prior, this prior also leads to results that accord with theintuition of scientists. noteworthy feature of the four classes of priors discussed above, is thatthere is a non-zero probability assigned to all the N values that R can take.In the contexts considered, an assignment like this is understandable. We stated that the notion of “almost identical trials” is to be taken as anundeﬁned primitive. In Section 1 and 2, we leaned on this primitive, andfollowing the Bayes-Laplace prescription of assigning prior distributions onobservable quantities, have proposed several classes of priors on R . Recallthat R being the unknown number of successes in N trials is, in principle,observable, and thus any personal probability assignment on R when viewedas a 2-sided bet can eventually be settled.An approach alternate to that of Section 2, to be described in this sec-tion, has as its foundation de Finetti’s (1974) notion of exchangeability .This notion endeavors to make more precise ones sense of almost identicaltrials. Exchangeability is a judgment which implies permutation invari-ance . When an inﬁnite sequence of Bernoulli trials X , X , · · · , is declaredexchangeable, de Finetti’s famous representation theorem comes in play. Itclaims that for every n ≥

1, and every sub-set X , · · · , X n of X , X , · · · , with X i = 1(0) if the i-th trial is a success (failure): P ( X = x , · · · , X n = x n ) = (cid:90) n (cid:89) i =1 P ( X i = x i | p )Π( p ) dp = (cid:90) p (cid:80) x i (1 − p ) n − (cid:80) x i Π( p ) dp Here P and Π denote personal probabilities of the X i ’s, i = 1 , · · · , n , and a ﬁctional unknown quantity [cf. De Groot(1996), p.135], p , respectively; ≤ p ≤

1. Statisticians refer to p as a Bernoulli parameter , whereas Good(1965) calls p a physical (or objective probability ); de Finetti refers to p as a chance . None of these labels provide any sense of what p means. Thequantity Π( p ) is ones personal prior probability distribution of p .It is important to remark that under a personalistic theory of probability, p should not be interpreted as a personal probability. Doing so would tanta-mount to looking at Π( p ) as a personal probability of a personal probability,and this cannot make Π( p ) operational [see Marschak (1975), p. 121-153 fora discussion]. We ﬁnd it useful to look at p as a propensity (see Section 3.1),in the sense of Popper (1969), and in a manner akin to Kolmogorov (1969).In specifying Π( p ) one is assigning a personal probability to an “unob-servable ﬁctional quantity” p , and this would go against the Mach-Einsteindictum of operationalization , because Π( p ) as a 2-sided bet on any value of p ∈ [0 ,

1] can never be settled. However an entity like Π( p ), as a mainstay ofapplied Bayesian statistics, is so commonly used that it warrants comment.Speciﬁcally Π( p ) enables one to automate an application of Bayes’ Law inthe light of new information, as an enforcer of coherence, and this in turnenables one to place bets on observables via their predictive probabilities. The notion of a propensity was introduced by Karl Popper (1959) in con-nection with his attempts to interpret quantum theory. However, propen-sity is also implicit in the 1973 writings of de Finetti (1974), Kolmogorov(1969), and Kendall (1959). By propensity it is meant a physical tendencyfor the occurrence, or not, of a certain event, say success, under repeatedincidences of its possible occurrence. An archetypal example is the occur-rence of heads in inﬁnite tosses of a coin under almost identical conditions.Since indeﬁnitely tossing a coin under almost identical conditions is meta- hysical, propensities are not observable, therefore not measurable, and thusnot actionable. Indeed, propensities are not probabilities [cf. Humphreys(1985)], and certainty not personal probabilities. They encapsulate an in-teraction between the outcome of a random phenomenon and the conditionsunder which the phenomenon occurs [cf. Kolmogorov (1969)]. Observed rel-ative frequencies, being the outcome of a ﬁnite number of occurrences, arenot propensities either. Relative frequencies are viewed as manifestations ofpropensities. Propensities are unobservable abstractions useful for obtainingpredictive distributions via Bayes’ Law, once they are endowed with personalprobabilities.To the best of our knowledge, the term “propensity” has appeared in atleast three diﬀerent contexts. The ﬁrst is in the context of quantum the-ory mentioned above; the second in educational testing and the social sci-ences wherein “propensity scores” are used for assigning treatment eﬀects. Apropensity score is de facto, a conditional probability, and thus has no bear-ing on what we consider here. The third context in which the term propensityhas appeared is that of the chemical Langevin equation [cf. Gillespie (2000)],and comes closest in spirit to what we think of as here.In the context of exchangeable Bernoulli trials, an extension of de Finetti’stheorem shows the propensity p as lim n →∞ n n (cid:80) X i . Since this limit cannot beknown de Finetti endows p with a prior distribution Π( p ), and in so doingprovides a foundation for Bayesian inference under Bernoulli trials. In whatfollows we ﬁrst discuss a commonly used choice for Π( p ), and then introducea new choice which is a scale transform of the commonly used choice. Thisnew choice could be more suitable for the contexts considered in this paper. .2 The Beta Family of Priors on Propensity Since 0 ≤ p ≤

1, a natural choice for Π( p ) would be a standard beta distri-bution with constants (parameters) α > β >

0. Diﬀerent choice for α and β enable one to represent diﬀerent shapes for Π( p ), and thus diﬀerentjudgments about the propensity p . For example, α = β = 1 makes Π( p )uniform over p , so that Π( p ) can be seen as an analogue of the Bayes-Laplacepublic prior on R . Identical values of α > β > p ) symmetricaround p = 1 /

2, whereas unequal values of α < ( > ) β make Π( p ) skewed tothe right(left). To encapsulate the prior opinion that values of p are closerto one than to zero, one would set α > β . Similarly with α < β .Π( p ) will be L-shaped, when α < β >

1; it is J-shaped when α > β <

1, and U-shaped when α < β < p = 0 and/or p = 1. However, when α < β = 1, Π( p ) is L-shaped with a probabilitymass at p = 1; with α = 1 and β <

1, Π( p ) is J-shaped with probabilitymass at 0; see Figure 3 and 4.Of the above choices, the ones which appear to be the closest in termsof relevance for the scenarios of asserting a law, or certifying a product, arethe U-shaped Π( p ), with α < β <

1, the L-shaped prior with β = 1and α < α = 1 and β <

1. Of these the L-shaped prior with a ﬁnite positive probability at p = 1 seems most promising.All of the above priors cover the entire [0 ,

1] range of p , and there could bescenarios wherein subjective opinion could be such that one need focus onlyon a segment of this range, say the range [ ω,

1] for some ω >

0. This matteris taken up later in Section 3.4.For now, it is well known that when Π( p ) ha a beta distribution withparameters α , β >

1, then having observed n successes in n Bernoulli trialsthe predictive probability that N future trials will yield N successes is given y beta-binomial [cf. Singpurwalla(2006), p.158] as B ( α + n, β ) B ( α + n + N, β ) , where B ( a, b ) = Γ( a + b )Γ( a )Γ( b ) . It can be seen that for any ﬁxed n , no matter how large, the limit of theabove probability, as N → ∞ , is zero. This conclusion is analogous to thatof Bayes and Laplace when a public prior is assigned to R ; it too will notaccord with the intuition of experimenters.Alternatively, suppose that the prior Π( p ) is L-shaped with a probabilitymass at p = 1; this is obtained via a beta distribution with parameters α < β = 1. Speciﬁcally, for 0 ≤ p ≤ p ) = Γ( α + 1)Γ( α + 1) p α − = α Γ( α )Γ( α + 1) 1 p − α = αp − α . As can be seen from Figure 3, this prior emphasizes small values of p , eventhough there is no probability mass at p = 0. (p) Values of p

Figure 3: L-Shaped Prior for Propensity p ( α < β = 1) Under the above prior, if n ≥ n successes, thenvia routine calculations it can be seen that the posterior distribution of p is ofthe form ( n + α ) p n + α − which is a beta distribution with parameters ( n + α )and 1. The eﬀect of testing is to increase the probability of p , at p = 1, fromits original α , to ( α + n ). A consequence of this posterior of p , is that thepredictive probability of observing all N successes in N future trials is givenby the beta-binomial distribution as ( α + n ) / ( α + n + N ), which as N → ∞ goes to zero; see the Appendix for details. Here again, this result would beagainst the grain of experimentalists, making a prior such as this unsuitablefor certiﬁcation and validation..What if the prior chosen is J-shaped with a ﬁnite probability at p = 0, andan inﬁnite probability density at p = 1?; see Figure 4. This prior, which issimilar in spirit to that of Bernardo’s, can be obtained via a beta distributionwith parameters α = 1 and β <

1; that is, Π( p ) = β/ (1 − p ) (1 − α ) . (p) Values of p

Figure 4: J-Shaped Prior for Propensity p ( α = 1, β < Under the above prior, if n ≥ n successes,then the posterior distribution of p is, for some constant C , of the form C βp n (1 − p ) β − ; this is a beta distribution with parameters ( n + 1) and β .Consequently, the predictive probability of observing N successes in N futuretrials is proportional to B ( n + 1 , β ) /B ( n + 1 + N, β ). Because 0 < β <

1, thelimit of this probability, as N → ∞ , is for small to moderate n zero, whereasit approaches one as n becomes large. As is shown in the Appendix, thepredictive probability is the product of N terms, each of the form ( n + N ) / ( n + N + β ), which for any β ∈ (0 ,

1) is a number less than one when n is small ormoderate, and is approximately equal to one when both n and N are large.Figure 5 illustrates the behavior of the predictive probability for β = 0 . N = 10 , n . This behavior of predictiveprobability better encapuslates the intuition of experimentalists than theother predictive probabilities discussed hereto, because ones certitude abouta law should increase as the number of successful test cases increase.Thus, like the priors of Jeﬀreys, Bernardo, and the portmanteau, of Sec-tions 2.2, 2.3, and 2.4, a prior such as this is viable for certiﬁcation andvalidation, provided that n is large. Figure 5: Predictive Probability Under a J-shaped Prior With β = 0 . The beta family of prior distributions considered in Section 3.2 covers theentire range of 0 to 1 for values of the propensity p . In many applications,especially those pertaining to system certiﬁcation, this could be seen as be-ing restrictive because the performance of systems improves over time as aconsequence of the process of debugging and successive ﬁxes. In such cir-cumstances, a prior on p over the range [ ω,

1] for some value of ω >

0, maybe judged more appropriate. One way to achieve this form of left truncationis via a transformation of the beta distribution of p , by scaling it by thefactor e − η , for some η ≥

0, and then taking its reﬂection. Speciﬁcally, we let p = 1 − e − η z , where η ≥

0, with z having a beta distribution with parameter α > β >

0. When η = 0, the distribution of p will be a standard betaover [0 , η >

0, the range of values of p will be restrictedto the left, so that the distribution of p will concentrate its mass more andmore towards one. See Figure 6, wherein α = 3, β = 2, and η = 0 , . , and1, respectively; the support of the distribution of p decreases with increasingvalues of η . Values of p

Figure 6: Plots of Left Truncated Beta Distribution of p . Verify that for c = e − η , with η >

0, the prior distribution of p is,Π( p ) = Γ( α + β )Γ( α )Γ( β ) ( 1 c ) α + β − (1 − p ) α − ( c − p ) β − , − c ≤ p ≤ shifted beta distribution with parameters α and β , over (1 − c ) and1. Because this distribution does not have a probability mass at p = 1, it willbe unsuitable for the purpose of certiﬁcation and validation. The distributionis meritorious all the same because it has been used by us as a model for theposterior distribution of the threshold 1 − c ; see Section 3.4.1.Were one to set α = 1 and β <

1, so that the distribution is better alignedwith the J-shaped distribution of Figure 4, the prior distribution of p wouldsimplify as Π( p ) = β ( 1 c ) β ( c − p ) β − , (1 − c ) ≤ p ≤ (p) Values of p /c Figure 7: L-Shaped Prior by Transformation and Reﬂection ( α = 1, β <

1, and c = e − η ) Whereas this prior, which is analogous to the L-shaped prior of Figure 3,has a probability mass of β/c at p=1, it will also have an inﬁnite probabilitydensity at p = 1 − c , making it unsuitable for use in certiﬁcation and valida-tion. Indeed with n successes out of n trials, the posterior distribution of p will continue to be a beta distribution with parameters ( n + 1) and β over(1 − c ) and 1. This distribution will have a probability mass ofΓ( n + 1 , β )Γ( n + 1)Γ( β ) c β − , at p = 1 , which, even for large n , is only slightly greater than β/c , for β <

1, suggestingan insensitivity of the distribution at p = 1 to the observed success data.Based on the above, as well as the illustrations of Figure 6, it appearsthat reﬂections of scale transformed beta distributions are inappropriate forthe task of certiﬁcation and validation. We therefore consider in section 3.4transformations entailing location shifts of suitable beta distributions. Theillustrations of Figure 6 and 7 are presented here for the sake of completeness,though Figure 6 is germane to the material of Section 3.4.1. .4 Left Truncated Beta Family Suppose that the prior distribution of p is a location shifted beta distributionwith parameters α = 1 and β <

1, over the range [ ω, < ω ≤ ω is analogous to the parameter (1 − c )of Figure 7. Verify that the probability density of p is of the formΠ( p ) = β (1 − ω ) β (1 − p ) β − , f or ω ≤ p ≤ . This distribution places a probability mass of β/ (1 − ω ) at p = ω , and aninﬁnite probability density at p = 1; see Figure 8. (p) Values of p /(1- )

Figure 8: Location Shifted Beta ( α = 1, β <

1, and ω > If all n out of n Bernoulli trials result in a success, then the posteriordistribution of p is also a beta distribution with parameters ( n + 1) and β ,over the range [ ω, ω speciﬁed,Π( p ; n ) = Γ( n + β + 1)Γ( n + 1)Γ( β ) ( p − ω ) n (1 − p ) β − (1 − ω ) ( n + β ) , ω ≤ p ≤ . his posterior distribution has an inﬁnite probability density at p = 1, buthere the (prior) probability mass of β/ (1 − ω ) at p = ω , vanishes. This latterfeature is to be expected, since none of the n Bernoulli trials have experienceda failure.It is easy to verify that the predictive probability of observing all N suc-cesses in N future trials is given as B ( n + 1 , β ) /B ( n + N + 1 , β )(1 − ω ) n + β ,which when n is moderate to large, and ω, β <

1, converges to one as N → ∞ .This result is analogous to that associated with the prior Figure 4, save forthe extra (1 − ω ) n + β term in the denominator. The eﬀect of this extra de-nominator term is to enable a faster rate of convergence to one as comparedto that given by B ( n + 1 , β ) /B ( n + N + 1 , β ) alone.The location-shifted prior of this section is therefore suitable for assertingcertiﬁcation and validation, and more so than that associated with the priorof Figure 4, because of a faster rate of convergence. Provided that one isable to specify ω in a meaningful fashion, the prior of Figure 7 seems tobe the most rewarding of the priors on propensity that we have consideredthus far. However, specifying a precise value for ω would be enigmatic. Achallenge therefore is how best to specify a prior on ω , and based on n outof n successful Bernoulli trials, how to induce its posterior. This matter istaken up below.We start by recalling that given ω , ω ∈ [0 , p , having observed n out of n is:Π( p | ω ; ( n out of n successes )) = Γ( n + β + 1)Γ( n + 1)Γ( β ) ( p − ω ) n (1 − p ) β − (1 − ω ) ( n + β ) , ω ≤ p ≤ . Were we to average out this posterior distribution with respect to a pos-terior distribution of ω (in the light of n out of n successes) we would obtainthe posterior of p unconditional on ω . However, obtaining a posterior of ω ,via the usual Bayesian prior to posterior iteration, poses a diﬃculty when ne does not have a probability model for inducing a likelihood function for ω . All the same, there is nothing within the Bayesian paradigm which man-dates that the only way to obtain posterior distributions is via a legislatedapplication of Bayes’ Law. Indeed, the philosopher Richard Jeﬀrey has pro-posed an alternative to Bayes’ Law [cf. Shafer (1981), Diaconis and Zabell(1982)], called Jeﬀrey’s Rule of Conditioning . Furthermore, since theprior is to be subjectively speciﬁed, and so can the likelihood be [were onenot to subscribe to the philosophical principle of conditionalization , cf.Williams (1980)], one is also at liberty to subjectively specify a posteriordistribution, subject to the usual constraints of coherence. This is what wechoose to do, as the material below indicates a possible approach.

To facilitate a subjective speciﬁcation of the posterior distribution of ω whichis coherent, we lean on the plots of Figure 6, each appropriate to a value ofn successes (out of n trials). These plots encapsulate the feature that theirthresholds move towards one, as n increases. The functional form describingthese plots is taken to be the posterior distribution of ω , which is a shiftedbeta distribution over (1 − c ) and 1, with parameters a and b . Speciﬁcally,we suppose thatΠ( ω ; a, b, c ) = Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (1 − ω ) a − ( c − ω ) b − , − c ≤ ω ≤ . In table 1, we indicate possible choices for a, b and c for diﬀerent valuesof n , assigning all n successes. The entries of the Table 1 are by no meansunique.

21 a b c1 1 1 12 2 2 15 2 3 110 3 4 0.9525 3 4 0.950 3 4 0.7575 4 5 0.5100 4 5 0.251000 4 5 0.05Table 1: Poster Distribution of ω as a Function of n Averaging out ω with respect to the above posterior distribution of ω , inΠ( p | ω ; n ˙) givn before, gives us the posterior distribution of p , unconditionalon ω as:Π( p ; n, ˙) = Γ( n + β + 1)Γ( n + 1)Γ( β ) Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (1 − p ) β − (cid:90) − c ( p − ω ) n ( c − ω ) b − (1 − ω ) n + β +1 − a dω, an expression diﬃcult to evaluate in closed form. Its numerical evaluation,is straightforward.Whereas knowing the ω -averaged posterior distribution of p is of limitedinterest, the more relevant entity to know is the predictive probability of N out of N successes in N future trials. With ω speciﬁed, this probability isgiven as B ( n + 1 , β ) /B ( n + N + 1 , β )(1 − ω ) n + β . Averaging out ω with respectto its posterior distribution, gives the unconditional predictive probability as: B ( n + 1 , β ) B ( n + N + 1 , β ) Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (cid:90) − c ( c − ω ) b − (1 − ω ) n + β +1 − a dω. For n large, the integral term simpliﬁes as (cid:82) − c ( c − ω ) b − (1 − ω ) n dω , and this can benumerically assessed. Summary and Conclusions

The aim of this paper is the articulate the conditions and circumstances inwhich observing a slew of successes, and no failures, enables one to assert alaw of nature, or to claim the certiﬁcation of an entity. These circumstancespertain to the characteristics of ones prior opinion, assuming that it is honestand genuine, and how it is expressed. One cannot assert certitude if prioropinion is of the “objective” type, or is encapsulated via distributions forwhich a predictive probability converges to zero. There do exist priors forwhich the predictive probability goes to one, for moderate, or even smallvalues of the number of items tested. If such priors reﬂect true opinion, thentheir use could be beneﬁcial, costwise. In an adversarial circumstance, a useof such priors can be subject to a challenge.It is not the purpose, nor the intent, of this paper to show, or to makethe case, that by manipulating priors, one can get what one wants to obtaincertitude and claim certiﬁcation. Rather, it is to make the case that bychoosing certain priors one can be assured of getting what one is entitledto get, because in observing a large sequence of successes and no failures,a user of certain kinds of priors could face the dilemma of a reductio adabsurdum . ppendix (Convergence of Posterior Predictive Probabilities)

1. For the L-shaped prior, the predictive probability that all N future trials will lead tosuccess is B ( α + n, /B ( α + n + N,

1) = Γ( α + n + 1)Γ( α + n ) · Γ( α + n + N )Γ( α + n + N + 1)= ( α + n )!( α + n − · ( α + n + N − α + n + N )!= ( α + n )( α + n − α + n − · ( α + n + N − α + n + N )( α + n + N − α + nα + n + N → as N → ∞

2. For the J-shaped prior, the predictive probability that all N future trials will lead tosuccess is proportional to B ( n + 1 , β ) /B ( n + N + 1 , β )= Γ( n + 1 + β )Γ( n + 1)Γ( β ) · Γ( n + N + 1)Γ( β )Γ( n + N + 1 + β )= ( n + β )! n ! · ( n + N )!( n + N + β )!= ( n + N )! n ! · ( n + β )!( n + N + β )!= ( n + N )( n + N − · · · ( n + N − N )! n ! · ( n + β )!( n + N + β )( n + N + β − · · · ( n + N + β − N )!= ( n + N )( n + N − · · · ( n + N − N + 1)( n + N + β )( n + N + β − · · · ( n + N + β − N + 1)which for any β < N terms, each close to one when n gets large. Theproduct tends to 1 as N → ∞ . 24 cknowledgements The idea of scale transforming a beta distributed variable to induce a thresh-old was suggested to the ﬁrst author by Professor N. Balakrishnan of McMaster University. Whereas this idea did not ﬂy in its intended spirit, itproved valuable for the subjective speciﬁcation of posterior distributions. Wethank Prof. Balakrishnan for this suggestion. The work reported here wassupported by a grant from the City University of Hong Kong, Project Num-ber 9380068, and by the Research Grants Council Theme-Based ResearchScheme Grant T32-102/14N and T32-101/15R. eferences [1] Bernardo, J. M. (1985) . “On a Famous Problem of Induction.” Trabajos de Estadistica,36, 1:24-30.[2]

De Finetti, B. (1974) . “Theory of Probability: A Critical Introductory Treatment.” Vol.2, London, New York, Wiley. .[3]

De Groot, M. (1966) . “In Personal Probabilities of Probabilities.” Theory and Decision(1975). 6 (2), p. 135.[4]

Diaconis, Persi, and

Zabell, Sandy. L. (1982) . “Updating Subjective Probability.”Journal of the American Statistical Association, 77 (380), 822-830.[5]

Gillespie, D. T. (2000) . “The chemical Langevin equation.” The Journal of ChemicalPhysics, 113(1), 297-306.[6]

Good, I. J. (1965) . “The Estimation of Probabilities: An Essay on Modern BayesianMethods.” M.I.T. Press, Cambridge, MA.[7]

Humphreys, P (1985) . “Why Propensities Cannot be Probabilities.” Philos. Review. 94,557-570.[8]

Jeffreys, H. (1961) . “Theory of Probability.” Third Edition. Clarendon Press, Oxford,England.[9]

Ke, Q., Ferrara, E., Radicchi, F., and Flammini, A. (2015) . “Deﬁning and identi-fying sleeping beauties in science.” Proceedings of the National Academy of Sciences, Vol.112, No.24, pp 7426-7431.[10]

Kendall, M. G. (1959) . “On the Reconciliation of Theories of Probability.” Biometrika,36. 101-116.[11]

Kolmogorov, A. N. (1969) . “The Theory of Probability.” In Mathematics: Its Content,Methods, and Meaning, Vol. 2, Part 3. A.D. Aleksandrov, A. N. Kolmogorov, and M. A.Lavrentav. Eds. M.I.T. Press, Cambridge, MA.[12]

Marschak, J. (1966) . “Do Personal Probabilities of Probabilities Have an OperationalMeaning? In Personal Probabilities of Probabilities.” Theory and Decision, 6(2), 127-133. Pearson, K. (1920) . “The Fundamental Problem of Practical Statistics.” Biometrika,13(1), 1-16.[14]

Popper, K. R. (1959) . “The Propensity Interpretation of Probability.” The BritishJournal for the Philosophy of Science, 10(37), 25-42.[15]

Shafer, G. (1981) . “Jeﬀrey’s Rule of Conditioning.” Philosophy of Science, Vol. 48, No.3, 337-362.[16]

Singpurwalla, N. D., and

Wilson, P. (2004) . “When can Finite Testing Ensure InﬁniteTrustworthiness?” J. Iranian Statist. Soc, 3, 1-37.[17]

Singpurwalla, N. D. (2006) . “Reliability and Risk: a Bayesian Perspective.” JohnWiley & Sons, New York.[18]

Williams, R. (1980) . “Bayesian Conditionalization and the Principle of Minimum Infor-mation.” Brit. J. Phil Sci. 3x, 131-144.[19]

Wilson, E.B. (1927) . “Probable Inference, The Law of Succession, and Statistical Infer-ence.” Journal of the American Statistical Association. Vol. 22, No.158, pp 209-212.[20]

Zabell, S. (1989) . “The Rule of Succession.” Erkenntnis 31; 283-321.. “The Rule of Succession.” Erkenntnis 31; 283-321.