Bernoulli Trials With Skewed Propensities for Certification and Validation
BBernoulli Trials With Skewed PropensitiesforCertification and Validation byNozer D. SingpurwallaThe George Washington University, Washington, D.C.andBoya LaiThe City University of Hong Kong, Hong KongJanuary 2020
Abstract
The impetus for writing this paper are the well publicized media reportsthat software failure was the cause of the two recent mishaps of the Boeing737 Max aircraft. The problem considered here though, is a specific one, inthe sense that it endeavors to address the general matter of conditions underwhich an item such as a drug, a material specimen, or a complex, system canbe certified for use based on a large number of Bernoulli trials, all successful.More broadly, the paper is an attempt to answer the old and honorable philo-sophical question, namely,“when can empirical testing on its own validate alaw of nature?” Our message is that the answer depends on what one startswith, namely, what is one’s prior distribution, what unknown does this priordistribution endow, and what has been observed as data.The paper is expository in that it begins with a historical overview, andends with some new ideas and proposals for addressing the question posed. Inthe sequel, it also articulates on Popper’s notion of “ propensity ” and its rolein providing a proper framework for Bayesian inference under Bernoulli trials,as well as the need to engage with posterior distributions that are subjectivelyspecified ; that is, without a recourse to the usual Bayesian prior to posterioriteration.
Keywords:
Bayes-Laplace Priors, Jeffreys’ Prior, Probabilistic Induction,Exchangeability, Drug Approval, Subjective Posteriors, Machine Learning .1 a r X i v : . [ s t a t . O T ] M a r Preamble: Perspective
Despite the science and engineering oriented implication of the title of thispaper, its essential contribution is to address the general philosophical ques-tion as to whether empirical evidence on its own can ever prove a law ofnature. The related practical question is when can a complex system, likecomputer software, a new drug, or an autonomous system, be certified foruse based on testing alone. Karl Pearson (1920) characterized this matteras a “Fundamental Problem of Practical Statistics”; his foresight continuesto hold almost 100 years later. Later on, in (1926), E.B. Wilson wrote anarticle alluding to this matter which the June 2015 issue of the Proceedingsof the (US) National Academy of Sciences has labelled “
A Sleeping Beautyin Science” .Whereas the problem considered is as ancient as Hume, Newton, andKant, it was Bayes in 1763, followed by Laplace in 1774, who proposed probabilistic induction , as a definitive response. Indeed, this was oneof their signal achievements. As a matter of perspective, Bayes’ Law turnsout to be merely a methodological by-product of the bigger philosophical,“ problem with induction ”, of Thomas Hobbs (1588-1679), that Bayes andLaplace were trying to address. In effect, probabilistic induction can be seenas a compromise between objections to the principles of induction and ofdeduction. An interested reader may also want to see Zabell (1989) for amore philosophical as well as a detailed technical discourse on the problem.With the advent of Data Science and Machine Learning, one is tempted tore-visit the broader question and ask, “Can Big Data and Machine Learning,on their own, certify a complex system, or ever prove a Law of Nature?”Were we to adopt a Bayes-Laplace like disposition, then we would arriveupon the viewpoint that data science and machine learning are the modernday enablers of probabilistic induction. Computer scientists may disagree ith this viewpoint, and declare it limited. All the same, the material of thispaper can help address the nagging question as to whether, and when, canAI based systems, like self-driving cars, and other robotic entities be trustedfor routine use.The matter of certification and/or asserting a law of nature based onevidence alone can be explicated via the archetypal example of Bernoullitrials under almost identical conditions . For the purpose of the next fewsections, we take the notion “almost identical” as an undefined primitive,and refrain from any attempt to make it more precise. The Bernoulli Trials Framework
Consider an experiment whose purpose is to certify a product, or to ascertaina law of nature, which entails n Bernoulli trials, where the outcome of eachtrial is a success, if the law be true, and failure otherwise; similarly with theproduct. Suppose that N such Bernoulli trials are to be performed undersimilar conditions, and let R be the number of trials leading to a success;0 ≤ R ≤ N . To assert a law, or to certify a product, we need to thinkof N as being very large, conceptually infinite. The law is deemed assertedif R = N , but since N is infinite, we are unable to conduct all N trials;thus R will be unknown. To overcome this obstacle, n out of the N trialsare chosen at random (i.e. without prejudice) and tested for successes andfailures. Suppose that all the n trials result in success. Based on the above,can one declare with certitude, that R = N , even if n is very, very large?Because of well documented difficulties with induction, the answer is anemphatic no! Indeed, from a mathematical point of view, one cannot invokethe inductive hypothesis, because n out of n successes does not imply ( n + 1)out of ( n +1) successes. However, under probabilistic induction, one is able toassert that, under certain conditions, with n out of n successes at hand, and n large, P ( R = N ) ≈ P denotes probability. In this paper, all probabilities re personal. Strategies for articulating these conditions, and approaches forassessing P ( R = N ), have been the topic of many investigations; some arehighlighted, in Section 2.Our focus is on Bayesian approaches; that is methods which assign priorsand invoke Bayes’ formula. There are two general directions in which this hasbeen done. The first is the classical approach of Bayes and Laplace in which aprior probability is assigned to the various values that R can take. The secondis, by now, a commonly used modern approach, whose foundation lies in DeFinetti’s famous representation theorem on exchangeable sequences . The Bayes-Laplace Approach and Its Variants
Suppose, as a start, that N is finite, so that the unknown R can take values,0 , , · · · , N . Then, in a sample of size n , the probability of observing T successes, is given by the hypergeometric distribution, so that P ( T = n | R ) = (cid:32) Rn (cid:33)(cid:32) Nn (cid:33) , R = n, n + 1 , · · · , N = 0 , R = 0 , , · · · , N − . Our aim is to find P ( R = N | T = n ) which by an application of Bayes’ Law,with P ( R = r ) as a prior for R , is P ( R = N | T = n ) = P ( R = N ) N (cid:80) r = n rn Nn P ( R = r ) hus all that is required to assess the probability that ( R = N ) is a priordistribution for R. The essential spirit of the Bayes-Laplace approach (andits variants) is the assignment of a prior distribution on observables, like R .Of the several priors on R that have been considered, those given in Sections2.1-2.4 are noteworthy. Also discussed therein are the consequences of eachprior vis a vis their relevance to certification and validation. The Bayes-Laplace strategy is that of expressing prior indifference among allpossible values that R can take, namely, R = 0 , , · · · , N . This is tantamountto a discrete uniform prior on R , namely P ( R = r ) = ( N + 1) − , for r =0 , , · · · , N . Such a prior accords well with the principle of insufficient reason,and is therefore also known as a public prior .Under this prior, it can be seen that P ( R = N | T = n ) = ( n + 1) / ( N + 1),so that for any fixed and large n , lim N →∞ ( n + 1) / ( N + 1) = 0. This meansthat even if all the n observed trials result in success, then irrespective ofhow large n is, the probability that all future N trials, (where N is large),will lead to success is zero! A result such as this goes against the grain ofexperimental scientists, who having observed a slew of successes would balkat the prospect of being told that the probability of all future trials beinga success is zero; they would prefer that the answer be something closer toone, if not one itself. As a reaction to the above concern spawned by the Bayes-Laplace prior,Jeffreys (1961) proposed a prior which places a large point mass at 0 and at , and spreads the difference over the remaining values of R . Specifically P ( R = N ) = (cid:26) − kN +1 , r = 1 , · · · , N − − kN +1 + k, r = 0 , N, for 0 < k ≤ . For k = 0, Jeffreys prior reduces to the Bayes-Laplace prior.Under the Jeffreys prior, it can be seen thatlim N →∞ P ( R = N | T = n ) = ( n + 1) k ( n + 1) k + 1 − k , which for any large value of n and k (cid:54) = 0 is not zero. Specifically, for k = ,the above limit is ( n + 1) / ( n + 3) which increases in n , and for large n attainsone as a limit.Jeffreys’ prior thus yields a result which accords well with the intuition ofexperimental scientists, but its construction, though ingenious, is ad hoc. Animprovement over Jeffreys’ prior, vis a vis a faster rate of convergence to one(as a function of n) as well as a justification that is grounded in informationtheoretic arguments, is a prior proposed by Bernardo (1985). The essence of Bernardo’s prior is a large point mass at R = N only, asopposed to Jeffreys’ large point masses at both R = 0 and at R = N . Here P ( R = N ) = (cid:26) − kN , r = 0 , , · · · , N − k, r = 0 , N. Under Bernardo’s prior, it can be seen thatlim N →∞ P ( R = N | T = n ) = ( n + 1) k ( n + 1) k + (1 − k ) , and this for k = 1 /
2, yields lim N →∞ P ( R = N | T = n ) = ( n + 1) / ( n + 2) whichconverges to one faster than Jeffreys’ ( n + 1) / ( n + 3) with its comparable k of 1 /
2, split between R = 0 and R = N . ernardo’s improvement over Jeffreys’ prior via its grounding in informa-tion theoretic ideas, is a noteworthy development; see Section 3.2. The matter of certifying a product, as a way of ensuring its future per-formance, is a topic in reliability and risk analysis that is akin to proving alaw of nature. An archetypal example is a piece of pre-tested and debuggedsoftware which will now be tested n times, under different inputs, to see ifit satisfactorily processes them all. Once it does so successfully, the soft-ware is certified and released for actual use. In Singpurwalla and Wilson(2004) a prior distribution for R, which satisfies certain criteria germane toefficient certification testing is motivated and proposed. Whereas the detailsof this prior are cumbersome to present, Figure 1 below illustrates its generalcharacter.There are finite point masses at R = 0 and R = N , with q ∈ (0 , k >
0, and λ >
0, as parameters. The parameter q , together with a function f ( x ) = kx , for 0 < x < ∞ , controls the rate of exponential decay (andrevival) over R = 0 , , , · · · , N and λ controls the sizes of the point massesat R = 0 and R = N . This prior is not unlike Jeffreys’ prior save for theexponential decay and revival between the point masses.For k = 5, λ = 3, and q = 0 .
5, Figure 2 illustrates the behavior oflim N →∞ P ( R = N | T = n ), as n → ∞ . n , N → ∞ A virtue of this prior is that it leads to an assertion of an item’s highprobability to not fail only after a few successful trials, and this probabilitymonotonically increases to one with additional successful tests. A prior suchas this will be germane to testing for certification when the item in questionhas been previously vetted, and most of its flaws eliminated. Like the Jeffreys’and the Bernardo’s prior, this prior also leads to results that accord with theintuition of scientists. noteworthy feature of the four classes of priors discussed above, is thatthere is a non-zero probability assigned to all the N values that R can take.In the contexts considered, an assignment like this is understandable. We stated that the notion of “almost identical trials” is to be taken as anundefined primitive. In Section 1 and 2, we leaned on this primitive, andfollowing the Bayes-Laplace prescription of assigning prior distributions onobservable quantities, have proposed several classes of priors on R . Recallthat R being the unknown number of successes in N trials is, in principle,observable, and thus any personal probability assignment on R when viewedas a 2-sided bet can eventually be settled.An approach alternate to that of Section 2, to be described in this sec-tion, has as its foundation de Finetti’s (1974) notion of exchangeability .This notion endeavors to make more precise ones sense of almost identicaltrials. Exchangeability is a judgment which implies permutation invari-ance . When an infinite sequence of Bernoulli trials X , X , · · · , is declaredexchangeable, de Finetti’s famous representation theorem comes in play. Itclaims that for every n ≥
1, and every sub-set X , · · · , X n of X , X , · · · , with X i = 1(0) if the i-th trial is a success (failure): P ( X = x , · · · , X n = x n ) = (cid:90) n (cid:89) i =1 P ( X i = x i | p )Π( p ) dp = (cid:90) p (cid:80) x i (1 − p ) n − (cid:80) x i Π( p ) dp Here P and Π denote personal probabilities of the X i ’s, i = 1 , · · · , n , and a fictional unknown quantity [cf. De Groot(1996), p.135], p , respectively; ≤ p ≤
1. Statisticians refer to p as a Bernoulli parameter , whereas Good(1965) calls p a physical (or objective probability ); de Finetti refers to p as a chance . None of these labels provide any sense of what p means. Thequantity Π( p ) is ones personal prior probability distribution of p .It is important to remark that under a personalistic theory of probability, p should not be interpreted as a personal probability. Doing so would tanta-mount to looking at Π( p ) as a personal probability of a personal probability,and this cannot make Π( p ) operational [see Marschak (1975), p. 121-153 fora discussion]. We find it useful to look at p as a propensity (see Section 3.1),in the sense of Popper (1969), and in a manner akin to Kolmogorov (1969).In specifying Π( p ) one is assigning a personal probability to an “unob-servable fictional quantity” p , and this would go against the Mach-Einsteindictum of operationalization , because Π( p ) as a 2-sided bet on any value of p ∈ [0 ,
1] can never be settled. However an entity like Π( p ), as a mainstay ofapplied Bayesian statistics, is so commonly used that it warrants comment.Specifically Π( p ) enables one to automate an application of Bayes’ Law inthe light of new information, as an enforcer of coherence, and this in turnenables one to place bets on observables via their predictive probabilities. The notion of a propensity was introduced by Karl Popper (1959) in con-nection with his attempts to interpret quantum theory. However, propen-sity is also implicit in the 1973 writings of de Finetti (1974), Kolmogorov(1969), and Kendall (1959). By propensity it is meant a physical tendencyfor the occurrence, or not, of a certain event, say success, under repeatedincidences of its possible occurrence. An archetypal example is the occur-rence of heads in infinite tosses of a coin under almost identical conditions.Since indefinitely tossing a coin under almost identical conditions is meta- hysical, propensities are not observable, therefore not measurable, and thusnot actionable. Indeed, propensities are not probabilities [cf. Humphreys(1985)], and certainty not personal probabilities. They encapsulate an in-teraction between the outcome of a random phenomenon and the conditionsunder which the phenomenon occurs [cf. Kolmogorov (1969)]. Observed rel-ative frequencies, being the outcome of a finite number of occurrences, arenot propensities either. Relative frequencies are viewed as manifestations ofpropensities. Propensities are unobservable abstractions useful for obtainingpredictive distributions via Bayes’ Law, once they are endowed with personalprobabilities.To the best of our knowledge, the term “propensity” has appeared in atleast three different contexts. The first is in the context of quantum the-ory mentioned above; the second in educational testing and the social sci-ences wherein “propensity scores” are used for assigning treatment effects. Apropensity score is de facto, a conditional probability, and thus has no bear-ing on what we consider here. The third context in which the term propensityhas appeared is that of the chemical Langevin equation [cf. Gillespie (2000)],and comes closest in spirit to what we think of as here.In the context of exchangeable Bernoulli trials, an extension of de Finetti’stheorem shows the propensity p as lim n →∞ n n (cid:80) X i . Since this limit cannot beknown de Finetti endows p with a prior distribution Π( p ), and in so doingprovides a foundation for Bayesian inference under Bernoulli trials. In whatfollows we first discuss a commonly used choice for Π( p ), and then introducea new choice which is a scale transform of the commonly used choice. Thisnew choice could be more suitable for the contexts considered in this paper. .2 The Beta Family of Priors on Propensity Since 0 ≤ p ≤
1, a natural choice for Π( p ) would be a standard beta distri-bution with constants (parameters) α > β >
0. Different choice for α and β enable one to represent different shapes for Π( p ), and thus differentjudgments about the propensity p . For example, α = β = 1 makes Π( p )uniform over p , so that Π( p ) can be seen as an analogue of the Bayes-Laplacepublic prior on R . Identical values of α > β > p ) symmetricaround p = 1 /
2, whereas unequal values of α < ( > ) β make Π( p ) skewed tothe right(left). To encapsulate the prior opinion that values of p are closerto one than to zero, one would set α > β . Similarly with α < β .Π( p ) will be L-shaped, when α < β >
1; it is J-shaped when α > β <
1, and U-shaped when α < β < p = 0 and/or p = 1. However, when α < β = 1, Π( p ) is L-shaped with a probabilitymass at p = 1; with α = 1 and β <
1, Π( p ) is J-shaped with probabilitymass at 0; see Figure 3 and 4.Of the above choices, the ones which appear to be the closest in termsof relevance for the scenarios of asserting a law, or certifying a product, arethe U-shaped Π( p ), with α < β <
1, the L-shaped prior with β = 1and α < α = 1 and β <
1. Of these the L-shaped prior with a finite positive probability at p = 1 seems most promising.All of the above priors cover the entire [0 ,
1] range of p , and there could bescenarios wherein subjective opinion could be such that one need focus onlyon a segment of this range, say the range [ ω,
1] for some ω >
0. This matteris taken up later in Section 3.4.For now, it is well known that when Π( p ) ha a beta distribution withparameters α , β >
1, then having observed n successes in n Bernoulli trialsthe predictive probability that N future trials will yield N successes is given y beta-binomial [cf. Singpurwalla(2006), p.158] as B ( α + n, β ) B ( α + n + N, β ) , where B ( a, b ) = Γ( a + b )Γ( a )Γ( b ) . It can be seen that for any fixed n , no matter how large, the limit of theabove probability, as N → ∞ , is zero. This conclusion is analogous to thatof Bayes and Laplace when a public prior is assigned to R ; it too will notaccord with the intuition of experimenters.Alternatively, suppose that the prior Π( p ) is L-shaped with a probabilitymass at p = 1; this is obtained via a beta distribution with parameters α < β = 1. Specifically, for 0 ≤ p ≤ p ) = Γ( α + 1)Γ( α + 1) p α − = α Γ( α )Γ( α + 1) 1 p − α = αp − α . As can be seen from Figure 3, this prior emphasizes small values of p , eventhough there is no probability mass at p = 0. (p) Values of p
Figure 3: L-Shaped Prior for Propensity p ( α < β = 1) Under the above prior, if n ≥ n successes, thenvia routine calculations it can be seen that the posterior distribution of p is ofthe form ( n + α ) p n + α − which is a beta distribution with parameters ( n + α )and 1. The effect of testing is to increase the probability of p , at p = 1, fromits original α , to ( α + n ). A consequence of this posterior of p , is that thepredictive probability of observing all N successes in N future trials is givenby the beta-binomial distribution as ( α + n ) / ( α + n + N ), which as N → ∞ goes to zero; see the Appendix for details. Here again, this result would beagainst the grain of experimentalists, making a prior such as this unsuitablefor certification and validation..What if the prior chosen is J-shaped with a finite probability at p = 0, andan infinite probability density at p = 1?; see Figure 4. This prior, which issimilar in spirit to that of Bernardo’s, can be obtained via a beta distributionwith parameters α = 1 and β <
1; that is, Π( p ) = β/ (1 − p ) (1 − α ) . (p) Values of p
Figure 4: J-Shaped Prior for Propensity p ( α = 1, β < Under the above prior, if n ≥ n successes,then the posterior distribution of p is, for some constant C , of the form C βp n (1 − p ) β − ; this is a beta distribution with parameters ( n + 1) and β .Consequently, the predictive probability of observing N successes in N futuretrials is proportional to B ( n + 1 , β ) /B ( n + 1 + N, β ). Because 0 < β <
1, thelimit of this probability, as N → ∞ , is for small to moderate n zero, whereasit approaches one as n becomes large. As is shown in the Appendix, thepredictive probability is the product of N terms, each of the form ( n + N ) / ( n + N + β ), which for any β ∈ (0 ,
1) is a number less than one when n is small ormoderate, and is approximately equal to one when both n and N are large.Figure 5 illustrates the behavior of the predictive probability for β = 0 . N = 10 , n . This behavior of predictiveprobability better encapuslates the intuition of experimentalists than theother predictive probabilities discussed hereto, because ones certitude abouta law should increase as the number of successful test cases increase.Thus, like the priors of Jeffreys, Bernardo, and the portmanteau, of Sec-tions 2.2, 2.3, and 2.4, a prior such as this is viable for certification andvalidation, provided that n is large. Figure 5: Predictive Probability Under a J-shaped Prior With β = 0 . The beta family of prior distributions considered in Section 3.2 covers theentire range of 0 to 1 for values of the propensity p . In many applications,especially those pertaining to system certification, this could be seen as be-ing restrictive because the performance of systems improves over time as aconsequence of the process of debugging and successive fixes. In such cir-cumstances, a prior on p over the range [ ω,
1] for some value of ω >
0, maybe judged more appropriate. One way to achieve this form of left truncationis via a transformation of the beta distribution of p , by scaling it by thefactor e − η , for some η ≥
0, and then taking its reflection. Specifically, we let p = 1 − e − η z , where η ≥
0, with z having a beta distribution with parameter α > β >
0. When η = 0, the distribution of p will be a standard betaover [0 , η >
0, the range of values of p will be restrictedto the left, so that the distribution of p will concentrate its mass more andmore towards one. See Figure 6, wherein α = 3, β = 2, and η = 0 , . , and1, respectively; the support of the distribution of p decreases with increasingvalues of η . Values of p
Figure 6: Plots of Left Truncated Beta Distribution of p . Verify that for c = e − η , with η >
0, the prior distribution of p is,Π( p ) = Γ( α + β )Γ( α )Γ( β ) ( 1 c ) α + β − (1 − p ) α − ( c − p ) β − , − c ≤ p ≤ shifted beta distribution with parameters α and β , over (1 − c ) and1. Because this distribution does not have a probability mass at p = 1, it willbe unsuitable for the purpose of certification and validation. The distributionis meritorious all the same because it has been used by us as a model for theposterior distribution of the threshold 1 − c ; see Section 3.4.1.Were one to set α = 1 and β <
1, so that the distribution is better alignedwith the J-shaped distribution of Figure 4, the prior distribution of p wouldsimplify as Π( p ) = β ( 1 c ) β ( c − p ) β − , (1 − c ) ≤ p ≤ (p) Values of p /c Figure 7: L-Shaped Prior by Transformation and Reflection ( α = 1, β <
1, and c = e − η ) Whereas this prior, which is analogous to the L-shaped prior of Figure 3,has a probability mass of β/c at p=1, it will also have an infinite probabilitydensity at p = 1 − c , making it unsuitable for use in certification and valida-tion. Indeed with n successes out of n trials, the posterior distribution of p will continue to be a beta distribution with parameters ( n + 1) and β over(1 − c ) and 1. This distribution will have a probability mass ofΓ( n + 1 , β )Γ( n + 1)Γ( β ) c β − , at p = 1 , which, even for large n , is only slightly greater than β/c , for β <
1, suggestingan insensitivity of the distribution at p = 1 to the observed success data.Based on the above, as well as the illustrations of Figure 6, it appearsthat reflections of scale transformed beta distributions are inappropriate forthe task of certification and validation. We therefore consider in section 3.4transformations entailing location shifts of suitable beta distributions. Theillustrations of Figure 6 and 7 are presented here for the sake of completeness,though Figure 6 is germane to the material of Section 3.4.1. .4 Left Truncated Beta Family Suppose that the prior distribution of p is a location shifted beta distributionwith parameters α = 1 and β <
1, over the range [ ω, < ω ≤ ω is analogous to the parameter (1 − c )of Figure 7. Verify that the probability density of p is of the formΠ( p ) = β (1 − ω ) β (1 − p ) β − , f or ω ≤ p ≤ . This distribution places a probability mass of β/ (1 − ω ) at p = ω , and aninfinite probability density at p = 1; see Figure 8. (p) Values of p /(1- )
Figure 8: Location Shifted Beta ( α = 1, β <
1, and ω > If all n out of n Bernoulli trials result in a success, then the posteriordistribution of p is also a beta distribution with parameters ( n + 1) and β ,over the range [ ω, ω specified,Π( p ; n ) = Γ( n + β + 1)Γ( n + 1)Γ( β ) ( p − ω ) n (1 − p ) β − (1 − ω ) ( n + β ) , ω ≤ p ≤ . his posterior distribution has an infinite probability density at p = 1, buthere the (prior) probability mass of β/ (1 − ω ) at p = ω , vanishes. This latterfeature is to be expected, since none of the n Bernoulli trials have experienceda failure.It is easy to verify that the predictive probability of observing all N suc-cesses in N future trials is given as B ( n + 1 , β ) /B ( n + N + 1 , β )(1 − ω ) n + β ,which when n is moderate to large, and ω, β <
1, converges to one as N → ∞ .This result is analogous to that associated with the prior Figure 4, save forthe extra (1 − ω ) n + β term in the denominator. The effect of this extra de-nominator term is to enable a faster rate of convergence to one as comparedto that given by B ( n + 1 , β ) /B ( n + N + 1 , β ) alone.The location-shifted prior of this section is therefore suitable for assertingcertification and validation, and more so than that associated with the priorof Figure 4, because of a faster rate of convergence. Provided that one isable to specify ω in a meaningful fashion, the prior of Figure 7 seems tobe the most rewarding of the priors on propensity that we have consideredthus far. However, specifying a precise value for ω would be enigmatic. Achallenge therefore is how best to specify a prior on ω , and based on n outof n successful Bernoulli trials, how to induce its posterior. This matter istaken up below.We start by recalling that given ω , ω ∈ [0 , p , having observed n out of n is:Π( p | ω ; ( n out of n successes )) = Γ( n + β + 1)Γ( n + 1)Γ( β ) ( p − ω ) n (1 − p ) β − (1 − ω ) ( n + β ) , ω ≤ p ≤ . Were we to average out this posterior distribution with respect to a pos-terior distribution of ω (in the light of n out of n successes) we would obtainthe posterior of p unconditional on ω . However, obtaining a posterior of ω ,via the usual Bayesian prior to posterior iteration, poses a difficulty when ne does not have a probability model for inducing a likelihood function for ω . All the same, there is nothing within the Bayesian paradigm which man-dates that the only way to obtain posterior distributions is via a legislatedapplication of Bayes’ Law. Indeed, the philosopher Richard Jeffrey has pro-posed an alternative to Bayes’ Law [cf. Shafer (1981), Diaconis and Zabell(1982)], called Jeffrey’s Rule of Conditioning . Furthermore, since theprior is to be subjectively specified, and so can the likelihood be [were onenot to subscribe to the philosophical principle of conditionalization , cf.Williams (1980)], one is also at liberty to subjectively specify a posteriordistribution, subject to the usual constraints of coherence. This is what wechoose to do, as the material below indicates a possible approach.
To facilitate a subjective specification of the posterior distribution of ω whichis coherent, we lean on the plots of Figure 6, each appropriate to a value ofn successes (out of n trials). These plots encapsulate the feature that theirthresholds move towards one, as n increases. The functional form describingthese plots is taken to be the posterior distribution of ω , which is a shiftedbeta distribution over (1 − c ) and 1, with parameters a and b . Specifically,we suppose thatΠ( ω ; a, b, c ) = Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (1 − ω ) a − ( c − ω ) b − , − c ≤ ω ≤ . In table 1, we indicate possible choices for a, b and c for different valuesof n , assigning all n successes. The entries of the Table 1 are by no meansunique.
21 a b c1 1 1 12 2 2 15 2 3 110 3 4 0.9525 3 4 0.950 3 4 0.7575 4 5 0.5100 4 5 0.251000 4 5 0.05Table 1: Poster Distribution of ω as a Function of n Averaging out ω with respect to the above posterior distribution of ω , inΠ( p | ω ; n ˙) givn before, gives us the posterior distribution of p , unconditionalon ω as:Π( p ; n, ˙) = Γ( n + β + 1)Γ( n + 1)Γ( β ) Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (1 − p ) β − (cid:90) − c ( p − ω ) n ( c − ω ) b − (1 − ω ) n + β +1 − a dω, an expression difficult to evaluate in closed form. Its numerical evaluation,is straightforward.Whereas knowing the ω -averaged posterior distribution of p is of limitedinterest, the more relevant entity to know is the predictive probability of N out of N successes in N future trials. With ω specified, this probability isgiven as B ( n + 1 , β ) /B ( n + N + 1 , β )(1 − ω ) n + β . Averaging out ω with respectto its posterior distribution, gives the unconditional predictive probability as: B ( n + 1 , β ) B ( n + N + 1 , β ) Γ( a + b )Γ( a )Γ( b ) ( 1 c ) a + b − (cid:90) − c ( c − ω ) b − (1 − ω ) n + β +1 − a dω. For n large, the integral term simplifies as (cid:82) − c ( c − ω ) b − (1 − ω ) n dω , and this can benumerically assessed. Summary and Conclusions
The aim of this paper is the articulate the conditions and circumstances inwhich observing a slew of successes, and no failures, enables one to assert alaw of nature, or to claim the certification of an entity. These circumstancespertain to the characteristics of ones prior opinion, assuming that it is honestand genuine, and how it is expressed. One cannot assert certitude if prioropinion is of the “objective” type, or is encapsulated via distributions forwhich a predictive probability converges to zero. There do exist priors forwhich the predictive probability goes to one, for moderate, or even smallvalues of the number of items tested. If such priors reflect true opinion, thentheir use could be beneficial, costwise. In an adversarial circumstance, a useof such priors can be subject to a challenge.It is not the purpose, nor the intent, of this paper to show, or to makethe case, that by manipulating priors, one can get what one wants to obtaincertitude and claim certification. Rather, it is to make the case that bychoosing certain priors one can be assured of getting what one is entitledto get, because in observing a large sequence of successes and no failures,a user of certain kinds of priors could face the dilemma of a reductio adabsurdum . ppendix (Convergence of Posterior Predictive Probabilities)
1. For the L-shaped prior, the predictive probability that all N future trials will lead tosuccess is B ( α + n, /B ( α + n + N,
1) = Γ( α + n + 1)Γ( α + n ) · Γ( α + n + N )Γ( α + n + N + 1)= ( α + n )!( α + n − · ( α + n + N − α + n + N )!= ( α + n )( α + n − α + n − · ( α + n + N − α + n + N )( α + n + N − α + nα + n + N → as N → ∞
2. For the J-shaped prior, the predictive probability that all N future trials will lead tosuccess is proportional to B ( n + 1 , β ) /B ( n + N + 1 , β )= Γ( n + 1 + β )Γ( n + 1)Γ( β ) · Γ( n + N + 1)Γ( β )Γ( n + N + 1 + β )= ( n + β )! n ! · ( n + N )!( n + N + β )!= ( n + N )! n ! · ( n + β )!( n + N + β )!= ( n + N )( n + N − · · · ( n + N − N )! n ! · ( n + β )!( n + N + β )( n + N + β − · · · ( n + N + β − N )!= ( n + N )( n + N − · · · ( n + N − N + 1)( n + N + β )( n + N + β − · · · ( n + N + β − N + 1)which for any β < N terms, each close to one when n gets large. Theproduct tends to 1 as N → ∞ . 24 cknowledgements The idea of scale transforming a beta distributed variable to induce a thresh-old was suggested to the first author by Professor N. Balakrishnan of McMaster University. Whereas this idea did not fly in its intended spirit, itproved valuable for the subjective specification of posterior distributions. Wethank Prof. Balakrishnan for this suggestion. The work reported here wassupported by a grant from the City University of Hong Kong, Project Num-ber 9380068, and by the Research Grants Council Theme-Based ResearchScheme Grant T32-102/14N and T32-101/15R. eferences [1] Bernardo, J. M. (1985) . “On a Famous Problem of Induction.” Trabajos de Estadistica,36, 1:24-30.[2]
De Finetti, B. (1974) . “Theory of Probability: A Critical Introductory Treatment.” Vol.2, London, New York, Wiley. .[3]
De Groot, M. (1966) . “In Personal Probabilities of Probabilities.” Theory and Decision(1975). 6 (2), p. 135.[4]
Diaconis, Persi, and
Zabell, Sandy. L. (1982) . “Updating Subjective Probability.”Journal of the American Statistical Association, 77 (380), 822-830.[5]
Gillespie, D. T. (2000) . “The chemical Langevin equation.” The Journal of ChemicalPhysics, 113(1), 297-306.[6]
Good, I. J. (1965) . “The Estimation of Probabilities: An Essay on Modern BayesianMethods.” M.I.T. Press, Cambridge, MA.[7]
Humphreys, P (1985) . “Why Propensities Cannot be Probabilities.” Philos. Review. 94,557-570.[8]
Jeffreys, H. (1961) . “Theory of Probability.” Third Edition. Clarendon Press, Oxford,England.[9]
Ke, Q., Ferrara, E., Radicchi, F., and Flammini, A. (2015) . “Defining and identi-fying sleeping beauties in science.” Proceedings of the National Academy of Sciences, Vol.112, No.24, pp 7426-7431.[10]
Kendall, M. G. (1959) . “On the Reconciliation of Theories of Probability.” Biometrika,36. 101-116.[11]
Kolmogorov, A. N. (1969) . “The Theory of Probability.” In Mathematics: Its Content,Methods, and Meaning, Vol. 2, Part 3. A.D. Aleksandrov, A. N. Kolmogorov, and M. A.Lavrentav. Eds. M.I.T. Press, Cambridge, MA.[12]
Marschak, J. (1966) . “Do Personal Probabilities of Probabilities Have an OperationalMeaning? In Personal Probabilities of Probabilities.” Theory and Decision, 6(2), 127-133. Pearson, K. (1920) . “The Fundamental Problem of Practical Statistics.” Biometrika,13(1), 1-16.[14]
Popper, K. R. (1959) . “The Propensity Interpretation of Probability.” The BritishJournal for the Philosophy of Science, 10(37), 25-42.[15]
Shafer, G. (1981) . “Jeffrey’s Rule of Conditioning.” Philosophy of Science, Vol. 48, No.3, 337-362.[16]
Singpurwalla, N. D., and
Wilson, P. (2004) . “When can Finite Testing Ensure InfiniteTrustworthiness?” J. Iranian Statist. Soc, 3, 1-37.[17]
Singpurwalla, N. D. (2006) . “Reliability and Risk: a Bayesian Perspective.” JohnWiley & Sons, New York.[18]
Williams, R. (1980) . “Bayesian Conditionalization and the Principle of Minimum Infor-mation.” Brit. J. Phil Sci. 3x, 131-144.[19]
Wilson, E.B. (1927) . “Probable Inference, The Law of Succession, and Statistical Infer-ence.” Journal of the American Statistical Association. Vol. 22, No.158, pp 209-212.[20]
Zabell, S. (1989) . “The Rule of Succession.” Erkenntnis 31; 283-321.. “The Rule of Succession.” Erkenntnis 31; 283-321.