aa r X i v : . [ s t a t . O T ] S e p Justifying the Norms of Inductive Inference
Olav Benjamin VassendSeptember 17, 2019
Abstract
Bayesian inference is limited in scope because it cannot be applied in ide-alized contexts where none of the hypotheses under consideration is true andbecause it is committed to always using the likelihood as a measure of eviden-tial favoring, even when that is inappropriate. The purpose of this paper isto study inductive inference in a very general setting where finding the truthis not necessarily the goal and where the measure of evidential favoring is notnecessarily the likelihood. I use an accuracy argument to argue for probabil-ism and I develop a new kind of argument to argue for two general updatingrules, both of which are reasonable in different contexts. One of the updatingrules has standard Bayesian updating, Bissiri et al.’s (2016) general Bayesianupdating, Douven’s (2016) IBE-based updating, and Vassend’s (2019a) quasi-Bayesian updating as special cases. The other updating rule is novel.
Contents
Conclusion 18A Characterization of the combination function 22B Characterization of the normalization step 24C Characterization of inferential updating 25D Characterization of predictive updating 26E General Bayesian updating is a special case of inferential updating 27F An alternative characterization of the combination step 28
Bayesians hold that inductive inference requires two ingredients. First, a prior prob-ability function defined on the hypotheses under consideration. Second, a likelihoodfunction, which assigns a probability to the evidence conditional on each hypothesis.Intuitively, the prior probability assigned to a hypotheses represents how plausible itis that the hypothesis is true before the evidence has been taken into account. Thelikelihood, on the other hand, is a measure of evidential favoring: if H ’s likelihoodon the evidence is greater than H ’s likelihood on the same evidence, then the evi-dence favors H over H . Given a prior and likelihood, Bayesians hold that the priorprobability of each hypothesis should be updated to a posterior probability throughthe use of Bayes’s formula, so that the posterior probability of H is proportional tothe prior probability of H multiplied by its likelihood.Bayesianism has become the most common formal framework used by philoso-phers of science to study scientific methodology, and it is also an influential frame-work for statistical inference. But it rests on an assumption that is often violated inscientific practice, namely that one of the hypotheses under consideration is true. Suppose none of the hypotheses under consideration is true, so that the goal is in-stead to find the hypothesis that is – in some sense – best. Depending on what ismeant by “best,” the likelihood may not be an appropriate measure of evidentialfavoring. For example, suppose the goal is to identify the hypothesis whose expected This limitation is well known, but often ignored. For discussion of the problem, see, e.g. Box(1980); Bernardo and Smith (1994); Forster and Sober (1994); Forster (1995); Key et al. (1999);Shaffer (2001); Sprenger (2009); Gelman and Shalizi (2013); Vassend (2019b); Walker (2013); andSprenger (forthcoming). H over H if and only if H ’s maximalprediction error on the evidence is lower than H ’s maximal prediction error on theevidence. The fact that Bayesianism is tied to using the likelihood as a measure ofevidential favoring is therefore a limitation of the framework.The goal of this paper is to study inductive inference in a very general setting.Suppose our goal is to identify the best hypothesis H (where “best” does not nec-essarily mean “true”). Let p be a function that assigns a number between 0 and 1(inclusive) to each hypothesis, such that p ( H ) is interpreted as representing a priorjudgment of how plausible it is that H is best (in the relevant sense) out of thehypotheses under consideration. In the rest of the paper, I will refer to any suchfunction as a “credibility function”. Suppose, moreover, that Ev[ E | H ] is an evi-dential measure that is sensible given the purpose at hand. Then the questions toconsider are as follows: (1) What norms should p obey? (2) How should p ( H ) andEv[ E | H ] be combined in order to produce a posterior score p E ( H ) that representshow plausible it is that H is best in light of E and the prior information?As we will see, one of the standard Bayesian arguments for probabilism general-izes, so that – given widely applicable conditions – p and p E ought to be probabilityfunctions. The more interesting results concern updating. I will show that, de-pending on what the goal is, the prior probability function and evidential measureshould be combined in one of the following two ways in order to produce a posteriorprobability: Inferential updating.
Given evidential measure Ev and prior proba-bility function p , update p to the posterior p E by way of the followingformula: p E ( H ) = Ev[ E | H ] p ( H ) P i Ev[ E | H i ] p ( H i ) Predictive updating.
Given evidential measure Ev and prior proba-bility function p , update p to the posterior p E by way of the followingprocedure: Step 1.
For each i , calculate q ( H i ) = p ( H i ) + Ev[ E | H i ].3 tep 2. Transform q to p E as follows: for each i , p E ( H i ) = 0or p E ( H i ) = q ( H i ) + d , where d is the unique number such that d is minimal and, for all i , p E ( H i ) ≥ P i p E ( H i ) = 1.The justification for the names of the two updating procedures will become clearerlater. Inferential updating is clearly a generalization of Bayesian updating. IndeedBayesian updating is just inferential updating with the likelihood used as the measureof evidential favoring. What separates inferential updating from predictive updatingis the former rule’s commitment to
Regularity : inferential updating will never assigna probability of 0 to any hypothesis, whereas predictive updating typically will. InSection 4, we’ll see that a commitment to Regularity is sometimes reasonable andsometimes not.The plan for the rest of the paper is as follows. In Section 2, I sketch an argumentfor why any credibility function ought to be probabilistic, regardless of whether thegoal is truth or something else. Since the argument is a straightforward adaptationof Pettigrew’s (2016) accuracy argument for probabilism, the section is brief. InSection 3, I give characterizations of inferential and predictive updating from a setof plausible assumptions. The strategy is to divide inductive updating into twosteps: in the first step, the prior plausibility of a hypothesis is combined with thehypothesis’s score on the evidence according to some measure of evidential favoringin order to produce a posterior score. In the second step, the posterior scores arenormalized so that they are probabilistic. As we’ll see, the requirement that thecombination step and normalization step commute in certain desirable ways, togetherwith a few other plausible assumptions, result in the conclusion that the combinationstep and normalization step must both be either multiplicative or additive. Thecharacterizations of inferential and predictive updating are then just a few shortsteps away. I end the paper with a discussion of inferential and predictive updating,including their relationship to each other and to other updating rules. Predictive updating, on the other hand, may remind the reader of the alternative to Jeffreyconditionalization derived by Leitgeb and Pettigrew (2010). The two rules do indeed share sev-eral features in common, although they are also importantly different. In fact, it is possible toderive a special case of predictive updating by using a proof strategy that resembles the one inLeitgeb and Pettigrew (2010). Why credibility functions should be probabilis-tic
Before we can show that credibility functions ought to be probabilistic, we need toget clearer on what this claim amounts to. Let H be a set of hypotheses and supposethe goal is to identify the hypothesis in H that is best rather than true (where“best” can mean anything we like). One complication that arises when “true” isreplaced by “best” is that whereas there is only one true hypotheses, there may beseveral that are best. For example, if “best” means “having a minimal maximumexpected prediction error,” then there may be several hypotheses that are tied forbest. Note, however, that this is more a theoretical possibility than a practicalone, since it is quite unlikely that multiple hypotheses would have (say) exactly thesame predictive accuracy score, especially if the number of hypotheses is large. Iwill henceforth assume that at most one hypothesis out of the hypotheses underconsideration is best. Note that if we make this assumption, then the hypotheseswill also be mutually exclusive in the sense that in any subset of hypotheses at mostone hypothesis can be best.Another theoretical possibility is that none of the hypotheses under considerationis best. This can, for example, happen if the hypothesis space is infinite and doesnot contain a single best hypothesis, but rather an infinite sequence of hypothesesin ascending order of goodness. To preclude this possibility, we must also assumethat at least one of the hypotheses under consideration is best.Provided we make the above assumptions (i.e. that exactly one of the hypothesesin H is best), then there is nothing mathematically or philosophically that prevents usfrom treating H as a sample space. I.e. H consists of hypotheses that are exhaustivein the sense that one of the hypotheses is best and mutually exclusive in the sensethat at most one of the hypotheses is best in any collection of hypotheses. Note alsothat there is a natural σ -algebra on H . More precisely, union (or disjunction) andintersection (or conjunction) are defined in the normal way, the identity element forconjunction (i.e. the top element of the algebra) is H , and the complement (negation)of any set A formed through unions and intersections of subsets of H is defined inthe following way: ¬ A := H − A . The main difference from the definition givenin most philosophical treatments of Bayesianism is that the top element is now H rather than the tautology. This makes a big interpretive difference, but no differenceto the mathematics. I thank X for pointing this out to me. I thank a referee for pointing out this possibility. H ∗ , generated by H to be probabilistic in the following way: Probability axioms.
A function p defined on H ∗ is probabilistic if andonly if it satisfies the following requirements:1. p ( H ) = 1.2. p ( A ) ≥ A of H ∗ .3. p ( A ∨ B ) = p ( A ) + p ( B ) − p ( A & B ), for all subsets A and B of H ∗ .Note that credibility functions automatically satisfy 2 since we have defined themto have a range between 0 and 1, so the real question is whether they ought to satisfy1 and 3. One of the standard arguments for why regular credence functions (ordegrees of belief) ought to be probabilistic is the accuracy argument (Joyce (1998),Joyce (2009), Pettigrew (2016), Predd et al. (2009)). Briefly, the argument is asfollows: the ideal credence function to have is the function that assigns 1 to thehypothesis that is true and 0 to all hypotheses that are false. Suppose now that wehave a divergence measure (satisfying certain reasonable properties) that quantifiesthe distance between the ideal function and any other candidate credence function.It can then be shown that any credence function that is not probabilistic will bedominated by some probabilistic function in the sense that the probabilistic functionwill be guaranteed to have a smaller divergence from the ideal function. Since it isirrational to choose an option that is known to be dominated, it follows that it isirrational to use a non-probabilistic credence function.An interesting fact about the accuracy argument for probabilism is that it doesnot depend for its validity on any specific interpretation of the credence function, nordoes it depend on the assumption that the ideal credibility function is the functionthat assigns 1 to the hypothesis that is true and 0 to all hypotheses that are false.Indeed, nothing in the accuracy argument prevents us from designating the idealcredibility function otherwise. Hence, we can easily adapt the argument to a contextwhere the goal is to identify the hypothesis that is best rather than true. In such acontext, the ideal function would clearly be one that assigns 1 to the hypothesis thatis best and 0 to all other hypotheses. We can then formulate the following versionof the accuracy argument: There are several versions of the argument; here, I present a variant of Pettigrew’s (2016)version.
61: The ideal credibility function is the function that assigns 1 to thehypothesis that is best and 0 to all other hypotheses.P2: Given any non-probabilistic function, there is a probabilistic functionthat is guaranteed to have a smaller divergence from the ideal function(given that the divergence measure has certain reasonable properties).P3: Given any probabilistic function, there does not exist any functionthat is guaranteed to have a smaller divergence from the ideal function(given that the divergence measure has certain reasonable properties).P4: If P1-P3, then non-probabilistic credibility functions are irrational.C: Non-probabilistic credibility functions are irrational.P2 and P3 are mathematical theorems (proven by Predd et al. (2009)) that holdregardless of what we choose as the ideal function. P1 and P4, on the other hand,are intuitively reasonable general rational principles. The main question that maybe raised about the generalized version of the accuracy argument is whether theconditions on the divergence measure are still reasonable when truth is no longer thegoal. For example, P2 and P3 require the assumption that the divergence measurebelong to the class of Bregman divergences. Is this a reasonable requirement tomake? My only response to this question is that I do not see how this assumption(and other necessary mathematical assumptions) are more plausible if truth is thegoal than if the goal is to identify the hypothesis that is best in some other sense. So,at least in my eyes, the generalized accuracy argument is at least as plausible as theoriginal argument. In any case, my main goal in this paper is not to give a carefulanalysis of the accuracy argument. From now I will assume that any credibilityfunction ought to be probabilistic. That is, I will assume that if p is a functionthat assigns a number between 0 and 1 to each hypothesis H that represents howplausible it is that H is best (in some sense), then p ought to be probabilistic. In thenext section, I turn to the main question of the paper: given a probability function p and given a piece of evidence E , how should p be updated in light of E ? Suppose we have a credibility function defined on a hypothesis set H that is proba-bilistic in the sense of the preceding section. Suppose, also, that we have an evidentialmeasure function Ev[ E | H ] defined on the set of evidence and the set of hypotheses7nder consideration. Note that we are not assuming that Ev[ E | H ] is probabilistic(e.g. P i Ev[ E | H i ] need not sum to 1). It is widely accepted that if the goal isto find the true hypothesis in a partition of hypotheses and the evidential measureis the likelihood, i.e. Ev[ E | H ] = p ( E | H ), then any probability function over thehypotheses ought to be updated through Bayesian updating: Bayesian updating: p E ( H ) = p ( E | H ) p ( H ) P i p ( E | H i ) p ( H i ) The natural generalization of Bayesian updating is what I have called inferentialupdating in the introduction. However, it is not clear why the prior probabilityfunction and the evidential measure should always be combined in a Bayesian-likemanner, regardless of what the evidential measure is and regardless of what the pur-pose of updating is. Unfortunately, whereas the accuracy argument for probabilismdoes not make any assumptions about how the credibility function is interpreted,the standard accuracy argument for Bayesian updating (Greaves and Wallace, 2006)relies on properties that are unique to the likelihood, in particular the fact that thelikelihood forms a joint distribution with the prior. Thus, the standard accuracyargument does not generalize to cases where the evidential measure is not the like-lihood. Other standard arguments for Bayesian updating have the same limitation(e.g. Dutch book arguments). A different kind of approach is therefore needed.Bissiri et al. (2016) come up with a different approach. They show that providedthat the evidential measure is a function of an additive loss function, L ( E, H ), suchthat Ev[ E & E | H ] = f ( L ( E , H ) + L ( E , H )), and given that a few other assump-tions are met, then the updating procedure must have the following form, where c issome constant: p E ( H ) = e − c ∗ L ( E | H ) p ( H ) P i e − c ∗ L ( E | H i ) p ( H i ) (3.1)Bissiri et al. (2016) call the above updating procedure “general Bayesian updat-ing.” General Bayesian updating traces back to Zhang (2006) and been increasinglyinfluential in statistics in recent years. Although Bissiri et al.’s (2016) argumentfor general Bayesian updating is interesting, it has several limitations. One problemis that, as Vassend (2019b) argues, the probabilities in (3.1) cannot be interpretedin the standard Bayesian way as plausibilities of truth. But if the probabilities arenot standard credibility functions, then the decision theoretic framework assumedby Bissiri et al. (2016) would seem to lack justification. The argument also makes See Gr¨unwald and van Ommen (2017) for a thorough discussion of general Bayesian updatingand related updating rules. f -diverences. This assumption rules out many standard divergence mea-sures, including all Bregman divergences aside from the Kullback-Leibler divergence(Amari, 2009). A final limitation of Bissiri et al.’s (2016) derivation is that thereare many reasonable evidential measures that cannot be written as a function of anadditive loss function. Indeed, even the likelihood will only have such a form if theevidence is independent conditional on H i , for all i . Thus, although their argumentis interesting, a more general approach that makes less restrictive and more philo-sophically defensible assumptions is desirable. That is the goal of this section. Laterwe will see that Bissiri et al.’s (2016) updating rule may be derived as a special case.To start, note that ordinary Bayesian updating can be decomposed into two steps:
Combination step.
For each i , calculate p ∗ ( H i ) = p ( E | H i ) p ( H i ). Normalization step.
Transform p ∗ to p ′ as follows: for each i , p ′ ( H i ) = p ∗ ( H i ) p ( E ) .In the first step, the prior plausibility of the hypothesis is combined with theevidential score (i.e. likelihood) of the hypothesis in order to produce an overalljudgment of the hypothesis’s posterior plausibility. In the second step, the posteriorplausibility of all the hypotheses are rescaled in such a way that they jointly obeythe probability axioms, i.e. such that all the posterior plausibility scores fall between0 and 1, inclusive, and jointly sum to 1.Bayesian updating is a special case of a much broader class of updating rules thatdecompose into a combination step and a normalization step. The purpose of the re-mainder of this paper will be to study this class of updating rules. The combinationstep requires a combination function, c , that takes as its input a prior probability, p ( H ) and a set of evidential scores, Ev[ E | H ], Ev[ E | H, E ], Ev[ E | H, E , E ], etc.,and that assigns a total score to H , taking into consideration both its prior proba-bility and its performance on the evidence. The normalization step then transforms They also give an alternative derivation that does not make this assumption. However, thealternative derivation makes other suspect assumptions. In particular, it assumes that the normal-ization procedure is multiplicative, which we’ll see later in this paper can be put into question. Recall that Bregman divergences play a crucial role in the accuracy argument for probabilism.The justification for the focus on Bregman divergences is their tight connection to strict propriety(see Predd et al. (2009)). If p ( E , E | H ) = p ( E | H ) p ( E | H ), we can write p ( E , E | H ) = e log p ( E | H )+log p ( E | H ) , i.e. thelikelihood is of the form required by Bissiri et al. (2016). But if p ( E , E | H ) = p ( E | H ) p ( E | H ),then we cannot write the likelihood in this way. Combination step:
For each hypothesis, H i , a set of evidential scoresand a prior probability are combined using some combination function c in order to produce an overall posterior score for H i . Normalization step:
The posterior scores of all the H i are transformedusing some function N such that they jointly satisfy the probability ax-ioms.In the next two subsections the combination step and the normalization step areanalyzed in detail. The goal is to show that – given reasonable assumptions – thecombination function c and the normalization function N both have a very limitedset of possible functional forms. Let e and e represent the evidential scores of a hypothesis H on some evidence,and let h represent H ’s prior probability; then there are two candidate forms for thecombination function that arguably stand out as being particularly plausible: Additive combination: c ( e , e , h ) = e + e + h Multiplicative combination: c ( e , e , h ) = e ∗ e ∗ h Note that e and e here may represent either conditional or unconditional ev-idential scores. For example, e may represent Ev[ E | H ], i.e. the unconditionalevidential score of H on E , or it may represent Ev[ E | H, E ], i.e. the conditionalevidential score of H on E given that E has already been taken into account. Note,also, that to say that the combination function is additive or multiplicative is not thesame as saying that the evidential measure is additive or multiplicative in the sensethat Ev[ E , E | H ] = Ev[ E | H ] + Ev[ E | H ] or Ev[ E , E | H ] = Ev[ E | H ] ∗ Ev[ E | H ].The latter assumptions are much stronger, and amount to assuming that E and E are independent conditional on H (relative to the evidential measure Ev).If we make a few reasonable assumptions, we can prove that the combinationfunction must be multiplicative or additive. First of all, suppose we have evidentialscores e and e , and a prior probability h . Clearly, the order in which we combinethe evidential scores and the prior should not matter for the final result we get.10hat is not to say that the order in which the evidence is received does not matter;it may. For example, if we flip a coin and the outcomes are six heads in a rowand then six tails in a row, then the order of the outcomes strongly suggest thatthe outcomes are probabilistically dependent. Nevertheless, the order in which weevaluate the available pieces of evidence in order to produce an overall judgmentshould not influence the overall judgment at which we arrive. For that reason, thecombination function should be commutative: c ( e , e ) = c ( e , e ). Furthermore, itclearly should not matter whether we first combine e and e and then combine theresult of that with e , or whether we combine e with e and then combine the resultwith e , or whether we combine all three pieces of evidence at the same time. Inother words, c should be associative: c ( e , c ( e , e )) = c ( c ( e , e ) , e ) = c ( e , e , e ).The final reasonable requirement is more quantitative. Clearly, the impact that e has on H ’s overall evidential score, after e has already been taken into account,should not depend on the impact that e has on H . That is not to say that a pieceof evidence E should not influence the impact that a different piece of evidence E has on H ’s evidential score; it may well, but if it does it should do so throughEv[ E | H, E ]. A piece of evidence may influence the evidential impact conferred byanother piece of evidence, but the evidential scores themselves should not influenceeach other. In other words, the requirement is that the impact that, for example, e = Ev[ E | H, E ] makes on H ’s total evidential score should not depend on theimpact that e = Ev[ E | H ] makes on H ’s total evidential score, nor vice versa.Given that we are willing to suppose that the combination function is twicedifferentiable, the preceding requirement may be naturally formalized as constraintson the partial derivatives of the combination function. Let c ( x, y ) be the combinationfunction as a function of variables x and y . Then the impact that the evidential score e makes on H ’s total evidential score is plausibly the value of the partial derivativeof c ( x, y ) with respect to x , when evaluated at x = e . If ∂c ( x,y ) ∂x c ( x = e , y ) is alarge number, then that means setting x to e makes a large difference to H ’s overallevidential score; if it is 0, then e makes no difference.The requirement that the impact that e makes should not depend on the impactthat e makes, nor vice versa, for any e and e , may then be formalized in terms of aconstraint on the higher-order partial derivatives of c , namely that for some constant k the following equation be obeyed: ∂ c ( x, y ) ∂x∂y = k The above equation formalizes the idea that the impact that x makes, i.e. ∂c∂x ,should not depend on the impact that y makes, i.e. ∂c∂y , where x and y represent11ny possible evidential scores. We can now show the following (the derivation is inAppendix A): Characterization of the combination function.
Suppose the combi-nation function, c ( x, y ) satisfies the following requirements :1. c is commutative.2. c is associative.3. c is twice differentiable.4. c ’s partial derivatives satisfy the following equation, for some num-ber k : ∂ c ( x, y ) ∂x∂y = k Then c must have one of the following two forms :1. If k = 0, then c ( x, y ) = x + y .2. If k = 0, then c ( x, y ) = xy .Hence, it follows that the combination function must be additive or multiplicative.Of course, this conclusion is only as plausible as the assumptions from which itis derived, and some people may be uncomfortable with some of the assumptionsthat have been made, in particular the condition on the partial derivatives of thecombination function. As it happens, it’s possible to derive the conclusion from quitedifferent assumptions. Hence, in order to show the robustness of the conclusion, Iprovide an alternative characterization of the combination function in Appendix F. After the combination function has produced a posterior plausibility score, the pos-terior score must be normalized to be a probability. In theory, normalizing a set ofnumbers means transforming the numbers in such a way that they are all between 0and 1 and jointly sum to 1, while at the same time retaining as much of their internalstructure as possible. In practice, this means that the most extreme numbers in theset may be forced to take the value 0, while the remaining numbers in the set arerescaled by some function, f . In other words, normalization in general takes thefollowing functional form: 12 ( x ) = ( x is sufficiently low f ( x ) Otherwise (3.2)For example, in the normalization step of standard Bayesian updating, N ( x ) = f ( x ) (i.e. no non-zero numbers are normalized to 0) and if the set to be normalizedis { a , a , . . . , a n } , then f ( x ) = P i a i . Note that both N and f are relative to the setthat is being normalized; hence, if we need to be precise, we should write N S and f S ,where the subscript indicates the set that is being normalized. Nevertheless, I willtypically leave off the subscripts in order to avoid clutter.Clearly, f should be a one-to-one function. Indeed, except in the case where x and y are both normalized to 0, it should be the case that if x < y then f ( x ) < f ( y ).Furthermore, it is clear that the function f ought to commute with the combinationfunction. Suppose we have scores e , e , and h . Then we should arrive at thesame posterior probability regardless of whether we do either of the following: firstwe combine h and e , normalize, then combine the normalized result with e andnormalize again; or we first combine h and e , normalize, and then combine thatnormalized result with e before normalizing again. In symbols, we require, forall possible scores x , y , and z , that: f ( c ( x, f ( c ( y, z )))) = f ( f ( c ( x, y ) , z )). Thejustification for this requirement is, again, that the order in which we evaluate ourevidence – which is arbitrary – should not have an influence on our final judgment.By combining just the preceding two requirements, we can show the following: Characterization of the normalization procedure.
Suppose we havea normalization procedure as in (3.2) that satisfies the following require-ments :1. f commutes with the combination function c . For all x , y , and x : f ( c ( x, f ( c ( y, z )))) = f ( f ( c ( x, y ) , z )).2. f is one-to-one: for all x and y , f ( x ) = f ( y ) if and only if x = y . Then the normalization process must have one of the following forms, forsome constant k that depends on the set, S , of numbers being normalized :1. If the combination function is multiplicative, then, for all x in S , f ( x ) = k ∗ x .2. If the combination function is additive, then, for all x in S , f ( x ) = x + k .The proof, which again is straightforward, is in Appendix B.13 .3 Characterizations of inferential and predictive updating The results so far show that any updating procedure needs to have either: (1) Amultiplicative combination step and a multiplicative normalization step, or (2) anadditive combination step and an additive normalization step. Call an updatingprocedure that satisfies either (1) or (2) a legitimate updating procedure. To characterize inferential updating we now introduce the following principle:
Regularity:
No hypothesis is ever conclusively ruled out by any evidenceunless the evidence logically refutes the hypothesis, i.e. the posteriorprobability of any hypothesis is always greater than 0.We can then show the following (see Appendix C):
Characterization of inferential updating.
The only legitimate up-dating procedure that satisfies Regularity is inferential updating. I.e.,given evidential measure Ev and prior probability function p , update p to the posterior p E by way of the following formula: p E ( H ) = Ev[ E | H ] p ( H ) P i Ev[ E | H i ] p ( H i )Inferential updating satisfies Regularity; it will never result in any hypothesishaving a posterior probability of 0. On the other hand, in Appendix C, I show thatan updating procedure that uses an additive combination function and an additivenormalization function must violate Regularity; most of the time, any such updatingrule must assign a posterior probability of 0 to some hypotheses. But this doesnot mean that such an updating rule should never be used. As we will see in thenext section, sometimes we may want to be able to exclude certain hypotheses fromconsideration—i.e., assign them a posterior probability of 0.Nevertheless, we do not want to exclude more hypotheses than is warranted bythe data. The updating procedure ought to be conservative and exclude as fewhypotheses as possible at every step. In other words, any updating procedure thatviolates Regularity should plausibly still satisfy the following principle: Note that not every updating rule that has been suggested in the literature is legitimate inthis sense of the word. For example, Douven and Wenmackers (2017) consider a rule according towhich p E ( H ) = c ∗ ( p ( H ) ∗ p ( E | H ) + f ( E, H )) where c is a normalization constant and f ( E, H ) is a“bonus” assigned to H in case H is the best explanation of E . This updating rule is not legitimatebecause it is neither purely additive nor purely multiplicative. On the other hand, the class of rulesconsidered in Douven (2016) are legitimate. onservativeness: The updating procedure assigns a posterior proba-bility of 0 to as few hypotheses as possible, given the combination func-tion, the normalization procedure, and the evidence available.We are now in a position to characterize predictive updating:
Characterization of predictive updating.
The only legitimate up-dating procedure that violates Regularity, but satisfies Conservativeness,is predictive updating. I.e., given evidential measure Ev and prior prob-ability function p , update p to the posterior p E by way of the followingprocedure: Step 1.
For each i , calculate q ( H i ) = p ( H i ) + Ev[ E | H i ]. Step 2.
Transform q to p E as follows: for each i , p E ( H i ) = 0or p E ( H i ) = q ( H i ) + d , where d is the unique number such that d is minimal and, for all i , p E ( H i ) ≥ P i p E ( H i ) = 1. Inferential updating and predictive updating differ in that the former updating ruleobeys Regularity while the latter rule does not. Is Regularity a reasonable constraint?In some contexts it is, but in others it is not. Suppose our main priority is to identifythe hypothesis that is true or (if none of the hypotheses is true) the hypothesis that isclosest to the truth according to some appropriate measure of closeness to the truth.Given this goal, it is reasonable to be risk-averse and open-minded: we do not wantto rule out any hypothesis as potentially being the hypothesis that is true. Evenif a lot of evidence strongly suggests that a hypothesis is false, there is always thepossibility that the evidence is unrepresentative or misleading. And so Regularity isa reasonable constraint in this context.However, suppose we do not care about which of our hypotheses is true or closestto the truth; our goal is not inferential, but predictive. We wish to find, as efficientlyas possible, the subset of hypotheses that can be expected to be as predictivelyaccurate as possible. In this context, there is no theoretical justification for requiring15hat the updating rule obey Regularity; on the contrary, there are good reasons forwhy we might want an updating rule that violates Regularity. In particular, supposethe posterior distribution will be used in order to make a weighted probabilisticprediction, i.e. the goal is for p ( D | H i ) p E ( H i ) to be as accurate on future data D as possible. In that case, it would seem inadvisable to assign positive probabilityto any hypothesis that has shown itself to be very predictively inaccurate, since thepredictions made by such a hypothesis would likely throw off the weighted prediction.On the other hand, we do not want to go to the opposite extreme and base theprediction on the single hypothesis that has performed best on the evidence, as thatis liable to lead to overfitting (Forster and Sober, 1994). Predictive updating enablesone to set the probabilities of predictively inaccurate hypotheses to 0 in a principled(and conservative) way.Let’s consider a specific example. When the hypotheses under considerationsmake probabilistic predictions and the goal is maximal predictive accuracy, it isnatural to use a strictly proper scoring rule as the measure of evidential favoring(Gneiting and Raftery, 2007). For various reasons, the most popular scoring rulein applied research is probably the Continuous Ranked Probability Score (CRPS).Suppose we have a set of competing statistical models M , M , etc., and for eachmodel, let p M i be the marginal (cumulative) probability forecast distribution corre-sponding to M i . Suppose, moreover, that p M i has finite first moment, that X , X and X are independent and identically distributed random variables that follow thedistribution of p M i , and that x is the actual observed outcome. Then the CRPS canbe written in the following way (where the expectations are taken relative to p M i ):CRPS( p M i , x ) = E | X − x | −
12 E | X − X | (4.1)As (4.1) makes clear, CRPS is a statistical generalization of absolute error. AsGneiting and Raftery (2007) point out, a significant benefit of the CRPS is that it iseasily interpretable, since the outputs of (4.1) can be reported in the same units asthe measurements. For example, suppose the measurements are in terms of meters.Then the CRPS score of a model on an observation will be a representation of howmany meters inaccurate the model’s predictions are of that observation, on average(since the prediction is a probability distribution rather than a single number, theaverage is needed).If we let Ev[ x | p M i ] = a ∗ CRPS( p M i , x ), where a is some constant, and assignprior probabilities to all the models, then predictive updating can be used to assignposterior probabilities to all the models. Importantly, given sufficient evidence If the models contain parameters, then the probability distributions over those parameters may a is chosen) many of the models will receive aposterior probability of 0. These posterior probabilities can then be used for modelselection or for making a weighted prediction using all the models. Of course, it isan empirical question whether predictive updating is better (for predictive purposes)than inferential updating (including standard Bayesian updating). An empiricalevaluating of predictive updating will have to wait for a different occasion, however.In this section I have simply tried to suggest one concrete way in which predictiveupdating may be implemented. As was already mentioned in the introduction to the paper, standard Bayesian updat-ing is clearly a special case of inferential updating: more precisely, we get Bayesianupdating if and only if Ev[ E | H ] ∝ p ( E | H ), i.e. if and only if the evidential measureis proportional to the likelihood. What Vassend (2019a) calls “quasi-Bayesian up-dating” is also a special case of inferential updating; indeed, quasi-Baysian updatingis simply inferential updating with an evidential measure that has been suitably cal-ibrated to a verisimilitude measure. Similarly, Douven’s (2016) IBE-based updatingrule is also clearly a kind of inferential updating.Perhaps more interestingly, Bissiri et al.’s (2016) general Bayesian updating isalso a special case of inferential updating. More precisely, we have: General Bayesian updating is a special case of inferential updat-ing.
Suppose the evidential measure Ev is a strictly decreasing function f of some loss function, L ( E, H ) , such that for all E and E , Ev satisfiesthe following conditions :1. Ev[ E | H, E ] = Ev[ E | H ] = f ( L ( E , H )).2. Ev[ E , E | H ] = f ( L ( E , H ) + L ( E , H )) . Then inferential updating has the following form : p ( H | E ) = e − c ∗ L ( E,H ) p ( H ) P i e − c ∗ L ( E,H i ) p ( H i ) For some constant c . be updated using either inferential or predictive updating.
17 sketch of the proof, which is straightforward, is given in Appendix E. Althoughgeneral Bayesian updating is a special case of inferential updating, the reverse isnot the case because – as was previously mentioned – many reasonable evidentialmeasures cannot be written as a function of an additive loss function. Suppose,for example, that the hypotheses under consideration are real-valued functions, f i and that the evidential measure is of the form Ev[( x , y ) , ( x , y ) , . . . , ( x n , y n ) | f i ] =Minimum( | y − f i ( x ) | , | y − f i ( x ) | , . . . , | y − f i ( x ) | ). It is clear in this case thatthe evidential measure cannot be written as a function of an additive loss function,simply because the Minimum operator is not additive.A diagram depicting the relationship between inferential updating, predictiveupdating, and various updating rules that have been suggested in the literature isgiven in Figure 1. Legitimate updating rulesInferential updatingQuasi-Bayesian(Vassend, 2019a)Standard Bayesian IBE-based(Douven, 2016)General Bayesian(Bissiri et al., 2016)Predictive updatingFigure 1: Overview of various updating rules The primary purpose of this paper has been to justify a set of very general synchronicand diachronic inductive norms. The resulting normative framework can be put toboth philosophical and scientific use. In philosophy of science, a standard way ofanalyzing scientific methodology is by seeing whether the methodology makes sensefrom a Bayesian perspective. For example, in this way, Sober (2015) analyzes parsi-mony inference, Dawid et al. (2015) analyze no-alternatives arguments in physics, Sober uses a likelihoodist approach, which is Bayesianism without the priors.
References
Acz´el, J. (2006).
Lectures on Functional Equations and Their Applications . DoverBooks on Mathematics. Dover Publications.Amari, S.-I. (2009). alpha-Divergence is Unique, Belonging to Both f -Divergence andBregman Divergence Classes. IEEE Transactions on Information Theory 55 (11),4925 – 4931.Bernardo, J. M. and A. F. M. Smith (1994).
Bayesian Theory . Wiley, New York,NY. 19issiri, P. G., C. Holmes, and S. Walker (2016). A General Framework for UpdatingBelief Distributions.
Journal of the Royal Statistical Society. Series B (Method-ological) 78 (5), 1103–1130.Box, G. E. P. (1980). Sampling and Bayes’ Inference in Scientific Modelling andRobustness.
Journal of the Royal Statistical Society. Series A (General) 143 (4),383–430.Dawid, R., S. Hartmann, and J. Sprenger (2015). The No Alternatives Argument.
British Journal for the Philosophy of Science 66 (1), 213–234.Douven, I. (2016). Explanation, Updating, and Accuracy.
Journal of CognitivePsychology 28 (8), 1004–1012.Douven, I. and S. Wenmackers (2017). Inference to the Best Explanation versusBayes’s Rule in a Social Setting.
British Journal for the Philosophy of Sci-ence 68 (2), 535–570.Forster, M. R. (1995, September). Bayes and bust: Simplicity as a problem fora probabilist’s approach to confirmation.
British Journal for the Philosophy ofScience 46 (3), 399–424.Forster, M. R. and E. Sober (1994). How To Tell When Simpler, More Unified,or Less Ad Hoc Theories Will Provide More Accurate Predictions.
The BritishJournal for the Philosophy of Science 45 (1), 1–35.Gelman, A. and C. R. Shalizi (2013). Philosophy and the Practice of BayesianStatistics.
British Journal of Mathematical and Statistical Psychology 66 , 8–38.Gneiting, T. and A. E. Raftery (2007). Strictly Proper Scoring Rules, Prediction, andEstimation.
Journal of the American Statistical Association 102 (477), 359–378.Greaves, H. and D. Wallace (2006). Justifying conditionalization: Conditionalizationmaximizes epistemic utility.
Mind 115 (459), 607–632.Gr¨unwald, P. and T. van Ommen (2017). Inconsistency of Bayesian Inference for Mis-specified Linear Models, and a Proposal for Repairing It.
Bayesian Analysis 12 (4),1069–1103.Jeffrey, R. (1983).
The Logic of Decision (Second ed.). Cambridge University Press,Cambridge. 20oyce, J. (1998). A Non-Pragmatic Vindication of Probabilism.
Philosophy of Sci-ence 65 (4), 575–603.Joyce, J. (2009). Accuracy and Coherence: Prospects for an Alethic Epistemologyof Partial Belief. In F. Huber and C. Schmidt-Petri (Eds.),
Degrees of Belief .Synthese.Key, J. T., L. R. Pericchi, and A. F. M. Smith (1999). Bayesian Model Choice: Whatand Why? In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith(Eds.),
Bayesian Statistics 6 , pp. 343–370. Oxford: Oxford University Press.Kopytov, V. M. and N. Y. Medvedev (1996).
Right-Ordered Groups . Siberian Schoolof Algebra and Logic. Springer.Leitgeb, H. and R. Pettigrew (2010). An Objective Justification of Bayesianism II:The Consequences of Minimizing Inaccuracy.
Philosophy of Science 77 , 236–272.Levinstein, B. A. (2012). Leitgeb and Pettigrew on Accuracy and Updating.
Philos-ophy of Science 79 (3), 413–424.Myrvold, W. (2016). On the Evidential Import of Unification. Unpublishedmanuscript.Pettigrew, R. (2016).
Accuracy and the Laws of Credence . Oxford University Press.Predd, J. B., R. Seiringer, E. H. Lieb, D. N. Osherson, H. V. Poor, and S. R. Kulkarni(2009). Probabilistic Coherence and Proper Scoring Rules.
IEEE Transactions onInformation Theory 55 (10), 4786–4792.Schupbach, J. N. (2018). Robustness Analysis as Explanatory Reasoning.
BritishJournal for the Philosophy of Science 69 (1), 275–300.Shaffer, M. J. (2001). Bayesian Confirmation of Theories That Incorporate Idealiza-tions.
Philosophy of Science 68 (1), 36–52.Sober, E. (2015).
Ockham’s Razors: A User’s Manual . Cambridge University Press.Sprenger, J. (2009). Statistics Between Inductive Logic and Empirical Science.
Jour-nal of Applied Logic 7 (2), 239–250.Sprenger, J. (forthcoming). Conditional Degree of Belief. To appear in
Philosophyof Science . 21anton, J. (2005).
Encyclopedia of Mathematics . Science Encyclopedia. Facts onFile.Vassend, O. B. (2019a). A Verisimilitude Framework for Inductive Inference, withan Application to Phylogenetics. To appear in
British Journal for the Philosophyof Science .Vassend, O. B. (2019b). New Semantics for Bayesian Inference: The InterpretiveProblem and Its Solutions. To appear in
Philosophy of Science .Walker, S. G. (2013). Bayesian Inference with Misspecified Models.
Journal ofStatistical Planning and Inference 143 (10), 1621–1633.Zhang, T. (2006). From e-Entropy to KL-Entropy: Analysis of Minimum InformationComplexity Density Estimation.
The Annals of Statistics 34 (5), 2180–2210.
A Characterization of the combination function
The goal of this section is to show the characterization of the combination functionin Section 3.1. There are two cases to consider: k = 0 and k = 0. Since the twocases are very similar, I will only consider the case where k = 0. So suppose that forsome non-zero k , we have: ∂ c ( x, y ) ∂x∂y = k (A.1)Taking the antiderivative with respect to x , it follows that: ∂c ( x, y ) ∂y = kx + C ( y ) + D (A.2)Where C ( y ) is a function of y , but not x , and D is some real number. Takingthe antiderivative of (A.2) with respect to y , we get: c ( x, y ) = kxy + Z C ( y ) dy + Dy + G ( x ) + F (A.3)Where G is a function of x and F is some real number. Moreover, exchangingthe labels x and y in (A.3) gives us: c ( y, x ) = kyx + Z C ( x ) dx + Dx + G ( y ) + F (A.4)22ut since c ( x, y ) = c ( y, x ), (A.3) and (A.4) must be equal, which means that kxy + R C ( y ) dy + Dy + G ( x ) + F = kxy + R C ( x ) dx + Dx + G ( y ) + F , and hence R C ( y ) dy + Dy + G ( x ) = R C ( x ) dx + Dx + G ( y ). Rearranging, we get: G ( x ) = Z C ( x ) dx + Dx + G ( y ) − Z C ( y ) dy − Dy (A.5)But since G ( x ) does not depend on y , the only way for (A.5) to be true is if G ( y ) − R C ( y ) dy − Dy is equal to some constant number, c . Hence, R C ( y ) dy + Dy = G ( y ) − c . Plugging this back into (A.3) (and absorbing the constant c into F ), weget: c ( x, y ) = kxy + G ( x ) + G ( y ) + F (A.6)Without loss of generality, we may assume that G (0) = 0, because if G (0) = A for some non-zero A , then we can just put G ′ ( x ) = G ( x ) − A and F ′ = F + 2 A , andwe get: c ( x, y ) = kxy + G ′ ( x ) + G ′ ( y ) + F ′ , with G ′ (0) = 0 (i.e. we simply absorbthe constant A into F ′ ).Now the fact that c is associative and commutative means that c ( c ( x, y ) , z )) = c ( c ( y, z ) , x ), and hence (A.6) implies that, for all x , y , and z : k ( kxy + G ( x ) + G ( y ) + F ) z + G ( kxy + G ( x ) + G ( y ) + F ) + G ( z ) + F = k ( kyz + G ( y ) + G ( z ) + F ) x + G ( kyz + G ( y ) + G ( z ) + F ) + G ( x ) + F (A.7)Simplifying, we have:[ G ( x ) + G ( y ) + F ] kz + G [ kxy + G ( x ) + G ( y ) + F ] + G ( z )= G ( y ) kx + G ( z ) kx + F kx + G [ kyz + G ( y ) + G ( z ) + F ] + G ( x ) (A.8)Note that because c is twice differentiable, so is G . Taking the derivative of eachside of (A.8) with respect to z gives:[ G ( x ) + G ( y ) + F ] k + ∂G ( z ) ∂z = ∂G ( z ) ∂z kx + G ′ [ kyz + G ( y ) + G ( z ) + F ] ∗ ∂G ( z ) ∂z (A.9)Next, taking the derivative of each side of (A.9) with respect to x gives: ∂G ( x ) ∂x k = ∂G ( z ) ∂z k (A.10)23ence, since k = 0, it follows that ∂G ( x ) ∂x = ∂G ( z ) ∂z . But since G ( x ) does not dependon z and G ( z ) does not depend on x, this means that ∂G ( x ) ∂x must be a constantnumber, i.e. ∂G ( x ) ∂x = a for some constant a . Since we are assuming that G (0) = 0, itfollows that G ( x ) = ax . Next, the fact that c ( x, y, z ) = c ( c ( x, y ) , z ) implies: kxyz + ax + ay + az + F = k ( kxy + ax + ay + F ) z + a ( kxy + ax + ay + F )+ az + F (A.11)Comparing the terms that contain xyz , we see that k = 1, and hence: ax + ay = axz + ayz + F z + axy + a x + a y + F a (A.12)Comparing the terms that contain z , we see that a ( x + y ) + F = 0 for all x and y . The only way this can be true is if a = F = 0. Hence we have, finally, that c ( x, y ) = xy . B Characterization of the normalization step
The goal of this section is to show the characterization of the normalization stepin Section 3.2. Let { a i } be an arbitrary set of n numbers, S , with normalizationfunction f S . Consider the set S = { a i } and the set S = { i } , which consistsof n copies of 1. Then condition (1) implies that, for all i , f ( c ( f ( c ( a i , a i )) , f ( c ( a i , f ( c ( a i , f ’s are relative to the relevant sets. For exam-ple, in f ( c ( a i , a i )), f is a rescaling function defined on the set { c ( a i , a i ) } . Note thatwe are abusing notation here: strictly speaking the various f ’s are not the same func-tion, since they are defined over different sets. However, to avoid needless clutter, Iuse f without subscripts.According to the characterization of the combination function, the combinationfunction is either multiplicative or additive. Since the derivations are very similar, Iwill only show that the normalization function must be multiplicative given that thecombination function is multiplicative. So suppose that the combination functionis c ( a, b ) = ab . Then we get: f ( f ( a i ∗ a i ) ∗
1) = f ( a i ∗ f ( a i ∗ f ( f (1)) = f ( a i ∗ f ( a i )), i.e. f ( a i ∗ f ( a i )) is a constant. But since, f is one-to-one,that means a i ∗ f ( a i ) must also be a constant. That is, there exists a constant k such that, for all a i in S , a i ∗ f ( a i ) = k . Hence f ( a i ) = k ∗ a i for all a i . Since S Which we can do, as before, by successively differentiating with respect to x , y , and z . Thisproof method is sometimes called “equating coefficients” (Tanton, 2005, p. 169). C Characterization of inferential updating
The goal in this section is to show that the only legitimate updating rule that satisfiesRegularity is inferential updating. According to the results in sections 3.1 and 3.2,any legitimate updating rule must either have (1) a multiplicative combination stepand a multiplicative normalization step, or (2) an additive combination step and anadditive normalization step. It is easy to show that it is possible for an updating rulethat satisfies (1) to satisfy Regularity, and that – indeed – the resulting updatingrule is inferential updating. In order to show that inferential updating is the onlyupdating rule that satisfies Regularity, it suffices to show that there is no updatingrule satisfying (2) that also satisfies Regularity.Suppose, for the sake of contradiction, that there is some updating rule thatsatisfies both (2) and Regularity. In order for Regularity to be obeyed, it has to bethe case that given any set of non-zero prior probabilities over a set of hypotheses, h , h , . . . , h n , and given any set of evidential scores for the hypotheses, e , e , . . . , e n ,the posteriors are also all non-zero. Thus, if N is the normalization function, thenthe following must be true for all h i : N ( e i + h i ) > i , where d is an additive normalization constant: e i + h i + d > X i ( e i + h i + d ) = 1 (C.3)And therefore, d = − n P e i . And so we have, for all h i : e i + h i − n X e i > e is the smallest e i . Then r = e i − n P e i <
0. Now suppose it’s also the case that h < − r . Then we have: 25 + h − n X e i = r + h < D Characterization of predictive updating
The goal in this section is to show that the only legitimate updating rule that violatesRegularity but satisfies Conservativeness is predictive updating. It is clear that anyupdating rule that satisfies Conservativeness but violates Regularity must be addi-tive. This is because any multiplicative updating rule that satisfies Conservativenessclearly also satisfies Regularity.So suppose the updating rule is additive and satisfies Conservativeness. Thenthe goal is to show that the updating rule must be equivalent to predictive updat-ing. Since the rule is additive, it must have the following form, where p E is theposterior probability distribution, H i is a hypothesis, h i is the prior probability ofthe hypothesis, e i is the evidential score of the hypothesis, and d is a normalizationconstant: p E ( H ) = ( x is sufficiently low h i + e i + d Otherwise (D.1)If the updating rule is conservative, then as few hypotheses as possible should beassigned a posterior probability of 0. It remains to show that this uniquely happenswhen d is minimal. Suppose there are n hypotheses. Without loss of generality,suppose the hypotheses are ordered such that 0 ≥ p E ( H ) ≥ p E ( H ) ≥ . . . ≥ p E ( H n ).Then there is some index m such that p E ( H i ) = 0 for i ≤ m and p E ( H i ) > i > m . Note that the updating procedure is conservative if and only if m is minimalbecause m is minimal if and only if a minimal number of hypotheses have a posteriorprobability of 0. In order for the posterior probabilities to be probabilistic, we musthave: X i p E ( H i ) = X i>m ( h i + e i ) + ( n − m ) d = 1 (D.2)Now suppose we have a different updating rule resulting in some posterior p ′ thatis not conservative: i.e. there an index m ′ > m such that p ′ E ( H i ) = 0 for i ≤ m ′ p ′ E ( H i ) > i > m ′ . Then p ′ must satisfy the following constraint for somenormalization constant d ′ : X i>m ′ ( h i + e i ) + ( n − m ′ ) d ′ = 1 (D.3)Comparing D.2 and D.3 and remembering that m ′ > m , we see that:0 < m ′ X i = m ( h i + e i ) = ( n − m ′ ) d ′ − ( n − m ) d (D.4)And hence, d < n − m ′ n − m d ′ < d ′ (D.5)Hence, d < d ′ . What the above proof shows is that any conservative updating rulehas a smaller additive normalization constant than any non-conservative updatingrule. To finish the proof, we show that there is just one conservative updating rule.Here we can use D.4 again. If both updating rules are conservative, then we have m = m ′ , and hence – making the necessary amendments in D.4, we have:0 = m ′ X i = m ( h i + e i ) = ( n − m ) d ′ − ( n − m ) d (D.6)Hence it follows that d ′ = d . But then the two updating rules are equivalent.Hence, there is only one conservative updating rule, namely the one that uses aminimal additive normalization constant. This is predictive updating. E General Bayesian updating is a special case ofinferential updating
The goal in this section is to show that Bissiri et al.’s (2016) general Bayesian up-dating is a special case of inferential updating. For some normalization constant k ,we have: p ( H | E , E ) = k ∗ Ev[ E | H, E ]Ev[ E | H ] p ( H ) = k ∗ f ( L ( E , H )) f ( L ( E , H )) p ( H )(E.1)But we also have: 27 ( H | E , E ) = k ∗ Ev[ E , E | H ] p ( H ) = k ∗ f ( L ( E , H ) + L ( E , H )) p ( H ) (E.2)Comparing C.1 and C.2, we see that f obeys the following functional equationfor all x and y : f ( x ) f ( y ) = f ( x + y ). Let g ( x ) = log f ( x ). Then g ( x + y ) = g ( x ) + g ( y ), which is the well known Cauchy equation whose solution is g ( x ) = − cx ,for some positive constant c (Acz´el, 2006, p. 31) (since f , and therefore g , is strictlydecreasing). Consequently f ( x ) = e − cx , and hence p ( H | E ) = k ∗ e − c ∗ L ( E,H ) p ( H ),which is Bissiri et al.’s (2016) general Bayesian updating rule. F An alternative characterization of the combina-tion step
In both everyday and scientific contexts, it’s common to think of evidence alge-braically: multiple lines of evidence combine in order provide stronger evidence;some evidence favors a hypothesis, while other evidence goes against it; a piece ofevidence here can cancel out a piece of evidence there; and some purported evidencehas no effect at all. In other words, evidential favoring has all the hallmarks of amathematical group. Now, suppose – as we have been doing up to now – that we usereal numbers to represent evidential scores. Then the set of all possible evidentialscores, G , together with the combination function plausibly form a mathematicalgroup. Indeed, they plausibly form an Archimedean group, because intuitively thereis no maximal evidential score. That is, if we use • to denote the combination func-tion, i.e. e • e = c ( e , e ), then it is plausible that ( G, • ) satisfies the followingaxioms:1. Closure.
For all possible evidential scores e and e , e • e is also a possibleevidential score.2. Associativity.
For all possible evidential scores e , e and e , ( e • e ) • e = e • ( e • e ).3. Identity.
There exists a possible evidential score i such that for all e , i • e = e • i = e . I.e., there exists a real number that represents evidence that has noeffect (either favorable or unfavorable).28. Inverse.
For each possible evidential score e , there exists a possible evidentialscore e ′ such that e • e ′ = e ′ • e = i . I.e. every evidential score could potentially(in principle) be cancelled out by other countervailing evidence. Commutativity.
For all possible evidential scores e and e , e • e = e • e .I.e. the order in which the evidence is considered is irrelevant.6. Archimedean property.
For all possible evidential scores e and e , thereexists an integer n such that e < e • e . . . • e (n times ).Suppose, in addition, that the set of evidential scores is totally ordered: for allevidential scores e and e , either e > e or e ≤ e . Then we can use the followingimportant result from group theory (see (Kopytov and Medvedev, 1996, p. 33), fora proof):
H¨older’s theorem.
Every Archimedean totally ordered group is order-isomorphic to a subgroup of the additive group of real numbers with thenatural order.The fact that ( G, • ) is order-isomorphic to a subgroup of the additive group of realnumbers with the natural order means there exists some subgroup, ( S, +) of the realnumbers and a one-to-one function, g , from ( G, • ) to ( S, +) that obeys the followingequation for all e and e in G : g ( e • e ) = g ( e ) + g ( e ). Since g is one-to-one, it hasan inverse, f . Hence, for all e and e in G , we can write: e • e = f ( g ( e ) + g ( e )).In the main text, I showed that the normalization procedure must be either addi-tive or multiplicative, given that the combination function is either multiplicative oradditive. But, arguably, it is not unreasonable to simply assume that the normaliza-tion must be either multiplicative or additive. Indeed, all updating rules that havebeen proposed in the literature have implicitly relied on a normalization procedurethat is either multiplicative or additive. In particular, the normalization procedureimplicit in both standard Bayesian updating and Jeffrey updating (Jeffrey, 1983) ismultiplicative, and the normalization procedure implicit in Leitgeb and Pettigrew’s(2010) alternative to Jeffrey updating is additive.Finally, it is reasonable to assume – as we did in the main text – that the nor-malization procedure commutes with the combination function in the sense that, forall a , b , and c , we have: N ( a • N ( b )) = N ( N ( a ) • b ) = N ( a • b ). We can now givethe following characterization of the combination function: A referee points out that this is a bit of an idealization, since a piece of evidence and a defeaterof that evidence will not typically cancel each other out precisely. A referee rightly points out that this assumption is also idealized. lternative characterization of the combination function. Sup-pose the combination function, c ( x, y ) satisfies the following requirements :1. The set of all evidential scores, G , and the combination function c ( x, y ) = x • y together form a totally ordered Archimedean group.2. The combination function commutes with the normalization func-tion N in the sense that, for all a , b , and c : N ( a • N ( b )) = N ( N ( a ) • b ) = N ( a • b ). Then c must have one of the following two forms :1. If the normalization function is additive, then c ( x, y ) = x + y .2. If the normalization function is multiplicative, then c ( x, y ) = xy . Proof.
The fact that the combination function commutes with the normalizationfunction implies that, for every e with inverse e − : N ( e • e − ) = N ( N ( e ) • e − ) = N ( f ( g ( N ( e )) + g ( e − ))) (F.1)Therefore, for all e , N ( f ( g ( N ( e ))+ g ( e − ))) = N ( i ), where i is the identity elementof the group. Since N is one-to-one, this means that f ( g ( N ( e )) + g ( e − )) = k , forsome constant k that does not depend on e . Furthermore, since f is one-to-one, this inturn implies that g ( N ( e ))+ g ( e − ) = k ′ , for some constant k ′ that does not depend on e . For the same reason, (F.1) also implies that g ( e ) + g ( e − ) = k ′′ , for some constant k ′′ that does not depend on e . Hence we have, finally, that g ( N ( e )) − g ( e ) = K ,where K = k ′ − k ′′ . Hence, g ( N ( e )) = g ( e ) + K .If the normalization procedure is multiplicative, then for some normalizationconstant a , we have g ( ae ) = g ( e ) + K . Note that a depends on the set to which e belongs. If { e i } is the set, then a = 1 P e i (F.2)Hence, depending on the other members of the set to which e belongs, a can beany number in the half-open interval (0 , e ). Thus we have, for all e and all a in (0 , e ),that g ( ae ) = g ( e ) + K , where K is a constant that may depend on a , but does notdepend on e .Similarly, we have—for some normalization constant b —that g ( bae ) = g ( ae ) + K ′ = g ( e ) + K ′′ . Here, b can be any number in the range (0 , ae ), or in other words30n (0 , ∞ ). But if we let y = ab and x = e , then the preceding means that for all x and y in (0 , ∞ ) we have: g ( yx ) = g ( x ) + K ′′ (F.3)Where K ′′ depends on y , but not on x . Interchanging the role of y and x , we alsohave: g ( xy ) = g ( y ) + K ′′′ (F.4)Where K ′′′ depends on x , but not on y . Comparing the above equations, we seethat g ( x ) + K ′′ = g ( y ) + K ′′′ . This implies the following: g ( xy ) = g ( x ) + g ( y ) + C (F.5)Where C is a constant that depends on neither x nor y . Now note that f (2 g ( i )) = i • i = i = f ( g ( i ). Since f is one-to-one, this implies that g ( i ) = 0. Next, (F.5) impliesthat g ( i ) = g (1 ∗ i ) = g (1) + g ( i ) + C . Thus g (1) = − C . Using (F.5) again, we have g (1) = g ( i ∗ i ) = g ( i ) + g ( i ) = g ( i ). But since g is one-to-one, this implies that i = 1, so that i = 1. Hence − C = g (1) = g ( i ) = 0, so C = 0. Finally, then, we have,for all x > y > g ( xy ) = g ( x ) + g ( y ) (F.6)Now put r ( x ) = g ( e x ). Then (F.6) becomes, for all real x and y : r ( x + y ) = r ( x ) + r ( y ) (F.7)This is the Cauchy functional equation, whose only solution is r ( x ) = cx , for anarbitrary constant c (Acz´el, 2006, p. 31). Hence, g ( x ) = r (log x ) = log x c . Since f isthe inverse of g , we have that f ( x ) = e x c . Finally, then, we have: x • y = f ( g ( x ) + g ( y )) = e (log( x c )+log( y c )) c = e ( c ∗ log( xy )) c = xy (F.8)I.e. the combination function is multiplicative, c ( x, y ) = xyxy