Quantifying knowledge with a new calculus for belief functions - a generalization of probability theory
aa r X i v : . [ m a t h . P R ] D ec Quantifying knowledge with a new calculus forbelief functions - a generalization of probabilitytheory
Timber Kerkvliet and Ronald MeesterVU University AmsterdamOctober 10, 2018
Abstract
We first show that there are practical situations in for instanceforensic and gambling settings, in which applying classical probabilitytheory, that is, based on the axioms of Kolmogorov, is problematic. Wethen introduce and discuss Shafer belief functions. Technically, Shaferbelief functions generalize probability distributions. Philosophically,they pertain to individual or shared knowledge of facts, rather thanto facts themselves, and therefore can be interpreted as generalizingepistemic probability, that is, probability theory interpreted epis-temologically. Belief functions are more flexible and better suitedto deal with certain types of uncertainty than classical probabilitydistributions. We develop a new calculus for belief functions whichdoes not use the much criticized Dempster’s rule of combination, bygeneralizing the classical notions of conditioning and independence ina natural and uncontroversial way. Using this calculus, we explain ourrejection of Dempster’s rule in detail. We apply the new theory to anumber of examples, including a gambling example and an example ina forensic setting. We prove a law of large numbers for belief functionsand offer a betting interpretation similar to the Dutch Book Theoremfor probability distributions.
Keywords:
Belief Functions, Conditioning, Independence, Modeling igno-rance, Law of large numbers, Epistemic interpretation, Gambling, Rejectionof Dempster’s rule, Lack of additivity.
In many situations, the classical Kolmogorov axioms for probability lead to avery useful theory, with many connections to other branches of mathematicsand with numerous important applications. The axioms themselves can1e justified in many ways, for instance via a frequentistic interpretationof probabilities. In such a frequentistic interpretation, we take relativefrequencies in repeated experiments as the motivation and justification ofthe axioms. Other justifications for the axioms of Kolmogorov are possibleas well, see e.g. [9] and references below.This is not to say, however, that the Kolmogorov axioms should bethe only and exclusive way to deal with uncertainty. Especially whenuncertainty is interpreted epistemologically, that is, relating to knowledgeof facts rather than to facts themselves, it is not always the case that theclassical axiom of additivity adequately describes the situation at hand. Forinstance, in a legal or forensic setting it has already been debated for severaldecades as to what extent the classical theory of probability and alternativesto it, are useful and/or suitable for assessing the value of evidence, see e.g.[7], [3], [13], [1], [18]. There is a number of aspects about modeling epistemicuncertainty for which the classical approach is problematic, and we start ourcontribution with a short discussion of these.First it has been observed by many that the classical theory cannotdistinguish between lack of belief and disbelief. Here, disbelief is associatedwith evidence indicating the negation of a proposition, whereas lack of beliefis associated with not having evidence at all. As Shafer [15] puts it, theclassical theory does not allow one to withhold belief from a propositionwithout according that belief to the negation of the proposition. When wewant to apply a theory of probabilities to legal issues, this becomes a relevantissue. Indeed, if certain exculpatory evidence in a case is dismissed, thenthis may result in less belief in the innocence of the suspect, but it gives nofurther indication for guilt.The second shortcoming of the classical theory is its inability to modelignorance on an individual level in situations where only group informationis available. Here is a classical example.
Example . ( The island problem ) In the classical version of the islandproblem (see e.g. [17] and [2]) a crime has been committed on an island,making it a certainty that an inhabitant of the island committed it. In theabsence of any further information, the classical point of view is to assign auniform prior probability over all inhabitants concerning the question who isthe culprit. However, this does not correspond to our knowledge. We knowfor sure that someone in the population committed the crime, but have nofurther belief about any individual. It would, therefore, be unreasonable toassign any further individual belief to the guilt of an individual, other thanthe fact that the population to which he or she belongs receives belief 1.This last fact distinguishes members from the population from individualsoutside it. However, the combination of assigning degree of belief 1 to thecollection of all inhabitants and 0 to each individual is impossible under theclassical axioms of probability, although this may be exactly the prior one2ants to impose.We will apply the theory we are about to develop to this example inSection 6.1. There we will see that using an uninformative prior, which ispossible in our theory, leads to a different result than using a uniform prior.The fact that these priors lead to different results confirms that these priorsare really distinct: a uniform prior is not a prior representing ignorance, andusing a uniform prior does not lead to the same results as using a prior thatdoes represent ignorance.This example suggest that the usual additivity, that is, P ( A ) + P ( B ) = P ( A ∪ B ) if A and B are disjoint, is not always desirable when P isinterpreted epistemologically, as is often the case in legal or forensic settings.In such a setting, one needs to model uncertainty on the level of knowledgethat one actually has, and there is no reason to suppose that individualor shared knowledge can always be adequately described by a classicalunderlying probability distribution, known or unknown.We next give a gambling example which is, like Example 1.1, just anotherexample of the inability of probability distributions to express ignorance. Example . Suppose a fair coin is flipped. However, with probability p >
0, the person flipping the coin gets the opportunity to change theoutcome of the flip. In ignorance about the way the person makes his or herdecisions, we can not give a probability distribution describing the outcomeof this process. We can not even assume that there necessarily exists aprobability distribution describing the decisions of this person. Our theorywill, nevertheless, allow us to make quantitative statements on which we canbase gambling strategies in this situation, see Section 6.2.Already back in the seventies of the previous century, there has been anattempt by Shafer [15] to develop a theory of probabilities outside the realmof the axioms of Kolmogorov. He introduced the concept of a belief function,which is a generalization of a probability distribution. Belief functions arenot necessarily additive and allow for the flexibility that our examples askfor. Based on the concept, Shafer also introduced a calculus for belieffunctions centered around the so called Dempster’s rule of combination.However, his attempt has been criticized fiercely (references below), forvarious and good reasons, and nowadays belief functions are hardly used, ifat all, in mainstream applied probability.In this article, we aim to re-develop the theory of belief functions, usingthe basic concept of Shafer, but setting up a new calculus without usingDempster’s rule. We think that our revision takes away the three reasonswhy people have rejected Shafer’s belief functions before, which we nowdiscuss.The first important obstacle is reported by Shafer himself in [16].Probability has a betting interpretation based on the Dutch Book Theorem,which traces back to Ramsey [12] and de Finetti [5]. Shafer writes that3any of his critics rejected them because of the lack of a suitable bettinginterpretation for belief functions. Shafer himself argues in [16] that no suchbehavioral interpretation is necessary. We do not have to follow that line ofreasoning, because in Section 9 we present a betting interpretation based ona characterization of belief functions (Theorem 9.2), much like the bettinginterpretation of probability based on the Dutch Book Theorem.What is probably the biggest concern about Shafer’s belief functions, seee.g. [14] and [6], is Dempster’s rule of combination. This rule of combinationis supposed to describe how different belief functions that are based on‘independent evidence’ should be combined into a new one. Let us be clearabout this point: we also reject Dempster’s rule and the calculus that stemsfrom it. In Section 5 we explain how the rule confuses ‘independence’ ofevidence and ‘independence’ of phenomena. While the troubled notionof ‘independent evidence’ has no place in our new theory, we do have amathematical notion of ‘independent phenomena’ as a generalization ofindependence in probability theory (see Section 4). Despite our rejection ofDempster’s rule, in the current article we do develop a very useful calculusof belief functions without using Dempster’s rule. The only thing we need isa proper rule for conditioning, and this is much less controversial, if at all.This should take care of the points raised in [14] and [6].The third concern about belief functions concerns the question whetheror not they represent knowledge or belief in a meaningful epistemologicalway. Pearl in [11], for instance, questions whether or not belief functionsrespect some ‘rules’ of reasoning. For the most part, Pearl’s criticism doesnot apply to our theory and how we want to use it, the exception being apoint about belief updating. In Section 3, after we introduced our rule ofconditioning, we address this point.The theory of belief functions which we are about to re-develop is,foremost, an epistemic theory on the level of individual or shared knowledge.The fact that we undertake this effort implicitly implies that we think thatthere are many situations in which classical epistemic interpretations fallshort; see our examples above. Our motivation for reviving the theory ofbelief functions hence lies in the fact that knowledge (or lack thereof) doesnot always fit into the classical framework, and that the classical axioms ofprobability theory are not always suitable for an epistemic interpretation.For an overview of classical epistemic interpretations we refer to [8] and thereferences therein.On the technical level, belief functions are a generalization of classicalprobability theory in the sense that any probability distribution is also abelief function. So, if there are reasons to assume that the quantities wewant to describe can be adequately modeled by assuming an underlyingclassical probability distribution, then the theory of belief functions allowsfor that. In other words, we lose nothing.For the applications of the theory to forensic examples, we refer to our4ompanion paper [10] that has an in-depth discussion of such applications.We will restrict our discussion to finite outcome spaces. In all examples inwhich we want to apply the theory, the outcome spaces are finite, and itis probably a good idea to study and develop a new theory in the simplestpossible setting first anyway.The current paper is organized as follows. In Section 2 we introducebelief functions and some basic results. In Section 3 and 4 we developthe backbone of our calculus by discussing the concepts of respectivelyconditioning and independence. In Section 5 we explain why we rejectDempster’s rule of combination using the insights of our own calculus. InSection 6 we discuss the behavior of the theory in a gambling example, andwith an example in a forensic setting. After that we discuss the relationbetween belief functions and a special collection of classical probabilitydistributions in Section 7, a law of large numbers for belief functions inSection 8, and finally in Section 9 we offer a betting interpretation for belieffunctions.Is is unavoidable that in the discussion of the basics of the theory, thereis some overlap with our companion paper [10]. It has been our aim to makeboth papers self-contained.
Let Ω be a finite outcome space. We want to make statements about theelements of Ω in the presence of uncertainty. The classical way to do thisis by means of a suitable probability distribution on Ω. A probabilitydistribution assigns a non-negative weight p ( ω ) to each element ω ∈ Ω insuch a way that the total weight is equal to 1. We may, for instance, expressour uncertainty about who is the culprit by means of such a probabilitydistribution. The probability that the culprit can be found in a subset A ofΩ is then equal to P ( A ) := X ω ∈ A p ( ω ) . (2.1)The probability measure P can be interpreted as epistemic, frequentistic orotherwise, depending on the context and personal taste. The weight p ( ω )represents the probability, degree of belief, or confidence in the outcome ω ,and P ( A ) represents our probability, degree of belief, or confidence in anoutcome which is contained in A . In classical probability theory, a subset ofΩ is also called an event or a hypothesis , and P describes the probability ofall such events or hypotheses.Next we define basic belief assignments and belief functions. Thedifference between a basic belief assignment and a probability distribution,is that the former assigns weights to nonempty subsets of Ω rather than toindividual outcomes. We write 2 Ω for the collection of all subsets of Ω.5 efinition 2.1. A function m : 2 Ω → [0 ,
1] is a basic belief assignment if m ( ∅ ) = 0 and X C ⊆ Ω m ( C ) = 1 . (2.2)Whereas p ( ω ) represents the probability or confidence in the outcome ω , m ( C ) represents our confidence in an outcome in C without any furtherspecification which element of C is the outcome. In slightly different words,we can interpret m ( C ) as the probability of having knowledge precisely C .It may appear that there is not much difference between P and m , butin fact there is. The crucial difference between P and m is that the weight ofa subset C of Ω is not immediately related to the weights of the elements orsubsets of C . For instance, if we have no clue whatsoever about the outcome,that is, if we have no information at all other than that the outcome is inΩ, then we may express this by putting m (Ω) = 1 and m ( A ) = 0 for allstrict subsets of Ω. If we want to assign belief 1 / a and b without making individual statements about a and b , then we can expressthis as m ( { a, b } ) = 1 / m ( { a } ) = m ( { b } ) = 0. It isalso possible that a basic belief functions only assigns positive weight tosingletons. In such a case, we are back in the classical situation.The quantity m ( C ) is sometimes referred to as the weight of the evidence that points precisely to C . We should view m as the analogue of p in theclassical description above. Next we define the analogue of P , which is calleda belief function .We want to quantify how much belief we can assign to a subset A of Ω.To this end, we consider all sets C in Ω with C ⊆ A , which are precisely theevents whose occurrence implies the occurrence of A . The belief in a set A now is the sum of the weights of all subsets of A . In terms of evidence, thebelief in A is the total weight of all evidence which implies A . Definition 2.2.
Given a basic belief assignment m : 2 Ω → [0 , belief function Bel : 2 Ω → [0 ,
1] is defined byBel( A ) := X C ⊆ A m ( C ) . (2.3)The most natural interpretation of the theory is to see a subsetof outcomes as a representation or description of individual or sharedknowledge, and that we quantify our knowledge with belief functions. Inslightly different words, the belief in A is the probability to have informationor evidence which implies the occurrence of A .The set Ω on the one hand, and singletons on the other, are the extremestates of knowledge, representing respectively ignorance (we do not knowanything about the outcome other than that it is in Ω) and completeknowledge (we know which element of Ω is the outcome). The empty6et represents knowing a contradiction which is impossible and hence hasprobability zero. Example . ( Probability distributions ) Every probability distribution is abelief function. To see this, let P : 2 Ω → [0 ,
1] be a probability distribution.Set m ( { ω } ) = P ( { ω } ) for all ω ∈ Ω and m ( C ) = 0 for all C such that | C | >
1. Then we getBel( A ) = X a ∈ A m ( { a } ) = X a ∈ A P ( { a } ) = P ( A ) (2.4)for every A ⊆ Ω. Probability distributions are belief functions for whichthe corresponding basic belief assignment only assigns positive weight tosingletons.If m ( C ) > C with | C | >
1, then Bel is not a probabilitydistribution because it not additive: for any nonempty, disjoint
A, B ⊆ Ωsuch that A ∪ B = C we findBel( A ∪ B ) > Bel( A ) + Bel( B ) . (2.5) Example . Suppose we want to state our beliefs about a suspect beingguilty or innocent, so Ω = { guilty, innocent } is our outcome space. If theonly evidence we have, is evidence of weight p that the suspect is innocent,then we have m ( { innocent } ) = p , m ( { guilty } ) = 0 and m (Ω) = 1 − p . Noticethat the belief that the suspect is guilty is not equal to 1 minus the beliefthat the suspect is innocent. The corresponding belief function Bel is givenby Bel( { guilty } ) = 0, Bel( { innocent } ) = p and Bel(Ω) = 1. Example . The function m for which m (Ω) = 1 and m ( A ) = 0 for allother A ⊆ Ω is a basic belief assignment. The corresponding belief functionassigns belief 1 to Ω and belief zero to all strict subsets of Ω. This belieffunction expresses total ignorance within a given population Ω, except forthe fact that the outcome must be in Ω. As such it addresses the problemnoticed in Example 1.1.
Example . With reference to the situation described in Example 1.2, welet Ω = { h, t } be the outcome space of the first croupier, where h stands for‘head’ and t for ‘tail’. We set the basic belief assignment m : 2 Ω → [0 ,
1] by m ( { h } ) = m ( { t } ) = (1 − p ) and m ( { h, t } ) = p .There is a natural way to identify a belief functions with a collectionof probability distributions, namely the collection P Bel of all probabilitydistributions that one can obtain by distributing a total mass of m ( C ) overthe elements of C , for all C ⊆ Ω. For instance, if m (Ω) = 1, then thecorresponding P Bel consists of all probability distributions on Ω. If Ω = { , } , and m ( { } ) = 1 − m ( { , } ) = , then P Bel consists of all probabilitydistributions that assign probability at least to 0.7t is not difficult to see thatBel( A ) = min P ∈P Bel P ( A ) . (2.6)Indeed, only for sets C which are completely contained in A , their mass m ( C )contributes to Bel( A ), and those are precisely the sets whose mass cannotbe moved outside A . This identity might lead to the idea to interpret P Bel as describing all the possible underlying probability distributions, one ofwhich is the “correct” one, still keeping the idea that the actual situationis described by an (unknown) probability distribution from the collection P Bel .It is this interpretation which is criticized in [11]. This interpretation of P Bel is closely tied with the theory of ‘minimum probability’ that has beenstudied, see e.g. [16]. This theory, however, is distinct from our theory. InSection 3, for example, we will see that conditional belief is not equal to theinfimum over all conditional probabilities in P Bel . In Section 7 we discussthe interpretation of P Bel in more detail.Whereas belief functions are in general not additive, it follows directlyfrom the definition that they are superadditive, i.e.Bel( A ∪ B ) ≥ Bel( A ) + Bel( B ) (2.7)for all disjoint A, B ⊆ Ω. The following theorem by Shafer [15] showsthat belief functions are characterized by a property between additivity andsuper-additivity.
Theorem 2.7.
A function
Bel : 2 Ω → [0 , is a belief function if and onlyif(B1) Bel(Ω) = 1 (B2) For all
A, B ⊆ Ω we have Bel( A ∪ B ) ≥ Bel( A ) + Bel( B ) − Bel( A ∩ B ) . (2.8) Remark . To see that (B2) is stricter than super-additivity, consider thefollowing example. Let Ω = { a, b, c } and set f (Ω) = f ( { a, b } ) = f ( { b, c } ) =1, f ( { b } ) = and f ( C ) = 0 for all other C . It is easy to check f issuperadditive, but (B2) does not hold since1 = f (Ω) f ( { a, b } ) + f ( { b, c } ) − f ( { b } ) = 32 . (2.9)Theorem 2.7 can be used to give an alternative definition of belieffunctions without deriving them from basic belief assignments. Using thetheorem we can directly check whether or not a function f : 2 Ω → [0 ,
1] isa belief function. We can use the following lemma by Shafer [15] to retrievethe basic belief assignment corresponding to a given belief function.8 emma 2.9.
Every belief function
Bel : 2 Ω → [0 , has a uniquecorresponding basic probability assignment m : 2 Ω → [0 , which is givenby m ( A ) = X C ⊆ A ( − | A |−| C | Bel( C ) . (2.10) We have discussed the mathematical definitions and basic properties of belieffunctions. The next thing on the agenda is to investigate how belief functionschange when additional or new information is provided. This is akin to theclassical situation in which a prior probability is updated into a posterior one,based on additional information. In this section we explain how this worksin our setting. The first thing to do is to determine how a belief functionchanges under additional information, or under a certain hypothesis. Thismeans we need to understand how conditioning works in our context.The rule we propose for conditioning is described as follows. Supposewe have a basic belief assignment m and corresponding belief function Bel.We want to condition on an event H . The weight of the evidence m ( A ) for A now becomes weight of evidence for A ∩ H , if A is consistent with H inthe sense that A ∩ H = ∅ . If A ∩ H = ∅ , then the new weight of evidencefor A becomes zero. Next we rescale the weights of the evidence in such away that the weights again sum up to 1. This can of course only be done ifthere is evidence with positive weight that is consistent with H . This leadsto the following definition. Definition 3.1.
Let m : 2 Ω → [0 ,
1] be a basic belief assignment and Belthe corresponding belief function. For H ⊆ Ω such that Bel( H c ) = 1 wedefine the conditional basic belief assignment m H : 2 Ω → [0 ,
1] by m H ( A ) := P B ∩ H = A m ( B )1 − P B ∩ H = ∅ m ( B ) , (3.1)for A = ∅ and m H ( ∅ ) = 0.The corresponding conditional belief function Bel H can now be obtainedin the obvious way from the basic belief assignment m H , as follows:Bel H ( A ) = X B ⊆ A m H ( B )= P ∅6 = C ∩ H ⊆ A m ( C )1 − P C ∩ H = ∅ m ( C )= P C ⊆ A ∪ H c m ( C ) − P C ⊆ H c m ( C )1 − P C ⊆ H c m ( C )= Bel( A ∪ H c ) − Bel( H c )1 − Bel( H c ) (3.2)9or all A and H such that Bel( H c ) = 1.Readers familiar with the work of Shafer [15] will notice that (3.2)is the same formula as (3.8) in [15]), and is sometimes called Dempster-conditioning. This name is somewhat misleading though. Indeed, Shaferderives the formula as a special case of Dempster’s rule, a rule which wereject. It so happens that one can derive (3.2) without Dempster’s rule,only making use of our definition of conditional belief.Interpreting m ( A ) as the probability of having knowledge A naturallyleads to a probability distribution P on the collections of subsets of Ω. Thatis, for A ⊆ Ω we write P ( A ) = X A ∈A m ( A ) . (3.3)The belief in A now is the probability that A is implied, that isBel( A ) = P ( { C ∈ Ω : C ⊆ A } ) . (3.4)We can now express (3.1) in terms of P by writing for A = ∅ m H ( A ) = P C ∩ H = A m ( C ) P C ∩ H = ∅ m ( C )= P ( { C ⊆ Ω : C ∩ H = A } ) P ( { C ⊆ Ω : C ∩ H = ∅} )= P ( { C ⊆ Ω : C ∩ H = A } | { C ⊆ Ω : C ∩ H = ∅} ) . (3.5)Conditional belief can be expressed asBel H ( A ) = X B ⊆ A m H ( B )= X B ⊆ A P ( { C ⊆ Ω : C ∩ H = B } | { C ⊆ Ω : C ∩ H = ∅} )= P ( { C ⊆ Ω : C ∩ H ⊆ A } | { C ⊆ Ω : C ∩ H = ∅} ) . (3.6)In words, (3.5) and (3.6) show that our notion of conditioning can be seen asclassically conditioning P on the collection of outcomes that are consistentwith H , and then lumping all outcomes that are the same under H together. Example . In the special case that Bel = P is a probability distribution,the notion of conditional belief coincides with the notion of conditionalprobability, i.e. Bel H ( A ) = P ω ∈ A ∩ H m ( { ω } ) P ω ∈ H m ( { ω } ) = P ( A | H ) , (3.7)for every A ⊆ Ω and H such that 1 − Bel( H c ) = P ( H ) > xample . Suppose we have a case in which the suspects are two parentsand their son, so Ω = { Father , Mother , Son } . We have a lot of evidence thatpoints to the parents, none of which points to one of them in particular.Further, we have some evidence that points to the son. The correspondingbasic belief assignment is, say, m ( { Father , Mother } ) = and m ( { Son } ) = . Under the hypothesis H that it is a man, i.e. H = { Father , Son } , theevidence against the parents counts as evidence against the father, so m H ( { Father } ) = 910 . (3.8)The next example shows that while (2.6) holds, there are Bel and A, H ⊆ Ω such that Bel H ( A ) = inf { P ( A | H ) : P ∈ P Bel } . (3.9)The right hand side in (3.9) goes under the name FH-conditioning,after Fagin and Halpern [4]. Hence, the example shows that Dempster-conditioning, and FH-conditioning need not lead to the same result. Example . ( Continuation of Example 3.3 .) The collection of probabilitydistributions that we can obtain by distributing weight on { Father,Mother } over { Father } and { Mother } is P Bel = (cid:26) P c : 0 ≤ c ≤ (cid:27) , (3.10)where the probability distribution P c : 2 Ω → [0 ,
1] is given by P c ( { Father } ) := c,P c ( { Mother } ) := 910 − c,P c ( { Son } ) := 110 . (3.11)Since P c ( { Father }| H ) = cc + , (3.12)we find inf { P ( { Father }| H ) : P ∈ P Bel } = 0 . (3.13)Compared to (3.8), we think the answer in (3.8) seems more appropriatethan the one in (3.13).Although our notion of conditioning generalizes the classical notion,there are significant differences between our conditioning and the classicalone. To this end, consider the following instructive example which wesubsequently discuss, and which appears in a slightly different formulationalso in [11]. 11 xample . Suppose Ω = { , } , and write an element of Ω as ( x, y ).Consider the following basic belief assignment m on Ω: m x = 0 x = 1 y = 0 ∗ y = 1 ∗ = 12 (3.14)and m x = 0 x = 1 y = 0 ∗ y = 1 ∗ = 12 . (3.15)Following the rules of our calculus, it is not difficult to see that Bel( x = y ) = 0. However, this outcome may not be so intuitive at first sight, whichbecomes apparent when we first condition on the outcome of y . Indeed,we have that Bel y =0 ( x = y ) = Bel y =1 ( x = y ) = while at the same timeBel( x = y ) = 0. This phenomenon deserves a discussion.We can gain understanding of the paradox in Example 3.5 by looking atthe law of total probability from classical probability calculus: P ( A ) = P ( A | B c ) P ( B c ) + P ( A | B ) P ( B ) , (3.16)for all events A and B with P ( B ) >
0. This law, combined with the factthat P ( B ) + P ( B c ) = 1, gives the very intuitive result that if P ( A | B ) = P ( A | B c ) = α , say, then it follows that P ( A ) must also be equal to α . (Insome references this phenomenon is called the sandwich principle , see e.g.[11] and references therein.) However, the analogue for belief functions,that is, if Bel B ( A ) = Bel B c ( A ) = α , then it follows that Bel( A ) must alsobe equal to α , does not hold in general, as Example 3.5 illustrates. We nowexplain why this does not hold in Example 3.5.First we note that Bel y =0 ( x = y ) = Bel y =1 ( x = y ) = is not at allcontroversial: if we simply know the value of y , then all uncertainty thatis left, is that of a fair coin flip. The paradox arises, because at the sametime we have Bel( x = y ) = 0. This zero belief is explained by the fact thatwe do not know how the outcome of y is produced. It may in fact be thecase that for some reason we do not know, the outcome of y is always theopposite of x . Therefore, our belief in x = y should indeed be zero. Based onBel y =0 ( x = y ) and Bel y =1 ( x = y ), for which how y is produced is completelyirrelevant, we can not infer anything about Bel( x = y ).It would be an entirely different matter to condition on the outcome of x . We know how the outcome of x is produced, namely as the result of afair coin flip. And once we know the outcome of x , we are still ignorantabout the outcome of y , i.e. Bel x =0 ( x = y ) = Bel x =1 ( x = y ) = 0 andtherefore it is completely reasonable to conclude that Bel( x = y ) = 0. Sowhen we condition on x , then the paradox does not arise. This is a special12ase of a more general situation in which an analogue of the law of totalprobability does hold, which we express in the following lemma. The prooffollows immediately from definitions. Lemma 3.6.
Let B , . . . , B n ⊆ Ω be a partition of Ω such that Bel( B i ) > for every i and for every C with m ( C ) > we have C ⊆ B i for some i .Then n X i =1 Bel( B i ) = 1 and Bel( A ) = n X i =1 Bel( B i )Bel B i ( A ) (3.17) for all A ⊆ Ω . Notice that in Example 3.5 with B = { y = 0 } and B = { y = 1 } , thecondition of the lemma is, as expected, not satisfied. For A = { x = y } , thesecond part of (3.17) does hold, because we do have thatBel( x = y ) = Bel y =0 ( x = y )Bel( y = 0) + Bel y =1 ( x = y )Bel( y = 1) , since both sides are equal to zero. However, the second part of (3.17) doesnot always hold, since for A = { x = 0 } we findBel( x = 0) = Bel y =0 ( x = 0)Bel( y = 0) + Bel y =1 ( x = 0)Bel( y = 1) . Now that we have a notion of conditioning, we can also introduce a notionof independence, and to this end we consider the following situation. LetΩ and Ω be outcome spaces of two phenomena that we would describe as‘independent’. Now we consider this two phenomena simultaneously. We setΩ := Ω × Ω and let X : Ω → Ω and Y : Ω → Ω be the projections ontorespectively the first and second coordinate. On Ω we define a basic beliefassignment m : 2 Ω → [0 ,
1] and corresponding belief function Bel. We want m and Bel to reflect the ‘independent’ nature of the two phenomena. To dothat, we need a mathematical definition of independence that is consistentwith our intuitive idea about ‘independence’.There are now at least three natural ways to proceed, and we explorethem now, together with their relationships.In the first approach we take conditional beliefs as the starting point,and require that Bel Y ∈ B ( X ∈ A ) = Bel( X ∈ A ) (4.1)and Bel X ∈ A ( Y ∈ B ) = Bel( Y ∈ B ) , (4.2)for all A ⊆ Ω and B ⊆ Ω for which the conditional beliefs are defined. Itis not difficult to show directly that (4.1) and (4.2) are equivalent, but thiswill follow also from Theorem 4.3 below, so we do not prove this here.13n the second approach, we proceed via a product form for the beliefBel( X ∈ A ; Y ∈ B ). As a natural generalization of independence inprobability theory, we require that this product is equal to the productof the ‘marginal’ beliefs, i.e.Bel( X ∈ A ; Y ∈ B ) = Bel( X ∈ A )Bel( Y ∈ B ) , (4.3)for all A ⊆ Ω and B ⊆ Ω .Instead of looking at a product form for Bel( X ∈ A ; Y ∈ B ), we can alsolook at a product form for m ( X ∈ A ; Y ∈ B ), which is our third approach.We first define Bel and Bel to be the ‘marginal’ belief functions, i.e.Bel ( A ) := Bel( X ∈ A ) and Bel ( B ) := Bel( Y ∈ B ) , (4.4)for all A ⊆ Ω and B ⊆ Ω . It follows from Theorem 2.7 that Bel and Bel are indeed belief functions. Let m and m the corresponding ‘marginal’basic belief assignments of respectively Bel and Bel . SinceBel ( A ) = X B ⊆ A × Ω m ( B )= X B | X ( B ) ⊆ A m ( B )= X B ⊆ A X C | X ( C )= B m ( C ) , (4.5)it follows that m ( A ) = X C | X ( C )= A m ( C ) . (4.6)A similar expression of course is true for m . Notice that m ( A ) is in general not the same as m ( X ∈ A ), since the set { X ∈ A } is in general not the onlyset with positive basic belief which projects onto A . Our third approach toindependence uses classical independence to require that m ( X ∈ A ; Y ∈ B ) = m ( A ) m ( B ) (4.7)for all A ⊆ Ω and B ⊆ Ω .Now we investigate the relations between the three approaches. Thenext two examples show that the requirements of respectively the first andsecond approach are weaker than the requirement of the third.14 xample . Suppose Ω = Ω = { , } and m X=0 X=1Y=0 ∗ ∗
Y=1 ∗ ∗ = 14 ,m X=0 X=1Y=0 ∗ Y=1 = 12 ,m X=0 X=1Y=0 ∗ ∗
Y=1 ∗ = 14 . (4.8)Then Bel Y =1 ( X = 1) = Bel Y =0 ( X = 1) = Bel( X = 1) = 0 , Bel Y =1 ( X = 0) = Bel Y =0 ( X = 0) = Bel( X = 0) = 12 , (4.9)so the requirement of the first approach is satisfied. But12 = m ( X = 0; Y = 0) = m ( { } ) m ( { } ) = 12 ·
12 = 14 . (4.10) Example . Suppose Ω = Ω = { , } and m X=0 X=1Y=0 ∗ Y=1 ∗ = 1 . (4.11)We have Bel( X ∈ A ; Y ∈ B ) = Bel( X ∈ A )Bel( Y ∈ B ) (4.12)trivially for all A, B ⊆ { , } , but0 = m ( X ∈ { , } ; Y ∈ { , } ) = m ( { , } ) m ( { , } ) = 1 . (4.13)As suggested by the examples, the problem is that there is positive masson sets that are not ‘rectangular’, i.e. S ⊆ Ω such that S = X ( S ) × Y ( S ).Consider a C ⊆ Ω with m ( C ) >
0. The set X ( C ) ⊆ Ω gives theoutcomes of the first phenomenon that C is consistent with. If we conditionon { Y = y } for some y ∈ Y ( C ), evidence for C will become evidence for C ∩ { Y = y } . Since we want to model the two phenomena as independent,it is reasonable to ask that conditioning on { Y = y } changes nothing aboutthe outcomes of the first phenomenon that this individual piece of evidenceis consistent with. So ∀ y ∈ Y ( C ) X ( C ∩ { Y = y } ) = X ( C ) . (4.14)15t follows directly that (4.14) holds if and only if C = X ( C ) × Y ( C ). If m ( C ) > C = X ( C ) × Y ( C ), we say that m concentrates onrectangles. Adding this constraint to the requirements of the first twoapproaches, makes all three approaches equivalent. Theorem 4.3.
The following statements are equivalent:(1) m concentrates on rectangles and Bel Y ∈ B ( X ∈ A ) = Bel( X ∈ A ) (4.15) for all A ⊆ Ω and B ⊆ Ω with Bel( Y ∈ B c ) = 1 ,(2) m concentrates on rectangles and Bel( X ∈ A ; Y ∈ B ) = Bel( X ∈ A )Bel( Y ∈ B ) (4.16) for all A ⊆ Ω and B ⊆ Ω ,(3) we have m ( X ∈ A ; Y ∈ B ) = m ( A ) m ( B ) (4.17) for all A ⊆ Ω and B ⊆ Ω .Proof. We prove (1) ⇒ (2), (2) ⇒ (3) and (3) ⇒ (1).We start with (1) ⇒ (2), so assume (1) holds. First we note that in generalit holds that Bel( A ∪ B ) = Bel( A )+Bel( B ) − Bel( A ∩ B ) if for every C ⊆ A ∪ B with m ( C ) > C ⊆ A or C ⊆ B . Because m concentrateson rectangles, we have that for every C ⊆ { X ∈ A ∨ Y ∈ B } with m ( C ) > C ⊆ { X ∈ A } or C ⊆ { Y ∈ B } . SoBel( X ∈ A ∨ Y ∈ B ) = Bel( X ∈ A ) + Bel( Y ∈ B ) − Bel( X ∈ A ; Y ∈ B )(4.18)for every A ⊆ Ω and B ⊆ Ω .Now let A ⊆ Ω and B ⊆ Ω such that B ( Y ∈ B ) = 1. We find usingrespectively (1), (3.2) and (4.18), thatBel( X ∈ A ) = Bel Y ∈ B c ( X ∈ A )= Bel( X ∈ A ∨ Y ∈ B ) − Bel( Y ∈ B )1 − Bel( Y ∈ B )= Bel( X ∈ A ) − Bel( X ∈ A ; Y ∈ B )1 − Bel( Y ∈ B ) . (4.19)Rewriting this equation givesBel( X ∈ A ; Y ∈ B ) = Bel( X ∈ A )Bel( Y ∈ B ) . (4.20)Now we note that (4.20) is trivially true for A ⊆ Ω and B ⊆ Ω withBel( Y ∈ B ) = 1, so (2) holds. 16e continue with (2) ⇒ (3), so assume (2) holds. We prove (3) byinduction on | A | + | B | . Clearly, if | A | + | B | = 0 for A ⊆ Ω and B ⊆ Ω , then m ( X ∈ A ; Y ∈ B ) = 0 = m ( A ) m ( B ). Now supposethat m ( X ∈ A ′ ; Y ∈ B ′ ) = m ( A ′ ) m ( B ′ ) for all A ′ ⊆ Ω and B ′ ⊆ Ω with | A ′ | + | B ′ | < n . Let A ⊆ Ω and B ⊆ Ω be such that | A | + | B | = n . Becauseof (2) we have m ( X ∈ A ; Y ∈ B )= Bel( X ∈ A ; Y ∈ B ) − X A ′ ⊆ A,B ′ ⊆ B | A ′ | + | B ′ | The projections X and Y are independent if for all A ⊆ Ω and B ⊆ Ω we have m ( X ∈ A ; Y ∈ B ) = m ( A ) m ( B ) . Our notion of independence generalizes the classical notion. In fact, inthe classical situation m concentrates on singletons, which of course are allrectangles and hence the three approaches of independence are the same inthat case.To see how we can interpret independence, we write the P from (3.3) forthe probability distribution corresponding to m on Ω × Ω and we write P and P for the probability distributions corresponding to respectively m and m . It follows directly from the definitions that P ( {{ X ∈ A ; Y ∈ B } : A ∈ A , B ∈ B} ) = P ( A ) P ( B ) (4.25)for all A ⊆ Ω and B ⊆ Ω , is equivalent with Definition 4.4. So (4.25)gives an interpretation of our notion of independence: for any A ⊆ Ω and B ⊆ Ω , the probability we have an outcome for X in A and an outcomefor Y in B is the product of the individual probabilities.Now that we have introduced independence, we revisit Example 3.5. Example . Consider the situation in Example 3.5 again. The basic beliefassignment described there arises for instance when X denotes the outcomeof someone flipping a fair coin, and we have no information whatsoever aboutthe way the outcome of Y and how it is produced. Indeed, the marginal18elief function of X simply assigns mass to both outcomes, whereas themarginal of Y represents complete ignorance.It is easy to check that X and Y are independent. In fact, if we want themarginal belief functions of X and Y to be as given, X and Y are necessarilyindependent. This may sound strange, but notice that in our theory Y is‘degenerate’, since its marginal belief function concentrates on { , } . Inthe classical theory, it is also the case that a degenerate random variable isindependent of any other random variable on the same probability space. As we have mentioned in the introduction, we do not need Dempster’s ruleof combination. Nevertheless we want to spend some lines on it, becausethis rule has been the most important reason to reject the theory of belieffunction in the past. Now that we have developed our notions of conditioningand independence, we can explain why we think Dempster’s rule deservesno place in our theory of belief functions.We start by introducing the rule. Suppose m and m are two basicbelief assignments on the same space Ω. Dempster’s rule of combinationstates that if m and m are ‘based on independent evidence’, we can definea canonical basic belief assignment m ⊕ m on Ω ‘combining’ m and m ,and which is given by( m ⊕ m )( A ) = P B,C | B ∩ C = A m ( B ) m ( C )1 − P B,C | B ∩ C = ∅ m ( B ) m ( C ) . (5.1)It is easy to check that m ⊕ m is indeed a basic belief assignment onΩ, but we think this basic belief assignment is in general not in any waya meaningful ‘combination’ of m and m . To see why, we first use ourown theory of belief functions to interpret (5.1). We define a basic beliefassignment m on the product space Ω by treating m and m as basic beliefassignments corresponding to independent phenomena, using Definition 4.4,i.e. m ( A × B ) := m ( A ) m ( B ) (5.2)for A, B ⊆ Ω. Let H := { ( ω, ω ) : ω ∈ Ω } ⊆ Ω , which is the set for whichthe outcomes are identical. Then for nonempty A ⊆ Ω we find m H ( { ( ω, ω ) : ω ∈ A } ) = P D ∩ H = { ( ω,ω ) : ω ∈ A } m ( D )1 − P D ∩ H = ∅ m ( D )= P D ,D | D ∩ D = A m ( D ) m ( D )1 − P D ,D | D ∩ D = ∅ m ( D ) m ( D )= ( m ⊕ m )( A ) . (5.3)19quation 5.3 shows that if m and m describe independent phenomena, m ⊕ m is the basic belief assignment after we learned that they had thesame outcome. This is curious because m and m are concerning the same phenomenon, which is the very opposite of m and m describingindependent phenomena. We think that this is the heart of the problemof Dempster’s rule: it does some computation treating m and m ascorresponding to independent phenomena, while the description claimsit does some computation treating m and m as corresponding to thesame phenomenon. The confusion between ‘independent evidence’ andindependent phenomena may find its origin in the problematic natureof the notion of ‘independent evidence’: if evidence concerns the samephenomenon, then automatically the all evidence is dependent because allthe evidence depends on the true outcome of the phenomenon. Whateverone precisely means with ‘independent evidence’, however, it is clear thatit should not be the same as evidence concerning independent phenomena.This leads to absurd results, like the following example illustrates. Example . Suppose we are going to flip a coin and set Ω = { h, t } forthe outcomes, where h stands for head and t for tail. Previous flips of thecoin have given us the information that head comes up 60% of times, hence m ( { h } ) = and m ( { t } ) = . A second investigation based on the shapeof the coin leads to the same conclusion, so m = m . Then combining thesetwo basic belief assignments using Dempster’s rule gives( m ⊕ m )( { h } ) = (cid:0) (cid:1) (cid:0) (cid:1) + (cid:0) (cid:1) = 913 ≈ . 69 (5.4)and ( m ⊕ m )( { t } ) = (cid:0) (cid:1) (cid:0) (cid:1) + (cid:0) (cid:1) = 413 ≈ . . (5.5)Hence Dempster’s rule leads to the conclusion that the coin is much morebiased than both of our two sources of evidence agreed on. Although onecould try to argue that confirmation could lead to more belief, it should atleast not lead to less belief. This is of course exactly what happens withour belief in the outcome ‘tail’: while m ( { t } = m ( { t } = , we have ( m ⊕ m )( { t } ) = < . From this it follows that the only acceptable outcomeof ‘combining’ m and m = m would be m again and not m ⊕ m .We reject Dempster’s rule and do not see any need for a rule that‘combines’ basic belief assignments on the same space. Since the theory ofbelief functions is a generalization of probability theory, we think it makessense to generalize concepts like conditioning and independence. Probabilitytheory, however, does not have a canonical rule for ‘combining’ probabilitydistributions on the same outcome space into a new canonical probabilitydistribution. Both probability theory and our theory of belief functions, as20pplied to forensic cases in our companion article [10] seem to function finewithout such a rule. Furthermore, the fact that in the well studied caseof probability distributions such a rule is not known, gives us reason to beskeptical about being able to construct a plausible rule for the general case. We start with an example from the forensic setting, and show how our theorycan avoid the problem of having to choose a prior, producing a less arbitraryresult. This is a typical case in which we think our theory should be used,and we once more refer to our companion paper [10] for further forensicexamples. After that, we also discuss a more traditional gambling example. Suppose we know that one of two persons, say A and B , has committed acertain crime. We write E for the event that the person that committed thecrime and A both have a certain (DNA) characteristic that occurs in thetotal population with probability p . Notice that, before we know E , we donot know that the person that committed the crime has the characteristic.We write G for the guilt of A . We are interested in G given E .The typical way to deal with this classically, is to use Bayes’ rule thatstates P ( G | E ) = P ( E | G ) P ( G ) P ( E | G ) P ( G ) + P ( E | G c ) P ( G c ) . (6.1)Whatever the interpretation of the classical probability measure P , subjec-tive, frequentistic or otherwise, P ( G | E ) represents the posterior probabilityof guilt conditioned on the available evidence, while P ( G ) denotes the priorprobability of guilt, before taking the evidence E into account. From ourinformation, we can conclude that P ( E | G ) = p and P ( E | G c ) = p . Thereis, however, no reason to assign any positive prior probability to either G or G c . But since classical probability requires that P ( G ) + P ( G c ) = 1, by lackof a better alternative, a uniform prior P ( G ) = P ( G c ) = is usually taken,which leads to P ( G | E ) = p p + p = 11 + p . (6.2)We can re-derive the answer in (6.2) in our setting, using the following21asic belief assigment based on a uniform prior on the guilt of A : m ( E ∩ G ) = m G c GE c E ∗ = 12 p,m ( E c ∩ G ) = m G c GE c ∗ E = 12 (1 − p ) ,m ( E ∩ G c ) = m G c GE c E ∗ = 12 p ,m ( E c ∩ G c ) = m G c GE c ∗ E = 12 (1 − p ) . (6.3)Then we find that Bel E ( G ) = p p + p = 11 + p . (6.4)In our theory the problem of choosing a prior can be resolved, since belieffunctions are more flexible than probability distributions. We will explainthis now.What we do, is determine what we actually know in certain scenarios.If A does not have the characteristic, which happens with probability 1 − p ,then we know E c but we do not know anything about the guilt of A . If A and B both have the characteristic, which happens with probability p ,then we know E but again we do not know anything about the guilt of A .If A has the characteristic and B has not, which happens with probability p (1 − p ), we know that either E c or G . This leads to the following basicbelief assignment: m ( E c ) = m G c GE c ∗ ∗ E = 1 − p,m ( E ) = m G c GE c E ∗ ∗ = p ,m ( E c ∪ G ) = m G c GE c ∗ ∗ E ∗ = p (1 − p ) . (6.5)This basic belief assignments does not give any prior belief on the guilt of A since Bel( G ) = Bel( G c ) = 0. If we condition on E , however, we can simply22ompute, using our rule of conditioning, thatBel E ( G ) = p (1 − p ) p (1 − p ) + p = 1 − p, (6.6)which is a smaller number than the classical answer. In a casino, two croupiers execute the following procedure independentlyof each other. First, a fair coin is flipped. After that, the croupiers get,with probability p , the opportunity to change the outcome of the coin flip,again independently of each other. How in such a situation the croupiersmake their decisions about changing or not, is unknown by the player. Afterthe two results are produced, but before the player knows the outcomes, theplayer is told whether or not the produced outcomes of the two croupiers arethe same. This means that of the four possible combinations of outcomes,only two remain, and these are the two outcomes on which the player canbet.We write Ω = { h, t } for the outcome space of the first croupier, where h stands for ‘head’ and t for ‘tail’, and we define the basic belief assignment m : 2 Ω → [0 , 1] by m ( { h } ) = m ( { t } ) = (1 − p ) and m ( { h, t } ) = p .We write Ω = { h, t } for the outcome space of the second croupier and set m : 2 Ω → [0 , 1] similar to m . On Ω = { h, t } , using our definition ofindependence, we get m : 2 Ω → [0 , 1] given by m h th ∗ t = m h tht ∗ = m h th ∗ t = m h tht ∗ = 14 (1 − p ) , (6.7) m h th ∗ t ∗ = m h tht ∗ ∗ = m h th ∗ ∗ t = m h th ∗ t ∗ = 12 p (1 − p ) , (6.8)23nd m h th ∗ ∗ t ∗ ∗ = p . (6.9)We write S = { ( h, h ) , ( t, t ) } for the event that the outcomes of the twocroupiers are the same. Using our rule of conditioning, we compute ourbelief in ( h, h ) in case we get the information that the results were the same:Bel S ( { ( h, h ) } ) = (1 − p ) + p (1 − p )1 − (1 − p ) . (6.10)Obviously, Bel S ( { ( t, t ) } ) has the same value, andBel S ( { ( t, t ) } ) + Bel S ( { ( h, h ) } ) = 1 − m S ( S ) = 1 − p . (6.11)The quantity p can be seen as the price we have to pay for our uncertaintyabout the decisions of the croupiers. Notice that if p = 0, we simply havetwo fair coin flips and (6.10) equals . If p = 1, we are completely ignorantand (6.10) equals 0.We now investigate how we can model this game with classical proba-bility distributions. In that case we have to make the assumption that bothcroupiers make their decision according to some probability distribution.This means that there are probabilities p , p that respectively the firstand second result (after possible changes of the croupiers) are head. Thedefinition of the game implies that p , p ∈ (cid:20) 12 (1 − p ) , 12 (1 + p ) (cid:21) . (6.12)Now we set P : 2 Ω → [0 , 1] by P ( { ( h, h ) } ) = p p , P ( { ( h, t ) } ) = p (1 − p ), P ( { ( t, h ) } ) = (1 − p ) p and P ( { ( t, t ) } ) = (1 − p )(1 − p ). It follows that P ( { ( h, h ) }| S ) = p p p p + (1 − p )(1 − p ) , (6.13)and an easy computation shows that (6.13) is contained in the interval " (1 − p ) p , (1 + p ) p . (6.14)Note that our answer in (6.10) is contained in this interval.To understand the difference between (6.10) and (6.13), note that the twoapproaches treat conditioning fundamentally different. In a classical setting,one first needs to choose and fix p and p before the concept of conditioningeven makes sense. In our approach with belief functions, however, we treat24he uncertainty about the decisions of the croupiers on the same level asour other uncertainty, making it possible to condition without making anyassumptions about the decisions of the croupiers first.We can make this global assessment more concrete by looking at anexample. Suppose that we are in the classical situation, and that thecroupiers try to get tail. That is, if first head comes up and they get theopportunity to change, then they choose tail. If tails comes up, they neverchange. This boils down to p = p = (1 − p ) and corresponds to the leftendpoint of the interval in (6.14).Now consider the event that the first croupier flips a head and doesnot get the chance to revise the outcome, and the second croupier does getthe chance to revise the outcome, an event with probability p (1 − p ). Inthe classical setting which we described, this event implies that the secondcroupier chooses a tail and hence the outcome ( h, t ) is not in S . The massassigned to this event, hence, only plays a role in the normalizing factorwhen we condition on S .In the theory of belief functions however, the conditioning worksfundamentally different. Considering the same event as described above,the probability mass of this event does not end up in the normalizing factor,but is instead added to the final belief in ( h, h ), because given S , it is impliedthat the choice of the second croupier was head.It is an interesting question as to which answer one would choose in a realbetting situation. The conditional probability of ( h, h ) given S can be safelysaid to be at least the left endpoint of the interval in (6.14), and perhapsthis is the only number someone analyzing the situation clasically, wants touse if you can bet only once. In the theory of belief functions however, onewould choose for the answer in (6.10) in case of a unique betting situation.Of course, when we repeat the betting experiment many times, one mightget insight in the strategy of the croupiers, and this might be a reason touse the classical theory with appropriate values of p and p . P Bel Let m : 2 Ω → [0 , 1] be a basic belief assignment. Let P Bel be the collectionof probability distributions (already introduced in Section 2) on Ω that wecan obtain by distributing for every C ⊆ Ω a probability mass of m ( C ) overthe elements of C . If A ⊆ Ω, then for every C ⊆ Ω with C \ A = ∅ , we canassign a probability mass of m ( C ) to an element outside A . Thus we haveinf { P ( A ) : P ∈ P Bel } = X C ⊆ A m ( C ) = Bel( A ) . (7.1)25his leads to an interpretation of belief as ‘minimum probability’. As wealready showed in Example 3.4, there are Bel and A, H ⊆ Ω such thatBel H ( A ) = inf { P ( A | H ) : P ∈ P Bel } . (7.2)This makes our theory distinct from a ‘lower probability’ theory of belieffunctions, see e.g. [16]. Lemma 7.1 shows how we can properly express Bel H in terms of P Bel . Lemma 7.1. Bel H ( A ) = inf { P ( A | H ) : P ∈ P Bel and P ( H c ) = Bel( H c ) } . Proof. Bel H ( A ) = Bel( A ∪ H c ) − Bel( H c )1 − Bel( H c )= inf { P ( A ∪ H c ) : P ∈ P Bel } − inf { P ( H c ) : P ∈ P Bel } − inf { P ( H c ) : P ∈ P Bel } = inf (cid:26) P ( A ∪ H c ) − P ( H c )1 − P ( H c ) : P ∈ P Bel and P ( H c ) = Bel( H c ) (cid:27) = inf (cid:26) P ( A ∩ H ) P ( H ) : P ∈ P Bel and P ( H c ) = Bel( H c ) (cid:27) = inf { P ( A | H ) : P ∈ P Bel and P ( H c ) = Bel( H c ) } . (7.3)Lemma 7.1 tells us that conditional belief is not the infimum over all conditional probabilities in P Bel , but only over a sub-collection of P Bel .Notice that this implies that Bel H ( A ) ≥ inf { P ( A | H ) : P ∈ P Bel } . Byonly considering P ∈ P Bel with P ( H c ) = Bel( H c ) in (7.3), we are discardingdistributions of P Bel on the basis that we have learned H . This means thatin our theory, the collection P Bel should be interpreted as a collection fromwhich we can discard distributions if we have reasons to do so. In particular,this means that we can not interpret P Bel as containing the ‘correct’ or‘actual’ probability distribution, without knowing which one it is. This isbecause if we interpret P Bel that way, we are not allowed to discard anyelement of P Bel , since by discarding a distribution we might discard theactual distribution.We conclude by expressing independence in terms of P Bel . Consider thesituation of Section 4 again and observe thatinf { P ( X ∈ A ; Y ∈ B ) : P ∈ P Bel } = inf { P ( A ) P ( B ) : P ∈ P Bel , P ∈ P Bel } (7.4)for all A ⊆ Ω and B ⊆ Ω , is equivalent with the requirement of the secondapproach. 26 A law of large numbers Since we only have developed our theory for finite outcome spaces, wepresent a ‘weak’ law of large numbers. Let X : Ω → R , m : 2 Ω → [0 , X such that it is a ‘guaranteed lower bound’ of the average of manyindependent ‘copies’ of X . This leads to the following definition. Definition 8.1. The expectation of X (with respect to m ) isExp( X ) := X C ⊆ Ω m ( C ) min ω ∈ C X ( ω ) . (8.1)In case Bel = P is a probability distribution, we haveExp( X ) = X ω ∈ Ω m ( { ω } ) X ( ω ) = X ω ∈ Ω P ( { ω } ) X ( ω ) = E ( X ) , (8.2)and thus Definition 8.1 is consistent with the concept of expectation forprobability distributions.First, we want to show that in the long run, Exp( A ) is an lower bound forthe average of n independent ‘copies’ of X . Secondly, we want to show thatthere is no bigger lower bound than Exp( A ). We make this precise. On Ω n we use Definition 4.4 to define the basic belief assignment m n : 2 Ω n → [0 , m n ( A × · · · × A n ) := n Y j =1 m ( A j ) , (8.3)making all projections independent. Let Bel n be the corresponding belieffunction. We set X j : Ω n → R by X j (( ω , ω , . . . , ω n )) := X ( ω j ) . (8.4) Theorem 8.2. For every ǫ > we have lim n →∞ Bel n n n X j =1 X j ≥ Exp( X ) − ǫ = 1 (8.5) and lim n →∞ Bel n n n X j =1 X j ≥ Exp( X ) + ǫ = 0 . (8.6)27 roof. Let ǫ > P by P ( { C } ) := m ( C ) and P n by P n ( { ( C , C , . . . , C n ) } ) := n Y j =1 m ( C j ) = n Y j =1 P ( { C j } ) . (8.7)We define the random variable ˆ X : 2 Ω → R byˆ X ( C ) := min ω ∈ C X ( ω ) (8.8)and let ˆ X j : (2 Ω ) n → R be given byˆ X j (( C , C , . . . , C n )) := ˆ X ( C j ) . (8.9)Observe that for any α ∈ [0 , 1] we haveBel n n n X j =1 X j ≥ α = P n ( C , C , . . . , C n ) : min ω j ∈ C j n n X j =1 X ( ω j ) ≥ α = P n ( C , C , . . . , C n ) : 1 n n X j =1 ˆ X ( C j ) ≥ α = P n n n X j =1 ˆ X j ≥ α . (8.10)The (classical) expectation of the ˆ X j is E ( ˆ X j ) = E ( ˆ X ) = X C ⊆ Ω P ( { C } ) ˆ X ( C ) = X C ⊆ Ω m ( C ) min ω ∈ C X ( ω ) = Exp( X )(8.11)and by the definition of P n all the ˆ X , . . . , ˆ X n are (classically) independent.With the classical weak law of large numbers, we then find thatlim n →∞ Bel n n n X j =1 X j ≥ Exp( X ) − ǫ = lim n →∞ P n n n X j =1 ˆ X j ≥ E ( ˆ X ) − ǫ = 1 (8.12)28nd lim n →∞ Bel n n n X j =1 X j ≥ Exp( X ) + ǫ = lim n →∞ P n n n X j =1 ˆ X j ≥ E ( ˆ X ) + ǫ = 0 . (8.13)If we take for A ⊆ Ω the random variable 1 A , then we findExp(1 A ) = X C ⊆ Ω m ( C ) min ω ∈ C A ( ω ) = X C ⊆ A m ( C ) = Bel( A ) . (8.14)This gives us a special case of Theorem 8.2. Lemma 8.3 (Corollary of Theorem 8.2) . For every ǫ > and every A ⊆ Ω we have lim n →∞ Bel n n n X j =1 A ( ω j ) ≥ Bel( A ) − ǫ = 1 (8.15) and lim n →∞ Bel n n n X j =1 A ( ω j ) ≥ Bel( A ) + ǫ = 0 . (8.16)Lemma 8.3 tells us that if we write F n ( A ) ∈ [0 , 1] for the relativefrequency of A occurring after n independent repetitions, then the beliefBel( A ) is the greatest lower bound for F n ( A ) we can give if n is large. Note,however, that this is not the same as knowing that for every large k thereis a n > k such that F n ( A ) is close to Bel( A ).Lemma 8.3 provides a frequency interpretation of belief function whichis analogous to the interpretation of the classical law of large numbers forclassical probability theory. It gives a mathematical formulation of theintuitive idea that when we independently repeat an experiment many times,the relative frequency of the number of occurrence of an event A should berelated to the belief in A . The fact that the belief in A is related to thegreatest lower bound for F n ( A ) we can give based on our knowledge, andnot to a limit of F n ( A ), reflects the difference between probabilities en belieffunctions. This difference, we say once more, is the difference between onthe one hand quantifying what F n ( A ) is , and on the other hand quantifyingwhat we know about F n ( A ).The extent to which the law of large numbers is useful depends very muchon the situation at hand. Note that we need independent repetitions of thesame experiment, and in many applications in, say, legal or forensic settings,such independent repetitions do not make much sense. Nevertheless, even29n these cases, it might be reassuring that under a hypothetical assumptionof independent repetitions, there is a law of large numbers which reflects thenature of belief functions quite well. Section 6.2 contains an example whichis repeatable. In a certain interpretation of probability distributions [12, 5], the probabilityon a set A is seen as the price an agent is willing to buy and sell a bet for thatpays out 1 if A turns out to be true. Given the constraint that agents cannotassign prices in such a way that they can have a guaranteed loss (a DutchBook), the Dutch Book Theorem tells us that probability distributions areexactly the functions that obey the Kolmogorov axioms. Here we want togive a similar interpretation for belief functions and derive a theorem similarto the Dutch Book Theorem. The crux here, is that we do not look at theprice an agent is willing to buy and sell for, but only the maximum price anagent is willing to buy for. We make that idea precise.We consider the following scenario. An agent assigns to every subset of S ⊆ Ω the maximum price P ( S ) ∈ [0 , 1] she is willing to pay for the bet thatpays out 1 if S turns out to be true. First, we look at the following theoremthat gives us the constraints corresponding to probability distributions. Theorem 9.1. A function P : 2 Ω → [0 , is a probability distribution ifand only if(P1) P (Ω) = 1 (P2) For all A , A , ..., A N ⊆ Ω and B , B , ..., B M ⊆ Ω such that ∀ ω ∈ Ω N X i =1 A i ( ω ) ≥ M X j =1 B j ( ω ) , (9.1) we have N X i =1 P ( A i ) ≥ M X j =1 P ( B j ) . (9.2) Proof. It is sufficient to show that (P2) is equivalent with finite additivity.First suppose (P2). Let A, B ⊆ Ω be disjoint. We have ∀ ω ∈ Ω 1 A ( ω ) + 1 B ( ω ) = 1 A ∪ B ( ω ) , (9.3)so by (P2) we find both P ( A ) + P ( B ) ≥ P ( A ∪ B ) and P ( A ∪ B ) ≥ P ( A ) + P ( B ). So P is finitely additive. 30ow suppose that P is finitely additive. Let A , A , ..., A N ⊆ Ω and B , B , ..., B M ⊆ Ω be such that (9.1) holds. Then N X i =1 P ( A i ) = X ω ∈ Ω P ( { ω } ) N X i =1 A i ( ω ) ≥ X ω ∈ Ω P ( { ω } ) M X j =1 B i ( ω )= M X j =1 P ( B j ) . (9.4)So (P2) holds.Constraint (P1) says that an agent always pays 1 for bets on tautologies,i.e. P (Ω) = 1. Constraint (P2) says that an agent must assign prices in sucha way that if she buys a set of bets that is guaranteed to pay out at least asmuch as another set of bets, that the total price for the first set must be asleast as much as the total price of the second set of bets.Given our interpretation of the maximum price an agent is willing topay for a bet, however, we think (P2) is too restrictive as illustrated bythe following example. Let Ω = { ω , ω } and consider an agent that iscompletely ignorant about how likely ω or ω is. Of course she will beready to pay 1 for a bet on Ω, since payout is guaranteed. But she couldfeel conservative in her ignorance and not be ready to pay anything for abet on { ω } or { ω } . However, if P ( { ω } ) = P ( { ω } ) = 0 while P (Ω) = 1,then (P2) is violated.The problem is that (P2) only compares actual payout under realizations ω ∈ Ω. Our example shows that an agent may also be interested inguaranteed payout of a bet on A if she only knows that the actual outcomeis an a given set S . This is in line with the epistemic interpretation whichwe discussed earlier, since in this epistemic interpretation, a subset S ⊆ Ωcorresponds to knowledge about the outcome being in S without furtherspecification.Hence we suggest to change (9.1) into ∀ S ⊆ Ω N X i =1 S ⊆ A i ) ≥ M X j =1 S ⊆ B j ) . (9.5)We then force the total price of the first set of bets to be at least thetotal price of the second set if not only the payout is at least as big in allcases, but also the guaranteed payout under any S is at least as big in allcases. The following theorem states that if we make this change, we get acharacterization of belief functions. 31 heorem 9.2. A function Bel : 2 Ω → [0 , is a belief function if and onlyif(B1) Bel(Ω) = 1 (B2*) For all A , A , . . . , A N ⊆ Ω and B , B , . . . , B M ⊆ Ω such that ∀ S ⊆ Ω N X i =1 S ⊆ A i ) ≥ M X j =1 S ⊆ B j ) , (9.6) we have N X i =1 Bel( A i ) ≥ M X j =1 Bel( B j ) . (9.7) Proof. Suppose (B1) and (B2*) hold. Let A, B ⊆ Ω. For all S ⊆ Ω we have1( S ⊆ A ∪ B ) + 1( S ⊆ A ∩ B ) ≥ S ⊆ A ) + 1( S ⊆ B ) . (9.8)So by (B2*) we findBel( A ∪ B ) + Bel( A ∩ B ) ≥ Bel( A ) + Bel( B ) . (9.9)So by Theorem 2.7 Bel is a belief function.Now suppose Bel is a belief function. Then (B1) is immediate and wehave to show (B2*). Let m : 2 Ω → [0 , 1] be the corresponding basic beliefassignment of Bel. Let A , A , . . . , A N ⊆ Ω and B , B , . . . , B M ⊆ Ω besuch that (9.6) holds. Then N X i =1 Bel( A i ) = X S ⊆ Ω m ( S ) N X i =1 S ⊆ A i ) ≥ X S ⊆ Ω m ( S ) M X j =1 S ⊆ B j )= M X j =1 Bel( B j ) . (9.10)So (B2*) holds.Theorem 9.2 tells us that an agent following the relaxed constraints, isprecisely an agent assigning Bel( A ) as a maximum price she is willing topay for a bet on A (that payouts out 1 if A is true), for some belief functionBel. This gives us our betting interpretation of belief functions.32 eferences [1] Aitken, C.G.G.: Statistics and the evaluation of evidence for forensicscientist. Chichester: John Wiley & Sons (1995).[2] Balding, P. and Donnelly, P.: Inference in forensic identification . Stat.Royal Stat. Soc. , 21 (1995).[3] Dawid, A.: Probability and Proof. In: Anderson, T., Schum, D.,Twining, W. Analysis of Evidence. Cambridge University Press, secondedn. (2005), on-line Appendix.[4] Fagin, R. and Halpern, J.Y.: Uncertainty, belief and probability . Proc.Int. Joint Conference on AI, Detroit, 1161-1167 (1989).[5] de Finetti, B.: La Pr´evision: Ses Lois Logiques, Ses Sources Subjectives ,Annales de l’Institut Henri Poincar´e , 1-68 (1937).[6] Fienberg, S.E., Schervish, M.J.: The Relevance of Bayesian Inference forthe Presentation of Statistical Evidence and for Legal Decisionmaking.Boston University Law Review, special issue on Probability and Inferencein the Law of Evidence , 771 (1986).[7] Friedman, R.D.: Assessing Evidence . Mich. L. Rev. , 1810 (1996).[8] Galavotti, M.C.: The modern epistemic interpretations of probability:logicism and subjectivism . In Handbook of the History of Logic , D.M.Gabbay, S. Hartmann, J. Woods (eds.), Elsevier (2009).[9] Goosens, W.K.: Alternative axiomatizations of elementary probabilitytheory . Notre Dame Jounal of Formal Logic XX , 227- 239 (1979).[10] Kerkvliet, T., Meester, R.: Assessing forensic evidence by computingbelief functions - theory and applications , to appear (2015).[11] Pearl. J.: Reasoning with belief functions: an analysis of compatibility .Int. J. of Approximate Reasoning , 363-389 (1990).[12] Ramsey, F. P. : Truth and Probability . In: The Foundations ofMathematics and other Logical Essays, 156-198, Routledge and KeganPaul, London (1931).[13] Robertson, B., Vignaux, G.A.: Interpreting evidence: evaluatingforensic science in the courtroom. John Wiley & Sons(1995).[14] D. A. Schum.: The evidential foundations of probabilistic reasoning .Northwestern University Press (1994).3315] Shafer, G.: A mathematical theory of evidenc . Princeton universityPress Princeton (1976).[16] Shafer, G.: Constructive probability.