[PDF] Maximum-entropy from the probability calculus: exchangeability, sufficiency

Abstract

Maximum-entropy distributions are shown to appear in the probability calculus as approximations of a model by exchangeability or a model by sufficiency, the former model being preferable. The implications of this fact are discussed, together with other questions: Prediction or retrodiction? How good is the maximum-entropy approximation? Is this a "derivation" of maximum-entropy?

Full PDF

MMaximum-entropy from the probability calculus:exchangeability, sufficiency

P.G.L. Porta Mana

Dedicato alla mia fantastica sorellina Marianna per il suo compleanno

The classical maximum-entropy principle method (Jaynes 1963) ap-pears in the probability calculus as an approximation of a particular modelby exchangeability or a particular model by sufficiency .The approximation from the exchangeability model can be inferredfrom an analysis by Jaynes (1996) and to some extent from works onentropic priors (Rodríguez 1989; 2002; Skilling 1989a; 1990). I tried to showit explicitly in a simple context (Porta Mana 2009). The approximationfrom the sufficiency model can be inferred from Bernardo & Smith (2000§ 4.5) and Diaconis & Freedman (1981) in combination with the Koopman-Pitman-Darmois theorem (see references in § 3).In this note I illustrate how either approximations arises, in turn, andthen give a heuristic synopsis of both. At the end I discuss some questions:Prediction or retrodiction? Which of the two models is preferable? (theexchangeable one.) How good is the maximum-entropy approximation?Is this a “derivation” of maximum-entropy?I assume that you are familiar with: the maximum-(relative-)entropymethod (Jaynes 1957a; much clearer in Jaynes 1963; Sivia 2006; Hobsonet al. 1973), especially the mathematical form of its distributions and itsprescription “expectations = empirical averages”; the probability calculus(Jaynes 2003; Hailperin 1996; Jeffreys 2003; Lindley 2014); the basics ofmodels by exchangeability and sufficiency (Bernardo et al. 2000 ch. 4),although I’ll try to explain the basic ideas behind them – likely you’veoften worked with them even if you’ve never heard of them under thesenames.

We have a potentially infinite set of measurements, each having K possibleoutcomes. Dice rolls and their six outcomes are a typical example. I usethe terms “measurement” and “outcome” to lend concreteness to thediscussion, but the formulae below apply to much more general contexts.1 a r X i v : . [ phy s i c s . d a t a - a n ] J un ORTA M ANA

Maximum-entropy from the probability calculus

The proposition that the n th measurement has outcome k is denoted E ( n ) k . The relative frequencies of the K possible outcomes in a set ofmeasurements are denoted f : (cid:3) ( f k ) . It may happen that in a measurementwe observe not directly an outcome but an “observable” having values ( O k ) (cid:3) : O for the K outcomes. This observable may be vector-valued.The empirical average of the observable in a set of N measurements withoutcomes { k , . . . , k N } is (cid:80) Nn (cid:3) O k n / N , equivalent to (cid:80) k O k f k .Probabilities have propositions as arguments (for good definitions ofwhat a proposition is – it isn’t a sentence, for example – see Strawson 1964;Copi 1979; Barwise et al. 2003). Johnson’s definition remains one of thesimplest and most beautiful: “Probability is a magnitude to be attachedto any possibly true or possibly false proposition; not, however, to theproposition in and for itself, but in reference to another proposition thetruth of which is supposed to be known” (Johnson 1924 Appendix, § 2). Seealso Hailperin’s (1996; 2011) formalization, sadly neglected in the literature.The assumptions or knowledge underlying our probabilities – our “model”– will be generically denoted by I , with subscripts denoting specificassumptions. We will sometimes let a quantity stand as abbreviation fora proposition, for example f for “the observed relative frequencies in N measurements are f ”. In such cases the probability symbol will be inlower-case to remind us of our notational sins.Lest this note become an anthill of indices let’s use the followingnotation: for positive K -tuples x : (cid:3) ( x i ) , y : (cid:3) ( y i ) , and number a , a x : (cid:3) ( ax i ) , x / y : (cid:3) ( x i / y i ) , x y : (cid:3) ( x i y i ) , x y : (cid:3) ( x i y i ) , exp x : (cid:3) ( exp x i ) , ln x : (cid:3) ( ln x i ) , x ! : (cid:3) ( x i ! ) , (cid:80) x : (cid:3) (cid:88) k x k , (cid:81) x : (cid:3) (cid:89) k x k , (cid:18) aa x (cid:19) : (cid:3) a ! (cid:81) ( a x ) ! . (1)The symbol δ indicates a Dirac delta (Lighthill 1964; even better: Egorov1990; 2001) or a characteristic function (cf. Knuth 1992), depending on thecontext.The Shannon entropy H ( x ) : (cid:3) − (cid:80) x ln x , and the relative Shannonentropy or negative discrimination information H ( x ; y ) : (cid:3) − (cid:80) x ln ( x / y ) .Let’s keep in mind the important properties H ( x ; y ) ⩽ , H ( x ; y ) (cid:3) ⇔ x (cid:3) y . (2)The problem typically addressed by maximum-entropy is this: giventhat in a large number N of measurements we have observed an average2 ORTA M ANA

Maximum-entropy from the probability calculus having value in a convex set A (which can consist of a single number), (cid:80) O f ∈ A , (3)what is the probability of having outcome k in an ( N + ) th measurement?In symbols, P (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I (cid:3) (cid:3) ? (4)where I denotes our state of knowledge. The maximum-entropy answer(Mead et al. 1984; Fang et al. 1997; Boyd et al. 2009) has the form r k exp ( λ O k ) (cid:80) r exp ( λ O ) (5)where r is a reference distribution and λ is determined by the constraintsin a way that we don’t need to specify here. The convexity of A ensuresthe uniqueness of this answer. Let’s assume that in our state of knowledge I x we deem the measurementsto be infinitely exchangeable (Bernardo et al. 2000 § 4.2); that is, there canbe a potentially unlimited number of them and their indices are irrelevantfor our inferences. De Finetti’s theorem (1930; 1937; Heath et al. 1976)states that this assumption forces us to assign probabilities of this form:P (cid:2) E ( ) k , . . . , E ( N ) k N (cid:12)(cid:12) I x (cid:3) (cid:3) ∫ q k · · · q k N p ( q | I x ) d q ≡ ∫ (cid:16) (cid:81) q N f (cid:17) p ( q | I x ) d q , (6)where the distribution q can be interpreted as the relative frequencies inthe long run,1 and integration is over the ( K − ) -dimensional simplex(Grünbaum 2003) of such distributions, { q ∈ R K ⩾ | (cid:80) q (cid:3) } . The termp ( q | I x ) d q can be interpreted as the prior probability density of observingthe long-run frequencies q in an infinite number of measurements. Thisprobability is not determined by the theorem.Let’s call the expression above an exchangeability model (Bernardo et al.2000 § 4.3). “But this long run is a misleading guide to current affairs. In the long run we are alldead.” (Keynes 2013 § 3.I, p. 65) ORTA M ANA

Maximum-entropy from the probability calculus

We assume that our state of knowledge I x is also expressed by aparticular prior density for the long-run frequencies:p ( q | I x ) d q (cid:3) κ ( L , r ) (cid:18) LL q (cid:19) (cid:81) r L q d q , L ⩾ , (7)which we can call “multinomial prior” because is a sort of continuousinterpolation of the multinomial distribution (Johnson et al. 1996 ch. 35).in the latter each q k assumes discrete values in { , / L , . . . , } and thenormalizing constant is unity; for this reason the normalizing constant κ ( L , r ) ≈ L in eq. (7). The results that follow also hold for any other priordensity that is asymptotically equal to the one above for L large, for exampleproportional to exp [ LH ( q ; r )] , which appears in Rodríguez’s (1989; 2002)entropic prior and in Skilling’s (1989a; 1990) prior for “classical” and“quantified” maximum-entropy.To find the probability (4) queried by maximum-entropy we need theprobability for each possible frequency distribution in the N measurements,which by combinatorial arguments isp ( f | I x ) (cid:3) ∫ (cid:18) NN f (cid:19) (cid:16) (cid:81) q N f (cid:17) p ( q | I x ) d q . (8)There are (cid:0) N + K − K − (cid:1) possible frequency distributions (Csiszár et al. 2004).By marginalization over the subset of frequencies consistent with ourdata, the probability for the empirical average isP (cid:0) (cid:80) O f ∈ A (cid:12)(cid:12) I x (cid:1) (cid:3) (cid:88) f δ ( (cid:80) O f ∈ A ) ∫ (cid:18) NN f (cid:19) (cid:16) (cid:81) q N f (cid:17) p ( q | I x ) d q . (9)Finally using Bayes’s theorem with the probabilities (6)–(9) we findP (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I x (cid:3) (cid:3) ∫ q k (cid:80) f δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:0) (cid:81) q N f (cid:1) p ( q | I x ) d q ∫ (cid:80) f δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:0) (cid:81) q N f (cid:1) p ( q | I x ) d q , (10)where the density p ( q | I x ) d q is specified in eq. (7), even though the formulaabove holds as well with any other prior density.I have graphically emphasized this formula because it is the exact answergiven to the question (4) by a general exchangeability model: it holds for4 ORTA M ANA

Maximum-entropy from the probability calculus all numbers K of possible outcomes, all numbers N of observations, andall sets A – even non-convex ones.If N and L are large we can use the bounds of the multinomial (Csiszáret al. 1981 Lemma 2.3) (cid:18) NN f (cid:19) (cid:3) ϵ ( N , f ) exp [ N H ( f )] , ( N + ) − K ⩽ ϵ ( N , f ) ⩽ , (11)analogously for (cid:0) LL r (cid:1) .From the bounds above it can be shown that the exact probabilityexpression (10) has the asymptotic formP (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I x (cid:3) ≃ κ ( N , L , r ) ∫ q k (cid:88) f δ ( (cid:80) O f ∈ A ) exp [ N H ( f ; q ) + LH ( q ; r )] d q , N , L large . (12)I prefer the symbol “ ≃ ”, “is asymptotically equal to” (iso 2009; ieee 1993;iupac 2007), to the limit symbol “ → ” because the latter may invite to thinkabout a sequence , but no such sequence exists. In each specific problem N has one, fixed, possibly unknown value, and cannot be increased at will.The symbol “ ≃ ” says that the right side differs from the left side by anerror that may be negligible. It is our duty to check whether this error isreally negligible for our purposes.The asymptotic expression above shows an interesting interplay of tworelative entropies. The two exponential terms give rise to two Dirac deltas.The delta in f requires some mathematical care owing to the discretenessof this quantity; see Csiszár (1984; 1985). In particular, if N < K the discreteset of (cid:0) N + K − K − (cid:1) possible frequency distributions lies within the ( N − ) -dimensional facets of the ( K − ) -dimensional simplex of distributions q ;it does not “fill” the simplex. In this case the frequency sum (cid:80) f cannot bemeaningfully approximated by an integral. The approximations below arevalid if the number N of observations is much larger than the number K of possible outcomes.If L / N is also large, taking limits in the proper order givesP (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I x (cid:3) ≃ r k , N , L , L / N large . (13)Note how the data about the average (3) are practically discarded in this ( L / N ) -large case. Compare with Skilling’s remark that the parameter L (his α ) shouldn’t be “particularly large” (cf. Skilling 1998 p. 2).5 ORTA M ANA

Maximum-entropy from the probability calculus

The asymptotic case that interests us is N / L large: the exponential in N dominates the integral of eq. (12), which becomes κ ( L , r ) (cid:88) f f k δ [ (cid:80) O f ∈ A ] exp [ LH ( f ; r )] ≃ (cid:80) O f ∈ A arg sup f H ( f ; r ) , (14)so that, finally,P (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I x (cid:3) ≃ f ∗ k , N , L , L / N large , with f ∗ maximizing H ( f ; r ) under constraints (cid:80) O f ∈ A , (15)which is the maximum-entropy recipe, giving the distribution (5). Consider the following assumption or working hypothesis, denoted I s : Topredict the outcome of an ( N + ) th measurement given knowledge of theoutcomes of N measurements, all we need to know is the average (cid:80) O f of an observable O in those N measurements, no matter the value of N . Inother words, any data about known measurements, besides the empiricalaverage of O , is irrelevant for our prediction. The average (cid:80) O f is thencalled a minimal sufficient statistics (Bernardo et al. 2000 § 4.5; Lindley 2008§ 5.5). In symbols,P (cid:2) E ( N + ) k (cid:12)(cid:12) E ( ) k , . . . , E ( N ) k N , I s (cid:3) (cid:3) p (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f , N , I s (cid:3) . (16)Note that the data { E ( n ) k n } determine the data { (cid:80) O f , N } but not vice versa,so some data have effectively been discarded in the conditional.The Koopman-Pitman-Darmois theorem (Koopman 1936; Pitman 1936;Darmois 1935; see also later analyses: Hipp 1974; Andersen 1970; Denny1967; Fraser 1963; Barankin et al. 1963) states that this assumption forcesus to assign probabilities of this form:P (cid:2) E ( ) k , . . . , E ( N ) k N (cid:12)(cid:12) I s (cid:3) (cid:3) ∫ p ( k | λ , r , I s ) · · · p ( k N | λ , r , I s ) p ( λ | I s ) d λ , ≡ ∫ (cid:104) (cid:81) p ( k | λ , r , I s ) N f (cid:105) p ( λ | I s ) d λ , (17a)with p ( k | λ , r , I s ) : (cid:3) r exp ( λ O ) Z ( λ ) , Z ( λ ) : (cid:3) (cid:80) r exp ( λ O ) , (17b)6 ORTA M ANA

Maximum-entropy from the probability calculus and we have defined p ( k | . . . ) : (cid:3) (cid:0) p ( | . . . ) , . . . , p ( K | . . . ) (cid:1) . The integrationof the parameter λ is over R M , with M the dimension of the vector-valuedobservable O , and r is a K -dimensional distribution. Neither r or thedistribution p ( λ | I s ) are determined by the theorem.Let’s call the expression above a sufficiency model (Bernardo et al. 2000§ 4.5). A sufficiency model can be viewed as a mixture, with weightdensity p ( λ | I s ) d λ , of distributions having maximum-entropy form (5)with multipliers λ .To find the probability (4) we calculate, as in the previous section, theprobabilities for the frequencies:p ( f | I s ) (cid:3) ∫ (cid:18) NN f (cid:19) (cid:104) (cid:81) p ( k | λ , r , I s ) N f (cid:105) p ( λ | I s ) d λ , (18)and for the empirical average by marginalization:P (cid:0) (cid:80) O f ∈ A (cid:12)(cid:12) I s (cid:1) (cid:3) (cid:88) f δ ( (cid:80) O f ∈ A ) ∫ (cid:18) NN f (cid:19) (cid:104) (cid:81) p ( k | λ , r , I s ) N f (cid:105) p ( λ | I s ) d λ . (19)From these using Bayes’s theorem we finally findP (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I s (cid:3) (cid:3) ∫ p ( k | λ , r , I s ) (cid:80) f δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:2) (cid:81) p ( k | λ , r , I s ) N f (cid:3) p ( λ | I s ) d λ ∫ (cid:80) f δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:2) (cid:81) p ( k | λ , r , I s ) N f (cid:3) p ( λ | I s ) d λ . (20)This is the exact answer given to the maximum-entropy question by asufficiency model if the constraints used in maximum-entropy are considered tobe a sufficient statistics . This proviso has serious consequences discussed in§ 5.2. The expression above holds for all N and all sets A , even non-convex ones.The asymptotic analysis for large N uses again the multinomial’sbounds (11). We findP (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I s (cid:3) ≃ κ ( N , r ) ∫ p ( k | λ , r , I s ) × (cid:88) f δ ( (cid:80) O f ∈ A ) exp (cid:8) N H (cid:2) f ; p ( k | λ , r , I s ) (cid:3) (cid:9) p ( λ | I s ) d λ , N large . (21)7 ORTA M ANA

Maximum-entropy from the probability calculus

A rigorous analysis of this limit can be done using “information projections”(Csiszár 1984; 1985); here is a heuristic summary. Consider the sum in f forfixed λ . We have two cases. (1) If λ is such that (cid:80) O p ( k | λ , r , I s ) ∈ A , thereexists a unique f in the sum for which the relative entropy in the exponentialreaches its maximum, zero, making the exponential unity. For all other f the relative entropy is negative and the exponential asymptoticallyvanishes for large N . The integral therefore doesn’t vanish asymptotically.(2) If λ is such that p ( k | λ , r , I s ) doesn’t satisfy the constraints, the relativeentropy in the exponential will be negative for all f in the sum, making theexponential asymptotically vanish for all f . The integral therefore vanishesasymptotically. The distinction between these two cases actually requiresmathematical care owing to the discreteness of the sum. The f sum thenacts as a delta or characteristic function (depending on whether A hasmeasure zero or not): (cid:88) f δ ( (cid:80) O f ∈ A ) exp (cid:8) N H (cid:2) f ; p ( k | λ , r , I s ) (cid:3) (cid:9) ≃ δ (cid:2) (cid:80) O p ( k | λ , r , I s ) ∈ A (cid:3) . (22)Thus asymptotically we have, using the explicit expression (17b) forp ( k | λ , r , I s ) :P (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I s (cid:3) ≃ ∫ δ (cid:20) (cid:80) O r exp ( λ O ) Z ( λ ) ∈ A (cid:21) r k exp ( λ O k ) Z ( λ ) p ( λ | I s ) d λ , N large . (23)This result can also be found first integrating λ and then summing f ,using a heuristic argument similar to the one above. This is a mixture, withweight density p ( λ | I s ) d λ , of maximum-relative-entropy distributions f ∗ that satisfy the individual constraints (cid:80) O f ∗ (cid:3) a , a ∈ A . The finaldistribution thus differs from the maximum-entropy one if the set A is nota singleton: maximum-entropy would pick up only one distribution. Butif the constraint set is a singleton, A (cid:3) { a } , we do obtain the same answer(5) as the maximum-entropy recipe:P (cid:2) E ( N + ) k (cid:12)(cid:12) (cid:80) O f (cid:3) a , I x (cid:3) ≃ f ∗ k , N large , with f ∗ maximizing H ( f ; r ) under constraints (cid:80) O f (cid:3) a . (24)8 ORTA M ANA

Maximum-entropy from the probability calculus

First of all let’s note that both the exchangeability (6) and sufficiency (17)models have the parametric formP (cid:2) E ( ) k , . . . , E ( N ) k N (cid:12)(cid:12) I (cid:3) (cid:3) ∫ p ( k | ν , I ) · · · p ( k N | ν , I ) p ( ν | I ) d ν ≡ ∫ (cid:104) (cid:81) p ( k | ν , I ) N f (cid:105) p ( ν | I ) d ν . (25)The final probability distribution p for the K outcomes of the ( N + ) th meas-urement belongs to a ( K − ) -dimensional simplex { p ∈ R K ⩾ | (cid:80) p (cid:3) } .The expression above first selects, within this simplex, a family of distri-butions { p ( k | ν , I )} parametrized by ν ; then it delivers the distribution p as a mixture of the distributions of this family, with weight densityp ( ν | I ) d ν . In the exchangeability model this family is actually the wholesimplex (that’s why it’s sometimes called a “non-parametric” model). Inthe sufficiency model it is an exponential family (Bernardo et al. 2000 § 4.5.3;Barndorff-Nielsen 2014).When we conditionalize on data D , the weight density is determinedby the mutual modulation of two weights: that of the probability of thedata p ( D | ν , I ) and the initial weight p ( ν | I ) . Pictorially, if K (cid:3) ( D | ν , I ) × p ( ν | I ) (cid:3) κ p ( ν | D , I ) (26)the final p is given by the mixture with the weight density p ( ν | D , I ) d ν ensuing from this modulation. The mathematical expression of the dataweight p ( D | ν , I ) is typically exponentiated to the number of measure-ments N from which the data originate; compare with eqs (19), (20). If N is large this weight is very peaked on the subset of distributions thatgive highest probability to the data, that is, that have expectations veryclose to the empirical averages. It effectively restricts the second weightp ( ν | I ) d ν to such “data subset”. In our case the data subset consists of alldistributions satisfying the constraints.The mechanism described so far is common to the exchangeability andthe sufficiency model. Their difference lies in how they choose the finaldistribution from the data subset. 9 ORTA M ANA

Maximum-entropy from the probability calculus

In the exchangeability model (6) the choice is made by the weightdensity p ( ν | I ) d ν , i.e. the multinomial prior (7). It is extremely peakedowing to the large parameter L , and its level curves are isentropics. Once it’srestricted to the data subset by the data weight p ( D | ν , I ) , it gives highestweight to the distribution p lying on the highest isentropic curve, which isunique if the data subset is convex; compare with fig.-eq. (26). Hence thisis a maximum-entropy distribution satisfying the data constraints. For thismechanism to work it’s necessary that the dominance of the data weightcomes first, and the dominance of the multinomial prior comes second.This is the reason why the correct asymptotic limit (15) has N , L , and N / L large.In the sufficiency model (17) the choice is made by the family ofdistributions { p ( k | ν , I )} ν . These distributions have by construction amaximum-entropy form for the particular observable O . This familyintersects the data subset in only one point if the constraint has the form (cid:80) O f (cid:3) a . This point is therefore the maximum-entropy distributionsatisfying the data constraints.The mechanism above also explains why these two models still workif the data subset is non-convex and touches the highest isentropics(exchangeability model) or the exponential family (sufficiency model) inmultiple points, bringing the maximum-entropy recipe to an impasse. Thefinal distribution will simply be an equal mixture of such tangency points;it may well lie outside of the data subset. An essential aspect of the maximum-entropy method is surprisingly often disregarded in the literature. If wehave data from N measurements, we can ask two questions: “Prediction”: what is the outcome of a further similar measurement? “Retrodiction”: what is the outcome of the first of the N measurements?Note that despite the literal meaning of these terms the distinction is notbetween future and past, but between unknown and partially known .It’s rarely made clear whether the maximum-entropy probabilitiesrefer to the first or to the second question. Yet these two questions arefundamentally different; their answers rely on very different principles.To answer the first question we can – but need not – fully rely onsymmetry principles in the discrete case. It is a matter of combinatorics10 ORTA M ANA

Maximum-entropy from the probability calculus and equal probabilities; a drawing-from-an-urn problem. Most derivationsof the maximum-entropy method (e.g. Jaynes 1963; Shore et al. 1980;van Campenhout et al. 1981; Csiszár 1985) address this question only, asoften betrayed by the presence of “ p ( x ) ” or similar expressions in theirfinal formulae.To answer the second question, symmetry and combinatorics alone areno use: additional principles are needed. This is the profound philosophicalquestion of induction , with its ocean of literature; my favourite sampleare the classic Hume (1896 book I, § III.VI), Johnson (1922 esp. chs VIIIff; 1924 Appendix; 1932), de Finetti (1937; 1959), Jeffreys (1955; 1973 ch. I;2003 § 1.0), Jaynes (2003 § 9.4). De Finetti, foreshadowed by Johnson, wasprobably the one who expressed most strongly, and explained brilliantly,that the probability calculus does not and cannot explain or justify ourinductive reasoning; it only expresses it in a quantitative way. This shift inperspective was very much like Galilei’s shift from why to how in the studyof physical phenomena.2 We do inductive inferences in many differentways (Jaynes 2003 § 9.4). The notion of exchangeability (de Finetti 1937;Johnson 1924 Appendix; 1932) captures one of the most intuitive andexpresses it mathematically.The calculations of the previous sections and the final probabilities (10),(20) for our two models pertain the predictive question, as clear from the E ( N + ) in their arguments. The two models can also be used to answerthe retrodictive question. The resulting formulae are different; they canagain be found applying the rules of the probability calculus and Bayes’stheorem. The retrodictive formula for the exchangeability model is (proof “According to credible traditions it was in the sixteenth century, an age of veryintense spiritual emotions, that people gradually ceased trying, as they had been tryingall through two thousand years of religious and philosophic speculation, to penetrateinto the secrets of Nature, and instead contented themselves, in a way that can only becalled superficial, with investigations of its surface. The great Galileo, who is always thefirst to be mentioned in this connection, did away with the problem, for instance, of theintrinsic reasons why Nature abhors a vacuum, so that it will cause a falling body toenter into and occupy space after space until it finally comes to rest on solid ground,and contented himself with a much more general observation: he simply established thespeed at which such a body falls, what course it takes, what time it takes, and what itsrate of acceleration is. The Catholic Church made a grave mistake in threatening this manwith death and forcing him to recant, instead of exterminating him without more ado.”(Musil 1979 vol. 1, ch. 72) ORTA M ANA

Maximum-entropy from the probability calculus in Porta Mana 2009 § B):P (cid:2) E ( n ) k (cid:12)(cid:12) (cid:80) O f ∈ A , I x (cid:3) (cid:3) ∫ (cid:80) f f k δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:0) (cid:81) q N f (cid:1) p ( q | I x ) ∫ (cid:80) f δ ( (cid:80) O f ∈ A ) (cid:0) NN f (cid:1) (cid:0) (cid:81) q N f (cid:1) p ( q | I x ) d q d q , n ∈ { , . . . , N } . (27)Graphically it differs from the predictive one (10) only in the replacementof q k by f k . An analogous replacement appears in the retrodictive formulafor the sufficiency model. But this graphically simple replacement leadsto a mechanism very different from the one of § 4 in delivering thefinal probability: it’s a mixture on the data subset rather than on thewhole simplex. Predictive and retrodictive probabilities can therefore bevery different for small N . See for example figs 1 and 2 below and theiraccompanying discussion.This means that the goodness of the maximum-entropy distributionas an approximation of our two models can depend on whether we areasking a predictive or a retrodictive question. This fact is very important inevery application. A maximum-entropy dis-tribution can be seen as an approximation of the distribution obtainedfrom the exchangeability model or the sufficiency one ( repetita iuvant ). Thetwo inferential models are not equivalent though, and there are reasonsto prefer the exchangeability one – despite the frequent association, inthe literature, of maximum-entropy with exponential families. The mostimportant and quite serious difference is this:Suppose that we have used either model to assign a predictive distribu-tion conditional on the empirical average a of the observable O , obtainedfrom N measurements. If N is large the distributions obtained from eithermodel will be approximately equal, and equal to the maximum-entropyone. Now someone gives us a new empirical average a ′ of a differentobservable O ′ , obtained from the same N measurements. This observableturns out to be complementary to the previous one, in the sense that ingeneral from knowing the value of (cid:80) O f we cannot deduce the value of (cid:80) O ′ f , and vice versa. These new data therefore reveal more about theoutcomes of our N measurements and of possible further measurements.The new empirical average a ′ can be incorporated in the exchangeabilitymodel; the resulting predictive and retrodictive distributions conditionalon ( a ′ , a ) will be numerically different from the ones conditional on a only.12 ORTA M ANA

Maximum-entropy from the probability calculus

They will be approximated by a maximum-entropy one based on the old and new constraints.If we incorporate the new average in the sufficiency model, however, the resulting predictive conditional distribution will be unchanged: knowledgeof the new data has no effect in the prediction of new measurements . Thereason is simple: the sufficiency model expresses by construction thatthe average of the old observable O is all we need for our inferencesabout further measurements. Any other observable is irrelevant. The newaverage automatically drops out under predictive conditioning. The onlyway to obtain a different predictive conditional distribution would be to discard the sufficiency model based on O , and use a new one based on ( O , O ′ ) . But that would be cheating!This shows how dramatically absolute and categorical the assumptionof the existence of a sufficient statistics is. The difficulty above doesn’thappen for the retrodictive distribution; the proof is left as an exercise foryou.Since the maximum-entropy method is meant to always employ newconstraints, we deduce that it’s more correct to interpret it as an approxim-ation of the exchangeability model than of the sufficiency model. How doesmaximum-entropy compare with the exchangeability model (6) withmultinomial prior (7) away from the asymptotic approximation?Their distributions are compared in the classic example of dice rollingin figs 1 and 2 for empirical averages of 5 and 6 (see Porta Mana 2009for the calculations). The maximum-entropy distribution (red) is at thetop; the distribution of the exchangeability model with L (cid:3) (blue) and L (cid:3)

50 (bluish purple) is shown underneath for the cases N (cid:3) N (cid:3) N (cid:3) ∞ , and for the retrodiction of an “old roll” E ( n ) k , n ∈ { , . . . , N } , andthe prediction of a “new roll” E ( N + ) k . The charts also report the Shannonentropies H of the distributions.The exchangeability model gives very reasonable and even “logical”probabilities for small N . For example, if you obtain an average of 5 in tworolls, it’s impossible that either of them was (cid:5) – unless, of course, youown a six-sided die with nine pips on one face. The exchangeability modellogically gives zero probability in this case (fig. 1 bottom left). Maximum-entropy gives an erroneous non-zero probability. And having obtainedan average of 5 or 6 in two rolls, would you really give a much higherprobability to (cid:9) or (cid:10) for a third roll? I’d still give 1 /

6. The exchangeability13

ORTA M ANA

Maximum-entropy from the probability calculus ⚀ ⚁ ⚂ ⚃ ⚄ ⚅ H  maximum-entropy distribution, a  retrodictive predictive . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( old roll | N = ∞ , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( new roll | N = ∞ , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( old roll | N =

12 , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( new roll | N =

12 , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( old roll | N = a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( new roll | N = a = Ι x L = or L = ) Figure 1 Maximum-entropy and exchangeability model, empirical average a (cid:3) ORTA M ANA

12 , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( new roll | N =

12 , a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( old roll | N = a = Ι x L = or L = ) . . ⚀ . . ⚁ . . ⚂ . . ⚃ . . ⚄ . . ⚅ H L   H L   P ( new roll | N = a = Ι x L = or L = ) Figure 2 Maximum-entropy and exchangeability model, empirical average a (cid:3) ORTA M ANA

Maximum-entropy from the probability calculus model reasonably gives an almost uniform distribution, especially forlarge L (both figures bottom right). The maximum-entropy distribution isunreasonably biased towards high values. If we observe a high averagein twelve rolls we start to suspect that the die/dice or the roll techniqueare biased. The exchangeability model expresses this bias, but moreconservatively than maximum-entropy.In fact the predictive exchangeability-model distribution can have higher entropy than the maximum-entropy one! This happens because,when N is small compared to L , the maximum-entropy prescription “whatyou’ve seen in N measurements (cid:3) what you should expect in an ( N + ) thmeasurement” is silly (MacKay 2003 Exercise 22.13). The exchangeabilitymodel intelligently doesn’t respect this prescription strictly, if N isn’t large.3See Porta Mana (2009) for comparisons under other values of the empiricalaverage and of the number of measurements.When is N large enough for the prescription to become reasonable?In other words, when is maximum-entropy a good approximation ofthe exchangeability model with multinomial prior? The answer dependson the interplay among the number of measurements N , the numberof possible outcomes K , the parameter L , the reference distribution r ,and the value a (or range A ) of the observed average. The first threeingredients determine the maximum heights of the densities involved inthe integral and sum of eq. (10); the last three ingredients determine thesize of the effective integration and sum region relative to the integrationsimplex, and the distance between the peaks of the data weights and theprior weights of fig.-eq. (26). All five ingredients determine how goodare the delta approximations in the integral and sum of eq. (10). We sawin § 2, p. 5, after eq. (12), that N needs to be much larger than K for theintegral and delta approximations of the frequency sum to be meaningful.Maximum-entropy approximations are not meaningful if the number ofpossible outcomes is much larger than the number of observations.It would be very useful to have explicit estimates of the maximum-entropy-approximation error as a function of the four quantities above. Ihope to analyse them in a future note, and promise it would be a shorternote. The heuristic explana-tion of § 4 shows that the maximum-entropy distributions appear asymp-totically owing to our specific choices of a multinomial prior in the “Obedience is no longer a virtue.” (Milani 1965) ORTA M ANA

Maximum-entropy from the probability calculus exchangeability model, and of an exponential family with observable O in the sufficiency model. They are therefore not derived only fromfirst principles or from some sort of universal limit. This is why I don’tcall the asymptotic analysis discussed in this note a “derivation” of themaximum-entropy “principle”. In my opinion this analysis shows that itis not a principle at all.The information-theoretic arguments – or should we say incentives –behind the standard maximum-entropy recipe can be lifted to a meta L and r ,though. They seem to be prone to an infinite regress; Jaynes was aware ofthis (Jaynes 2003 § 11.1, p. 344).It would be useful if the multinomial or entropic priors could beuniquely determined by intuitive inferential assumptions, as for exampleis the case with the Johnson-Dirichlet prior, proportional to q L d q : thisprior must be used if we believe (denote this by I J ) that the frequencies ofother outcomes are irrelevant for predicting a particular one:p (cid:2) E ( N + ) k (cid:12)(cid:12) f , N , I J (cid:3) (cid:3) p (cid:2) E ( N + ) k (cid:12)(cid:12) f k , N , I J (cid:3) , k ∈ { , . . . , K } , (28)a condition called “sufficientness” (Johnson 1924; 1932; Good 1965 ch. 4;Zabell 1982; Jaynes 1996). Asymptotically it leads to a maximum-entropydistribution with Burg’s (1975) entropy (cid:80) ln x (see Jaynes 1996; Porta Mana2009).But, after all, the logical calculus doesn’t tell us which truths to choose atthe beginning of a logical deduction. Why should the probability calculustell us which probabilities to choose at the beginning of a probabilisticinduction? Interpreting the maximum-entropy method as an ap-proximation of the exchangeable model (6) with multinomial prior (7) hasmany advantages: • it clears up the meaning of the “expectation (cid:3) average” prescription ofthe maximum-entropy method; • it identifies the range of validity of such prescription; • it quantifies the error of the maximum-entropy approximation; “This is an expression used to hide the absence of any mathematical idea [. . .].Personally, I never use this expression in front of children.” (Girard 2001 p. 446) ORTA M ANA

Maximum-entropy from the probability calculus • it gives a more sensible solution when this approximation doesn’t hold; • it clearly differentiates between prediction and retrodiction; • it can be backed up by information-theoretic incentives (Rodríguez1989; 2002) if you’re into those.Disadvantages: • It can’t be used to answer the question “Where did the cat go?”. Butthis question lies forever beyond the reach of the probability calculus.That’s all (Hanshaw 1928).

Thanks . . . to Philip Goyal, Moritz Helias, Vahid Rostami, Jackob Jordan, Alper Ye-genoglu, Emiliano Torre for many insightful discussions about maximum-entropy. To Mari & Miri for continuous encouragement and affection, andto Buster Keaton and Saitama for filling life with awe and inspiration. Tothe developers and maintainers of L A TEX, Emacs, AUCTEX, Open ScienceFramework, PhilSci, Hal archives, biorXiv, Python, Inkscape, Sci-Hub formaking a free and unfiltered scientific exchange possible.

Bibliography (“van X ” is listed under V; similarly for other prefixes, regardless of national conventions.)Andersen, E. B. (1970): Sufficiency and exponential families for discrete sample spaces . J. Am.Stat. Assoc. , 1248–1255.Barankin, E. W., Maitra, A. P. (1963): Generalization of the Fisher-Darmois-Koopman-Pitmantheorem on sufficient statistics . Sankhy¯a A , 217–244.Barnard, G. A., Jaynes, E. T., Seidenfeld, T., Polasek, W., Csiszár, I. (1985): Discussion [Anextended maximum entropy principle and a Bayesian justification] and

Reply . In: (Bernardo,DeGroot, Lindley, Smith 1985), 93–98. See (Csiszár 1985).Barndorff-Nielsen, O. E. (2014):

Information and Exponential Families: In Statistical Theory ,reprint. (Wiley, New York). First publ. 1978.Barwise, J., Etchemendy, J. (2003):

Language, Proof and Logic . (CSLI, Stanford). Writtenin collaboration with Gerard Allwein, Dave Barker-Plummer, Albert Liu. First publ.1999.Bernardo, J.-M., DeGroot, M. H., Lindley, D. V., Smith, A. F. M., eds. (1985):

BayesianStatistics 2 . (Elsevier and Valencia University Press, Amsterdam and Valencia).Bernardo, J.-M., Smith, A. F. (2000):

Bayesian Theory , reprint. (Wiley, New York). Firstpubl. 1994.Boyd, S., Vandenberghe, L. (2009):

Convex Optimization , 7th printing with corrections.(Cambridge University Press, Cambridge). . First publ. 2004. ORTA M ANA

Maximum-entropy from the probability calculus

Burg, J. P. (1975):

Maximum entropy spectral analysis . PhD thesis. (Stanford University,Stanford). .Copi, I. M. (1979):

Symbolic Logic , 5th ed. (Macmillan, New York). First publ. 1954.Csiszár, I. (1984):

Sanov property, generalized I -projection and a conditional limit theorem . Ann.Prob. , 768–793.— (1985): An extended maximum entropy principle and a Bayesian justification . In: (Bernardo,DeGroot, Lindley, Smith 1985), 83–93. With discussion and reply (Barnard, Jaynes,Seidenfeld, Polasek, Csiszár 1985).Csiszár, I., Körner, J. (1981):

Information Theory: Coding Theorems for Discrete MemorylessSystems . (Academic Press, New York). Second ed. (Csiszár, Körner 2011).— (2011):

Information Theory: Coding Theorems for Discrete Memoryless Systems , 2nd ed.(Cambridge University Press, Cambridge). First publ. 1981.Csiszár, I., Shields, P. C. (2004):

Information theory and statistics: a tutorial . Foundationsand Trends in Communications and Information Theory , 417–528. .Curien, P.-L. (2001): Preface to Locus solum . Math. Struct. in Comp. Science , 299–300.See also (Girard 2001).Darmois, G. (1935): Sur les lois de probabilité à estimation exhaustive . Comptes rendushebdomadaires des séances de l’Académie des sciences , 1265–1266.de Finetti, B. (1930):

Funzione caratteristica di un fenomeno aleatorio . Atti Accad. Lincei: Sc.Fis. Mat. Nat. IV , 86–133. .— (1937): La prévision : ses lois logiques, ses sources subjectives . Ann. Inst. Henri Poincaré , 1–68. Transl. as (de Finetti 1964).— (1959): La probabilità e la statistica nei rapporti con l’induzione, secondo i diversi punti divista . In: (de Finetti 2011), 1–115. Transl. as (de Finetti 1972a).— (1964):

Foresight: its logical laws, its subjective sources . In: (Kyburg, Smokler 1980),53–118. Transl. of (de Finetti 1937) by Henry E. Kyburg, Jr.— (1972a):

Probability, statistics and induction: their relationship according to the variouspoints of view . In: (de Finetti 1972b), ch. 9, 147–227. Transl. of (de Finetti 1959).— (1972b):

Probability, Induction and Statistics: The art of guessing . (Wiley, London).— ed. (2011):

Induzione e statistica , reprint. (Springer, Berlin). First publ. 1959.Demidov, A. S. (2001):

Generalized Functions in Mathematical Physics: Main Ideas andConcepts . (Nova Science, Huntington, USA). With an addition by Yu. V. Egorov.Denny, J. L. (1967):

Sufficient conditions for a family of probabilities to be exponential . Proc.Natl. Acad. Sci. (USA) , 1184–1187.Diaconis, P., Freedman, D. (1981): Partial exchangeability and sufficiency . In: (Ghosh, Roy1981), 205–236. Also publ. 1982 as technical report , http://statweb.stanford.edu/~cgates/PERSI/year.html .Egorov, Yu. V. (1990): A contribution to the theory of generalized functions . Russ. Math.Surveys (Uspekhi Mat. Nauk) , 1–49.— (2001): A new approach to the theory of generalized functions . In: (Demidov 2001), 117–123.Erickson, G. J., Rychert, J. T., Smith, C. R., eds. (1998):

Maximum Entropy and BayesianMethods . (Springer, Dordrecht).Fang, S.-C., Rajasekera, J. R., Tsao, H.-S. J. (1997):

Entropy Optimization and MathematicalProgramming , reprint. (Springer, New York).Ford, K. W., ed. (1963):

Statistical Physics . (Benjamin, New York). ORTA M ANA

Maximum-entropy from the probability calculus

Fougère, P. F., ed. (1990):

Maximum Entropy and Bayesian Methods: Dartmouth, U.S.A., 1989 .(Kluwer, Dordrecht).Fraser, D. A. S. (1963):

On sufficiency and the exponential family . J. Roy. Stat. Soc. B ,115–123.Ghosh, J. K., Roy, J., eds. (1981): Statistics: Applications and New Directions . (IndianStatistical Institute, Calcutta).Girard, J.-Y. (2001):

Locus solum: From the rules of logic to the logic of rules . Math. Struct.in Comp. Science , 301–506. http://iml.univ-mrs.fr/~girard/Articles.html .See also (Curien 2001).Good, I. J. (1965): The Estimation of Probabilities: An Essay on Modern Bayesian Methods .(MIT Press, Cambridge, USA).Grünbaum, B. (2003):

Convex Polytopes , 2nd ed. (Springer, New York). Prep. by VolkerKaibel, Victor Klee, and Günter M. Ziegler. First publ. 1967.Hailperin, T. (1996):

Sentential Probability Logic: Origins, Development, Current Status, andTechnical Applications . (Associated University Presses, London).— (2011):

Logic with a Probability Semantics: Including Solutions to Some PhilosophicalProblems . (Lehigh University Press, Plymouth, UK).Hanshaw, A. (1928):

My Blackbirds are Bluebirds now . (Velvet Tone, Washington, D.C.).With her Sizzlin’ Syncopators; written by Cliff Friend, composed by Irving Caesar.Heath, D., Sudderth, W. (1976):

De Finetti’s theorem on exchangeable variables . AmericanStatistician , 188–189.Hipp, C. (1974): Sufficient statistics and exponential families . Ann. Stat. , 1283–1292.Hobson, A., Cheng, B.-K. (1973): A comparison of the Shannon and Kullback informationmeasures . J. Stat. Phys. , 301–310.Hume, D. (1896): A Treatise of Human Nature: Being an Attempt to Introduce the ExperimentalMethod of Reasoning into Moral Subjects , reprint. (Oxford University Press, London).Ed., with an analytical index, by L. A. Selby-Bigge. https://archive.org/details/treatiseofhumann00hume_0 . First publ. 1739–1740.ieee (1993):

ANSI/IEEE Std 260.3-1993: American National Standard: Mathematical signs andsymbols for use in physical sciences and technology . Institute of Electrical and ElectronicsEngineers.iso (2009):

ISO 80000:2009: Quantities and units . International Organization for Standard-ization. First publ. 1993.iupac (2007):

Quantities, Units and Symbols in Physical Chemistry , 3rd ed. (RSC, Cambridge).Prepared for publication by E. Richard Cohen, Tomislav Cvitaš, Jeremy G. Frey, BertilHolmström, Kozo Kuchitsu, Roberto Marquardt, Ian Mills, Franco Pavese, MartinQuack, Jürgen Stohner, Herbert L. Strauss, Michio Takami, Anders J Thor. First publ.1988.Jaynes, E. T. (1957a):

Information theory and statistical mechanics . Phys. Rev. , 620–630. http://bayes.wustl.edu/etj/node1.html , see also (Jaynes 1957b).— (1957b): Information theory and statistical mechanics. II . Phys. Rev. , 171–190. http://bayes.wustl.edu/etj/node1.html , see also (Jaynes 1957a).— (1963): Information theory and statistical mechanics . In: (Ford 1963), 181–218. Repr. in(Jaynes 1989), ch. 4, 39–76. http://bayes.wustl.edu/etj/node1.html .— (1989):

E. T. Jaynes: Papers on Probability, Statistics and Statistical Physics , reprint.(Kluwer, Dordrecht). Ed. by R. D. Rosenkrantz. First publ. 1983. ORTA M ANA

Maximum-entropy from the probability calculus

Jaynes, E. T. (1996):

Monkeys, kangaroos, and N . http://bayes.wustl.edu/etj/node1.html . First publ. 1986. (Errata: in equations (29)–(31), (33), (40), (44), (49) the commasshould be replaced by gamma functions, and on p. 19 the value 0 .

915 should bereplaced by 0 . Probability Theory: The Logic of Science . (Cambridge University Press, Cam-bridge). Ed. by G. Larry Bretthorst; http://omega.albany.edu:8008/JaynesBook.html , http://omega.albany.edu:8008/JaynesBookPdf.html , . First publ. 1994.Jeffreys, H. (1955): The present position in probability theory . Brit. J. Phil. Sci. , 275–289.— (1973): Scientific Inference , 3rd ed. (Cambridge University Press, Cambridge). Firstpubl. 1931.— (2003):

Theory of Probability , 3rd ed. (Oxford University Press, London). First publ.1939.Johnson, N. L., Kotz, S., Balakrishnan, N. (1996):

Discrete Multivariate Distributions . (Wiley,New York). First publ. 1969 in chapter form.Johnson, W. E. (1922):

Logic. Part II: Demonstrative Inference: Deductive and Inductive .(Cambridge University Press, Cambridge).— (1924):

Logic. Part III: The Logical Foundations of Science . (Cambridge University Press,Cambridge). https://archive.org/details/logic03john , https://archive.org/details/johnsonslogic03johnuoft .— (1932): Probability: the deductive and inductive problems . Mind , 409–423. Withsome notes and an appendix by R. B. Braithwaite.Keynes, J. M. (2013): A Tract on Monetary Reform , repr. of second ed. (Cambridge UniversityPress, Cambridge). First publ. 1923.Knuth, D. E. (1992):

Two notes on notation . Am. Math. Monthly , 403–422. arXiv:math/9205211 .Koopman, B. O. (1936): On distributions admitting a sufficient statistic . Trans. Am. Math.Soc. , 399–409.Kyburg Jr., H. E., Smokler, H. E., eds. (1980): Studies in Subjective Probability , 2nd ed.(Robert E. Krieger, Huntington, USA). First publ. 1964.Lighthill, M. J. (1964):

Introduction to Fourier Analysis and Generalised Functions . (CambridgeUniversity Press, London). First publ. 1958.Lindley, D. V. (2008):

Introduction to Probability and Statistics from a Bayesian Viewpoint. Part2: Inference , reprint. (Cambridge University Press, Cambridge). First publ. 1965.— (2014):

Understanding Uncertainty , rev. ed. (Wiley, Hoboken, USA). First publ. 2006.MacKay, D. J. C. (2003):

Information Theory, Inference, and Learning Algorithms . (CambridgeUniversity Press, Cambridge). . First publ. 1995.Mead, L. R., Papanicolaou, N. (1984):

Maximum entropy in the problem of moments . J. Math.Phys. , 2404–2417. http://bayes.wustl.edu/Manual/MeadPapanicolaou.pdf .Milani, L. (1965): L’obbedienza non è piú una virtú . (Libreria Editrice Fiorentina, Florence). https://cleliabartoli.files.wordpress.com/2015/09/lobbedienza-non-c3a8-pic3b9-una-virtc3b9.pdf .Musil, R. (1979):

The Man Without Qualities . (Picador, London). Transl. by E. Wilkins andE. Kaiser. First publ. in German 1930 as (Musil 2000).— (2000):

Der Mann ohne Eigenschaften . (Rowohlt, Reinbek bei Hamburg). Herausgegebenvon Adolf Frisé. First publ. 1930. Transl. as (Musil 1979). ORTA M ANA

Maximum-entropy from the probability calculus

Pitman, E. J. G. (1936):

Sufficient statistics and intrinsic accuracy . Math. Proc. Camb. Phil.Soc. , 567–579.Porta Mana, P. G. L. (2009): On the relation between plausibility logic and the maximum-entropyprinciple: a numerical study . arXiv:0911.2197 . Presented as invited talk at the 31stInternational Workshop on Bayesian Inference and Maximum Entropy Methods inScience and Engineering “MaxEnt 2011”, Waterloo, Canada.Rodríguez, C. C. (1989): The metrics induced by the Kullback number . In: (Skilling 1989b),415–422.— (2002):

Entropic priors for discrete probabilistic networks and for mixtures of Gaussiansmodels . Am. Inst. Phys. Conf. Proc. , 410–432. arXiv:physics/0201016 .Shore, J. E., Johnson, R. W. (1980):

Axiomatic derivation of the principle of maximum entropyand the principle of minimum cross-entropy . IEEE Trans. Inform. Theor.

IT-26 , 26–37.See also comments and correction (Shore, Johnson 1983).— (1983): Comments on and correction to “axiomatic derivation of the principle of maximumentropy and the principle of minimum cross-entropy” . IEEE Trans. Inform. Theor.

IT-29 ,942–943.Sivia, D. S. (2006): Data Analysis: A Bayesian Tutorial , 2nd ed. (Oxford University Press,Oxford). Written with J. Skilling. First publ. 1996.Skilling, J. (1989a):

Classic maximum entropy . In: (Skilling 1989b), 45–52.— ed. (1989b):

Maximum Entropy and Bayesian Methods: Cambridge, England, 1988 . (Kluwer,Dordrecht).— (1990):

Quantified maximum entropy . In: (Fougère 1990), 341–350.— (1998):

Massive inference and maximum entropy . In: (Erickson, Rychert, Smith 1998),1–14. .Strawson, P. F. (1964):

Introduction to Logical Theory . (Methuen, London). First publ. 1952.van Campenhout, J. M., Cover, T. M. (1981):

Maximum entropy and conditional probability .IEEE Trans. Inform. Theor.

IT-27 , 483–489.Zabell, S. L. (1982): W. E. Johnson’s “sufficientness” postulate . Ann. Stat. , 1090–1099.Repr. in (Zabell 2005 pp. 84–95).— (2005): Symmetry and Its Discontents: Essays on the History of Inductive Probability .(Cambridge University Press, Cambridge)..(Cambridge University Press, Cambridge).