Learning about a Categorical Latent Variable under Prior Near-Ignorance
Alberto Piatti, Marco Zaffalon, Fabio Trojani, Marcus Hutter
aa r X i v : . [ m a t h . P R ] M a y Technical Report IDSIA-05-07
Learning about a Categorical Latent Variableunder Prior Near-Ignorance
Alberto Piatti, IDSIA, Switzerland, [email protected] Zaffalon, IDSIA, Switzerland, zaff[email protected] Trojani, University of S. Gallen, Switzerland, [email protected] Hutter, ANU&NICTA, Australia ,[email protected]
May 2007
Abstract
It is well known that complete prior ignorance is not compatible withlearning, at least in a coherent theory of (epistemic) uncertainty. What is lesswidely known, is that there is a state similar to full ignorance, that Walleycalls near-ignorance , that permits learning to take place. In this paper weprovide new and substantial evidence that also near-ignorance cannot be re-ally regarded as a way out of the problem of starting statistical inference inconditions of very weak beliefs. The key to this result is focusing on a settingcharacterized by a variable of interest that is latent . We argue that such asetting is by far the most common case in practice, and we show, for the caseof categorical latent variables (and general manifest variables) that there isa sufficient condition that, if satisfied, prevents learning to take place underprior near-ignorance. This condition is shown to be easily satisfied in the mostcommon statistical problems.
Contents
Prior near-ignorance, latent and manifest variables, observational processes,vacuous beliefs, imprecise probabilities. Introduction
Epistemic theories of statistics are often concerned with the question of prior igno-rance . Prior ignorance means that a subject, who is about to perform a statisticalanalysis, has not any substantial belief about the underlying data-generating process.Yet, the subject would like to exploit the available sample to draw some statisticalinference, i.e., the subject would like to use the data to learn, moving away fromthe initial condition of ignorance. This situation is very important as it is oftendesirable to start a statistical analysis with weak assumptions about the problem ofinterest, thus trying to implement an objective-minded approach to statistics.A fundamental question is if prior ignorance is compatible with learning. Walleygives a negative answer for the case of his self-consistent (or coherent ) theory ofstatistics: he shows, in a very general sense, that vacuous prior beliefs lead tovacuous posterior beliefs, irrespective of the type and amount of observed data[Walley (1991), Section 7.3.7]. But, at the same time, he proposes focusing on aslighlty different state of beliefs, called near-ignorance , that does enable learningto take place [Walley (1991), Section 4.6.9]. Loosely speaking, near-ignorant beliefsare beliefs close but not equal to vacuous (see Section 3). The possibility to learnunder prior near-ignorance is shown, for instance, in the special case of the near-ignorance prior defining the imprecise Dirichlet model (IDM). This is a popularmodel used in the case of inference from categorical data generated by a discreteprocess ([Walley (1996), Bernard (2005)]).In this paper, we also focus on a categorical random variable X, expressing theoutcomes of a multinomial process, but we assume that such a variable is latent .This means that we cannot observe the realizations of X, so we can learn aboutit only by means of another (not necessarily categorical) variable S, related to Xin some known way. Variable S is assumed to be manifest , in the sense that itsrealizations can be observed (see Section 2).In such a setting, we introduce a condition in Section 4, related to the likelihoodof the observed data, that is shown to be sufficient to prevent learning about Xunder prior near-ignorance. The condition is very general as it is developed for anyprior that models near-ignorance (not only the one used in the IDM), and for verygeneral kinds of relation between X and S. We show then, by simple examples,that such a condition is easily satisfied, even in the most elementary and commonstatistical problems.In order to appreciate this result, it is important to realize that latent variablesare ubiquitous in problems of uncertainty. It can be argued, indeed, that there is apersistent distinction between (latent) facts (e.g., health, state of economy, color ofa ball) and (manifest) observations of facts: one can regard them as being relatedby a so-called observational process ; and the point is that these kinds of processesare imperfect in practice. Observational processes are often neglected in statistics,when their imperfection is deemed to be tiny. But a striking outcome of the presentresearch is that, no matter how tiny the imperfection, provided it exists, learning is2ot possible under prior near-ignorance.In our view, the present results raise serious doubts about the possibility to adopta condition of prior near-ignorance in real, as opposed to idealized, applications ofstatistics. As a consequence, it may make sense to consider re-focusing the researchabout this subject on developing models of very weak states of belief that are,however, stronger than near-ignorance.
In this paper, we follow the general definition of latent and manifest variables given by [Skrondal and Rabe-Hesketh (2004)]: a latent variable is a random vari-able whose realizations are unobservable (hidden), while a manifest variable is arandom variable whose realizations can be directly observed. The concept of la-tent variable is central in many sciences, like for example psychology and medicine.[Skrondal and Rabe-Hesketh (2004)] list several fields of application and several phe-nomena that can be modeled using latent variables, and conclude that latent variablemodeling “ pervades modern mainstream statistics ,” although “ this omni-presence oflatent variables is commonly not recognized, perhaps because latent variables aregiven different names in different literatures, such as random effects, common fac-tors and latent classes ,” or hidden variables.But what are latent variables in practice? According to [Boorsbom et al. (2002)],there may be different interpretations of latent variables. A latent variable can beregarded, for example, as an unobservable random variable that exists independentlyof the observation. An example is the unobservable health status of a patient thatis subject to a medical test. Another possibility is to regard a latent variable asa product of the human mind, a construct that does not exist independent of theobservation. For example the unobservable state of the economy , often used ineconomic models. In this paper, we assume the existence of a latent categoricalrandom variable X, with outcomes in X = { x , . . . , x k } and unknown chances θ ∈ Θ := { θ = ( θ , . . . , θ k ) | P ki =1 θ i = 1 , ≤ θ i ≤ } , without stressing any particularinterpretation.Suppose now that our aim is to predict, after N realizations of the variableX, the next outcome (or the next N ′ outcomes). Because the variable X is latentand therefore unobservable by definition, the only possible way to learn somethingabout the probabilities of the next outcome is to observe the realizations of somemanifest variable S related, in a known way, to the (unobservable) realizations ofX. An example of known relationship between latent and manifest variables is thefollowing. Example 1
We consider a binary medical diagnostic test used to assess the healthstatus of a patient with respect to a given disease. The accuracy of a diagnostic3est is determined by two probabilities: the sensitivity of a test is the probability ofobtaining a positive result if the patient is diseased; the specificity is the probabilityof obtaining a negative result if the patient is healthy. Medical tests are assumed tobe imperfect indicators of the unobservable true disease status of the patient. There-fore, we assume that the probability of obtaining a positive result when the patientis healthy, respectively of obtaining a negative result if the patient is diseased, arenon-zero. Suppose, to make things simpler, that the sensitivity and the specificityof the test are known. In this example, the unobservable health status of the patientcan be considered as a binary latent variable X with values in the set { Healthy , Ill } ,while the result of the test can be considered as a binary manifest variable S withvalues in the set { Negative result , Positive result } . Because the sensitivity and thespecificity of the test are known, we know how X and S are related. ♦ We continue discussion about this example later on, in the light of our results,in Example 2 of Section 4.
Consider a categorical random variable X with outcomes in X = { x , . . . , x k } andunknown chances θ ∈ Θ. Suppose that we have no relevant prior information about θ and we are therefore in a situation of prior ignorance. How should we model ourprior beliefs in order to reflect the initial lack of knowledge?Let us give a brief overview of this topic in the case of coherent models of uncer-tainty, such as Bayesian probability and Walley’s theory of coherent lower previsions .In the traditional Bayesian setting, prior beliefs are modeled using a singleprior probability distribution. The problem of defining a standard prior probabil-ity distribution modeling a situation of prior ignorance, a so-called noninformativeprior , has been an important research topic in the last two centuries and, despitethe numerous contributions, it remains an open research issue, as illustrated by[Kass and Wassermann (1996)]. See also [Hutter (2006)] for recent developmentsand complementary considerations. There are many principles and properties thatare desirable to model a situation of prior ignorance and that have been used inpast research to define noninformative priors. For example Laplace’s symmetry orindifference principle has suggested, in case of finite possibility spaces, the use of theuniform distribution. Other principles, like for example the principle of invarianceunder group transformations , the maximum entropy principle, the conjugate priors principle, etc., have suggested the use of other noninformative priors, in particularfor continuous possibility spaces, satisfying one or more of these principles. But,in general, it has proven to be difficult to define a standard noninformative priorsatisfying, at the same time, all the desirable principles. For further details about the modeling of diagnostic accuracy with latent variables see[Yang and Becker (1997)]. Starting from the work of Laplace at the beginning of the 19 th century ([Laplace (1820)]).
4n the case of finite possibility spaces, we agree with[De Cooman and Miranda (2006)] when they say that there are at least twoprinciples that should be satisfied to model a situation of prior ignorance: the sym-metry principle and the embedding principle . The symmetry principle states that,if we are completely ignorant a priori about θ , then we have no reason to favour onepossible outcome of X to another, and therefore our probability model on θ shouldbe symmetric. This principle recalls Laplace’s symmetry or indifference principlethat, in the past decades, has suggested the use of the uniform prior as standardnoninformative prior. The embedding principle states that, for each possible event A , the probability assigned to A should not depend on the possibility space X inwhich A is embedded. In particular, the probability assigned a priori to the event A should be invariant with respect to refinements and coarsenings of X . It is easyto show that the embedding principle is not satisfied by the uniform distribution.How should we model our prior ignorance in order to satisfy these two principles?[Walley (1991)] gives a compelling answer to this question: he proves that theonly probability model consistent with coherence and with the two principles is the vacuous probability model , i.e., the model that assigns, for each non-trivial event A ,lower probability P( A ) = 0 and upper probability P( A ) = 1. It is evident that thismodel cannot be expressed using a single probability distribution. It follows that,to model properly and in a coherent way a situation of prior ignorance, we need imprecise probabilities . Unfortunately, adopting the vacuous probability model for X is not a practicalsolution to our initial problem, because it produces only vacuous posterior probabili-ties. [Walley (1991)] suggests, as practical solution, the use of near-ignorance priors .A near-ignorance prior is a large closed convex set M of probability distributionsfor θ , very close to the vacuous probability model, which produces a priori vacuousexpectation s for various functions f on Θ, i.e., such that E ( f ) = inf θ ∈ Θ f ( θ ) and E ( f ) = sup θ ∈ Θ f ( θ ).An example of near-ignorance prior that is particularly instructive is the setof priors M used in the imprecise Dirichlet model (IDM). The IDM models asituation of prior ignorance about the chances θ of a categorical random variableX. The near-ignorance prior M used in the IDM consists in the set of all Dirichletdensities p ( θ ) = dir s, t ( θ ) for a fixed s > t ∈ T , where dir s, t ( θ ) := Γ( s ) Q ki =1 Γ( st i ) k Y i =1 θ st i − i , (1)and T := { t = ( t , . . . , t k ) | k X j =1 t k = 1 , < t j < } . (2) In Note 7, p. 526. See also Section 5.5. For a complementary point of view, see [Hutter (2006)]. M in the IDM implies vacuous prior expectations for allfunctions f ( θ ) = θ N ′ i , for all N ′ ≥ i ∈ { , . . . , k } , i.e., E ( θ N ′ i ) = 0 and E ( θ N ′ i ) = 1. Choosing N ′ = 1, we have, a priori,P(X = x i ) = E ( θ i ) = 0 , P(X = x i ) = E ( θ i ) = 1 . It follows that the particular near-ignorance prior M used in the IDM impliesvacuous prior probabilities for each possible outcome of the variable X. It can beshown that this particular set of priors satisfies both the symmetry and embeddingprinciples.But what is the difference between the vacuous probability model and the thenear-ignorance prior used in the IDM? In fact, although both models produce vacu-ous prior probabilities and both models satisfy the symmetry and embedding princi-ples, the IDM yields posterior probabilities that are not vacuous, while the vacuousprobability model produces only vacuous posterior probabilities. The answer to thisquestion is the reason why we use the term near-ignorance : in the IDM, althoughwe are completely ignorant about the possible outcomes of the variable X, we arenot completely ignorant about the chances θ , because we assume a particular classof prior distributions, i.e., the Dirichlet distributions for a fixed value of s . Consider a sequence of independent and identically distributed (IID) categoricallatent variables (X i ) i ∈ N with outcomes in X and unknown chances θ ∈ Θ, and asequence of independent manifest variables (S i ) i ∈ N . We assume that a realization ofthe manifest variable S i can be observed only after an (unobservable) realization ofthe latent variable X i and that the probability distribution of S i given X i is knownfor each i ∈ N . Furthermore, we assume S i to be independent of the chances θ ofX i given X i . Define the random variables X := (X , . . . , X N ), S := (S , . . . , S N ) and X ′ := (X N +1 , . . . , X N + N ′ ).We focus on the problem of predictive inference. Suppose that we observe adataset s of realizations of manifest variables S , . . . , S N related to the (unobservable)dataset x ∈ X N of realizations of the variables X , . . . , X N . Using the notationdefined above we have S = s and X = x . Our aim is to predict the outcomes ofthe next N ′ variables X N +1 , . . . , X N + N ′ . In particular, given x ′ ∈ X N ′ , our aim is tocalculate P( X ′ = x ′ | S = s ) and P( X ′ = x ′ | S = s ). To simplify notation, when noconfusion is possible, we denote in the rest of the paper S = s with s and X ′ = x ′ with x ′ . The (in)dependence structure can be depicted graphically as follows: ✍✌✎☞ θ ✍✌✎☞ X i ♥✍✌✎☞ S ii =1 ...N + N ′ ✲ ✲ For a general presentation of predictive inference see [Geisser (1993)]; for a discussion of theimprecise probability approach to predictive inference see [Walley et al. (1999)]. θ with a near-ignoranceprior M and denoting by n ′ := ( n ′ , . . . , n ′ k ) the frequencies of the dataset x ′ , wehave P( x ′ | s ) = inf p ∈M P p ( x ′ | s ) :== inf p ∈M Z Θ k Y i =1 θ n ′ i i p ( θ | s ) dθ ==: inf p ∈M E p k Y i =1 θ n ′ i i | s ! == E k Y i =1 θ n ′ i i | s ! , where, according to Bayes theorem, p ( θ | s ) = P( s | θ ) p ( θ ) R Θ P( s | θ ) p ( θ ) dθ , provided that R Θ P( s | θ ) p ( θ ) dθ = 0. Analogously, substituting sup to inf in (3), weobtain P( x ′ | s ) = E k Y i =1 θ n ′ i i | s ! . (3)The central problem now is to choose M so as to be as ignorant as possible a prioriand, at the same time, to be able to learn something from the observed dataset ofmanifest variables s . Theorem 1 and the following corollaries yield a first partialsolution to the above problem, stating several conditions for learning under priornear-ignorance. Theorem 1
Let s be given. Consider a bounded continuous function f defined on Θ and denote with f max the Supremum of f on Θ . If the likelihood function P( s | θ ) is strictly positive in each point in which f reaches its maximum value f max and itis continuous in an arbitrary small neighborhood of these points, and M is suchthat a priori E ( f ) = f max , then E ( f | s ) = E ( f ) = f max . Many corollaries to Theorem 1 are listed in Section B of the Appendix. Herewe discuss only the most important corollary. Consider, given a dataset x ′ , theparticular function f ( θ ) = Q ki =1 θ n ′ i i . This function is particularly important for The Assumption about P( s | θ ) in Theorem 1 can be substituted by the following weakerassumption. For a given arbitrary small δ >
0, denote with Θ δ the measurable set, Θ δ := { θ ∈ Θ | f ( θ ) ≥ f max − δ } . If P( s | θ ) is such that, lim δ → inf θ ∈ Θ δ P( s | θ ) = c > , then Theorem 1 holds. x ′ . It is easy to show that,in this case, the minimum of f is 0 and is reached in all the points θ ∈ Θ with θ i = 0 for some i such that n ′ i >
0, while the maximum of f is reached in a singlepoint of Θ corresponding to the relative frequencies f ′ of the sample x ′ , i.e., at f ′ = (cid:16) n ′ N ′ , . . . , n ′ k N ′ (cid:17) ∈ Θ, and the maximum of f is given by Q ki =1 (cid:16) n ′ i N ′ (cid:17) n ′ i . It followsthat vacuous probabilities regarding the dataset x ′ are given byP( x ′ ) = E k Y i =1 θ n ′ i i ! = 0 , P( x ′ ) = E k Y i =1 θ n ′ i i ! = k Y i =1 (cid:18) n ′ i N ′ (cid:19) n ′ i . Corollary 1
Let s be given and let P( s | θ ) be a continuous strictly positive functionon Θ . Then, if M implies vacuous prior probabilities for a dataset x ′ ∈ X N ′ , thepredictive probabilities of x ′ are vacuous also a posteriori, after having observed s ,i.e., P( x ′ | s ) = P( x ′ ) = 0 , P( x ′ | s ) = P( x ′ ) = k Y i =1 (cid:18) n ′ i N ′ (cid:19) n ′ i . In other words, Corollary 1 states a sufficient condition that prevents learningto take place under prior near-ignorance: if the likelihood function P( s | θ ) is con-tinuous and strictly positive on Θ, then all the dataset x ′ ∈ X N ′ for which M implies vacuous probabilities have vacuous probabilities also a posteriori, after hav-ing observed s . It follows that, if this sufficient condition is satisfied, we cannot usenear-ignorance priors to model a state of prior ignorance for the same reason forwhich, in Section 3, we have excluded the vacuous probability model: because onlyvacuous posterior probabilities are produced.The sufficient condition described above is satisfied very often in practice, asillustrated by the following striking examples. Example 2
Consider the medical test introduced in Example 1 and an (ideally)infinite population of individuals. Denote with the binary variable X i ∈ { H, I } thehealth status of the i -th individual of the population and with S i ∈ { + , −} theresults of the diagnostic test applied to the same individual. We assume that thevariables in the sequence (X i ) i ∈ N are IID with unknown chances ( θ, − θ ), where θ corresponds to the (unknown) proportion of diseased individuals in the population.Denote with 1 − ε the sensitivity and with 1 − ε the specificity of the test. Thenit holds that P(S i = + | X i = H ) = ε > , i = − | X i = I ) = ε > , where ( I, H, + , − ) denote (patient ill, patient healthy, test positive, test negative).Suppose that we observe the results of the test applied to N different individualsof the population; using our previous notation we have S = s . For each individualwe have, P(S i = + | θ ) ==P(S i = + | X i = I )P(X i = I | θ )++P(S i = + | X i = H )P(X i = H | θ ) == (1 − ε ) | {z } > · θ + ε |{z} > · (1 − θ ) > . Analogously, P(S i = − | θ ) ==P(S i = − | X i = I )P(X i = I | θ )++P(S i = − | X i = H )P(X i = H | θ ) == ε |{z} > · θ + (1 − ε ) | {z } > · (1 − θ ) > . Denote with n s the number of positive tests in the observed sample s . Then, becausethe variables S i are independent, we haveP( S = s | θ ) = ((1 − ε ) · θ + ε · (1 − θ )) n s ·· ( ε · θ + (1 − ε ) · (1 − θ )) N − n s > θ ∈ [0 ,
1] and each s ∈ X N . Therefore, according to Corollary 1, allthe predictive probabilities that, according to M , are vacuous a priori remainvacuous a posteriori. It follows that, if we want to avoid vacuous posterior predictiveprobabilities, then we cannot model our prior knowledge (ignorance) using a near-ignorance prior implying some vacuous prior predictive probabilities. This simpleexample shows that our previous theoretical results raise serious questions about theuse of near-ignorance priors also in very simple, common, and important situations.The situation presented in this example can be extended, in a straightfor-ward way, to the general categorical case and has been studied, in the spe-cial case of the near-ignorance prior used in the imprecise Dirichlet model, in[Piatti et al. (2005)]. ♦ Example 2 focuses on discrete latent and manifest variables. In the next example,we show that our theoretical results have important implications also in models withdiscrete latent variables and continuous manifest variables.9 xample 3
Consider the sequence of IID categorical variables (X i ) i ∈ N with out-comes in X N and unknown chances θ ∈ Θ. Suppose that, for each i ≥
1, after arealization of the latent variable X i , we can observe a realization of a continuousmanifest variable S i . Assume that p (S i | X i = x j ) is a continuous positive probabilitydensity, e.g., a normal N ( µ j , σ j ) density, for each x j ∈ X . We have p (S i | θ ) = X x j ∈X N p (S i | X i = x j ) · P(X i = x j | θ ) == X x j ∈X N p (S i | X i = x j ) | {z } > · θ j > , because θ j is positive for at least one j ∈ { , . . . , N } and we have assumed S i to beindependent of θ given X i . Because we have assumed (S i ) i ∈ N to be a sequence ofindependent variables, we have, p ( S = s | θ ) = N Y i =1 p (S i = s i | θ ) | {z } > > . Therefore, according to Corollary 1, if we model our prior knowledge using a near-ignorance prior M , the vacuous prior predictive probabilities implied by M remainvacuous a posteriori. It follows that, if we want to avoid vacuous posterior predictiveprobabilities, we cannot model our prior knowledge using a near-ignorance priorimplying some vacuous prior predictive probabilities. ♦ Examples 2 and 3 raise, in general, serious criticisms about the use of near-ignorance priors in practical applications.The only predictive model in the literature, of which we are aware, where anear-ignorance prior is used successfully to obtain non-vacuous posterior predictiveprobabilities is the IDM. In the next example, we explain how the IDM avoids ourtheoretical limitations.
Example 4
In the IDM, we assume that the IID categorical variables (X i ) i ∈ N areobservable. In other words, we have S i = X i for each i ≥ S = X = x , we haveP( S = x | θ ) = P( X = x | θ ) = k Y i =1 θ n i i , where n i denotes the number of times that x i ∈ X has been observed in x . We haveP( X = x | θ ) = 0 for all θ such that θ j = 0 for at least one j such that n j > X = x | θ ) > θ ∈ Θ, in particular for all θ in the interior of Θ.The near-ignorance prior M used in the IDM consists in the set of all theDirichlet densities dir s, t ( θ ) for a fixed s > t ∈ T , where dir s, t ( θ ) and T have been defined in (1) and (2). 10he particular choice of M in the IDM implies, for each N ′ ≥ i ∈ { , . . . , k } , that E ( θ N ′ i ) = 0 , E ( θ N ′ i ) = 1 . Consequently, denoting with d i ∈ X N ′ the dataset with n ′ i = N ′ and n ′ j = 0 for each j = i , a priori we have, P( X ′ = d i ) = 0 , P( X ′ = d i ) = 1 , and in particular P(X = x i ) = 0 , P(X = x i ) = 1 . It can be shown that other prior predictive probabilities are not vacuous. For ex-ample, for i = j , we have E ( θ i θ j ) = s s + 1) <
14 = sup θ ∈ Θ θ i θ j . The IDM produces, for each possible observed data set x , non-vacuous posteriorpredictive probabilities for each possible future data set (see [Walley (1996)]). Thismeans that our previous theoretical limitations are avoided in some way. To explainthis result we consider two cases. We consider firstly an observed data set x wherewe have observed at least two different outcomes. Secondly, we consider a data set x formed exclusively by outcomes of the same type, in other words, a data set ofthe type d i .In the first case we have that P( x | θ ) = Q kj =1 θ n j j is equal to zero for θ = e i for each i ∈ { , . . . , k } . In fact, θ i = 1 implies θ j = 0 for each j = i and thereis at least one j with n j >
0. Therefore, the assumptions of Corollaries 4 and 5are not satisfied. And in fact the IDM produces non-vacuous posterior predictiveprobabilities for each data set that, a priori, has vacuous predictive probabilities.On the other hand, all the datasets whose prior predictive probability reaches itsmaximum in a relative frequency f ∈ Θ such that P( x | f ) >
0, are characterized bynon-vacuous prior predictive probabilities.The second case yields similar results. The only difference is that P( d i | θ ) = θ N ′ i for a given i ∈ { , . . . , k } . In this case P( x | e i ) = 1 > x i | x ) = P( x i ) = 1 , P(X ′ = d i | x ) = P( d i ) = 1 , and consequently, for each j = i and each y = d i ,P( x j | x ) = P( x j ) = 0 , P(X ′ = y | x ) = P( y ) = 0 . x i | x ) > , P(X ′ = d i | x ) > , P( x j | x ) < , P(X ′ = y | x ) < , and therefore the posterior predictive probabilities are not vacuous for each possiblefuture data set. ♦ Yet, since the variables (X i ) i ∈ N are assumed to be observable, the successfulapplication of a near-ignorance prior in the IDM is not helpful in addressing thedoubts raised by our theoretical results about the applicability of near-ignorancepriors in situations where the variables (X i ) i ∈ N are latent. In this paper we have proved a sufficient condition that prevents learning about alatent categorical variable to take place under prior near-ignorance about the data-generating process.The condition holds as soon as the likelihood is strictly positive (and continuous),and so is satisfied frequently, even in the simplest settings. Taking into account thatthe considered framework is very general and pervasive of statistical practice, weregard this result as a form of substantial evidence against the possibility to use priornear-ignorance in real statistical problems. Given that complete prior ignorance isnot compatible with learning, as it is well known, we deduce that there is little hopeto use any form of prior ignorance to do objective-minded statistical inference inpractice.As a consequence, we suggest that future research efforts should be directed tostudy and develop new forms of knowledge that are close to near-ignorance but thatdo not coincide with it.
Acknowledgements
This work was partially supported by Swiss NSF grants 200021-113820/1 (AlbertoPiatti), 200020-109295/1 (Marco Zaffalon) and 100012-105745/1 (Fabio Trojani).
A Technical preliminaries
In this appendix we provide some technical results that are used to prove the theo-rems in the paper. First of all, we introduce some notation used in this appendix.Consider a sequence of probability densities ( p n ) n ∈ N and a function f defined on aset Θ. Then, we use the notation, E n ( f ) := Z Θ f ( θ ) p n ( θ ) dθ, n ( e Θ) := Z e Θ p n ( θ ) dθ, e Θ ⊆ Θ . In addition, for a given probability density p on Θ, E p ( f ) := Z Θ f ( θ ) p ( θ ) dθ, P p ( e Θ) := Z e Θ p ( θ ) dθ, e Θ ⊆ Θ . Finally, with → we denote lim n →∞ . Theorem 2
Let Θ ⊂ R k be the closed k -dimensional simplex and let ( p n ) n ∈ N bea sequence of probability densities defined on Θ w.r.t. the Lebesgue measure. Let f ≥ be a bounded continuous function on Θ and denote with f max the supremumof f on Θ . For this function define the measurable sets Θ δ = { θ ∈ Θ | f ( θ ) ≥ f max − δ } . (4) Assume that ( p n ) n ∈ N concentrates on a maximum of f for n → ∞ , in the sense that E n ( f ) → f max , (5) then, for all δ > , it holds P n (Θ δ ) → . Theorem 3
Let L ( θ ) ≥ be a bounded measurable function with lim δ → inf θ ∈ Θ δ L ( θ ) =: c > , (6) under the same assumptions of Theorem 2. Then E n ( Lf ) E n ( L ) = R Θ f ( θ ) L ( θ ) p n ( θ ) dθ R Θ L ( θ ) p n ( θ ) dθ → f max . (7) Remark 1 If f has a unique maximum in θ = θ and L is a function, continuous inan arbitrary small neighborhood of θ = θ , such that L ( θ ) > , then (6) is satisfied. B Corollaries to Theorem 1
The following Corollaries to Theorem 1 are necessary to prove Corollary 1, and areuseful to understand more deeply the limiting results implied by the use of near-ignorance priors with latent variables. 13 orollary 2
Let x ′ and s be given. Denote with f ′ := (cid:16) n ′ N ′ , . . . , n ′ k N ′ (cid:17) ∈ Θ the vectorof relative frequencies of the dataset x ′ . If P( s | θ ) is continuous in an arbitrary smallneighborhood of θ = f ′ , P( s | f ′ ) > and M is such that P( x ′ ) = sup θ ∈ Θ k Y i =1 θ n ′ i i ! = k Y i =1 (cid:18) n ′ i N ′ (cid:19) n ′ i , then P( x ′ | s ) = P( x ′ ) . Corollary 3
Let x ′ and s be given. If P( s | θ ) > for each θ ∈ Θ with θ i = 0 for atleast one i with n ′ i > , and M is such that P( x ′ ) = 0 , it follows that P( x ′ | s ) = P( x ′ ) = 0 . Corollary 4
Let s be given. Consider an arbitrary x i ∈ X and denote with e i theparticular vector of chances with θ i = 1 and θ j = 0 for each j = i . Suppose that M is such that, a priori, P(X = x i ) := E ( θ i ) = 1 . Then, if P( s | e i ) > and P( s | θ ) iscontinuous in a neighborhood of θ = e i , we have P(X N +1 = x i | s ) = P(X = x i ) = 1 , (8) and consequently, P(X N +1 = x j | s ) = P(X j = x i ) = 0 , (9) for each j = i . Corollary 5
Let s and N ′ be given and consider an arbitrary x i ∈ X . Suppose that M is such that, a priori, P(X = x i ) := E ( θ i ) = 1 . Denote with d i ∈ X N ′ the dataset with n i = N ′ and n j = 0 for each j = i . Then, if P( s | e i ) > and P( s | θ ) iscontinuous in a neighborhood of θ = e i , we have P( X ′ = d i | s ) = 1 , and consequently, P( X ′ = y | s ) = 0 , for each y = d i . References [Bernard (2005)] Bernard J. M. (2005) An introduction to the imprecise Dirichletmodel for multinomial data.
International Journal of Approximate Reasoning ,39 (2–3), 123–150. 14Boorsbom et al. (2002)] Boorsbom D., Mellenbergh G. J., van Heerden J. (2002)The theoretical status of latent variables.
Psychological Review , 110 (2), 203–219.[De Cooman and Miranda (2006)] De Cooman G., Miranda E. (2006) Symmetryof models versus models of symmetry. In
Probability and Inference: Essays inHonor of Henry E. Kyburg, Jr . Eds. Hofer W. and Wheeler G., 82 pages. King’sCollege Publications, London.[Geisser (1993)] Geisser S. (1993)
Predictive Inference: An Introduction.
Mono-graphs on Statistics and Applied Probability 55. Chapman and Hall, New York.[Hutter (2006)] Hutter M. (2006) On the foundations of universal sequence predic-tion. In
Proc. 3rd Annual Conference on Theory and Applications of Models ofComputation (TAMC’06) , 408–420, Beijing.[Kass and Wassermann (1996)] Kass R., Wassermann L. (1996) The selection ofprior distributions by formal rules.
Journal of the American Statistical Associ-ation , 91: 1343–1370.[Laplace (1820)] Laplace P. S. (1820),
Essai Philosophique sur Les Probabilit´es .English translation:
Philosophical Essays on Probabilities (1951), New York:Dover.[Piatti et al. (2005)] Piatti A., Zaffalon M., Trojani F. (2005) Limits of learningfrom imperfect observations under prior ignorance: the case of the impreciseDirichlet model, in: Cozman, F. G., Nau, B., Seidenfeld, T. (Eds),
ISIPTA ’05:Proceedings of the Fourth International Symposium on Imprecise Probabilitiesand Their Applications. , Manno (Switzerland), 276–286.[Skrondal and Rabe-Hesketh (2004)] Skrondal A., Rabe-Hasketh S. (2004)
Gener-alized Latent Variable Modeling: Multilevel, Longitudinal, and Structural Equa-tion Models.
Chapman and Hall/CRC, Boca Raton.[Yang and Becker (1997)] Yang I., Becker M. P. (1997) Latent variable modeling ofdiagnostic accuracy.
Biometrics , 53: 948–958.[Walley (1991)] Walley P. (1991)
Statistical Reasoning with Imprecise Probabilities.
Chapman and Hall, New York.[Walley (1996)] Walley P. (1996) Inferences from multinomial data: learning abouta bag of marbles.