Asymptotic optimality of myopic information-based strategies for Bayesian adaptive estimation
aa r X i v : . [ m a t h . S T ] J a n Bernoulli (1), 2016, 615–651DOI: 10.3150/14-BEJ670 Asymptotic optimality of myopicinformation-based strategies for Bayesianadaptive estimation
JANNE V. KUJALA
Department of Mathematical Information Technology, University of Jyv¨askyl¨a, P.O. Box 35,FI-40014 Jyv¨askyl¨a, Finland. E-mail: jvk@iki.fi
This paper presents a general asymptotic theory of sequential Bayesian estimation giving resultsfor the strongest, almost sure convergence. We show that under certain smoothness conditionson the probability model, the greedy information gain maximization algorithm for adaptiveBayesian estimation is asymptotically optimal in the sense that the determinant of the posteriorcovariance in a certain neighborhood of the true parameter value is asymptotically minimal.Using this result, we also obtain an asymptotic expression for the posterior entropy based on anovel definition of almost sure convergence on “most trials” (meaning that the convergence holdson a fraction of trials that converges to one). Then, we extend the results to a recently publishedframework, which generalizes the usual adaptive estimation setting by allowing different trialplacements to be associated with different, random costs of observation. For this setting, theauthor has proposed the heuristic of maximizing the expected information gain divided by theexpected cost of that placement. In this paper, we show that this myopic strategy satisfies ananalogous asymptotic optimality result when the convergence of the posterior distribution isconsidered as a function of the total cost (as opposed to the number of observations).
Keywords: active data selection; active learning; asymptotic optimality; Bayesian adaptiveestimation; cost of observation; D-optimality; decision theory; differential entropy; sequentialestimation
1. Introduction
The theoretical framework of this paper is that of Bayesian adaptive estimation withan information based objective function (see, e.g., MacKay [9], Kujala and Lukka [7],Kujala [6]). Following the notation of Kujala [5, 6], the basic problem we consider is theestimation of an unobservable random variable Θ : Ω O - based on a sequence y x , . . . , y x t of independent (given θ ) realizations from some conditional densities p ( y x t | θ ) indexed bytrial placements x t , each of which can be adaptively chosen from some set X based on theoutcomes ( y x , . . . , y x t − ) of the earlier observations. A commonly used greedy strategy This is an electronic reprint of the original article published by the ISI/BS in
Bernoulli ,2016, Vol. 22, No. 1, 615–651. This reprint differs from the original in pagination andtypographic detail. (cid:13)
J.V. Kujala is to choose the next placement so as to maximize the expected immediate informationgain, that is, the decrease of the (differential) entropy of the posterior distribution giventhe next observation.Previous work on the asymptotics of Bayesian estimation (see, e.g., Schervish [11],van der Vaart [13]) has mostly concentrated on the i.i.d. case, and in the few cases wherethe independent (given θ ) but not identical case is considered, it is customarily assumedthat a certain fixed sequence of variables is given. Hence, these results do not apply tothe present situation where the sequence X t of placements is also random.Paninski [10] has developed an asymptotic theory for this adaptive setting. He statesconsistency and asymptotic normality results for the greedy information maximizationplacement strategy and quantifies the asymptotic efficiency of the method. However, theproofs therein are not complete and hence do not provide a sufficient foundation forsome generalizations and theorems we are interested in. In this paper, we develop a moregeneral theory which allows us to generalize the main results of Paninski [10] to almostsure convergence (with novel proofs) and to show that the greedy method is in a certainsense asymptotically optimal among all placement methods. Furthermore, we provide arigorous and general framework that lends itself to further extensions of the theory.One particular extension we are interested in is analyzing the asymptotic properties ofthe novel framework proposed in Kujala [5]. In this framework, the observation of Y x isassociated with some random cost C x (see Section 4.4 for details). To make measurement“cost-effective”, a myopic placement rule is considered that on each trial t maximizes theexpected value of the information gain (decrease of entropy) G t = H(Θ | Y X , . . . , Y X t − ) − H(Θ | Y X , . . . , Y X t )divided by the expected value of the cost C t = C X t . This is called a myopic strategy asit looks only one step ahead. However, it is not a greedy strategy as it does not optimizethe immediate gain.In Kujala [5], the following fairly simple asymptotic optimality result is given for thismyopic strategy. Theorem 1.1.
Suppose that there exists a constant α > such that max x ∈ X E( G t | y , X t = x )E( C t | y , X t = x ) = α (1.1) for all possible sets y of past observations. If the next placement X t is defined as themaximizer of (1.1) and if for some σ < ∞ and ε > , Var( G t | Y X , . . . , Y X t − ) ≤ σ , Var( C t | Y X , . . . , Y X t − ) ≤ σ , E( C t | Y X , . . . , Y X t − ) ≥ ε (1.2) for all t , then the gain-to-cost ratio satisfies lim t →∞ G + · · · + G t C + · · · + C t a . s . = α. symptotic optimality of myopic strategies This is asymptotically optimal in the sense that for any other strategy that satisfies (1.2),we have lim sup t →∞ G + · · · + G t C + · · · + C t a . s . ≤ α. However, this result requires the obtainable information gains to not decrease overtime for the optimality condition to make sense and hence does not in general applyto smooth models. In this paper, we provide a counterpart of the above result using anoptimality criterion (D-optimality) relevant to smooth models.Our results are structured as follows. In Section 2, we derive strong consistency of theposterior distributions under extremely mild, purely topological conditions on the familyof likelihood functions. In Section 3, we consider the local smoothness assumptions (to beassumed in a certain neighborhood of the true parameter value) required for asymptoticnormality. In Section 4.1, we develop a theory of asymptotic proportions and use it for anovel type of convergence of random variables that is required in our analysis. Then, inSections 4.2 and 4.3, we are able to quantify the asymptotic covariance and asymptoticentropy of the posterior distribution and to show a form of asymptotic optimality forthe standard greedy information maximization strategy. In Section 4.4, these resultsare generalized to the situation with random costs of observation associated with eachplacement as discussed above. The heuristically justified, myopic placement strategyproposed in Kujala [5] turns out to be asymptotically optimal also in the sense of thepresent paper, supporting the view that this strategy is the most natural generalizationof the greedy information maximization strategy to the situation where the costs ofobservation can vary. We give concrete examples of the optimality results in Section 5and then end with general discussion in Section 6.
We shall denote random variables by upper case letters and their specific values by lowercase letters. The information theoretic definitions that we will use are the (differential)entropy H( A ) = − R p ( a ) log p ( a ) d a , which does depend on the parameterization of a , theKullback–Leibler divergence D KL ( p ( a ) k p ( b )) = Z p ( a ) log p ( a ) p ( b ) d a, which is independent of the parameterization, and the mutual informationI( A ; B ) = Z p ( a, b ) log p ( a, b ) p ( a ) p ( b ) d( a, b )= Z p ( a ) D KL ( p ( b | a ) k p ( b )) d a = Z p ( b ) D KL ( p ( a | b ) k p ( a )) d b, J.V. Kujala which is also independent of the parameterization as well as symmetric. Also, the iden-tities I( A ; B ) = H( A ) − E(H( A | B )) = H( B ) − E(H( B | A )) hold whenever the differencesare well defined. This is all standard notation (see, e.g., Cover and Thomas [3]) exceptthat in our notation, there is no implicit expectation over the values of A in H( B | A ), andso it is a random variable depending on the value of A . Similarly, a conditional density p ( b | a ) as an argument to D KL ( · · · ) is treated the same way as any other density of b ,with no implicit expectation over a .The densities p ( a ) and p ( b ) above are assumed to be taken w.r.t. arbitrary dominatingmeasures “d a ” and “d b ”. Thus, following Lindley [8], we are in fact working in full mea-sure theoretic generality even though we use the more familiar notation. The underlyingprobability space is (Ω , F , P ) and so, for example, P { Θ ∈ U } means the probability thatthe value of Θ : Ω → O - is within the measurable set U ⊂ O - . In some places we may abbre-viate this by p ( U ), but it will be clear from the context what random variable is referredto. When we say “for a.e. θ ”, it is w.r.t. the prior distribution of Θ. The σ -algebra of O - is assumed to contain at least the Borel sets of the topology which O - is assumed to beendowed with.For any fixed x ∈ X , we assume that the conditional densities p ( y x | θ ) are given w.r.t.the same dominating σ -finite measure “d y x ” for all θ ∈ O - and when we say “for a.e. y x ”,it is w.r.t. this measure. For brevity, we shall indicate conditioning on the data Y t :=( Y X , . . . , Y X t ) by the subscript t on any quantities that depend on them. For example, p t ( θ ) = p ( θ | Y t ) is the posterior density of Θ given Y t and E t ( f (Θ)) = E t ( f (Θ) | Y t ) isthe posterior expectation of f (Θ) given Y t .It is often assumed that one can observe multiple independent (given θ ) copies ofthe same random variable Y x . However, instead of complicating the general notationwith something like Y ( t ) x t , we rely on the fact that the set X can explicitly includeseparate indices for any identically distributed copies, for example, one might have[ Y ( x,t ) | θ ] i . i . d . ∼ [ Y ( x,t ′ ) | θ ] for all t, t ′ ∈ N , t = t ′ . Hence, we can use the simple notationwith no loss of generality.The greedy information gain maximization strategy can be formally defined as choosingthe placement X t to be the value x that maximizes the mutual information I t − (Θ; Y x ) =H t − (Θ) − E t − (H t − (Θ | Y x )), the expected decrease in the entropy of Θ after the nextobservation. In some models, there may be no maximum of the mutual information inwhich case the placement should be chosen sufficiently close to the supremum, which weformally define as the ratio of the mutual information and its supremum converging toone (condition O4 in Section 4).
2. Consistency
The general assumptions for consistency are:C1. The parameter space O - is a compact topological space.C2. The family of log-likelihoods is (essentially) equicontinuous, that is, for all θ ∈ O - and ε >
0, there exists a neighborhood U of θ such that whenever θ ′ ∈ U , | log p ( y x | θ ) − log p ( y x | θ ′ ) | < ε symptotic optimality of myopic strategies y x for all x ∈ X .C3. All points in O - are statistically distinguishable from each other. That is, for alldistinct θ, θ ′ ∈ O - , d x ( θ, θ ′ ) := Z | p ( y x | θ ) − p t ( y x | θ ′ ) | d y x > x ∈ X .C4. For some γ >
0, the placements X t satisfyI t − (Θ; Y X t ) ≥ γ sup x ∈ X I t − (Θ; Y x )for all sufficiently large t . Remark 2.1.
These assumptions for consistency are considerably weaker than thoseformulated in Paninski [10]. In particular, the assumptions C1–C3 only pertain to thelikelihood function p ( y x | θ ), absolutely nothing is assumed about the prior distributionof Θ. Furthermore, these assumptions are purely topological in the sense that they arepreserved by all homeomorphic transformations of O - . Also, in C4, we do not requireperfect maximization of information gain; this is useful as it allows us to apply the sameresult to the non-greedy strategy discussed in Section 4.4 as well. Remark 2.2.
Non-compact spaces can be handled if the log-likelihood has an (essen-tially) equicontinuous extension to a compactification of O - . This happens precisely whenthe following conditions hold:C1 ′ . The parameter space O - is a topological space.C2 ′ . The function f ( θ ) = (( x, y x ) log p ( y x | θ )), with the topology of the target spaceinduced by the ([0 , ∞ ]-valued) norm k v k = sup x ∈ X ess sup y x | v ( x, y x ) | , is continuous (this is just restating C2) and the closure of the range f ( O - ) iscompact (this is the extra condition needed for non-compact spaces).C3 ′ . For all distinct θ, θ ′ ∈ O - , the inequality f ( θ ) = f ( θ ′ ) holds true, where equality isinterpreted w.r.t. a.e. y x . (This is equivalent to C3.)In that case, f lifts continuously to the Stone– ˇCech compactification β O - of O - (Theo-rem A.1). Condition C3 may not hold for the points added by the compactification, butthis can be fixed by moving to the compact quotient space β O - / ker( f ). Thus, C1–C3 canalways be replaced by the strictly weaker conditions C1 ′ –C3 ′ . Lemma 2.1.
Suppose that
C1–C3 hold. Then, there exists a metric d : O - × O - → R thatis consistent with the topology of O - , and an estimator ˆΘ t such that for each t there exists x ∈ X such that I t ( Y x ; Θ) ≥ E t ( d (Θ , ˆΘ t ) ) . J.V. Kujala
Proof.
First, we show that the pseudometric d x defined in C3 is continuous in O - × O - for all x ∈ X . It can be shown using C2 that for any θ ∈ O - and ε >
0, there existsa neighborhood U θ,ε such that d x ( θ, θ ′ ) ≤ ε for all θ ′ ∈ U θ,ε . Thus, for any ε > θ , θ ∈ O - , the triangle inequality implies | d x ( θ ′ , θ ′ ) − d x ( θ , θ ) | ≤ d x ( θ , θ ′ ) + d x ( θ , θ ′ ) ≤ ε whenever ( θ ′ , θ ′ ) ∈ U θ ,ε × U θ ,ε , and so d x is continuous.As d x is continuous, the set S x = { ( θ, θ ′ ) ∈ O - × O - : d x ( θ, θ ′ ) > } is open for every x ∈ X . Now C3 implies that S x ∈ X S x covers O - × O - , and as O - × O - iscompact, there exists a finite subcover S x ∈ X ′ S x . It follows that d ( θ, θ ′ ) = (cid:20) | X ′ | X x ∈ X ′ (cid:18)Z | p ( y x | θ ) − p t ( y x | θ ′ ) | d y x (cid:19) (cid:21) / is positive definite and hence a metric. Since X ′ is finite, this metric inherits the continuityof d x .To show that the topology induced by d coincides with that of O - , let U be an ar-bitrary open neighborhood of θ . Then U c is compact and so its continuous image S := { d ( θ , θ ): θ ∈ U c } is compact, too. It follows that S c is open and as 0 ∈ S c , weobtain [0 , δ U ) ⊂ S c for some δ U >
0. Thus, we obtain { θ ∈ O - : d ( θ , θ ) < δ U } ⊂ U , and sothe topology induced by d is finer than the default topology of O - . As d is continuous, weobtain the converse, and so the topologies coincide.Let then t be arbitrary. We extend d ( θ, θ ′ ) with a special point ¯Θ t / ∈ O - for which wedefine the distances d ( θ, ¯Θ t ) = (cid:20) | X ′ | X x ∈ X ′ (cid:18)Z | p ( y x | θ ) − p t ( y x ) | d y x (cid:19) (cid:21) / . The extended distance function may not be strictly positive definite, but it is still apseudometric and satisfies the triangle inequality. DenotingˆΘ t = arg min θ ∈ O - d ( θ, ¯Θ t ) , we have d ( θ, ¯Θ t ) ≥ d ( ˆΘ t , ¯Θ t ) for all θ ∈ O - , and the triangle inequality yields d ( θ, ¯Θ t ) ≥ d ( θ, ˆΘ t ) − d ( ˆΘ t , ¯Θ t ). Adding both inequalities, we obtain 2 d ( θ, ¯Θ t ) ≥ d ( θ, ˆΘ t ) for all θ ∈ O - .Now, the L -bound of Kullback–Leibler divergence [3], Lemma 11.6.1, yieldsmax x ∈ X ′ I t ( Y x ; Θ) ≥ | X ′ | X x ′ ∈ X ′ I t ( Y x ; Θ) symptotic optimality of myopic strategies Z | X ′ | X x ′ ∈ X ′ D KL ( p ( y x | θ ) k p t ( y x )) p t ( θ ) d θ ( L bound) ≥ Z | X ′ | X x ′ ∈ X ′ (cid:20)Z | p ( y x | θ ) − p t ( y x ) | d y x (cid:21) p t ( θ ) d θ = 4 Z d ( θ, ¯Θ t ) p t ( θ ) d θ ≥ Z d ( θ, ˆΘ t ) p t ( θ ) d θ. (cid:3) Lemma 2.2.
Suppose that K is a function of Θ and has a finite range K . Then, forarbitrarily chosen placements X t , the inequality P ∞ t =1 I t − ( K ; Y X t ) < ∞ holds almostsurely (which implies I t − ( K ; Y X t ) a . s . −→ ). Proof.
As I t − ( K ; Y X t ) = H t − ( K ) − E t − (H t ( K )), where 0 ≤ H t ( K ) ≤ log | K | for all t ,we obtain E t X k =1 I k − ( K ; Y X k ) ! = E(H ( K ) − E t − (H t ( K ))) ≤ log | K | for all t . As I t − ( K ; Y X t ) is nonnegative, the sequence of partial sums is non-decreasing,and Lebesgue’s monotone convergence theorem yieldsE ∞ X k =1 I k − ( K ; Y X k ) ! = lim t →∞ E t X k =1 I k − ( K ; Y X k ) ! ≤ log | K | < ∞ , which implies the statement. (cid:3) Lemma 2.3.
Suppose that C1 and C2 hold. Then I t − (Θ; Y X t ) a . s . −→ for arbitrarily cho-sen placements X t . Proof.
Let ε > O - is compact, a finite number of the sets U θ,ε given byC2 cover it. Thus, we can partition the parameter space into a finite number of subsets O - k each one contained in some U θ,ε . Letting the random variable K denote the index ofthe subset that Θ falls into, the chain rule of mutual information yieldsI t − (Θ; Y t ) = I t − (Θ , K ; Y t ) = I t − ( K ; Y t ) + X k p t − ( k )I t − (Θ; Y t | k ) , (2.1)where Y t := Y X t and Lemma 2.2 implies that I t − ( K ; Y t ) a . s . −→
0. Let us then look at thelatter term. Convexity of the Kullback–Leibler divergence yieldsI t − (Θ; Y t | k ) = Z p t − ( θ | k ) D KL ( p ( y t | θ ) k p t − ( y t | k )) d θ ≤ Z p t − ( θ | k ) (cid:20)Z p t − ( θ ′ | k ) D KL ( p ( y t | θ ) k p ( y t | θ ′ )) d θ ′ (cid:21) d θ J.V. Kujala = Z Z p t − ( θ | k ) p t − ( θ ′ | k ) (cid:20)Z p ( y t | θ ) log p ( y t | θ ) p ( y t | θ ′ ) | {z } ≤ ε for a . e . y t d y t (cid:21) d θ d θ ′ ≤ ε for all t . Thus, lim sup t →∞ I t − (Θ; Y t ) ≤ ε almost surely. As ε > t − (Θ; Y t ) a . s . −→ (cid:3) Lemma 2.4.
For any measurable function f : O - → R , if the prior expectation E f (Θ) iswell-defined and finite, then lim t →∞ E t f (Θ) exists as a finite number almost surely. Proof.
The finiteness of E f (Θ) implies that E | f (Θ) | must also be finite and so Z t :=E t f (Θ) satisfies E | Z t | = E | E t f (Θ) | ≤ E | f (Θ) | < ∞ for all t . Furthermore, since Z t +1 depends linearly on the posterior p t +1 whose expectation E t ( p t +1 ) equals the prior p t ,we obtain E t ( Z t +1 ) = Z t for all t and so Z t is a martingale. As sup t E | Z t | ≤ E | f (Θ) | < ∞ ,Theorem A.2 implies that lim Z t exists as a finite number almost surely. (cid:3) Theorem 2.1 (Strong consistency).
Suppose that
C1–C4 hold. Then, conditioned onalmost any θ ∈ O - as the true parameter value, the posteriors are strongly consistent, thatis, P t { Θ ∈ U } a . s . −→ for any neighborhood U of θ . Proof.
As the metric d given by Lemma 2.1 is bounded, Lemma 2.4 implies thatlim t →∞ E t ( d (Θ , θ )) exists and is finite for all θ in a countable dense subset of O - almostsurely, in which case continuity of d implies the same for all θ ∈ O - .Lemmas 2.1 and 2.3 and C4 yield E t ( d (Θ , ˆΘ t )) a . s . −→
0. As d is bounded, Lebesgue’sdominated convergence theorem and Markov’s inequality imply P { d (Θ , ˆΘ t ) > ε } ≤ E( d (Θ , ˆΘ t )) ε = E(E t ( d (Θ , ˆΘ t ))) ε → ε > d (Θ , ˆΘ t ) P →
0. Convergence in probability implies that there exists asubsequence t k such that d (Θ , ˆΘ t k ) a . s . −→
0. Thus, conditioned on almost any θ as the truevalue, we obtain d ( θ , ˆΘ t k ) a . s . −→
0, and the triangle inequality yieldsE t k ( d (Θ , θ )) ≤ E t k ( d (Θ , ˆΘ t k )) + d ( θ , ˆΘ t k ) a . s . −→ . As we have already established that the full sequence E t ( d (Θ , θ )) almost surely con-verges, it now follows that the limit must almost surely be zero. Thus, given any neigh-borhood U ⊃ B d ( θ , ε ) of θ , Markov’s inequality yields P t { Θ ∈ U c } ≤ P t { Θ ∈ B d ( θ , ε ) c } ≤ E t ( d (Θ , θ )) ε a . s . −→ . (cid:3) symptotic optimality of myopic strategies Lemma 2.5.
Suppose that
C1–C3 hold and assume that conditioned on θ ∈ O - as thetrue parameter value, the posteriors are strongly consistent. Then: Given any metric d consistent with the topology of O - , Θ ∗ t := arg min θ ∈ O - E t ( d (Θ , θ ) ) a . s . −→ θ . For any neighborhood U of θ there exists a constant c > such that, almost surely, I t ( Y x ; Θ) ≥ c P t { Θ ∈ U c } for some x ∈ X for all sufficiently large t . Proof.
Let D be the diameter of Θ. The triangle inequality a ≤ b + c implies a ≤ ( b + c ) ≤ b + c ) and so consistency of the posteriors yields d ( θ , Θ ∗ t ) ≤ t ( d (Θ , θ ) + d (Θ , Θ ∗ t ) ) ≤ t ( d (Θ , θ ) ) ≤ r + D P t { Θ ∈ B d ( θ , r ) c } ) a . s . −→ r + D · r >
0, which implies Θ ∗ t a . s . −→ θ .Let us then assume that the metric d is the one given by Lemma 2.1 and choose ε > B d ( θ , ε ) ⊂ U . As Θ ∗ t a . s . −→ θ , we have B d (Θ ∗ t , ε ) ⊂ U for all sufficiently large t ,and so Lemma 2.1 and Markov’s inequality yieldI t ( Y x ; Θ) ≥ E t ( d (Θ , ˆΘ t ) ) ≥ E t ( d (Θ , Θ ∗ t ) ) ≥ ε P t { Θ ∈ B d (Θ ∗ t , ε ) c } ≥ ε P t { Θ ∈ U c } for some x ∈ X . (cid:3) The differential entropy is sensitive to the parameterization, but asymptotically, we canin most cases ignore this due to the following lemma.
Lemma 2.6.
Suppose that the prior entropy
H(Θ) is well-defined and finite. Then, lim t →∞ [H t (Θ) + D KL ( p t ( θ ) k p ( θ ))] exists as a finite number almost surely. Proof.
As H t (Θ) + D KL ( p t ( θ ) k p ( θ )) = E t log p (Θ) and E log p (Θ) = − H(Θ) is well-defined and finite, the statement follows from Lemma 2.4. (cid:3)
Lemma 2.7.
Suppose that C1 ′ holds and let f be defined as in C2 ′ . Then, for any subset S ⊂ O - , | log p t +1 ( θ | S ) − log p t ( θ | S ) | ≤ f ( S )0 J.V. Kujalafor all θ ∈ S . If C2 ′ holds, then this upper bound is finite. Proof.
Let θ ∈ S be fixed. If p t ( θ | S ) is multiplied by p ( y x | θ ) /p ( y x | θ ), it can changeby at most a factor of exp(diam f ( S )), and for the same reason, the normalization con-stant for this density is within a factor of exp(diam f ( S )) from 1. The statement follows.Suppose then that C2 ′ holds. As f ( O - ) is compact, it follows that f ( S ) ⊂ f ( O - ) mustbe bounded. (cid:3) Lemma 2.8.
Suppose that C1 and C2 hold. Then, for any ε > , the inequality D KL ( p t ( θ ) k p ( θ )) < εt holds true for all sufficiently large t . Proof.
Let ε > O - into a finitenumber of subsets O - k such that | log p ( y x | θ ) − log p ( y x | θ k ) | ≤ ε for all θ ∈ O - k , y x , and x ∈ X , where θ k is some fixed point of O - k . Let the random variable K denote the indexof the subset that Θ falls into. Lemma 2.7 implies that | log p t +1 ( θ | k ) − log p t ( θ | k ) | ≤ ε for all θ ∈ O - k , which yields D KL ( p t ( θ | k ) k p ( θ | k )) = E t (cid:18) log p t (Θ | k ) p (Θ | k ) (cid:12)(cid:12)(cid:12) k (cid:19) ≤ εt for all t and k . The chain rule of Kullback–Leibler divergence now yields D KL ( p t ( θ ) k p ( θ )) = D KL ( p t ( k ) k p ( k )) + X k p t ( k ) D KL ( p t ( θ | k ) k p ( θ | k )) ≤ log max k p ( k ) − + 2 εt, where we may assume that p ( k ) is positive since we can drop any set O - k with p ( k ) = 0from the partition. (cid:3) Lemma 2.9.
Suppose that O - ⊂ R n is bounded and the family of log-likelihoods is uni-formly Lipschitz, that is, | log p ( y x | θ ) − log p ( y x | θ ′ ) | ≤ M | θ − θ ′ | for all θ, θ ′ ∈ O - for all y x and x ∈ X . Then, for arbitrarily chosen placements X t , theexpected gain over t trials is bounded by I(Θ; Y t ) ≤ n log t + c for some constant c < ∞ . Proof.
For each t , we can subdivide the bounded parameter space O - into ≤ ct n subsets O - k , each having diameter ≤ t − . Letting the random variable K t denote the index of thesubset that Θ falls into, the chain rule of mutual information yieldsI(Θ; Y t ) = I( K t ; Y t ) | {z } ≤ log( ct n ) + X k t p ( k t ) I(Θ; Y t | k t ) | {z } ≤ M ≤ n log t + log c + M (2.2) symptotic optimality of myopic strategies (cid:3)
3. Asymptotic normality
In this section, we assume that:N1. The parameter space O - is a subset of R n .N2. The true parameter value θ is an interior point of O - .N3. The log-likelihood θ log p ( y x | θ ) is twice continuously differentiable with |∇ θ log p ( y x | θ ) | ≤ M and |∇ θ log p ( y x | θ ) | ≤ M for all x ∈ X and y x .N4. The family of Hessians θ
7→ ∇ θ log p ( y x | θ ) is equicontinuous at θ over all x ∈ X and y x .N5. The prior density is absolutely continuous w.r.t. the Lebesgue measure with pos-itive and continuous density at θ .For simplicity of notation, all statements are implicitly conditioned on θ being the trueparameter value. Throughout this section, we will denote the posterior mean and covari-ance by ˆΘ t := E t (Θ) and Σ t = Cov t (Θ). Note that the expected square error E t ( | Θ − θ | )is minimized by the mean θ = E t (Θ). Thus, if the posteriors are strongly consistent,then Lemma 2.5 implies that ˆΘ t a . s . −→ θ . Note also that the square error is related to thevariance through the identity E t ( | Θ − ˆΘ t | ) = tr(Σ t ). Lemma 3.1.
Suppose that N1 and N3 hold and O - is a bounded convex set withdiameter ≤ D < ∞ . Then, there exists a constant C M,D < ∞ such that for all t , and x , | I t ( Y x ; Θ) − ( Σ t ) ⊙ I x ( ˆΘ t ) | ≤ C M,D E t ( | Θ − ˆΘ t | ) , where ⊙ denotes the Frobenius product A ⊙ B = P i,j A ij B ij = tr( A T B ) , and I x ( θ ) is theFisher information matrix I x ( θ ) := Z (cid:20) ∇ θ p ( y x | θ ) p ( y x | θ ) (cid:21)(cid:20) ∇ θ p ( y x | θ ) p ( y x | θ ) (cid:21) T p ( y x | θ ) d y x . Proof.
We can formally expand the mutual information asI t ( Y x ; Θ) = H t ( Y x ) − E t (H( Y x | Θ))= Z g (cid:18)Z p ( y x | θ ) p t ( θ ) d θ (cid:19) d y x − Z (cid:18)Z g ( p ( y x | θ )) d y x (cid:19) p t ( θ ) d θ = Z (cid:20) g (cid:18)Z p ( y x | θ ) p t ( θ ) d θ (cid:19) − Z g ( p ( y x | θ )) p t ( θ ) d θ (cid:21) d y x , where g ( p ) = − p log p . (Although H t ( Y x ) − E t (H( Y x | Θ)) may not be well defined here,the last line is always well-defined and equal to the mutual information.) Denoting p y x :=2 J.V. Kujala p ( y x | ˆΘ t ), Taylor’s theorem yields g ( p ) = − p y x log p y x − (1 + log p y x )( p − p y x ) − ( p − p y x ) p y x + ( p − p y x ) q p,y x , where q p,y x is some number between p y x and p . The error term is bounded by | ε y x ( p ) | := (cid:12)(cid:12)(cid:12)(cid:12) ( p − p y x ) q p,y x (cid:12)(cid:12)(cid:12)(cid:12) ≤ | p − p y x | { p, p y x } p y x = 16 (exp( | log p − log p y x | ) − p y x , and as | log p ( y x | θ ) − log p ( y x | ˆΘ t ) | ≤ M | θ − ˆΘ t | ≤ M D , we further obtain | ε y x ( p ( y x | θ )) | ≤
16 (exp( | log p ( y x | θ ) − log p ( y x | ˆΘ t ) | ) − p ( y x | ˆΘ t ) ≤
16 (exp( M | θ − ˆΘ t | ) − p ( y x | ˆΘ t ) ≤ (cid:18) exp( M D ) − M D M | θ − ˆΘ t | (cid:19) p ( y x | ˆΘ t )= C | θ − ˆΘ t | p ( y x | ˆΘ t ) . Due to the linearity of the integral, the constant and first order terms of the expansioncancel out, leaving justI t ( Y x ; Θ) ≈ Z − [ R p ( y x | θ ) p t ( θ ) d θ − p y x ] + R [ p ( y x | θ ) − p y x ] p t ( θ ) d θ p y x d y x = Z
12 Var t (cid:18) p ( y x | Θ) p ( y x | ˆΘ t ) (cid:19) p ( y x | ˆΘ t ) d y x , where the error is bounded by (cid:12)(cid:12)(cid:12)(cid:12)Z ε y x (cid:18)Z p ( y x | θ ) p t ( θ ) d θ (cid:19) d y x − Z Z ε y x ( p ( y x | θ )) p t ( θ ) d θ d y x (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) ε y x (cid:18)Z p ( y x | θ ) p t ( θ ) d θ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) + Z | ε y x ( p ( y x | θ )) | p t ( θ ) d θ (cid:27) d y x Jensen ≤ Z (cid:26)Z | ε y x ( p ( y x | θ )) | p t ( θ ) d θ + Z | ε y x ( p ( y x | θ )) | p t ( θ ) d θ (cid:27) d y x ≤ Z Z C | θ − ˆΘ t | p ( y x | ˆΘ t ) p t ( θ ) d θ d y x ≤ C E t ( | Θ − ˆΘ t | )for all t , ˆΘ t , and x (Jensen’s inequality applies as | ε y x ( p ) | is convex). symptotic optimality of myopic strategies p ( y x | θ ) p ( y x | ˆΘ t ) = 1 + ∇ θ p ( y x | ˆΘ t ) T p ( y x | ˆΘ t ) ( θ − ˆΘ t ) + 12 ( θ − ˆΘ t ) T ∇ θ p ( y x | θ ′ ) p ( y x | ˆΘ t ) ( θ − ˆΘ t ) T , where θ ′ is a convex combination of ˆΘ t and θ . The coefficients are uniformly bounded by (cid:12)(cid:12)(cid:12)(cid:12) ∇ θ p ( y x | ˆΘ t ) p ( y x | ˆΘ t ) (cid:12)(cid:12)(cid:12)(cid:12) = |∇ θ log p ( y x | ˆΘ t ) | ≤ M and (cid:12)(cid:12)(cid:12)(cid:12) ∇ θ p ( y x | θ ′ ) p ( y x | ˆΘ t ) (cid:12)(cid:12)(cid:12)(cid:12) = p ( y x | θ ′ ) p ( y x | ˆΘ t ) | {z } ≤ exp( MD ) |∇ θ log p ( y x | θ ′ ) | {z } |·|≤ M ∇ θ log p ( y x | θ ′ ) T | {z } |·|≤ M + ∇ θ log p ( y x | θ ′ ) | {z } |·|≤ M |≤ exp( M D )( M + M ) =: C . Thus, denoting the linear term by A and the error term by B , we obtainVar t (cid:18) p ( y x | Θ) p ( y x | ˆΘ t ) (cid:19) = Var t ( A ) + Var t ( B ) + 2 Cov t ( A, B ) , where Var t ( A ) = Σ t ⊙ (cid:20) ∇ θ p ( y x | ˆΘ t ) p ( y x | ˆΘ t ) (cid:21)(cid:20) ∇ θ p ( y x | ˆΘ t ) p ( y x | ˆΘ t ) (cid:21) T , Var t ( B ) ≤ E t ( | B | ) ≤ ( C ) E t ( | Θ − ˆΘ t | ) ≤ ( C ) D E t ( | Θ − ˆΘ t | ) , | Cov t ( A, B ) | = | E t ( AB ) − E t ( A ) | {z } =0 E t ( B ) | ≤ E t ( | A || B | ) ≤ M C E t ( | Θ − ˆΘ t | ) . (cid:3) For the next theorems and lemmas, we define the following conditions that depend ona subset U ⊂ O - :L1. |∇ θ log | p ( y x | θ ) − ∇ θ log | p ( y x | θ ′ ) | < µ/ θ, θ ′ ∈ U , x ∈ X , and y x .L2. | log p ( θ ) − log p ( θ ′ ) | ≤ C for all θ, θ ′ ∈ U .L3. The maximum likelihood estimator Θ ∗ t := arg max θ ∈ U p ( Y t | θ ) is eventually well-defined and converges to θ as t increases within indices satisfying λ t ≥ tµ , where λ t is the smallest eigenvalue of −∇ θ log p ( Y t | θ ). Lemma 3.2.
Suppose that N4 and N5 hold. Then, for any µ, C > , there exists a con-stant δ µ,C < ∞ such that L1 and L2 hold for any neighborhood U of θ having diameterless than δ µ,C . J.V. Kujala
Lemma 3.3.
Suppose that N1 , N3 , and L1 hold. If p ( Y t | θ ) ≥ p ( Y t | θ ) for some θ ∈ U ,then | θ − θ | ≤ | A t | t / µ , where A t = t − / ∇ log p ( Y t | θ ) . Furthermore, conditioned on θ as the true parametervalue, P {| A t | ≥ a } ≤ n exp (cid:18) − a nM (cid:19) for all t satisfying λ t ≥ tµ , where λ t is the smallest eigenvalue of −∇ θ log p ( Y t | θ ) . Proof.
Taylor’s theorem yieldslog p ( Y t | θ ) = log p ( Y t | θ ) + =: Z t z }| { ∇ θ log p ( Y t | θ ) T ( θ − θ )+ ( θ − θ ) T ∇ θ log p ( Y t | θ ′ )( θ − θ ) | {z } ≤− (1 / λ t | θ − θ | ≤− (1 / tµ | θ − θ | , for some θ ′ between θ and θ . Thus, p ( Y t | θ ) ≥ p ( Y t | θ ) implies Z Tt ( θ − θ ) ≥ tµ | θ − θ | , which in turn implies | Z t | ≥ tµ | θ − θ | . This is equivalent to the first statement.Let us then prove the latter statement. Now | Z t | t − / = | A | t ≥ a implies that | Z ( k ) t | ≥ t / a/ √ n holds for at least one component k ∈ { , . . . , n } . But as each Z ( k ) t is a martingalesatisfying Z ( k )0 = 0 and | Z ( k ) k +1 − Z ( k ) k | ≤ M , Theorem A.4 yields P {| Z ( k ) t | ≥ t / a/ √ n } ≤ (cid:18) − ta ntM (cid:19) for all k ∈ { , . . . , n } . Summing these probabilities over k so as to give an upper boundon the probability that at least one component is over the limit gives the statement. (cid:3) Lemma 3.4.
Suppose that
N1–N3 and L1 hold. Then, L3 holds almost surely. Proof.
For any sufficiently small ε >
0, N2 implies that the set V = B ( θ , ε ) is a subsetof O - . Lemma 3.3 applied to this set implies that Θ ∗ t converges fast in probability to θ ,that is, the probability P { Θ ∗ t / ∈ B ( θ , ε ) } sums to a finite value over all t . This impliesthat Θ ∗ t a . s . −→ θ . (cid:3) Theorem 3.1 (Asymptotic normality).
Suppose that
N1–N5 hold and let
L1–L3 holdfor some µ > , C > , and U ⊂ O - . Then, the following conditions surely hold when t increases within indices satisfying λ t ≥ tµ :symptotic optimality of myopic strategies The posterior density of the scaled variable Φ t = t / (Θ − Θ ∗ t ) satisfies Z | p t ( φ t | Θ ∈ U ) − N ( φ t ; 0 , B − t ) | d φ t → , where N ( · · · ) denotes a normal density with given mean and covariance and B t = − t − ∇ θ log p ( Y t | θ ) . All moments as well as the entropy of p t ( φ t | Θ ∈ U ) are asymptotically equal tothose of N ( φ t ; 0 , B − t ) , that is, the difference converges to zero. Adjusting for the t / scaling factor, this implies in particular that t Cov t (Θ | U ) − B − t → and t / E t ( | Θ − E t (Θ | U ) | | U ) ≤ c n µ − / for sufficiently large t for someconstant c n , and so (assuming that U is bounded and convex), Lemma 3.1 yields sup x ∈ X (cid:12)(cid:12)(cid:12)(cid:12) t I t (Θ; Y x | U ) − B − t ⊙ I x ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) → . Proof.
The scaled variable Φ t takes values in the set V t := { φ t ∈ R n : Θ ∗ t + t − / φ t ∈ U } .A Taylor expansion of log p ( Y t | φ t ) at φ t = 0 yields p t ( φ t ) p t ( φ t = 0) = exp( ± ε ( r )) exp (cid:18) − φ Tt B t φ t ± ε ( r ) | φ t | (cid:19) for all φ t satisfying Θ ∗ t + t − / φ t ∈ B ( θ , r ), where ε ( r ) = sup x,y x ,θ ∈ B ( θ ,r ) max (cid:26)(cid:12)(cid:12)(cid:12)(cid:12) log p ( θ ) p ( θ ′ ) (cid:12)(cid:12)(cid:12)(cid:12) , |∇ θ log p ( y x | θ ) − ∇ θ log p ( y x | θ ′ ) | (cid:27) . Denoting r t = t / , we have S t := B (0 , r t ) ⊂ V t for sufficiently large t and ε t = ε ( r t t − / + | Θ ∗ t − θ | ) →
0. It follows p t ( φ t ) ∝ f t ( φ t ) := exp( − φ Tt B t φ t ) | {z } =: N t ( φ t ) g t ( φ t )for all φ t ∈ V t , where g t ( φ ) = exp( ± ε t ± ε / t ) → φ ∈ S t . As N t ( φ ) is uniformlybounded and S t → R n , it follows [ φ ∈ V t ] f t ( φ ) − N t ( φ ) → φ ∈ R n . Furthermore,as N t ( φ t ) ≤ exp( − µ | φ | ) and g t ( φ ) = exp( ± C ± µ | φ | ) for all φ ∈ V t , it follows Z [ φ ∈ V t ] f t ( φ ) | φ | k ≤ Z exp (cid:18) C − µ | φ | (cid:19) | φ | k < ∞ , Z N t ( φ ) | φ | k < ∞ for all k ≥
0, and so Lebesgue’s dominated convergence theorem implies that Z | [ φ ∈ V t ] f t ( φ ) u ( φ ) − N t ( φ ) u ( φ ) | d φ → J.V. Kujala for any function | u ( φ ) | ≤ | φ | k . This implies that all moments of [ φ ∈ V t ] f t ( φ ) are asymp-totically equal to those of N t ( φ ). As the eigenvalues of B t are between µ and M , the nor-malization constant Z := R N t ( φ ) d φ is within the constant range [(2 π /M ) n/ , (2 π /µ ) n/ ],and it follows that the moments of the normalized densities p t ( φ t ) and N ( φ t ; 0 , B − t )are also asymptotically equal. Similarly, as f t ( φ ) log f t ( φ ) − N t ( φ ) log N t ( φ ) →
0, wherethe log-factors can be bounded by polynomials of | φ | , it follows that the entropies of p t ( φ t ) and N ( φ t ; 0 , B − t ) are asymptotically equal. (Note that the entropy of a density p ( x ) = f ( x ) /Z can be calculated as − ( R f log f ) /Z + log( Z ).) (cid:3) Lemma 3.5.
Suppose that N1 and N3 hold. Then, conditioned on θ as the true param-eter value, E( −∇ θ log p ( Y x | θ )) = I x ( θ ) for all x ∈ X , and B t − P tk =1 I X t ( θ ) t a . s . −→ , where B t = − t − ∇ θ log p ( Y t | θ ) . Proof. E( −∇ θ log p ( Y x | θ ) | Θ = θ )= Z p ( y x | θ ) (cid:26)(cid:20) ∇ θ p ( y x | θ ) p ( y x | θ ) (cid:21)(cid:20) ∇ θ p ( y x | θ ) p ( y x | θ ) (cid:21) T − ∇ θ p ( y x | θ ) p ( y x | θ ) (cid:27) d y x = I x ( θ ) − Z ∇ θ p ( y x | θ ) d y x = I x ( θ ) − ∇ θ Z ∇ θ p ( y x | θ ) d y x = I x ( θ ) − ∇ θ Z p ( y x | θ ) d y x = I x ( θ ) , where the interchange of the order of integration and differentiation is justified byLebesgue’s dominated convergence theorem for the d y x -integrable dominating functions f x ( y x ) and g x ( y x ) given by |∇ θ p ( y x | θ ) | = p ( y x | θ ) |∇ θ log p ( y x | θ ) ∇ θ log p ( y x | θ ) T + ∇ θ log p ( y x | θ ) |≤ p ( y x | θ ) exp( M | θ − θ | ) · ( M + M ) ≤ p ( y x | θ ) exp( M D ) · ( M + M ) =: f x ( y x )and |∇ θ p ( y x | θ ) | = p ( y x | θ ) |∇ θ log p ( y x | θ ) |≤ p ( y x | θ ) exp( M D ) · M =: g x ( y x ) . symptotic optimality of myopic strategies Z k = −∇ θ log p ( Y x k | θ ) − I X k ( θ ), given Θ = θ , the sequence Z + · · · + Z k of partial sums is a martingale and satisfies E( | Z k | ) ≤ ( M + M ) < ∞ for all k , andso Theorem A.3 implies that ( Z + · · · + Z t ) /t a . s . −→
0, which is the statement. (cid:3)
Corollary 3.1.
Suppose that
N1–N5 hold. Then, for all µ > , almost surely t Σ t > ( B t + µI ) − (meaning that the difference is positive definite) for all sufficiently large t , where B t := − t − ∇ θ log p ( Y t | θ ) . In particular, tr( t Σ t ) ≥ (2 µ ) − and det( t Σ t ) ≥ (2 µ ) − (2 M ) − ( n − for all sufficiently large t satisfying min λ B t ≤ µ ≤ M , where min λ B t denotes the smallest eigenvalue of B t . Proof.
Let µ > Y ′ x := ( Y x , Z ),where Z ∼ N (Θ , µ − I ) is independent (given θ ) from Y x . Let U be a neighborhood of θ satisfying L1 and L2 as well as L3 almost surely. If we choose the auxiliary component z t so as to obtain t − P tk =1 z k = E(Θ | y t ) for each t , then L3 remains satisfied given theaugmented data and we also obtain Σ t > Σ ′ t , because the augmented data will strictlydecrease the square error from the original mean, and moving to the new mean canonly further reduce this error. The normalized Hessian at θ for the augmented datais B ′ t = B t + µI , and so, due to Lemma 3.5, min λ B ′ t ≥ µ/ t (although we have fiddled with the z k values, Lemma 3.5 still applies as it does not dependon these values). Thus, Theorem 3.1(3) implies that t Cov(Θ | y ′ t , U ) − ( B ′ t ) − → z k values).Since P t { Θ ∈ U c } decays exponentially in the augmented model, it follows that also t Σ ′ t − ( B ′ t ) − →
0. As the eigenvalues of B ′ t are within the range [ µ/ , M + µ/ t Σ ′ t ) − − B ′ t →
0, which implies ( t Σ ′ t ) − − B ′ t < εI for all sufficiently large t for any ε >
0. It follows t Σ t > t Σ ′ t > ( B t + ( µ + ε ) I ) − for allsufficiently large t . (cid:3)
4. Asymptotic optimality
In this section, we assume that:O1. C1–C4 hold globally.O2. Some neighborhood U of θ ∈ O - is homeomorphic to a subset of R n that satisfiesN1–N5.O3. There exists placements x , . . . , x m ∈ X and nonnegative weights α + · · · + α m = 1such that P mj =1 α j I x j ( θ ) is positive definite.O4. The placements X t satisfy R t := I t (Θ; Y X t +1 )sup x ∈ X I t (Θ; Y x ) . (See Section 4.1 below for the definition of “ ”.)First, let us say a few words about the main difficulty related to the adaptivity of theplacements, namely the complications caused by any secondary modes in the posterior8 J.V. Kujala distribution. This issue is discussed by Paninski [10] in the context of consistency, but itseems that even after consistency has been established, the issue cannot be ignored.The information maximization strategy decreases the relative weights of any secondarymodes only at a rate approximately proportional to 1 /t [10]. Therefore, any secondarymode may have a contribution proportional to 1 /t to all moments of the posterior dis-tribution. This means that only the first order moments of the approximating normaldistribution remain asymptotically accurate, even though its total variation distancefrom the posterior does tend to zero. In particular, the inverse Hessian of the likelihoodgenerally does not give an asymptotically accurate approximation of the global posteriorcovariance. (In fact, the global posterior covariance may be undefined as O - need not havea global Euclidean structure.)For this reason, the asymptotic approximation to the expected information gainI t (Θ; Y x | U ) given by Theorem 3.1(3) only applies within a sufficiently small neighbor-hood U of the true parameter value, where the posterior can be shown to be asymptoti-cally unimodal. Nonetheless, even though the local and global moments are not in goodagreement asymptotically, it turns out that I t (Θ; Y X t +1 | U ) is in fact in good agreementwith I t (Θ; Y X t +1 ) on “most trials”. Indeed, as the relative weights of any secondary modestypically decay at an exponential rate with the number of trials whose placements candistinguish between them, it follows that the placements of only a decreasing fraction oftrials can be significantly affected by the secondary modes.To formalize this intuition, we will first develop a theory for measuring asymptoticproportions. Definition 4.1.
To measure subsets K ⊂ N , we use the proportion measures ρ ( K ) = lim n →∞ ρ ,n ( K ) , ρ a,b ( K ) = | K ∩ [ a, b [ | b − a , where | · | indicates the cardinality of a set. (Note that although ρ a,b is a measure in themeasure-theoretic sense for any a, b ∈ N , the limit ρ is only a finitely additive measure.)When we say “for almost every n ∈ N ”, we mean that the set where the statement doesnot hold is a null set w.r.t. ρ . We use the notation x k x to mean that there exists asubset K ⊂ N with ρ ( K ) = 1 such that [ k ∈ K ]( x k − x ) → . We also define lim sup k ∞ x k := inf { x ∈ R : x k ≤ x for a.e. k ∈ N } , lim inf k ∞ x k := sup { x ∈ R : x k ≥ x for a.e. k ∈ N } , and when both equal x , we write lim k ∞ x k = x . Lemma 4.1.
Suppose that for all j ∈ N , the proposition P jk holds for a.e. k ∈ N . Thenthere exists an increasing sequence j ( k ) → ∞ such that P k ∧ · · · ∧ P j ( k ) k holds for a.e. k ∈ N .symptotic optimality of myopic strategies Proof.
For all j ∈ N , Q jk := P k ∧ · · · ∧ P jk holds for a.e. k ∈ N . Thus, for all j ∈ N , f j ( k ) := inf k ′ ≥ k P k ′ i =1 Q ji k ′ is increasing in k and tends to one as k → ∞ . Choosing j ( k ) = max { j ′ ∈ N : f j ′ ( k ) ≥ − /j ′ } yields the statement. (cid:3) Lemma 4.2. If x k is a bounded sequence, then the following are equivalent: x k x , | x k − x | < ε for a.e. k ∈ N for all ε > ,
3. lim k ∞ x k = x , t P tk =1 | x k − x | → .If x k is not bounded, then 1–3 are equivalent and implied by 4. Proof.
All implications are fairly obvious. As an example, “2 ⇒
1” follows fromLemma 4.1 applied to P jk = [ | x k − x | < /j ]. (cid:3) Lemma 4.3.
Let x k be a nonnegative sequence. If P ∞ k =1 x k < ∞ , then for any ε > ,the inequality x k < ε/k holds true for almost every k ∈ N (which implies k · x k ). Proof.
Assume the contrary: for some ε > K ⊂ N such that x k ≥ ε/k for all k ∈ K and for some c > ρ ,k ( K ) > c for arbitrarily large k . As ρ ,n +1 ( K ) − ρ k,n + k ( K ) ≤ k/n → n → ∞ for all k , we can recursively find an increasing sequenceof indices k = 1, k i +1 ≥ k i , such that ρ k i ,k i +1 ( K ) ≥ c for all i . This yields ∞ X k =1 x k ≥ ∞ X i =1 c ( k i +1 − k i ) εk i ≥ ∞ X i =1 c (2 k i − k i ) εk i = ∞ , which contradicts the assumption. (cid:3) Lemma 4.4.
Suppose that a sequence of random variables X k : Ω → [ − M, M ] satisfies X k X almost surely. Then, E( | X k − X | ) . Proof.
By Lemma 4.2(4) and the dominated convergence theorem,1 t t X k =1 E( | X k − X | ) = E t t X k =1 | X k − X | ! → E lim t →∞ t t X k =1 | X k − X | ! = 0 . (cid:3) Corollary 4.1.
Suppose that the event A k happens for a.e. k ∈ N a.s. Then, P { A k } . J.V. Kujala
Definition 4.2.
We use the notation X k P X to mean that there exists a subset K ⊂ N with ρ ( K ) = 1 such that [ k ∈ K ]( X k − X ) P → . Lemma 4.5. X k P X if and only if P {| X k − X | ≥ ε } for all ε > . Proof.
The “only if” direction is obvious. We will prove the “if” direction.By definition, we have P {| X k − X | ≥ /j } ≤ /j for a.e. k ∈ N for all j ∈ N . Lemma 4.1then implies that there exists an increasing sequence j ( k ) → ∞ such that P {| X k − X | ≥ /j ( k ) } ≤ /j ( k ) → k ∈ N . (cid:3) Lemma 4.6.
Suppose that a sequence of random variables X k satisfies X k X almostsurely. Then, X k P X . Proof.
Let ε > Y t = 1 t t X k =1 [ | X k − X | ≥ ε ] ,X k X implies that Y t →
0. As Y t is bounded, the dominated convergence theoremimplies 0 = E (cid:16) lim t →∞ Y t (cid:17) = lim t →∞ E( Y t ) = lim t →∞ t t X k =1 P {| X k − X | ≥ ε } and so Lemma 4.2(4) yields P {| X k − X | ≥ ε }
0. Now Lemma 4.5 implies the state-ment. (cid:3)
In this section, we show that the greedy information maximization strategy satisfiesasymptotically a condition known as D-optimality. This condition is defined as maximal-ity of the determinant of the Fisher information matrix of the experiment at the trueparameter value θ . The D-optimality criterion is special among all functionals of theinformation matrix (such as the trace, minimum eigenvalue, etc.) in that it is insensitiveto linear or affine transformations of the parameter space O - . Furthermore, in the asymp-totically normal models that we are interested in, it yields a (local) approximation ofthe posterior entropy, which is the utility function commonly used in adaptive estima-tion settings. We will make use of this fact in the next section to derive an asymptoticexpression of the posterior entropy. symptotic optimality of myopic strategies Lemma 4.7.
For almost any θ ∈ O - satisfying O1–O3 , there exists a constant c such thatfor all µ > , given θ as the true parameter value, almost surely I t (Θ; Y X t +1 ) ≥ c ( tµ ) − for all sufficiently large t satisfying λ t ≤ tµ , where λ t denotes the smallest eigenvalue of −∇ θ log p ( Y t | θ ) . Proof.
Denoting I := P mj =1 α j I j , where α j and I j := I x j ( θ ) are given by O3, the small-est eigenvalue min λ I is positive.Suppose that U has diameter D and let C M,D be the constant of Lemma 3.1 applied to U as the parameter space. The same constant also applies to any subset U = B ( θ , δ/ ⊂ U with diameter δ ≤ D and as the posteriors are strongly consistent in U , too, Lemma 2.5implies that E t (Θ | U ) a . s . −→ θ . Thus, N3 and N4 imply that | I x (E t (Θ | U )) − I x ( θ ) | < δ for all x for all sufficiently large t . We obtainI t ( Y x ; Θ | U ) ≥
12 Cov t (Θ | U ) ⊙ I x (E t (Θ | U )) − C M,D E t ( | Θ − E t (Θ | U ) | | U ) ≥
12 Cov t (Θ | U ) ⊙ I x (E t (Θ | U )) − C M,D E t ( δ | Θ − E t (Θ | U ) | | U )= 12 tr(Cov t (Θ | U ) I x (E t (Θ | U ))) − C M,D δ tr(Cov t (Θ | U )) ≥
12 tr(Cov t (Θ | U ) I x ( θ )) − (cid:18) C M,D + 12 (cid:19) δ tr(Cov t (Θ | U )) ≥
12 max j =1 ,...,m tr(Cov t (Θ | U ) I j ) − (cid:18) C M,D + 12 (cid:19) δ tr(Cov t (Θ | U )) ≥
12 tr(Cov t (Θ | U ) I ) − (cid:18) C M,D + 12 (cid:19) δ tr(Cov t (Θ | U )) ≥
12 tr(Cov t (Θ | U )) min λ I − (cid:18) C M,D + 12 (cid:19) δ tr(Cov t (Θ | U ))= (cid:18) min λ I − (cid:18) C M,D + 12 (cid:19) δ (cid:19) tr(Cov t (Θ | U )) =: c tr(Cov t (Θ | U ))for some x ∈ X (fourth inequality) for all sufficiently large t (third inequality), where wehave used the fact that tr( A ) min λ B ≤ tr( AB ) ≤ tr( A ) max λ B (sixth and third inequal-ities). Let us then choose δ < min λ I / (2 C M,D + 1) so that c as defined above is positive.Now, the inequality I t (Θ; Y x ) ≥ p t ( U )I t (Θ; Y x | U ), which follows from the chain rule ofmutual information (cf. the proof of the next lemma), and C4 + Corollary 3.1 implyI t (Θ; Y t +1 ) ≥ γ sup x ∈ X I t (Θ; Y x ) ≥ γ sup x ∈ X p t ( U )I t (Θ; Y x | U ) ≥ γp t ( U ) c tr(Cov t (Θ | U )) ≥ γp t ( U ) c (2 tµ ) − . As Lemma 2.5 yields p t ( U ) a . s . −→
1, the statement follows. (cid:3) J.V. Kujala
Lemma 4.8.
For almost any θ ∈ O - satisfying O1–O3 , there exists a neighborhood U ⊂ U of θ such that conditioned on θ as the true parameter value, almost surely, Q t := I t (Θ; Y X t +1 | U )I t (Θ; Y X t +1 ) . Proof.
By Lemmas 2.2, 2.3 and 4.3, almost surely, the convergencesI t (Θ; Y X t +1 | U ) → ,t I t ([Θ ∈ U ]; Y X t +1 ) U in a countable basis of the compact metrizable space O - .It follows that the same is true conditioned on almost any θ ∈ O - as the true parametervalue. Thus, given almost any θ ∈ O - , we can pick a neighborhood U ⊂ U of θ from thecountable basis such that the above convergences almost surely hold.Lemma 4.7 (applied to µ = M ) almost surely yieldsI t (Θ; Y t +1 ) ≥ c ( M t ) − =: c t − for all sufficiently large t , where we denote Y t +1 = Y X t +1 . Condition C4 + Lemma 2.5yields I t (Θ; Y t +1 ) ≥ γ sup x ∈ X I t (Θ; Y x ) ≥ γcp t ( U c ) =: c p t ( U c )for all sufficiently large t , and the chain rule of mutual information yieldsI t (Θ; Y t +1 ) = I t ([Θ ∈ U ]; Y t +1 ) + p t ( U )I t (Θ; Y t +1 | U ) + p t ( U c )I t (Θ; Y t +1 | U c ) . Thus, almost surely,I t (Θ; Y t +1 | U )I t (Θ; Y t +1 ) = 1 p t ( U ) | {z } → (cid:20) − ≤ I t (Θ; Y t +1 ) /c z }| { p t ( U c ) → z }| { I t (Θ; Y t +1 | U c ) + z }| { t I t ([Θ ∈ U ]; Y t +1 ) t − I t (Θ; Y t +1 ) | {z } ≥ c t − (cid:21) . (cid:3) Corollary 4.2.
Conditioned on almost any θ satisfying O1–O4 , the sequence D t := sup x ∈ X B − t ⊙ I x ( θ ) − B − t ⊙ I X t +1 ( θ ) satisfies [min λ B t ≥ µ ] D t a.s. for any given µ > , where min λ B t denotes the smallesteigenvalue of B t := − t − ∇ θ log p ( Y t | θ ) . Proof.
Let us first shrink the neighborhood U of θ as necessary to make its diametersmaller than the constant δ µ,C given by Lemma 3.2. Then, let U ⊂ U be the neighbor-hood of θ given by Lemma 4.8. By Theorem 3.1(3), there now exist random sequences symptotic optimality of myopic strategies E t → E ′ t → θ as the true value,12 sup x ∈ X B − t ⊙ I x ( θ ) = sup x ∈ X t I t (Θ; Y x | U ) + E t , B − t ⊙ I X t +1 ( θ ) = t I t (Θ; Y X t +1 | U ) + E ′ t whenever min λ B t ≥ µ . For these t , it follows12 D t = (cid:18)
12 sup x ∈ X B − t ⊙ I x ( θ ) | {z } =tr( B − t I x ( θ )) ≤ nµ − M − E t (cid:19)(cid:18) − I t (Θ; Y X t +1 | U )sup x ∈ X I t (Θ; Y x | U ) (cid:19) + E t − E ′ t , where Lemma 4.8 and the inequality I t (Θ; Y x ) ≥ p t ( U )I t (Θ; Y x | U ) yieldI t (Θ; Y X t +1 | U )sup x ∈ X I t (Θ; Y x | U ) ≥ p t ( U ) I t (Θ; Y X t +1 | U )sup x ∈ X I t (Θ; Y x ) = p t ( U ) Q t R t , and so [min λ B t ≥ µ ] D t (cid:3) Lemma 4.9.
Conditioned on almost any θ satisfying O1–O3 , there exists µ such that min λ B t ≥ µ for infinitely many t ∈ N , where min λ B t denotes the smallest eigenvalue of B t = − t − ∇ θ p ( Y t | θ ) . Proof.
Let µ > t − (Θ; Y X t ) ≥ c ( tµ ) − forall sufficiently large t satisfying min λ B t < µ and Lemma 4.8 implies that I t − (Θ; Y X t | U ) ≥ c ( tµ ) − for a.e. t satisfying min λ B t ≤ µ . Let then K µ := { t ∈ N : min λ B t ≥ µ } and suppose that ρ ( K µ ) = 0. Then, ρ j := ρ j , j +1 ( K µ ) →
0, and then exists j such that ρ j ≤ / j ≥ j . It follows j − X t =1 I t − (Θ; Y X t | U ) ≥ cµ j − X t =1 [ t / ∈ K µ ] 1 t ≥ cµ j − X j = j j +1 − X t =2 j (1+ ρ j ) t ≥ cµ ( j − j ) log 23 / , and so t X k =1 I k − (Θ; Y X k | U ) ≥ (cid:18) cµ log 43 (cid:19) log ( t − − c c,µ for all t = 2 j , j ≥ j . Since µ was arbitrary, this implies that the sum grows asymptoticallysuperlogarithmically if ρ ( K µ ) = 0 holds for all µ >
0. If this event has positive probabilityamong all θ ∈ U , then alsoI(Θ; Y t | U ) = E t X k =1 I k − (Θ; Y X k | U ) (cid:12)(cid:12)(cid:12) U ! J.V. Kujala grows superlogarithmically, contradicting Lemma 2.9. Thus, for almost all θ ∈ U satis-fying O1–O3, either K µ is not ρ -measurable or ρ ( K µ ) >
0. In either case K µ is infinite. (cid:3) Theorem 4.1 (Asymptotic D-optimality, part 1).
Conditioned on almost any θ ∈ O - satisfying O1–O4 , almost surely, B t := − t − ∇ θ log p ( Y t | θ ) → B ∗ := arg max B ∈I det( B ) , where I is the convex hull of the closure of { I x ( θ ) } x ∈ X . The maximizer B ∗ is unique,because the determinant is log-concave on the compact convex set I . This result is optimalin the sense that for any strategy of choosing the placements X t (instead of O4 and C4 ),almost surely lim sup t →∞ det( B t ) ≤ det( B ∗ ) . Proof.
The objective function is f ( B ) = (cid:26) log det( B ) , min λ B > −∞ , otherwise,where λ B denotes the set of eigenvalues of B . Lemma 3.5 implies that B t is asymptoticallya convex combination of matrices in the closure of { I x ( θ ) } x ∈ X and so lim sup t →∞ f ( B t ) ≤ f ( B ∗ ). Let us then show that this upper bound is tight.First, we choose some representation B ∗ = P mk =1 α k I k of the optimum point, where I k are matrices in the closure of { I x ( θ ) } x ∈ X and P mk =1 α k = 1.For any symmetric real matrix B t , we have (with slight abuse of notation) ∇ f ( B t ) = B − t , ∇ f ( B t ) = − [( B − t ) i ( B − t ) Tj ] ni,j , [ ∇ f ( B t )] B = − [( B − t ) i ( B − t ) Tj ⊙ B ] ni,j = − B − t BB − t ,B ⊙ [ ∇ f ( B t )] B = − tr( B − t BB − t B ) , and Taylor’s theorem yields f ( B t +1 ) = f ( B t ) + B − t ⊙ ( B t +1 − B t ) − tr( B − t B ′ B − t B ′ ) , where B ′ is between 0 and B t +1 − B t . Denoting B := −∇ θ log( p ( Y X t +1 | θ )), we obtain f ( B t +1 ) − f ( B t ) = f (cid:18) tB t + Bt + 1 (cid:19) − f ( B t )= B − t ⊙ B − B t t + 1 −
12 tr( B − t B ′ B − t B ′ ) | {z } |·|≤ n M µ − ( t +1) − ≥ t + 1 (cid:18) B − t ⊙ B − n − nM µ − t + 1 (cid:19) , symptotic optimality of myopic strategies t satisfying min λ B t ≥ µ for any µ >
0. Denoting by λ i the eigenvalues of B − t B ∗ , Corollary 4.2 now implies that B − t ⊙ I X t +1 ( θ ) + D t = sup x ∈ X B − t ⊙ I x ( θ ) ≥ max k B − t ⊙ I k ≥ X k α k ( B − t ⊙ I k ) = B − t ⊙ B ∗ = tr( B − t B ∗ ) = n X i =1 λ i = n + n X i =1 ( λ i − ≥ n + n X i =1 log( λ i )= n + log det( B − t B ∗ ) = n + f ( B ∗ ) − f ( B t ) , where [min λ B t ≥ µ ] D t µ >
0. Noting that I X t +1 ( θ ) = E t ( B | θ ), we obtainE t ( f ( B t +1 ) | θ ) − f ( B t ) ≥ t + 1 ( f ( B ∗ ) − f ( B t ) − D µ,t ) , where D µ,t = D t + (2 nM µ − ) / ( t + 1).From now on, in order to keep the notation clean, we will implicitly condition allprobability statements on Θ = θ .Let the constants f < f < f ( B ∗ ) be arbitrary and define µ := exp( f ) M − n / > t satisfies f ( B t ) ≥ f . Then, the definition of µ guarantees thatmin λ B t ≥ µ . Let then α ∈ ]1 , exp( µ/M )] be arbitrary. Since min λ B t can decrease by atmost M/t per each step, we obtainmin λ B t ≥ µ − t X t = t +1 Mt ≥ µ − M log t t ≥ µ for all t between t and t := ⌊ αt ⌋ . Thus, the following inequalities hold true for all t ∈ [ t , t [: E t − f ( B t ) − f ( B t − ) ≥ t ( f ( B ∗ ) − f ( B t − ) − D µ,t − ) , E t − ( tf ( B t ) − ( t − f ( B t − )) ≥ f ( B ∗ ) − D µ,t − , E t ( tf ( B t ) − ( t − f ( B t − )) ≥ f ( B ∗ ) − E t D µ,t − , t X t = t +1 E t ( tf ( B t ) − ( t − f ( B t − )) ≥ t X t = t +1 ( f ( B ∗ ) − E t D µ,t − ) , E t ( t f ( B t )) − t f ( B t ) ≥ ( t − t ) f ( B ∗ ) − t − X t = t E t D µ,t , J.V. Kujala and dividing by t , we obtain the inequalityE t f ( B t ) − α − f ( B t ) ≥ (cid:18) − t t (cid:19) f ( B ∗ ) − E t t t − X t = t D µ,t ! → (1 − α − ) f ( B ∗ ) , where we have used the fact that t ≤ αt , and where the convergence holds for anyincreasing sequence of indices t satisfying f ( B t ) ≥ f (which implies min λ B t ≥ µ forall t ∈ [ t , t [). This convergence is obtained by applying Lemma 4.2(3) to the boundedsequence [min λ B t ≥ µ ] D µ,t
0, which yields (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t t − X t = t D µ,t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ t t − X t =0 | [min λ B t ≥ µ ] D µ,t | → | D µ,t | ≤ nM µ − + 2 nM µ − for all t , Lebesgue’s dominated convergencetheorem allows us to take this limit inside the expectation). Thus, there exists a positiveconstant s such that E t f ( B t ) ≥ f ( B t ) + 2 s for all sufficiently large t satisfying f ≤ f ( B t ) ≤ f . Also, since the maximum changein the value of f over one step is bounded by v/t for some constant v > µ ), we obtainVar t f ( B t ) ≤ t X t = t +1 (cid:18) vt (cid:19) ≤ Z t t (cid:18) vt (cid:19) d t = v (cid:18) t − t (cid:19) ≤ v t . Now Markov’s inequality yields P t { f ( B t ) < f ( B t ) + s } ≤ P t { f ( B t ) < E t f ( B t ) − s }≤ P t {| E t f ( B t ) − f ( B t ) | > s }≤ Var t f ( B t ) s ≤ v t s . As this upper bound on the probability sums to a finite number over the sequence t ( k )determined by t ( k + 1) = t ( k ) = ⌊ αt ( k ) ⌋ , the Borel–Cantelli lemma implies that al-most surely f ( B t ( k +1) ) < f ( B t ( k ) ) + s holds for only finitely many indices k ∈ N sat-isfying f ≤ f ( B t ( k ) ) ≤ f . Thus, there exists k such that for all k ≥ k , whenever f ≤ f ( B t ( k ) ) ≤ f , the value f ( B t ( k )) will increase by at least s on each step as k increases. Furthermore, since | f ( B t ) − f ( B t ( k ) ) | ≤ t ( k ) X t = t ( k )+1 vt ≤ v log t ( k ) t ( k ) ≤ v log α symptotic optimality of myopic strategies t ∈ [ t ( k ) , t ( k )[, it follows that if f ( B t ( k ) ) ≥ f for any k ≥ k , then f ( B t ) ≥ f − v log α for all sufficiently large t (provided that f − v log α ≥ f ). Since f − v log α canbe made arbitrarily close to f ( B ∗ ) by appropriate choices of rational α > f < f ( B ∗ ) for arbitrarily small rational f , we almost surely obtain lim inf t →∞ f ( B t ) ≥ f ( B ∗ ) unless f ( B t ) eventually stays below any number. But this would imply thatlim sup t →∞ min λ B t ≤
0, which is almost surely contradicted by Lemma 4.9. (cid:3)
Corollary 4.3 (Asymptotic D-optimality, part 2).
Conditioned on almost any θ ∈ O - satisfying O1–O4 , there exists a neighborhood U of θ such that t Cov t (Θ | U ) a . s . −→ ( B ∗ ) − . This is optimal in the sense that for any other strategy in place of O4 and C4 , almost surely lim inf t →∞ det( t Cov t (Θ | U )) ≥ det( B ∗ ) − . Proof.
Given O4, Theorems 4.1 and 3.1(2) imply that t Cov t (Θ | U ) a . s . −→ ( B ∗ ) − . For anyother strategy, we have lim sup t →∞ det( B t ) ≤ det( B ∗ ) a.s., and so Theorem 3.1(2) yieldslim inf t →∞ det( t Cov t (Θ | U )) ≥ det( B ∗ ) − a.s. as t increases within indices satisfyingmin λ B t > µ for some given µ >
0. But Corollary 3.1 implies that if we choose a sufficientlysmall µ >
0, then det( t Cov t (Θ | U )) ≥ det( B ∗ ) − also for min λ B t ≤ µ , and the statementfollows. (cid:3) Remark 4.1.
As discussed in the beginning of this section, secondary modes withweights proportional to 1 /t may remain outside U , and they do contribute to the asymp-totic variance. Thus, the D-optimality result (part 2) shown here is only a local form ofoptimality.The situation would be different if the placements were chosen so as to minimize thedeterminant of the posterior covariance Cov t (Θ) directly (which, of course, presupposesthat the parameter space has global Euclidean structure). Then, slightly more trialswould be spent to decrease the weights of the secondary modes, but they should remaininsignificant in proportion. Thus, we can conjecture that B t a . s . −→ B ∗ would still obtain inTheorem 4.1 with t Cov t (Θ) asymptotically equal to ( B t ) − , making the result globallyoptimal. Here we use the D-optimality result to derive an expression for the asymptotic entropy.
Corollary 4.4.
Conditioned on almost any θ ∈ O - satisfying O1–O4 , for any neighbor-hood U of θ , there exists a constant c U such that almost surely, p t ( U c ) ≤ c U /t for a.e. t ∈ N . Proof.
Theorem 4.1 implies that min λ B t ≥ µ for all sufficiently large t for some µ > ε >
0, Theorem 3.1(3) yields t I t (Θ; Y X t +1 | U ) ≤ sup x ∈ X B − t ⊙ I x ( θ ) + ε ≤ nµ − M + ε =: c J.V. Kujala for all sufficiently large t , where U is any sufficiently small neighborhood of θ . Combinedwith Lemma 4.8, this implies that I t (Θ; Y X t +1 ) ≤ c/t for a.e. t ∈ N , and so Lemma 2.5(2)yields the statement. (cid:3) Remark 4.2.
Note that the statement of Corollary 4.4 holds only for a.e. t ∈ N . Whathappens in a sufficiently long run is that most trials are spent on increasing the accuracyaround the global mode and an approximately logarithmically growing number of trialsis spent on placements that decrease the weights of secondary modes. However, on anysuch trial there is a small probability that the weight of the secondary mode actuallyincreases, and given a sufficiently long run, this will eventually happen arbitrarily manytimes in a row, making the weight of the secondary mode temporarily arbitrarily muchlarger than the c/t bound that holds on most trials. Theorem 4.2.
Conditioned on almost any θ ∈ O - satisfying O1–O4 , if the prior entropy
H(Θ) w.r.t. a parameterization that is consistent with the local Euclidean structure (i.e.,the prior density p ( θ ) is given w.r.t. a measure that coincides with the Lebesgue measureon subsets of U ) is well-defined and finite, then, almost surely H t (Θ) + n t H ∗ := −
12 log det( B ∗ ) + n π e) . Proof.
Let us condition everything on θ being the true value. Theorem 3.1(2) impliesthat for some sufficiently small neighborhood U of θ ,H t (Θ | U ) + n t a . s . −→ H ∗ . Lemmas 2.6 and 2.8 imply that for any ε > | H t (Θ | U c ) | < εt for all sufficiently large t , and as Corollary 4.4 yields p t ( U c ) ≤ c/t for a.e. t , Lemma 4.2(2) implies p t ( U c )H t (Θ | U c )
0. The statement now follows from the chain rule of entropyH t (Θ) = p t ( U )H t (Θ | U ) + p t ( U c )H t (Θ | U c ) | {z } + H t ([Θ ∈ U ]) | {z } → . s . , where the first term satisfies p t ( U )H t (Θ | U ) + n t = p t ( U ) (cid:20) H t (Θ | U ) + n t (cid:21) + p t ( U c ) | {z } ≤ c/t n t H ∗ . (cid:3) Corollary 4.5.
Suppose that
O1–O4 hold for almost all θ ∈ O - and that the prior entropy H(Θ) w.r.t. a parameterization that is consistent with the local Euclidean structures U in O2 is well-defined and finite. Then, H t (Θ) + n t P H ∗ . symptotic optimality of myopic strategies In other words, there exists a set K ⊂ N of indices with ρ ( K ) = 1 such that H t (Θ) + n t P → H ∗ , as t increases within K . Proof.
Apply Lemma 4.6 to the statement of Theorem 4.2. (cid:3)
In Kujala [5] the adaptive sequential estimation framework is generalized to the situationwhere the observation of Y x is associated with some random cost C x of observation,which given the value of Y x , is independent of Θ and the results and costs of any otherobservations: Θ ւ ↓ ց Y x Y x ′ · · ·↓ ↓ C x C x ′ · · · The technical requirement that C x depends on Θ only through Y x is satisfied in particularif C x is a component of Y x . Thus, it leads to no loss of generality if the incurred costsare observable.The goal considered in Kujala [5] is maximization of the expected information gainof a sequential experiment that terminates when the total cost overruns a given budget.To achieve this goal, the heuristic of maximizing the expected information gain I t (Θ; Y x )divided by the expected cost E t ( C x ) on each trial is proposed. In this section, we are ableto show that this heuristic is in fact asymptotically optimal (as the budget tends to infin-ity) under essentially the same conditions that the plain information gain maximizationis.Thus, condition O4 is now replaced by the following:O4 ′ . The placements satisfy R ′ t := I t (Θ; Y X t +1 ) / E t ( C X t +1 )sup x ∈ X (I t (Θ; Y x ) / E t ( C x )) , where | C x | ≤ M , E( C x | θ ) ≥ γ ′ >
0, and the family of expected cost functions { θ E( C x | θ ): x ∈ X } is equicontinuous at θ .Due to the assumed bounds on the expected cost E( C x | θ ), condition C4 is still satisfiedand so all the previous lemmas depending on it apply. Together with the following lemma,these bounds also imply that the total cost grows asymptotically within linear bounds.0 J.V. Kujala
Lemma 4.10.
Suppose that O4 ′ holds. Then, conditioned on θ as the true parametervalue, C t − P tk =1 E( C X k | θ ) t a . s . −→ , where C t := P tk =1 C X k . In particular, for any γ < γ ′ , almost surely C t ≥ tγ for all suffi-ciently large t (as well as C t ≤ tM for all t ). Proof.
Denoting Z k = C X k − E( C X k | θ ), given Θ = θ , the sequence Z + · · · + Z k ofpartial sums is a martingale and satisfies E( | Z k | ) ≤ M < ∞ for all k , and so Theo-rem A.3 implies that ( Z + · · · + Z t ) /t a . s . −→
0, which is the statement. (cid:3)
Next, we will generalize Corollary 4.2 for the cost-aware placements.
Corollary 4.6.
Conditioned on almost any θ satisfying O1–O3 and O4 ′ , the sequence D t := sup x ∈ X B − t ⊙ I x ( θ )E( C x | θ ) − B − t ⊙ I X t +1 ( θ )E t ( C X t +1 | θ ) satisfies [min λ ( C t /t ) B t ≥ µ ] D t a.s. for any given µ > , where min λ ( C t /t ) B t denotesthe smallest eigenvalue of B t := − C − t ∇ θ log p ( Y t | θ ) and C t := P tk =1 C X k . Proof.
Let us first shrink the neighborhood U of θ as necessary to make its diametersmaller than the constant δ µ,C given by Lemma 3.2. Then, let U ⊂ U be the neigh-borhood of θ given by Lemma 4.8. The boundedness and equicontinuity at θ of θ E( C x | θ ) ∈ [ γ ′ , M ] imply that conditioned on Θ = θ , almost surely, E t ( C x ) → E( C x | θ ),uniformly over all x ∈ X . Combined with Theorem 3.1(3), this implies that there existrandom sequences E t → E ′ t → θ as the true value,12 sup x ∈ X B − t ⊙ I x ( θ )E( C x | θ ) = sup x ∈ X C t I t (Θ; Y x | U )E t ( C x ) + E t , B − t ⊙ I X t +1 ( θ )E( C X t +1 | θ ) = C t I t (Θ; Y X t +1 | U )E t ( C X t +1 ) + E ′ t whenever min λ ( C t /t ) B t ≥ µ . For these t , it follows12 D t = (cid:18)
12 sup x ∈ X B − t ⊙ I x ( θ )E( C x | θ ) | {z } ≤ tr( B − t I x ( θ )) /γ ≤ n ( γµ ) − M − E t (cid:19)(cid:18) − I t (Θ; Y X t +1 | U ) / E t ( C X t +1 )sup x ∈ X (I t (Θ; Y x | U ) / E t ( C x )) (cid:19) + E t − E ′ t , symptotic optimality of myopic strategies t (Θ; Y x ) ≥ p t ( U )I t (Θ; Y x | U ) yieldI t (Θ; Y X t +1 | U ) / E t ( C X t +1 )sup x ∈ X (I t (Θ; Y x | U ) / E t ( C x )) ≥ p t ( U ) I t (Θ; Y X t +1 | U ) / E t ( C X t +1 )sup x ∈ X (I t (Θ; Y x ) / E t ( C x )) = p t ( U ) Q t R ′ t , and so [min λ ( C t /t ) B t ≥ µ ] D t (cid:3) Lemma 4.11.
The range of the expression r t = P tk =1 I x k ( θ ) P tk =1 E( C x k | θ ) over all sequences x k in X and all finite t is a dense subset of the set I defined as theclosure of the convex hull of S = (cid:26) I x ( θ )E( C x | θ ) (cid:27) x ∈ X . Furthermore, the range of the limits of all converging r t equals I . Proof.
For any sequence { x k } , we have r t = P tk =1 I x k ( θ ) P tk =1 E( C x k | θ ) = t X k =1 (cid:18) E( C x k | θ ) P tk =1 E( C x k | θ ) (cid:19)| {z } =: α k,t I x k ( θ )E( C x k | θ ) , and so r t is always a convex combination of elements in S . The convex combinationis not exactly linear w.r.t. the number of different x in the sequence because of thedifferent E( C x k | θ ) weights, but nonetheless, by varying the proportions of different x in a sufficiently long sequence, any convex combination can be approximated arbitrarilywell. (cid:3) Theorem 4.3 (Asymptotic D-optimality, part 1).
Conditioned on almost any θ ∈ O - satisfying O1–O3 , O4 ′ , almost surely, B t := −∇ θ log p ( Y t | θ ) C t → B ∗ := arg max B ∈I det( B ) , where C t := P tk =1 C X k and I is the convex hull of the closure of S = (cid:26) I x ( θ )E( C x | θ ) : x ∈ X (cid:27) . This is optimal in the sense that for any strategy of choosing the placements X t (insteadof O4 ′ and C4 ), almost surely lim sup t →∞ det( B t ) ≤ det( B ∗ ) . J.V. Kujala
Proof.
Since S is bounded, I is a compact convex set and B ∗ is well defined. Lemmas3.5, 4.10, and 4.11 imply that lim sup t →∞ det( B t ) ≤ det( B ∗ ) a.s. Let us then show thatthis upper bound is tight.Lemma 4.11 implies that there exists a representation B ∗ = lim m →∞ P mk =1 I k P mk =1 c k of the optimum point B ∗ where ( I k , c k ) are elements of { ( I x ( θ ) , E( C x | θ )): x ∈ X } .Denoting B := −∇ θ log( p ( Y X t +1 | θ )) and C := C X t +1 , and assuming min λ ( C t /t ) B t ≥ µ ,we obtain | B | , | C | ≤ M, | B − t | ≤ ( µ/M ) − , | B − CB t | ≤ M + M /µ, C t + C ≥ γ ( t + 1)and so, for some B ′ between 0 and B t +1 − B t , we obtain f ( B t +1 ) − f ( B t ) = f (cid:18) C t B t + BC t + C (cid:19) − f ( B t )= B − t ⊙ B − CB t C t + C −
12 tr( B − t B ′ B − t B ′ ) ≥ C t + C (cid:18) B − t ⊙ B − nC − [( µ/M ) − ( M + M /µ )] C t + C (cid:19) ≥ E t ( C | θ ) C t + C | {z } ≥ ( γ/M ) / ( t +1) (cid:18) B − t ⊙ B E t ( C | θ ) − nC E t ( C | θ ) − C M,µ,γ t + 1 (cid:19) . Denoting by λ i the eigenvalues of B − t B ∗ , we obtain B − t ⊙ I X t +1 ( θ )E( C X t +1 | θ ) + D t = sup x ∈ X B − t ⊙ I x ( θ )E( C x | θ ) ≥ sup k (cid:18) B − t ⊙ I k c k (cid:19) ≥ lim m →∞ P mk =1 ( B − t ⊙ I k ) P mk =1 c k = B − t ⊙ B ∗ = tr( B − t B ∗ ) = n X i =1 λ i = n + n X i =1 ( λ i − ≥ n + n X i =1 log( λ i )= n + log det( B − t B ∗ ) = n + f ( B ∗ ) − f ( B t ) , where Corollary 4.6 implies that [min λ ( C t /t ) B t ≥ µ ] D t
0. Noting that E t ( B/ E t ( C | θ ) | θ ) = I X t +1 ( θ ) / E( C X t +1 | θ ), it followsE t ( f ( B t +1 ) | θ ) − f ( B t ) ≥ γ/Mt + 1 ( f ( B ∗ ) − f ( B t ) − D µ,t ) , symptotic optimality of myopic strategies D µ,t = D t + C M,µ,γ / ( t + 1).From here on, the proof is essentially the same as in the maximum information case.We just use µ := exp( f ) M − n / λ ( C t /t ) B t ≥ µ for f ( B t ) ≥ f . (cid:3) The part 2 of the D-optimality result as well as analogs of the asymptotic entropyresults follow with essentially the same proofs (just replacing t with C t at appropriateplaces): Corollary 4.7 (Asymptotic D-optimality, part 2).
Conditioned on almost any θ ∈ O - satisfying O1–O3 , O4 ′ , there exists a neighborhood U of θ such that C t Cov t (Θ | U ) a . s . −→ ( B ∗ ) − , where C t := P tk =1 C X k . This is optimal in the sense that for anyother strategy in place of O4 ′ and C4 , almost surely lim inf t →∞ det( C t Cov t (Θ | U )) ≥ det( B ∗ ) − . Theorem 4.4.
Conditioned on almost any θ ∈ O - satisfying O1–O3 , O4 ′ , if the priorentropy H(Θ) w.r.t. a parameterization that is consistent with the local Euclidean struc-ture (i.e., the prior density p ( θ ) is given w.r.t. a measure that coincides with the Lebesguemeasure on subsets of U ) is well-defined and finite, then, almost surely H t (Θ) + n C t H ∗ := −
12 log det( B ∗ ) + n π e) , where C t := P tk =1 C X k . Corollary 4.8.
Suppose that
O1–O4 hold for almost all θ ∈ O - and that the prior entropy H(Θ) w.r.t. a parameterization that is consistent with the local Euclidean structures U in O2 is well-defined and finite. Then, H t (Θ) + n C t P H ∗ , where C t := P tk =1 C X k . In other words, there exists a set K ⊂ N of indices with ρ ( K ) = 1 such that H t (Θ) + n C t P → H ∗ , as t increases within K .
5. Examples
In this section, we give specific examples illustrating the optimality results.
Example 5.1 (Psychometric model).
Consider the psychometric model, where anobserver’s unknown intensity threshold Θ for detecting a stimulus of intensity x is dis-tributed uniformly on [0 , Y x ∈ { , } for a test intensity x ∈ [0 , J.V. Kujala is distributed as p ( y x | θ ) = (cid:26) ψ ( θ − x ) , y x = 1 (detected),1 − ψ ( θ − x ) , y x = 0 (not detected),where ψ ( x ) is the psychometric function, here assumed to be the sigmoid ψ ( x ) = 11 + e − x for simplicity (for more general psychometric models, see Kujala and Lukka [7], and thereferences therein).In this model, the Fisher information of a given placement x is calculated as I x ( θ ) = X y x =0 p ( y x | θ ) (cid:20) ∂∂θ log p ( y x | θ ) (cid:21) = ψ ′ ( θ − x ) ψ ( θ − x )[1 − ψ ( θ − x )] = e θ − x [1 + e θ − x ] . Thus, for any given θ , the D-optimal value of the averaged Fisher information in Theo-rem 4.1 is B ∗ = given by the placement x = θ to which the greedy algorithm eventuallyconverges. Now Corollary 4.5 yieldsH t (Θ) + n t P H ∗ = −
12 log det( B ∗ ) | {z } =0 . + n π e) (5.1)and this is the asymptotically optimal posterior entropy. In this example, the same ex-pression also gives the asymptotically optimal expected utility E(H t (Θ)) + n log t , whichwe will next compare to that of the offline design. Example 5.2 (Offline design).
A rigorous study of the optimal offline design is beyondthe scope of the present article, so we will not go into detailed proofs here but only sketchthe general ideas. Suffice it to say that for an offline design for optimizing the expectedutility E(H t (Θ)), one cannot do much better than to use the usual strategy of placingthe trials evenly on the interval [0 , , B t a . s . −→ Z I x ( θ ) d x = 1100 (cid:18)
11 + e − θ −
11 + e − θ (cid:19) ∈ [0 . , . , where B t = − t − ∇ θ log p ( Y t | θ ), and it can be shown that the asymptotic posteriorentropy satisfies H t (Θ) + n t − (cid:20) −
12 log det( B t ) | {z } lim ≤ . (cid:21) + n π e) a . s . −→ , symptotic optimality of myopic strategies t →∞ (cid:20) H t (Θ) + n t (cid:21) ≥ −
12 log 0 .
01 + n π e)on the posterior entropy. Comparing to the asymptotically optimal posterior entropy(5.1), it follows that the offline design needs asymptotically at least ( . . ) /n = 25 timesas many trials as the optimal adaptive design for the same accuracy. If the range [0 , Example 5.3 (Varying cost of observation).
Let us then return to the adaptivecase and suppose that instead of a unit cost, each trial costs C x = 1 + 3[ Y x = 0]units. Such a formulation could be based on the assumption that the observer takesfour times as long to respond when the stimulus is not detected. Then, the asymptoticefficiency of a placement x in Theorem 4.3 is characterized by the expression I x ( θ )E( C x ) = I x ( θ )1 + 3[1 − ψ ( θ − x )] = 15 + 5 cosh( θ − x ) − θ − x ) . (5.2)This expression is maximized by the placement x = θ + log 2 to which the myopic algo-rithm eventually converges to (provided it is within the range [0 , θ ≤ − log 2 ≈ . B ∗ = . Comparing to the asymptoti-cally optimal placement x = θ for unit cost (yielding B ∗ = in (5.2)), we see that thecost-aware strategy reaches the same accuracy in 10% less cost (time) in this example.
6. Discussion
We have derived an expression for the asymptotic efficiency of any sequential experimentdesign for both the standard framework with unit cost of observation as well as for thegeneralized framework with random costs of observation as proposed in Kujala [5]. Wehave shown an asymptotic D-optimality result for the greedy information optimizationstrategy in the standard framework and we have extended this result for the novel myopicstrategy proposed in Kujala [5] for the situation with random costs of observations. Theseresults indicate that for (almost) all true parameter values θ , the greedy or myopicadaptive design is asymptotically optimal among all placement strategies in a well-definedsense.Assuming the standard sequential estimation framework with unit cost of observation,Lemma 3.5 together with the asymptotic normality result imply that the asymptotic6 J.V. Kujala efficiency of any given design is characterized by the average P tk =1 I X k ( θ ) t of the Fisher information matrices I x ( θ ) over the sequence of placements X t and theD-optimality criterion of a design refers to maximality of the determinant of this averagedinformation matrix at the limit. For any given θ , there is a distribution (or sequence)of placements x ∈ X yielding the D-optimal average information matrix. For (almost) all θ , the placements of the greedy adaptive design converge to such an optimum, whereasthe offline design cannot adjust the distribution of the placements x ∈ X depending onthe true value θ . Thus, the offline design can be equally efficient for a given true valueof Θ, but generally not for all values θ ∈ O - and depending on the model, the gap inefficiency can be arbitrarily large as seen in Example 5.2.The situation is essentially the same in the framework with random costs of observa-tion, the only difference being that the convergence of the estimate of Θ is not measuredin relation to t but in relation to the total cost C t = C X + · · · + C X t of placements. Inthis situation, the asymptotic efficiency is characterized by the ratio P tk =1 I X k ( θ ) P tk =1 E( C X k | θ )and the limit is again determined by the distribution (or sequence) of the placements x ∈ X . Theorem 4.3 shows that the myopic strategy of maximizingI t (Θ; Y x )E t ( C x )yields the asymptotically D-optimal efficiency in this situation.However, the actual utility function assumed in both of the frameworks considered isthe differential entropy, and so the most relevant asymptotic optimality criterion shouldbe based on the asymptotic properties of the differential entropy as shown in, for example,Corollaries 4.5 and 4.8. Thus, a topic for future work is finding conditions under whichthe results of Corollaries 4.5 and 4.8 can be said to be optimal among all placementstrategies. Appendix: Auxiliary theorems
Theorem A.1 (Stone– ˇCech compactification).
Suppose that X is a Tychonoffspace. Then there exists a compact space βX that embeds X as a dense subspace. Anycontinuous map f : X → K , where K is a compact Hausdorff space, lifts uniquely to acontinuous map βX → K . Theorem A.2 (Martingale convergence).
Let X k be a submartingale (i.e., E( X k +1 | X , . . . , X k ) ≥ X k ) and suppose that sup k E | X k | < ∞ . Then, X = lim k →∞ exists almostsurely and E | X | < ∞ .symptotic optimality of myopic strategies Proof.
For example, [11], Theorem B.117, page 648, or [12], Theorem 1, page 508. (cid:3)
Theorem A.3 (A strong law of large numbers for martingales).
Let X k = Z + · · · + Z k be a martingale and let δ > . If ∞ X k =1 E( | Z k | ) k δ < ∞ , then X k /k δ a . s . −→ . Proof.
For example, [2] or [12], Theorem 4, page 519. (cid:3)
Theorem A.4 (Hoeffding–Azuma inequality).
Let X k be a martingale and supposethat | X k − X k − | ≤ c k for all k . Then, for all t > and k ∈ N , P { X n − X ≥ t } ≤ exp (cid:18) − t P nk =1 c k (cid:19) , and P {| X n − X | ≥ t } ≤ (cid:18) − t P nk =1 c k (cid:19) . Proof.
See [4], Theorem 2 and note around (2.18) on page 18, or [1]. (cid:3)
Acknowledgements
The author is grateful to Matti Vihola for many stimulating discussions. This researchwas supported by the Academy of Finland (grant number 121855).
References [1]
Azuma, K. (1967). Weighted sums of certain dependent random variables.
Tˆohoku Math.J. (2) Chow, Y.S. (1967). On a strong law of large numbers for martingales.
Ann. Math. Statist. Cover, T.M. and
Thomas, J.A. (2006).
Elements of Information Theory , 2nd ed. Hobo-ken, NJ: Wiley. MR2239987[4]
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables.
J.Amer. Statist. Assoc. Kujala, J.V. (2010). Obtaining the best value for money in adaptive sequential estimation.
J. Math. Psych. J.V. Kujala [6]
Kujala, J.V. (2012). Bayesian adaptive estimation: A theoretical review. In
Descriptiveand Normative Approaches to Human Behavior ( E.N. Dzhafarov and
L. Perry ,eds.).
Adv. Ser. Math. Psychol. Kujala, J.V. and
Lukka, T.J. (2006). Bayesian adaptive estimation: The next dimension.
J. Math. Psych. Lindley, D.V. (1956). On a measure of the information provided by an experiment.
Ann.Math. Statist. MacKay, D.J.C. (1992). Information-based objective functions for active data selection.
Neural Comput. Paninski, L. (2005). Asymptotic theory of information-theoretic experimental design.
Neu-ral Comput. Schervish, M.J. (1995).
Theory of Statistics . Springer Series in Statistics . New York:Springer. MR1354146[12]
Shiryaev, A.N. (1996).
Probability , 2nd ed.
Graduate Texts in Mathematics . New York:Springer. MR1368405[13] van der Vaart, A.W. (1998). Asymptotic Statistics . Cambridge Series in Statistical andProbabilistic Mathematics . Cambridge: Cambridge Univ. Press. MR1652247. Cambridge: Cambridge Univ. Press. MR1652247