Monotonic convergence of a general algorithm for computing optimal designs
aa r X i v : . [ s t a t . C O ] O c t The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2010
MONOTONIC CONVERGENCE OF A GENERAL ALGORITHMFOR COMPUTING OPTIMAL DESIGNS
By Yaming Yu
University of California, Irvine
Monotonic convergence is established for a general class of multi-plicative algorithms introduced by Silvey, Titterington and Torsney[
Comm. Statist. Theory Methods (1978) 1379–1389] for computingoptimal designs. A conjecture of Titterington [ Appl. Stat. (1978)227–234] is confirmed as a consequence. Optimal designs for logisticregression are used as an illustration.
1. A general class of algorithms.
Optimal experimental design (approx-imate theory) is a well-developed area, and we refer to Kiefer (1974), Silvey(1980), P´azman (1986) and Pukelsheim (1993) for a general introductionand basic results. We consider computational aspects of optimal designs,focusing on a finite design space X = { x , . . . , x n } . Suppose the probabil-ity density or mass function of the response is specified as p ( y | x, θ ) where θ = ( θ , . . . , θ m ) ⊤ is the parameter of interest. Let A i denote the m × m ex-pected Fisher information matrix from a unit assigned to x i with the ( j, k )entry [the expectation is with respect to p ( y | x i , θ )] A i ( j, k ) = E (cid:20) ∂ log p ( y | x i , θ ) ∂θ j ∂ log p ( y | x i , θ ) ∂θ k (cid:21) . The moment matrix, as a function of the design measure w = ( w , . . . , w n ),is defined as M ( w ) = n X i =1 w i A i which is proportional to the Fisher information for θ when the number ofunits assigned to x i is proportional to w i . Here w ∈ ¯Ω, and ¯Ω denotes the Received May 2009; revised October 2009.
AMS 2000 subject classifications.
Key words and phrases.
A-optimality, auxiliary variables, c-optimality, D-optimality,experimental design, generalized linear models, multiplicative algorithm.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2010, Vol. 38, No. 3, 1593–1606. This reprint differs from the original inpagination and typographic detail. 1
Y. YU closure of Ω = { w : w i > , P ni =1 w i = 1 } . Throughout we assume that A i arewell defined and hence nonnegative definite. The setΩ + ≡ { w ∈ ¯Ω : M ( w ) > } , is assumed nonempty. Our approach may conceivably extend to the casewhere M ( w ) is allowed to be singular, by using generalized inverses, althoughwe do not pursue this here.Given an optimality criterion φ defined on positive definite matrices, thegoal is to maximize φ ( M ( w )) with respect to w ∈ Ω + . Typical optimalitycriteria include:(i) the D-criterion φ ( M ) = log det( M ),(ii) the A-criterion φ − ( M ) = − tr( M − ),(iii) more generally, the p th mean criterion φ p ( M ) = − tr( M p ) , p < φ − ,c ( M ) = − c ⊤ M − c , where c is a nonzero constantvector.Often only a linear combination K ⊤ θ , for example, a subvector of θ , is ofinterest. The Fisher information for K ⊤ θ is naturally defined as ( K ⊤ M − K ) − ,assuming invertibility [Pukelsheim (1993)]. We may therefore consider theD- and A-criteria for K ⊤ θ defined, respectively, as φ ,K ( M ) = − log det( K ⊤ M − K );(1) φ − ,K ( M ) = − tr( K ⊤ M − K ) . The c-criterion is a special case of φ − ,K ( M ). Motivations for such optimalitycriteria are well known. In a linear problem, the A-criterion seeks to minimizethe sum of variances of the best linear unbiased estimators (BLUEs) for allcoordinates of θ while the c-criterion seeks to minimize the variance of theBLUE for c ⊤ θ . Similar interpretations (with asymptotic arguments) applyto nonlinear problems.In general M ( w ) also depends on the unknown parameter θ which com-plicates the definition of an optimality criterion. A simple solution is tomaximize φ ( M ( w )) with θ fixed at a prior guess θ ∗ ; this leads to local op-timality [Chernoff (1953)]. Local optimality may be criticized for ignoringuncertainty in θ . However, in a situation where real prior information isavailable, or where the dependence of M on θ is weak, it is nevertheless aviable approach and has been adopted routinely [see, e.g., Li and Majumdar(2008)]. Henceforth we assume a fixed θ ∗ and suppress the dependence of M on θ . Possible extensions are mentioned in Section 5.Optimal designs do not usually come in closed form. As early as Wynn(1972), Fedorov (1972), Atwood (1973) and Wu and Wynn (1978), and aslate as Torsney (2007), Harman and Pronzato (2007) and Dette, Pepelyshev LGORITHM FOR OPTIMAL DESIGNS and Zhigljavsky (2008), various procedures have been studied for numeri-cal computation. We shall focus on the following multiplicative algorithm[Titterington (1976, 1978), Silvey, Titterington and Torsney (1978)] whichis specified through a power parameter λ ∈ (0 , Algorithm I.
Set λ ∈ (0 ,
1] and w (0) ∈ Ω. For t = 0 , , . . . , compute w ( t +1) i = w ( t ) i d λi ( w ( t ) ) P nj =1 w ( t ) j d λj ( w ( t ) ) , i = 1 , . . . , n, (2)where d i ( w ) = tr( φ ′ ( M ( w )) A i ) , φ ′ ( M ) ≡ ∂φ ( M ) ∂M . Iterate until convergence.For a heuristic explanation, observe that (2) is equivalent to w ( t +1) i ∝ w ( t ) i (cid:18) ∂φ ( M ( w )) ∂w i (cid:12)(cid:12)(cid:12)(cid:12) w = w ( t ) (cid:19) λ , i = 1 , . . . , n. (3)The value of ∂φ ( M ( w )) /∂w i indicates the amount of gain in information, asmeasured by φ , by a slight increase in w i , the weight on the i th design point.So (3) can be seen as adjusting w so that relatively more weight is placedon design points whose increased weight may result in a larger gain in φ . If φ is increasing and concave, then a convenient convergence criterion, basedon the general equivalence theorem [Kiefer and Wolfowitz (1960), Whittle(1973)], is max ≤ i ≤ n d i ( w ( t ) ) ≤ (1 + δ ) ¯ d ( w ( t ) ) , (4)where ¯ d ( w ) ≡ P ni =1 w i d i ( w ) and δ is a small positive constant.Algorithm I is remarkable in its generality. For example, little restrictionis placed on the underlying model p ( y | x, θ ). Part of the reason, of course, isthat we focus on Fisher information and local optimality, which essentiallyreduces the problem to a linear one.There exists a large literature on Algorithm I and its relatives [see, e.g.,Titterington (1976, 1978), Silvey, Titterington and Torsney (1978), P´azman(1986), Fellman (1989), Pukelsheim and Torsney (1991), Torsney and Man-dal (2006), Harman and Pronzato (2007), Dette, Pepelyshev and Zhigljavsky(2008) and Torsney and Mart´ın-Mart´ın (2009)]. One feature that has at-tracted much attention is that Algorithm I appears to be monotonic, that is, φ ( M ( w ( t ) )) increases in t , at least in some special cases. For example, when φ = φ (for D-optimality) and λ = 1, Titterington (1976) and P´azman (1986)have shown monotonicity using clever probabilistic and analytic inequali-ties [see also Dette, Pepelyshev and Zhigljavsky (2008) and Harman and Y. YU
Trnovsk´a (2009)]. Algorithm I is also known to be monotonic for φ = φ − ,K as in (1), assuming λ = 1 / A i are rank-one [Fellman (1974), Torsney(1983)]. Monotonicity is important because convergence then holds undermild assumptions (see Section 4). Results in these special cases suggest amonotonic convergence theory for a broad class of φ which is also supportedby numerical evidence presented in some of the references above.
2. Main result.
We aim to state general conditions on φ that ensurethat Algorithm I converges monotonically. As a consequence certain knowntheoretical results are unified and generalized, and one particular conjecture[Titterington (1978)] is confirmed. Define ψ ( M ) ≡ − φ ( M − ) , M > . The functions φ and ψ are assumed to be differentiable on invertible matri-ces. Our conditions are conveniently stated in terms of ψ . As usual, for twosymmetric matrices, M ≤ ( < ) M means M − M is nonnegative (positive)definite. • ψ ( M ) is increasing:0 < M ≤ M = ⇒ ψ ( M ) ≤ ψ ( M )(5)or, equivalently, ψ ′ ( M ) is nonnegative definite for positive definite M . • ψ ( M ) is concave: αψ ( M ) + (1 − α ) ψ ( M ) ≤ ψ ( αM + (1 − α ) M )(6)for α ∈ [0 , , M , M >
0. Equivalently, ψ ( M ) ≤ ψ ( M ) + tr( ψ ′ ( M )( M − M )) , M , M > . (7)Condition (5) is usually satisfied by any reasonable information criterion[Pukelsheim (1993)]. Also note that, if (5) fails, then ∂φ ( M ( w )) /∂w i on theright-hand side of (3) is not even guaranteed to be nonnegative. The realrestriction is the concavity condition (6). For example, (6) is not satisfiedby ψ p ( M ) = − φ p ( M − ) (the p th mean criterion) when p < −
1. [It is usuallyassumed that φ ( M ), rather than ψ ( M ), is concave.] Nevertheless, (6) is satisfied by a wide range of criteria, including the commonly used D-, A- orc-criteria [see cases (i) and (ii) in the illustration of the main result below].Our main result is as follows. Theorem 1 (General monotonicity).
Assume (5) and (6). Assume thatin iteration (2), with < λ ≤ , we have M ( w ( t ) ) > , φ ′ ( M ( w ( t ) )) = 0 and M ( w ( t +1) ) > . Then φ ( M ( w ( t +1) )) ≥ φ ( M ( w ( t ) )) . LGORITHM FOR OPTIMAL DESIGNS In other words, under mild conditions which ensure that (2) is well defined[specifically, the denominator in (2) is nonzero], (5) and (6) imply that (2)never decreases the criterion φ . Let us illustrate Theorem 1 with some exam-ples. For simplicity, in (i)–(iv) we display formulae for λ = 1 only, althoughmonotonicity holds for all λ ∈ (0 , φ p ( M ) = (cid:26) log det M, p = 0; − tr( M p ) , p ∈ [ − , ψ p ( M ) ≡ − φ p ( M − ) satisfies (5) and (6). By Theorem 1, AlgorithmI is monotonic for φ = φ p , p ∈ [ − , p = 0 and p = − λ ). The iteration (2) reads w ( t +1) i = w ( t ) i tr( M p − ( w ( t ) ) A i )tr( M p ( w ( t ) )) , i = 1 , . . . , n. (ii) More generally, given a full rank m × r matrix K ( r ≤ m ), consider ψ p,K ( M − ) ≡ − φ p,K ( M ) = (cid:26) log det( K ⊤ M − K ) , p = 0;tr(( K ⊤ M − K ) − p ) , p ∈ [ − , ψ p,K ( M ) satisfies (5) and (6). By Theorem 1, Algorithm I is monotonicfor φ = φ p,K , p ∈ [ − , w ( t +1) i = w ( t ) i tr( M − K ( K ⊤ M − K ) − p − K ⊤ M − A i )tr(( K ⊤ M − K ) − p ) (cid:12)(cid:12)(cid:12)(cid:12) M = M ( w ( t ) ) . (8)(iii) In particular, taking r = 1 , K = c (an m × p = − φ − ,c .The iteration (8) reduces to w ( t +1) i = w ( t ) i c ⊤ M − ( w ( t ) ) A i M − ( w ( t ) ) cc ⊤ M − ( w ( t ) ) c , i = 1 , . . . , n. As noted by a referee, with p = −
1, the choice λ = 1 may lead to an oscil-lating behavior in the sense that w ( t ) alternates between two points at which φ − ,c ( M ( w )) takes the same value. While this does not contradict Theorem1, it suggests that other values of λ are more desirable for fast convergence.Following Fellman (1974) and Torsney (1983), a practical recommendationis λ = 1 / p = − p = 0 , r = m − K =(0 r , I r ) ⊤ . Henceforth 0 r denotes the r × I r denotes the r × r identity matrix. Assume A i = x i x ⊤ i , x ⊤ i = (1 , z ⊤ i ) and z i is ( m − × θ , . . . , θ m ) under thelinear model, y | ( x, θ ) ∼ N( x ⊤ θ, σ ) , x ⊤ = (1 , z ⊤ ) , Y. YU where the parameter is θ = ( θ , θ , . . . , θ m ) ⊤ . That is, interest centers onall coefficients other than the intercept. Nevertheless, as far as the designmeasure w is concerned, the optimality criterion, φ ,K ( M ), coincides with φ ( M ), that is, − log det( K ⊤ M − ( w ) K ) = log det M ( w ) . After some algebra, (8) reduces to w ( t +1) i = w ( t ) i ( z i − ¯ z ) ⊤ M − c ( w ( t ) )( z i − ¯ z ) m − , i = 1 , . . . , n, (9)where ¯ z = n X i =1 w ( t ) i z i ; M c ( w ( t ) ) = n X i =1 w ( t ) i ( z i − ¯ z )( z i − ¯ z ) ⊤ . Thus (9) satisfies det M ( w ( t +1) ) ≥ det M ( w ( t ) ).Monotonicity of (9) has been conjectured since Titterington (1978), andconsiderable numerical evidence has accumulated over the years. Recently,extending the arguments of P´azman (1986), Dette, Pepelyshev and Zhigl-javsky (2008) have obtained results which come very close to resolving Tit-terington’s conjecture. Nevertheless, we have been unable to extend theirarguments further. Instead we prove the general Theorem 1 using a differ-ent approach, and settle this conjecture as a consequence.The proof of Theorem 1 is achieved by using a method of auxiliary vari-ables . When a function f ( w ) [e.g., − det M ( w )] to be minimized is com-plicated, we introduce a new variable Q and a function g ( w, Q ) such thatmin Q g ( w, Q ) = f ( w ) for all w , thus transforming the problem into minimiz-ing g ( w, Q ) over w and Q jointly. Then we may use an iterative conditionalminimization strategy on g ( w, Q ). This is inspired by the EM algorithm[Dempster, Laird and Rubin (1977), Meng and van Dyk (1997); in partic-ular, see Csisz´ar and Tusnady’s (1984) interpretation; see Yu (2008) for arelated interpretation of the data augmentation algorithm].In Section 3 we analyze Algorithm I using this strategy. Although atten-tion is paid to the mathematics, our focus is on intuitively appealing in-terpretations which may lead to further extensions of Algorithm I with thesame desirable monotonicity properties. If the algorithm is monotonic, thenconvergence can be established under mild conditions (Section 4). Section 5contains an illustration with optimal designs for a simple logistic regressionmodel. LGORITHM FOR OPTIMAL DESIGNS
3. Explaining the monotonicity.
A key observation is that the problemof maximizing φ ( M ( w )), or, equivalently, minimizing ψ ( M − ( w )) can beformulated as a joint minimization over both the design and the estimator.Specifically, let us compare the original Problem P1 with its companionProblem P2. Throughout A / denotes the symmetric nonnegative definite(SNND) square root of an SNND matrix A . Problem P1.
Minimize − φ ( M ( w )) ≡ ψ (( P ni =1 w i A i ) − ) over w ∈ Ω. Problem P2.
Minimize g ( w, Q ) ≡ ψ ( Q ∆ w Q ⊤ )(10)over w ∈ Ω and Q [an m × ( mn ) matrix], subject to QG = I m , where∆ w ≡ Diag( w − , . . . , w − n ) ⊗ I m ; G ≡ ( A / , . . . , A / n ) ⊤ . Though not immediately obvious, Problems P1 and P2 are equivalent,and this may be explained in statistical terms as follows. In (10), Q ∆ w Q ⊤ issimply the variance matrix of a linear unbiased estimator, QY , of the m × θ in the model Y = Gθ + ε, ε ∼ N(0 , ∆ w ) , where Y is the ( mn ) × QG = I m ensures unbiasedness. [Note that G is full-rank since M ( w ) is nonsingularby assumption.] Of course, the weighted least squares (WLS) estimator isthe best linear unbiased estimator, having the smallest variance matrix (inthe sense of positive definite ordering) and, by (5), the smallest ψ for thatmatrix. It follows that, for fixed w , g ( w, Q ) is minimized by choosing QY asthe WLS estimator, g ( w, ˆ Q WLS ) = inf QG = I m g ( w, Q ) , (11) ˆ Q WLS = M − ( w )( w A / , . . . , w n A / n ) . (12)However, from (10) and (12) we get g ( w, ˆ Q WLS ) = ψ ( M − ( w )) . (13)That is, Problem P2 reduces to Problem P1 upon minimizing over Q .Since Problem P2 is not immediately solvable, it is natural to considerthe subproblems: (i) minimizing g ( w, Q ) over Q for fixed w and (ii) mini-mizing g ( w, Q ) over w for fixed Q . Part (ii) is again formulated as a jointminimization problem. For a fixed m × ( mn ) matrix Q such that QG = I m ,let us consider Problems P3 and P4. Y. YU
Problem P3.
Minimize g ( w, Q ) as in (10) over w ∈ Ω. Problem P4.
Minimize the function h (Σ , w, Q ) = ψ (Σ) + tr( ψ ′ (Σ)( Q ∆ w Q ⊤ − Σ)) , (14)over w ∈ Ω and the m × m positive-definite matrix Σ.The concavity assumption (7) implies that h (Σ , w, Q ) ≥ ψ ( Q ∆ w Q ⊤ )(15)with equality when Σ = Q ∆ w Q ⊤ , that is, Problem P4 reduces to ProblemP3 upon minimizing over Σ.Since Problem P4 is not immediately solvable, it is natural to considerthe subproblems: (i) minimizing h (Σ , w, Q ) over Σ for fixed w and Q and (ii)minimizing h (Σ , w, Q ) over w for fixed Σ and Q . Part (ii), which amountsto minimizing tr( ψ ′ (Σ) Q ∆ w Q ⊤ ) = tr( Q ⊤ ψ ′ (Σ) Q ∆ w ) , admits a closed-form solution: if we write Q = ( Q , . . . , Q n ) where each Q i is m × m , then w i should be proportional to tr( Q ⊤ i ψ ′ (Σ) Q i ). But Algorithm Imay not perform an exact minimization here [see (16)].Based on the above discussion, we can express Algorithm I as an iterativeconditional minimization algorithm involving w, Q and Σ. At iteration t ,define Q ( t ) = ( Q ( t )1 , . . . , Q ( t ) n ); Q ( t ) i = w ( t ) i M − ( w ( t ) ) A / i , i = 1 , . . . , n ;Σ ( t ) = Q ( t ) ∆ w ( t ) Q ( t ) ⊤ = M − ( w ( t ) ) . Then we have ψ ( M − ( w ( t ) )) = g ( w ( t ) , Q ( t ) ) [by (13)]= h (Σ ( t ) , w ( t ) , Q ( t ) ) [by (14)] ≥ h (Σ ( t ) , w ( t +1) , Q ( t ) ) (see below)(16) ≥ g ( w ( t +1) , Q ( t ) ) [by (15), (10)](17) ≥ ψ ( M − ( w ( t +1) )) [by (11), (13)] . (18)The choice of w ( t +1) leads to (16) as follows. After simple algebra, the iter-ation (2) becomes w ( t +1) i = r λi w − λi P nj =1 r λj w − λj , i = 1 , . . . , n, LGORITHM FOR OPTIMAL DESIGNS where w i ≡ w ( t ) i , r i ≡ tr( Q ( t ) ⊤ i ψ ′ (Σ ( t ) ) Q ( t ) i ) . Since 0 < λ ≤
1, Jensen’s inequality yields n X i =1 r i w i ! − λ ≥ n X i =1 w i (cid:18) r i w i (cid:19) − λ ; n X i =1 r i w i ! λ ≥ n X i =1 w i (cid:18) r i w i (cid:19) λ . That is, n X i =1 r i w i ≥ n X i =1 r − λi w λ − i ! n X i =1 r λi w − λi ! . Hence tr( ψ ′ (Σ ( t ) ) Q ( t ) ∆ w ( t ) Q ( t ) ⊤ ) = n X i =1 r i w ( t ) i ≥ n X i =1 r i w ( t +1) i = tr( ψ ′ (Σ ( t ) ) Q ( t ) ∆ w ( t +1) Q ( t ) ⊤ ) , which produces (16). Choosing λ = 1 /
2, that is, w ( t +1) i ∝ √ r i , leads to exactminimization in (16); choosing λ = 1 yields equality in (16). But any choiceof w ( t +1) that decreases h (Σ ( t ) , w, Q ( t ) ) at (16) would have resulted in thedesired inequality, ψ ( M − ( w ( t ) )) ≥ ψ ( M − ( w ( t +1) )) . We may allow λ to change from iteration to iteration, and monotonicity stillholds, as long as λ ∈ (0 , λ . Also note thatwe assume w ( t ) i , w ( t +1) i > i . This is not essential, however, because(i) the possibility of w ( t ) i = 0 can be handled by restricting our analysis to alldesign points i such that w ( t ) i >
0, and (ii) the possibility of w ( t +1) i = 0 canbe handled by a standard limiting argument. Monotonicity holds as long as M ( w ( t ) ) and M ( w ( t +1) ) are both positive definite, as noted in the statementof Theorem 1.
4. Global convergence.
Monotonicity (Theorem 1) plays an importantrole in the following convergence theorem.
Theorem 2 (Global convergence).
Denote the mapping (2) by T . (a) Assume φ ′ ( M ( w )) ≥ φ ′ ( M ( w )) A i = 0 , w ∈ Ω + , i = 1 , . . . , n. Y. YU (b)
Assume (2) is strictly monotonic, that is, w ∈ Ω + , T w = w = ⇒ φ ( M ( T w )) > φ ( M ( w )) . (19)(c) Assume φ is strictly concave and φ ′ is continuous on positive definitematrices. (d) Assume that, if M (a positive definite matrix) tends to M ∗ such that φ ( M ) increases monotonically, then M ∗ is nonsingular.Let w ( t ) be generated by (2) with w (0) i > for all i . Then: (i) all limit points of w ( t ) are global maxima of φ ( M ( w )) on Ω + , and (ii) as t → ∞ , φ ( M ( w ( t ) )) increases monotonically to sup w ∈ Ω + φ ( M ( w )) . The proof of Theorem 2 is somewhat subtle. Standard arguments showthat all limit points of w ( t ) are fixed points of the mapping T . This alonedoes not imply convergence to a global maximum, however, because thereoften exist sub-optimal fixed points on the boundary of Ω. (Global maximaoccur routinely on the boundary also.) Our goal is therefore to rule outpossible convergence to such sub-optimal points; details of the proof arepresented in Yu (2009), an extended version of this paper. We shall commenton conditions (a)–(d).Condition (a) ensures that starting with w (0) ∈ Ω + , all iterations are welldefined. Moreover, if w (0) i > i , then w ( t ) i > t and i . Thishighlights the basic idea that, in order to converge to a global maximum w ∗ , the starting value w (0) must assign positive weight to every supportpoint of w ∗ . Such a requirement is not necessary for monotonicity. On theother hand, assigning weight to nonsupporting points of w ∗ tends to slow thealgorithm down. Hence methods that quickly eliminate nonoptimal supportpoints are valuable [Harman and Pronzato (2007)].Condition (b) simply says that unless w is a fixed point, the mapping T should produce a better solution. Let us assume (5), (7) and condition (a)so that Theorem 1 applies. Then, by checking the equality condition in (16),it is easy to see that condition (b) is satisfied if 0 < λ <
1. [The argumentleading to (19) technically assumes that all coordinates of w are nonzero,but we can apply it to the appropriate subvector of w .] If λ = 1, then (16)reduces to an equality. However, by checking the equality conditions in (17)and (18), we can show that condition (b) is satisfied if ψ is strictly increasingand strictly concave: M ≥ M > , (20) M = M = ⇒ ψ ( M ) < ψ ( M ); M , M > , (21) M = M = ⇒ ψ ( M ) < ψ ( M ) + tr( ψ ′ ( M )( M − M )) . LGORITHM FOR OPTIMAL DESIGNS Conditions (c) and (d) are technical requirements that concern φ alone.Condition (c) ensures uniqueness of the optimal moment matrix which sim-plifies the analysis. Condition (d) ensures that positive definiteness of M ( w )is maintained in the limit. Conditions (c) and (d) are satisfied by φ = φ p with p ≤
0, for example.Let us mention a typical example of Theorem 2.
Corollary 1.
Assume A i = 0 , w (0) i > , i = 1 , . . . , n , and M ( w (0) ) > .Then the conclusion of Theorem 2 holds for Algorithm I with φ = φ . Proof.
Conditions (a), (c) and (d) are readily verified. Condition (b)is satisfied by (20) and (21). The claim follows from Theorem 2. (cid:3)
When (20) or (21) fails, and λ = 1, it is often difficult to appeal to Theorem2 because strict monotonicity [condition (b)] may not hold. We illustrate thiswith an example where the monotonicity is not strict, and the algorithm doesnot converge [see Pronzato, Wynn and Zhigljavsky (2000), Chapter 7; alsothe remark in case (iii) following Theorem 1]. Consider iteration (9) ( λ = 1)with n = m = 2 and design space X = { x i = (1 , z i ) ⊤ , i = 1 , } , z = − z = 1.It is easy to show that, for any w ( t ) = ( w , w ) ∈ Ω, iteration (9) maps w ( t ) to w ( t +1) = ( w , w ). Thus, unless w = w = 1 /
5. Further remarks and illustrations.
One can think of several reasonsfor the wide interest in Algorithm I and its relatives. Similar to the EMalgorithm, Algorithm I is simple, easy to implement and monotonically con-vergent for a large class of optimality criteria (although this was not provedin the present generality). Algorithm I is known to be slow sometimes. But itserves as a foundation upon which more effective variants can be built [see,e.g., Harman and Pronzato (2007) and Dette, Pepelyshev and Zhigljavsky(2008)]. While solving the conjectured monotonicity of (9) holds mathe-matical interest, our main contribution is a way of interpreting such algo-rithms as optimization on augmented spaces. This opens up new possibilitiesin constructing algorithms with the same desirable monotonic convergenceproperties.As a numerical example, consider the logistic regression model p ( y | x, θ ) = exp( yx ⊤ θ )1 + exp( x ⊤ θ ) , y = 0 , . The expected Fisher information for θ from a unit assigned to x i is A i = x i exp( x ⊤ i θ )(1 + exp( x ⊤ i θ )) x ⊤ i . Y. YU
Fig. 1.
Values of φ = log det M and φ − = − tr( M − ) for Algorithm I with design spaces X and X . We compute locally optimal designs with prior guess θ ∗ = (1 , ⊤ ( m = 2),and design spaces, X = { x i = (1 , i/ ⊤ , i = 1 , . . . , } ; X = { x i = (1 , i/ ⊤ , i = 1 , . . . , } . The design criteria considered are φ (for D-optimality) and φ − . We useAlgorithm I with λ = 1, starting with equally weighted designs.For φ , Corollary 1 guarantees monotonic convergence. This is illustratedby Figure 1, the first row, where φ = log det M ( w ) is plotted against iter-ation t . Using the convergence criterion (4) with δ = 0 . X and 2121 for X . The actual locallyD-optimal designs are w = w = 0 . X and w = w = 0 . X , ascan be verified using the general equivalence theorem. This simple exampleserves to illustrate both the monotonicity of Algorithm I (when Theorem 1applies) and its potential slow convergence.For φ − , although Algorithm I can be implemented just as easily, Theorem1 does not apply because the concavity condition (7) no longer holds. Indeed,Algorithm I (with λ = 1) is not monotonic, as is evident from Figure 1, inthe second row, where φ − = − tr( M − ( w )) is plotted against iteration t . LGORITHM FOR OPTIMAL DESIGNS This shows the potential danger of using Algorithm I when monotonicity isnot guaranteed.Although Theorem 1 does not cover the φ p criterion for p < −
1, it is stillpossible that monotonicity holds for a smaller range of λ . Calculations inspecial cases lead to the conjecture [Silvey, Titterington and Torsney (1978)]that Algorithm I is monotonic if 0 < λ ≤ / (1 − p ). Theorem 1 providesfurther evidence for this conjecture, but new insights are needed to resolveit. We have focused on local optimality. An alternative, Bayesian optimality [Chaloner and Larntz (1989), Chaloner and Verdinelli (1995)], seeks to max-imize the expected value of φ ( M ( θ ; w )) over a prior distribution π ( θ ). Thenotation M ( θ ; w ) emphasizes the dependence of the moment matrix on theparameter θ . It would be worthwhile to extend our strategy in Section 3 toBayesian optimality, and we plan to report both theoretical and empiricalevaluations of such extensions in future works. Acknowledgment.
The author would like to thank Don Rubin, Xiao-LiMeng and David van Dyk for introducing him to the field of statisticalcomputing. He is also grateful to Mike Titterington, Ben Torsney and thereferees for their valuable comments.REFERENCES
Atwood, C. L. (1973). Sequences converging to D-optimal designs of experiments.
Ann.Statist. Chaloner, K. and
Larntz, K. (1989). Optimal Bayesian design applied to logistic re-gression experiments.
J. Statist. Plann. Inference Chaloner, K. and
Verdinelli, I. (1995). Bayesian experimental design: A review.
Statist. Sci. Chernoff, H. (1953). Locally optimal design for estimating parameters.
Ann. Math.Statist. Csisz´ar, I. and
Tusnady, G. (1984). Information geometry and alternating minimizationprocedures.
Statist. Decisions (Suppl. 1) 205–237. MR0785210
Dempster, A. P., Laird, N. M. and
Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm (with discussion).
J. Roy. Statist. Soc. Ser. B Dette, H., Pepelyshev, A. and
Zhigljavsky, A. (2008). Improving updating rulesin multiplicative algorithms for computing D-optimal designs.
Comput. Statist. DataAnal. Fedorov, V. V. (1972).
Theory of Optimal Experiments . Academic Press, New York.MR0403103
Fellman, J. (1974). On the allocation of linear observations (Thesis).
Soc. Sci. Fenn.Comment. Phys.-Math. Fellman, J. (1989). An empirical study of a class of iterative searches for optimal designs.
J. Statist. Plann. Inference Harman, R. and
Pronzato, L. (2007). Improvements on removing nonoptimal supportpoints in D-optimum design algorithms.
Statist. Probab. Lett. Y. YU
Harman, R. and
Trnovsk´a, M. (2009). Approximate D-optimal designs of experimentson the convex hull of a finite set of information matrices.
Math. Slovaca Kiefer, J. (1974). General equivalence theory for optimum designs (approximate theory).
Ann. Statist. Kiefer, J. and
Wolfowitz, J. (1960). The equivalence of two extremum problems.
Canad. J. Math. Li, G. and
Majumdar, D. (2008). D-optimal designs for logistic models with three andfour parameters.
J. Statist. Plann. Inference
Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to afast new tune (with discussion).
J. Roy. Statist. Soc. Ser. B P´azman, A. (1986).
Foundations of Optimum Experimental Design . Reidel, Dordrecht.
Pronzato, L., Wynn, H. and
Zhigljavsky, A. (2000).
Dynamical Search: Applicationsof Dynamical Systems in Search and Optimization . Chapman & Hall/CRC, Boca Raton.
Pukelsheim, F. (1993).
Optimal Design of Experiments . Wiley, New York. MR1211416
Pukelsheim, F. and
Torsney, B. (1991). Optimal weights for experimental designs onlinearly independent support points.
Ann. Statist. Silvey, S. D. (1980).
Optimal Design . Chapman & Hall, London. MR0606742
Silvey, S. D., Titterington, D. M. and
Torsney, B. (1978). An algorithm for optimaldesigns on a finite design space.
Comm. Statist. Theory Methods Titterington, D. M. (1976). Algorithms for computing D-optimal design on finite designspaces. In
Proc. of the 1976 Conf. on Information Science and Systems Titterington, D. M. (1978). Estimation of correlation coefficients by ellipsoidal trim-ming.
J. Appl. Stat. Torsney, B. (1983). A moment inequality and monotonicity of an algorithm. In
Pro-ceedings of the International Symposium on Semi-Infinite Programming and Appl. (K.O. Kortanek and A. V. Fiacco, eds.).
Lecture Notes in Economics and MathematicalSystems
Torsney, B. (2007). W-iterations and ripples therefrom. In
Optimal Design and Re-lated Areas in Optimization and Statistics (L. Pronzato and A. Zhigljavsky, eds.) 1–12.Springer, New York. MR2513344
Torsney, B. and
Mandal, S. (2006). Two classes of multiplicative algorithms forconstructing optimizing distributions.
Comput. Statist. Data Anal. Torsney, B. and
Mart´ın-Mart´ın, R. (2009). Multiplicative algorithms for computingoptimum designs.
J. Statist. Plann. Inference
Whittle, P. (1973). Some general points in the theory of optimal experimental design.
J. Roy. Statist. Soc. Ser. B Wu, C. F. and
Wynn, H. P. (1978). The convergence of general step-length algorithmsfor regular optimum design criteria.
Ann. Statist. Wynn, H. P. (1972). Results in the theory and construction of D-optimum experimentaldesigns.
J. Roy. Statist. Soc. Ser. B Yu, Y. (2008). A bit of information theory, and the data augmentation algorithm con-verges.