[PDF] Monotonic convergence of a general algorithm for computing optimal designs

Abstract

Monotonic convergence is established for a general class of multiplicative algorithms introduced by Silvey, Titterington and Torsney [Comm. Statist. Theory Methods 14 (1978) 1379--1389] for computing optimal designs. A conjecture of Titterington [Appl. Stat. 27 (1978) 227--234] is confirmed as a consequence. Optimal designs for logistic regression are used as an illustration.

Full PDF

aa r X i v : . [ s t a t . C O ] O c t The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2010

MONOTONIC CONVERGENCE OF A GENERAL ALGORITHMFOR COMPUTING OPTIMAL DESIGNS

By Yaming Yu

University of California, Irvine

Monotonic convergence is established for a general class of multi-plicative algorithms introduced by Silvey, Titterington and Torsney[

Comm. Statist. Theory Methods (1978) 1379–1389] for computingoptimal designs. A conjecture of Titterington [ Appl. Stat. (1978)227–234] is conﬁrmed as a consequence. Optimal designs for logisticregression are used as an illustration.

1. A general class of algorithms.

Optimal experimental design (approx-imate theory) is a well-developed area, and we refer to Kiefer (1974), Silvey(1980), P´azman (1986) and Pukelsheim (1993) for a general introductionand basic results. We consider computational aspects of optimal designs,focusing on a ﬁnite design space X = { x , . . . , x n } . Suppose the probabil-ity density or mass function of the response is speciﬁed as p ( y | x, θ ) where θ = ( θ , . . . , θ m ) ⊤ is the parameter of interest. Let A i denote the m × m ex-pected Fisher information matrix from a unit assigned to x i with the ( j, k )entry [the expectation is with respect to p ( y | x i , θ )] A i ( j, k ) = E (cid:20) ∂ log p ( y | x i , θ ) ∂θ j ∂ log p ( y | x i , θ ) ∂θ k (cid:21) . The moment matrix, as a function of the design measure w = ( w , . . . , w n ),is deﬁned as M ( w ) = n X i =1 w i A i which is proportional to the Fisher information for θ when the number ofunits assigned to x i is proportional to w i . Here w ∈ ¯Ω, and ¯Ω denotes the Received May 2009; revised October 2009.

AMS 2000 subject classiﬁcations.

Key words and phrases.

A-optimality, auxiliary variables, c-optimality, D-optimality,experimental design, generalized linear models, multiplicative algorithm.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2010, Vol. 38, No. 3, 1593–1606. This reprint diﬀers from the original inpagination and typographic detail. 1

Y. YU closure of Ω = { w : w i > , P ni =1 w i = 1 } . Throughout we assume that A i arewell deﬁned and hence nonnegative deﬁnite. The setΩ + ≡ { w ∈ ¯Ω : M ( w ) > } , is assumed nonempty. Our approach may conceivably extend to the casewhere M ( w ) is allowed to be singular, by using generalized inverses, althoughwe do not pursue this here.Given an optimality criterion φ deﬁned on positive deﬁnite matrices, thegoal is to maximize φ ( M ( w )) with respect to w ∈ Ω + . Typical optimalitycriteria include:(i) the D-criterion φ ( M ) = log det( M ),(ii) the A-criterion φ − ( M ) = − tr( M − ),(iii) more generally, the p th mean criterion φ p ( M ) = − tr( M p ) , p < φ − ,c ( M ) = − c ⊤ M − c , where c is a nonzero constantvector.Often only a linear combination K ⊤ θ , for example, a subvector of θ , is ofinterest. The Fisher information for K ⊤ θ is naturally deﬁned as ( K ⊤ M − K ) − ,assuming invertibility [Pukelsheim (1993)]. We may therefore consider theD- and A-criteria for K ⊤ θ deﬁned, respectively, as φ ,K ( M ) = − log det( K ⊤ M − K );(1) φ − ,K ( M ) = − tr( K ⊤ M − K ) . The c-criterion is a special case of φ − ,K ( M ). Motivations for such optimalitycriteria are well known. In a linear problem, the A-criterion seeks to minimizethe sum of variances of the best linear unbiased estimators (BLUEs) for allcoordinates of θ while the c-criterion seeks to minimize the variance of theBLUE for c ⊤ θ . Similar interpretations (with asymptotic arguments) applyto nonlinear problems.In general M ( w ) also depends on the unknown parameter θ which com-plicates the deﬁnition of an optimality criterion. A simple solution is tomaximize φ ( M ( w )) with θ ﬁxed at a prior guess θ ∗ ; this leads to local op-timality [Chernoﬀ (1953)]. Local optimality may be criticized for ignoringuncertainty in θ . However, in a situation where real prior information isavailable, or where the dependence of M on θ is weak, it is nevertheless aviable approach and has been adopted routinely [see, e.g., Li and Majumdar(2008)]. Henceforth we assume a ﬁxed θ ∗ and suppress the dependence of M on θ . Possible extensions are mentioned in Section 5.Optimal designs do not usually come in closed form. As early as Wynn(1972), Fedorov (1972), Atwood (1973) and Wu and Wynn (1978), and aslate as Torsney (2007), Harman and Pronzato (2007) and Dette, Pepelyshev LGORITHM FOR OPTIMAL DESIGNS and Zhigljavsky (2008), various procedures have been studied for numeri-cal computation. We shall focus on the following multiplicative algorithm[Titterington (1976, 1978), Silvey, Titterington and Torsney (1978)] whichis speciﬁed through a power parameter λ ∈ (0 , Algorithm I.

Set λ ∈ (0 ,

1] and w (0) ∈ Ω. For t = 0 , , . . . , compute w ( t +1) i = w ( t ) i d λi ( w ( t ) ) P nj =1 w ( t ) j d λj ( w ( t ) ) , i = 1 , . . . , n, (2)where d i ( w ) = tr( φ ′ ( M ( w )) A i ) , φ ′ ( M ) ≡ ∂φ ( M ) ∂M . Iterate until convergence.For a heuristic explanation, observe that (2) is equivalent to w ( t +1) i ∝ w ( t ) i (cid:18) ∂φ ( M ( w )) ∂w i (cid:12)(cid:12)(cid:12)(cid:12) w = w ( t ) (cid:19) λ , i = 1 , . . . , n. (3)The value of ∂φ ( M ( w )) /∂w i indicates the amount of gain in information, asmeasured by φ , by a slight increase in w i , the weight on the i th design point.So (3) can be seen as adjusting w so that relatively more weight is placedon design points whose increased weight may result in a larger gain in φ . If φ is increasing and concave, then a convenient convergence criterion, basedon the general equivalence theorem [Kiefer and Wolfowitz (1960), Whittle(1973)], is max ≤ i ≤ n d i ( w ( t ) ) ≤ (1 + δ ) ¯ d ( w ( t ) ) , (4)where ¯ d ( w ) ≡ P ni =1 w i d i ( w ) and δ is a small positive constant.Algorithm I is remarkable in its generality. For example, little restrictionis placed on the underlying model p ( y | x, θ ). Part of the reason, of course, isthat we focus on Fisher information and local optimality, which essentiallyreduces the problem to a linear one.There exists a large literature on Algorithm I and its relatives [see, e.g.,Titterington (1976, 1978), Silvey, Titterington and Torsney (1978), P´azman(1986), Fellman (1989), Pukelsheim and Torsney (1991), Torsney and Man-dal (2006), Harman and Pronzato (2007), Dette, Pepelyshev and Zhigljavsky(2008) and Torsney and Mart´ın-Mart´ın (2009)]. One feature that has at-tracted much attention is that Algorithm I appears to be monotonic, that is, φ ( M ( w ( t ) )) increases in t , at least in some special cases. For example, when φ = φ (for D-optimality) and λ = 1, Titterington (1976) and P´azman (1986)have shown monotonicity using clever probabilistic and analytic inequali-ties [see also Dette, Pepelyshev and Zhigljavsky (2008) and Harman and Y. YU

Trnovsk´a (2009)]. Algorithm I is also known to be monotonic for φ = φ − ,K as in (1), assuming λ = 1 / A i are rank-one [Fellman (1974), Torsney(1983)]. Monotonicity is important because convergence then holds undermild assumptions (see Section 4). Results in these special cases suggest amonotonic convergence theory for a broad class of φ which is also supportedby numerical evidence presented in some of the references above.

2. Main result.

We aim to state general conditions on φ that ensurethat Algorithm I converges monotonically. As a consequence certain knowntheoretical results are uniﬁed and generalized, and one particular conjecture[Titterington (1978)] is conﬁrmed. Deﬁne ψ ( M ) ≡ − φ ( M − ) , M > . The functions φ and ψ are assumed to be diﬀerentiable on invertible matri-ces. Our conditions are conveniently stated in terms of ψ . As usual, for twosymmetric matrices, M ≤ ( < ) M means M − M is nonnegative (positive)deﬁnite. • ψ ( M ) is increasing:0 < M ≤ M = ⇒ ψ ( M ) ≤ ψ ( M )(5)or, equivalently, ψ ′ ( M ) is nonnegative deﬁnite for positive deﬁnite M . • ψ ( M ) is concave: αψ ( M ) + (1 − α ) ψ ( M ) ≤ ψ ( αM + (1 − α ) M )(6)for α ∈ [0 , , M , M >

0. Equivalently, ψ ( M ) ≤ ψ ( M ) + tr( ψ ′ ( M )( M − M )) , M , M > . (7)Condition (5) is usually satisﬁed by any reasonable information criterion[Pukelsheim (1993)]. Also note that, if (5) fails, then ∂φ ( M ( w )) /∂w i on theright-hand side of (3) is not even guaranteed to be nonnegative. The realrestriction is the concavity condition (6). For example, (6) is not satisﬁedby ψ p ( M ) = − φ p ( M − ) (the p th mean criterion) when p < −

1. [It is usuallyassumed that φ ( M ), rather than ψ ( M ), is concave.] Nevertheless, (6) is satisﬁed by a wide range of criteria, including the commonly used D-, A- orc-criteria [see cases (i) and (ii) in the illustration of the main result below].Our main result is as follows. Theorem 1 (General monotonicity).

Assume (5) and (6). Assume thatin iteration (2), with < λ ≤ , we have M ( w ( t ) ) > , φ ′ ( M ( w ( t ) )) = 0 and M ( w ( t +1) ) > . Then φ ( M ( w ( t +1) )) ≥ φ ( M ( w ( t ) )) . LGORITHM FOR OPTIMAL DESIGNS In other words, under mild conditions which ensure that (2) is well deﬁned[speciﬁcally, the denominator in (2) is nonzero], (5) and (6) imply that (2)never decreases the criterion φ . Let us illustrate Theorem 1 with some exam-ples. For simplicity, in (i)–(iv) we display formulae for λ = 1 only, althoughmonotonicity holds for all λ ∈ (0 , φ p ( M ) = (cid:26) log det M, p = 0; − tr( M p ) , p ∈ [ − , ψ p ( M ) ≡ − φ p ( M − ) satisﬁes (5) and (6). By Theorem 1, AlgorithmI is monotonic for φ = φ p , p ∈ [ − , p = 0 and p = − λ ). The iteration (2) reads w ( t +1) i = w ( t ) i tr( M p − ( w ( t ) ) A i )tr( M p ( w ( t ) )) , i = 1 , . . . , n. (ii) More generally, given a full rank m × r matrix K ( r ≤ m ), consider ψ p,K ( M − ) ≡ − φ p,K ( M ) = (cid:26) log det( K ⊤ M − K ) , p = 0;tr(( K ⊤ M − K ) − p ) , p ∈ [ − , ψ p,K ( M ) satisﬁes (5) and (6). By Theorem 1, Algorithm I is monotonicfor φ = φ p,K , p ∈ [ − , w ( t +1) i = w ( t ) i tr( M − K ( K ⊤ M − K ) − p − K ⊤ M − A i )tr(( K ⊤ M − K ) − p ) (cid:12)(cid:12)(cid:12)(cid:12) M = M ( w ( t ) ) . (8)(iii) In particular, taking r = 1 , K = c (an m × p = − φ − ,c .The iteration (8) reduces to w ( t +1) i = w ( t ) i c ⊤ M − ( w ( t ) ) A i M − ( w ( t ) ) cc ⊤ M − ( w ( t ) ) c , i = 1 , . . . , n. As noted by a referee, with p = −

1, the choice λ = 1 may lead to an oscil-lating behavior in the sense that w ( t ) alternates between two points at which φ − ,c ( M ( w )) takes the same value. While this does not contradict Theorem1, it suggests that other values of λ are more desirable for fast convergence.Following Fellman (1974) and Torsney (1983), a practical recommendationis λ = 1 / p = − p = 0 , r = m − K =(0 r , I r ) ⊤ . Henceforth 0 r denotes the r × I r denotes the r × r identity matrix. Assume A i = x i x ⊤ i , x ⊤ i = (1 , z ⊤ i ) and z i is ( m − × θ , . . . , θ m ) under thelinear model, y | ( x, θ ) ∼ N( x ⊤ θ, σ ) , x ⊤ = (1 , z ⊤ ) , Y. YU where the parameter is θ = ( θ , θ , . . . , θ m ) ⊤ . That is, interest centers onall coeﬃcients other than the intercept. Nevertheless, as far as the designmeasure w is concerned, the optimality criterion, φ ,K ( M ), coincides with φ ( M ), that is, − log det( K ⊤ M − ( w ) K ) = log det M ( w ) . After some algebra, (8) reduces to w ( t +1) i = w ( t ) i ( z i − ¯ z ) ⊤ M − c ( w ( t ) )( z i − ¯ z ) m − , i = 1 , . . . , n, (9)where ¯ z = n X i =1 w ( t ) i z i ; M c ( w ( t ) ) = n X i =1 w ( t ) i ( z i − ¯ z )( z i − ¯ z ) ⊤ . Thus (9) satisﬁes det M ( w ( t +1) ) ≥ det M ( w ( t ) ).Monotonicity of (9) has been conjectured since Titterington (1978), andconsiderable numerical evidence has accumulated over the years. Recently,extending the arguments of P´azman (1986), Dette, Pepelyshev and Zhigl-javsky (2008) have obtained results which come very close to resolving Tit-terington’s conjecture. Nevertheless, we have been unable to extend theirarguments further. Instead we prove the general Theorem 1 using a diﬀer-ent approach, and settle this conjecture as a consequence.The proof of Theorem 1 is achieved by using a method of auxiliary vari-ables . When a function f ( w ) [e.g., − det M ( w )] to be minimized is com-plicated, we introduce a new variable Q and a function g ( w, Q ) such thatmin Q g ( w, Q ) = f ( w ) for all w , thus transforming the problem into minimiz-ing g ( w, Q ) over w and Q jointly. Then we may use an iterative conditionalminimization strategy on g ( w, Q ). This is inspired by the EM algorithm[Dempster, Laird and Rubin (1977), Meng and van Dyk (1997); in partic-ular, see Csisz´ar and Tusnady’s (1984) interpretation; see Yu (2008) for arelated interpretation of the data augmentation algorithm].In Section 3 we analyze Algorithm I using this strategy. Although atten-tion is paid to the mathematics, our focus is on intuitively appealing in-terpretations which may lead to further extensions of Algorithm I with thesame desirable monotonicity properties. If the algorithm is monotonic, thenconvergence can be established under mild conditions (Section 4). Section 5contains an illustration with optimal designs for a simple logistic regressionmodel. LGORITHM FOR OPTIMAL DESIGNS

3. Explaining the monotonicity.

A key observation is that the problemof maximizing φ ( M ( w )), or, equivalently, minimizing ψ ( M − ( w )) can beformulated as a joint minimization over both the design and the estimator.Speciﬁcally, let us compare the original Problem P1 with its companionProblem P2. Throughout A / denotes the symmetric nonnegative deﬁnite(SNND) square root of an SNND matrix A . Problem P1.

Minimize − φ ( M ( w )) ≡ ψ (( P ni =1 w i A i ) − ) over w ∈ Ω. Problem P2.

Minimize g ( w, Q ) ≡ ψ ( Q ∆ w Q ⊤ )(10)over w ∈ Ω and Q [an m × ( mn ) matrix], subject to QG = I m , where∆ w ≡ Diag( w − , . . . , w − n ) ⊗ I m ; G ≡ ( A / , . . . , A / n ) ⊤ . Though not immediately obvious, Problems P1 and P2 are equivalent,and this may be explained in statistical terms as follows. In (10), Q ∆ w Q ⊤ issimply the variance matrix of a linear unbiased estimator, QY , of the m × θ in the model Y = Gθ + ε, ε ∼ N(0 , ∆ w ) , where Y is the ( mn ) × QG = I m ensures unbiasedness. [Note that G is full-rank since M ( w ) is nonsingularby assumption.] Of course, the weighted least squares (WLS) estimator isthe best linear unbiased estimator, having the smallest variance matrix (inthe sense of positive deﬁnite ordering) and, by (5), the smallest ψ for thatmatrix. It follows that, for ﬁxed w , g ( w, Q ) is minimized by choosing QY asthe WLS estimator, g ( w, ˆ Q WLS ) = inf QG = I m g ( w, Q ) , (11) ˆ Q WLS = M − ( w )( w A / , . . . , w n A / n ) . (12)However, from (10) and (12) we get g ( w, ˆ Q WLS ) = ψ ( M − ( w )) . (13)That is, Problem P2 reduces to Problem P1 upon minimizing over Q .Since Problem P2 is not immediately solvable, it is natural to considerthe subproblems: (i) minimizing g ( w, Q ) over Q for ﬁxed w and (ii) mini-mizing g ( w, Q ) over w for ﬁxed Q . Part (ii) is again formulated as a jointminimization problem. For a ﬁxed m × ( mn ) matrix Q such that QG = I m ,let us consider Problems P3 and P4. Y. YU

Problem P3.

Minimize g ( w, Q ) as in (10) over w ∈ Ω. Problem P4.

Minimize the function h (Σ , w, Q ) = ψ (Σ) + tr( ψ ′ (Σ)( Q ∆ w Q ⊤ − Σ)) , (14)over w ∈ Ω and the m × m positive-deﬁnite matrix Σ.The concavity assumption (7) implies that h (Σ , w, Q ) ≥ ψ ( Q ∆ w Q ⊤ )(15)with equality when Σ = Q ∆ w Q ⊤ , that is, Problem P4 reduces to ProblemP3 upon minimizing over Σ.Since Problem P4 is not immediately solvable, it is natural to considerthe subproblems: (i) minimizing h (Σ , w, Q ) over Σ for ﬁxed w and Q and (ii)minimizing h (Σ , w, Q ) over w for ﬁxed Σ and Q . Part (ii), which amountsto minimizing tr( ψ ′ (Σ) Q ∆ w Q ⊤ ) = tr( Q ⊤ ψ ′ (Σ) Q ∆ w ) , admits a closed-form solution: if we write Q = ( Q , . . . , Q n ) where each Q i is m × m , then w i should be proportional to tr( Q ⊤ i ψ ′ (Σ) Q i ). But Algorithm Imay not perform an exact minimization here [see (16)].Based on the above discussion, we can express Algorithm I as an iterativeconditional minimization algorithm involving w, Q and Σ. At iteration t ,deﬁne Q ( t ) = ( Q ( t )1 , . . . , Q ( t ) n ); Q ( t ) i = w ( t ) i M − ( w ( t ) ) A / i , i = 1 , . . . , n ;Σ ( t ) = Q ( t ) ∆ w ( t ) Q ( t ) ⊤ = M − ( w ( t ) ) . Then we have ψ ( M − ( w ( t ) )) = g ( w ( t ) , Q ( t ) ) [by (13)]= h (Σ ( t ) , w ( t ) , Q ( t ) ) [by (14)] ≥ h (Σ ( t ) , w ( t +1) , Q ( t ) ) (see below)(16) ≥ g ( w ( t +1) , Q ( t ) ) [by (15), (10)](17) ≥ ψ ( M − ( w ( t +1) )) [by (11), (13)] . (18)The choice of w ( t +1) leads to (16) as follows. After simple algebra, the iter-ation (2) becomes w ( t +1) i = r λi w − λi P nj =1 r λj w − λj , i = 1 , . . . , n, LGORITHM FOR OPTIMAL DESIGNS where w i ≡ w ( t ) i , r i ≡ tr( Q ( t ) ⊤ i ψ ′ (Σ ( t ) ) Q ( t ) i ) . Since 0 < λ ≤

1, Jensen’s inequality yields n X i =1 r i w i ! − λ ≥ n X i =1 w i (cid:18) r i w i (cid:19) − λ ; n X i =1 r i w i ! λ ≥ n X i =1 w i (cid:18) r i w i (cid:19) λ . That is, n X i =1 r i w i ≥ n X i =1 r − λi w λ − i ! n X i =1 r λi w − λi ! . Hence tr( ψ ′ (Σ ( t ) ) Q ( t ) ∆ w ( t ) Q ( t ) ⊤ ) = n X i =1 r i w ( t ) i ≥ n X i =1 r i w ( t +1) i = tr( ψ ′ (Σ ( t ) ) Q ( t ) ∆ w ( t +1) Q ( t ) ⊤ ) , which produces (16). Choosing λ = 1 /

2, that is, w ( t +1) i ∝ √ r i , leads to exactminimization in (16); choosing λ = 1 yields equality in (16). But any choiceof w ( t +1) that decreases h (Σ ( t ) , w, Q ( t ) ) at (16) would have resulted in thedesired inequality, ψ ( M − ( w ( t ) )) ≥ ψ ( M − ( w ( t +1) )) . We may allow λ to change from iteration to iteration, and monotonicity stillholds, as long as λ ∈ (0 , λ . Also note thatwe assume w ( t ) i , w ( t +1) i > i . This is not essential, however, because(i) the possibility of w ( t ) i = 0 can be handled by restricting our analysis to alldesign points i such that w ( t ) i >

0, and (ii) the possibility of w ( t +1) i = 0 canbe handled by a standard limiting argument. Monotonicity holds as long as M ( w ( t ) ) and M ( w ( t +1) ) are both positive deﬁnite, as noted in the statementof Theorem 1.

4. Global convergence.

Monotonicity (Theorem 1) plays an importantrole in the following convergence theorem.

Theorem 2 (Global convergence).

Denote the mapping (2) by T . (a) Assume φ ′ ( M ( w )) ≥ φ ′ ( M ( w )) A i = 0 , w ∈ Ω + , i = 1 , . . . , n. Y. YU (b)

Assume (2) is strictly monotonic, that is, w ∈ Ω + , T w = w = ⇒ φ ( M ( T w )) > φ ( M ( w )) . (19)(c) Assume φ is strictly concave and φ ′ is continuous on positive deﬁnitematrices. (d) Assume that, if M (a positive deﬁnite matrix) tends to M ∗ such that φ ( M ) increases monotonically, then M ∗ is nonsingular.Let w ( t ) be generated by (2) with w (0) i > for all i . Then: (i) all limit points of w ( t ) are global maxima of φ ( M ( w )) on Ω + , and (ii) as t → ∞ , φ ( M ( w ( t ) )) increases monotonically to sup w ∈ Ω + φ ( M ( w )) . The proof of Theorem 2 is somewhat subtle. Standard arguments showthat all limit points of w ( t ) are ﬁxed points of the mapping T . This alonedoes not imply convergence to a global maximum, however, because thereoften exist sub-optimal ﬁxed points on the boundary of Ω. (Global maximaoccur routinely on the boundary also.) Our goal is therefore to rule outpossible convergence to such sub-optimal points; details of the proof arepresented in Yu (2009), an extended version of this paper. We shall commenton conditions (a)–(d).Condition (a) ensures that starting with w (0) ∈ Ω + , all iterations are welldeﬁned. Moreover, if w (0) i > i , then w ( t ) i > t and i . Thishighlights the basic idea that, in order to converge to a global maximum w ∗ , the starting value w (0) must assign positive weight to every supportpoint of w ∗ . Such a requirement is not necessary for monotonicity. On theother hand, assigning weight to nonsupporting points of w ∗ tends to slow thealgorithm down. Hence methods that quickly eliminate nonoptimal supportpoints are valuable [Harman and Pronzato (2007)].Condition (b) simply says that unless w is a ﬁxed point, the mapping T should produce a better solution. Let us assume (5), (7) and condition (a)so that Theorem 1 applies. Then, by checking the equality condition in (16),it is easy to see that condition (b) is satisﬁed if 0 < λ <

1. [The argumentleading to (19) technically assumes that all coordinates of w are nonzero,but we can apply it to the appropriate subvector of w .] If λ = 1, then (16)reduces to an equality. However, by checking the equality conditions in (17)and (18), we can show that condition (b) is satisﬁed if ψ is strictly increasingand strictly concave: M ≥ M > , (20) M = M = ⇒ ψ ( M ) < ψ ( M ); M , M > , (21) M = M = ⇒ ψ ( M ) < ψ ( M ) + tr( ψ ′ ( M )( M − M )) . LGORITHM FOR OPTIMAL DESIGNS Conditions (c) and (d) are technical requirements that concern φ alone.Condition (c) ensures uniqueness of the optimal moment matrix which sim-pliﬁes the analysis. Condition (d) ensures that positive deﬁniteness of M ( w )is maintained in the limit. Conditions (c) and (d) are satisﬁed by φ = φ p with p ≤

0, for example.Let us mention a typical example of Theorem 2.

Corollary 1.

Assume A i = 0 , w (0) i > , i = 1 , . . . , n , and M ( w (0) ) > .Then the conclusion of Theorem 2 holds for Algorithm I with φ = φ . Proof.

Conditions (a), (c) and (d) are readily veriﬁed. Condition (b)is satisﬁed by (20) and (21). The claim follows from Theorem 2. (cid:3)

When (20) or (21) fails, and λ = 1, it is often diﬃcult to appeal to Theorem2 because strict monotonicity [condition (b)] may not hold. We illustrate thiswith an example where the monotonicity is not strict, and the algorithm doesnot converge [see Pronzato, Wynn and Zhigljavsky (2000), Chapter 7; alsothe remark in case (iii) following Theorem 1]. Consider iteration (9) ( λ = 1)with n = m = 2 and design space X = { x i = (1 , z i ) ⊤ , i = 1 , } , z = − z = 1.It is easy to show that, for any w ( t ) = ( w , w ) ∈ Ω, iteration (9) maps w ( t ) to w ( t +1) = ( w , w ). Thus, unless w = w = 1 /

5. Further remarks and illustrations.

One can think of several reasonsfor the wide interest in Algorithm I and its relatives. Similar to the EMalgorithm, Algorithm I is simple, easy to implement and monotonically con-vergent for a large class of optimality criteria (although this was not provedin the present generality). Algorithm I is known to be slow sometimes. But itserves as a foundation upon which more eﬀective variants can be built [see,e.g., Harman and Pronzato (2007) and Dette, Pepelyshev and Zhigljavsky(2008)]. While solving the conjectured monotonicity of (9) holds mathe-matical interest, our main contribution is a way of interpreting such algo-rithms as optimization on augmented spaces. This opens up new possibilitiesin constructing algorithms with the same desirable monotonic convergenceproperties.As a numerical example, consider the logistic regression model p ( y | x, θ ) = exp( yx ⊤ θ )1 + exp( x ⊤ θ ) , y = 0 , . The expected Fisher information for θ from a unit assigned to x i is A i = x i exp( x ⊤ i θ )(1 + exp( x ⊤ i θ )) x ⊤ i . Y. YU

Fig. 1.

Values of φ = log det M and φ − = − tr( M − ) for Algorithm I with design spaces X and X . We compute locally optimal designs with prior guess θ ∗ = (1 , ⊤ ( m = 2),and design spaces, X = { x i = (1 , i/ ⊤ , i = 1 , . . . , } ; X = { x i = (1 , i/ ⊤ , i = 1 , . . . , } . The design criteria considered are φ (for D-optimality) and φ − . We useAlgorithm I with λ = 1, starting with equally weighted designs.For φ , Corollary 1 guarantees monotonic convergence. This is illustratedby Figure 1, the ﬁrst row, where φ = log det M ( w ) is plotted against iter-ation t . Using the convergence criterion (4) with δ = 0 . X and 2121 for X . The actual locallyD-optimal designs are w = w = 0 . X and w = w = 0 . X , ascan be veriﬁed using the general equivalence theorem. This simple exampleserves to illustrate both the monotonicity of Algorithm I (when Theorem 1applies) and its potential slow convergence.For φ − , although Algorithm I can be implemented just as easily, Theorem1 does not apply because the concavity condition (7) no longer holds. Indeed,Algorithm I (with λ = 1) is not monotonic, as is evident from Figure 1, inthe second row, where φ − = − tr( M − ( w )) is plotted against iteration t . LGORITHM FOR OPTIMAL DESIGNS This shows the potential danger of using Algorithm I when monotonicity isnot guaranteed.Although Theorem 1 does not cover the φ p criterion for p < −

1, it is stillpossible that monotonicity holds for a smaller range of λ . Calculations inspecial cases lead to the conjecture [Silvey, Titterington and Torsney (1978)]that Algorithm I is monotonic if 0 < λ ≤ / (1 − p ). Theorem 1 providesfurther evidence for this conjecture, but new insights are needed to resolveit. We have focused on local optimality. An alternative, Bayesian optimality [Chaloner and Larntz (1989), Chaloner and Verdinelli (1995)], seeks to max-imize the expected value of φ ( M ( θ ; w )) over a prior distribution π ( θ ). Thenotation M ( θ ; w ) emphasizes the dependence of the moment matrix on theparameter θ . It would be worthwhile to extend our strategy in Section 3 toBayesian optimality, and we plan to report both theoretical and empiricalevaluations of such extensions in future works. Acknowledgment.

The author would like to thank Don Rubin, Xiao-LiMeng and David van Dyk for introducing him to the ﬁeld of statisticalcomputing. He is also grateful to Mike Titterington, Ben Torsney and thereferees for their valuable comments.REFERENCES

Atwood, C. L. (1973). Sequences converging to D-optimal designs of experiments.

Ann.Statist. Chaloner, K. and

Larntz, K. (1989). Optimal Bayesian design applied to logistic re-gression experiments.

J. Statist. Plann. Inference Chaloner, K. and

Verdinelli, I. (1995). Bayesian experimental design: A review.

Statist. Sci. Chernoff, H. (1953). Locally optimal design for estimating parameters.

Ann. Math.Statist. Csisz´ar, I. and

Tusnady, G. (1984). Information geometry and alternating minimizationprocedures.

Statist. Decisions (Suppl. 1) 205–237. MR0785210

Dempster, A. P., Laird, N. M. and

Rubin, D. B. (1977). Maximum likelihood fromincomplete data via the EM algorithm (with discussion).

J. Roy. Statist. Soc. Ser. B Dette, H., Pepelyshev, A. and

Zhigljavsky, A. (2008). Improving updating rulesin multiplicative algorithms for computing D-optimal designs.

Comput. Statist. DataAnal. Fedorov, V. V. (1972).

Theory of Optimal Experiments . Academic Press, New York.MR0403103

Fellman, J. (1974). On the allocation of linear observations (Thesis).

Soc. Sci. Fenn.Comment. Phys.-Math. Fellman, J. (1989). An empirical study of a class of iterative searches for optimal designs.

J. Statist. Plann. Inference Harman, R. and

Pronzato, L. (2007). Improvements on removing nonoptimal supportpoints in D-optimum design algorithms.

Statist. Probab. Lett. Y. YU

Harman, R. and

Trnovsk´a, M. (2009). Approximate D-optimal designs of experimentson the convex hull of a ﬁnite set of information matrices.

Math. Slovaca Kiefer, J. (1974). General equivalence theory for optimum designs (approximate theory).

Ann. Statist. Kiefer, J. and

Wolfowitz, J. (1960). The equivalence of two extremum problems.

Canad. J. Math. Li, G. and

Majumdar, D. (2008). D-optimal designs for logistic models with three andfour parameters.

J. Statist. Plann. Inference

Meng, X.-L. and van Dyk, D. (1997). The EM algorithm—an old folk-song sung to afast new tune (with discussion).

J. Roy. Statist. Soc. Ser. B P´azman, A. (1986).

Foundations of Optimum Experimental Design . Reidel, Dordrecht.

Pronzato, L., Wynn, H. and

Zhigljavsky, A. (2000).

Dynamical Search: Applicationsof Dynamical Systems in Search and Optimization . Chapman & Hall/CRC, Boca Raton.

Pukelsheim, F. (1993).

Optimal Design of Experiments . Wiley, New York. MR1211416

Pukelsheim, F. and

Torsney, B. (1991). Optimal weights for experimental designs onlinearly independent support points.

Ann. Statist. Silvey, S. D. (1980).

Optimal Design . Chapman & Hall, London. MR0606742

Silvey, S. D., Titterington, D. M. and

Torsney, B. (1978). An algorithm for optimaldesigns on a ﬁnite design space.

Comm. Statist. Theory Methods Titterington, D. M. (1976). Algorithms for computing D-optimal design on ﬁnite designspaces. In

Proc. of the 1976 Conf. on Information Science and Systems Titterington, D. M. (1978). Estimation of correlation coeﬃcients by ellipsoidal trim-ming.

J. Appl. Stat. Torsney, B. (1983). A moment inequality and monotonicity of an algorithm. In

Pro-ceedings of the International Symposium on Semi-Inﬁnite Programming and Appl. (K.O. Kortanek and A. V. Fiacco, eds.).

Lecture Notes in Economics and MathematicalSystems

Torsney, B. (2007). W-iterations and ripples therefrom. In

Optimal Design and Re-lated Areas in Optimization and Statistics (L. Pronzato and A. Zhigljavsky, eds.) 1–12.Springer, New York. MR2513344

Torsney, B. and

Mandal, S. (2006). Two classes of multiplicative algorithms forconstructing optimizing distributions.

Comput. Statist. Data Anal. Torsney, B. and

Mart´ın-Mart´ın, R. (2009). Multiplicative algorithms for computingoptimum designs.

J. Statist. Plann. Inference

Whittle, P. (1973). Some general points in the theory of optimal experimental design.

J. Roy. Statist. Soc. Ser. B Wu, C. F. and

Wynn, H. P. (1978). The convergence of general step-length algorithmsfor regular optimum design criteria.

Ann. Statist. Wynn, H. P. (1972). Results in the theory and construction of D-optimum experimentaldesigns.

J. Roy. Statist. Soc. Ser. B Yu, Y. (2008). A bit of information theory, and the data augmentation algorithm con-verges.