What is the gradient of a scalar function of a symmetric matrix ?
aa r X i v : . [ m a t h . NA ] M a y WHAT IS THE GRADIENT OF A SCALAR FUNCTION OF ASYMMETRIC MATRIX ? ∗ SHRIRAM SRINIVASAN † AND
NISHANT PANDA ‡ Abstract.
Perusal of research articles in the statistics and electrical engineering communitiesthat deal with the topic of matrix calculus reveal a different approach to calculation of the gradientof a real-valued function of a symmetric matrix. In contrast to the standard approach wherein thegradient is calculated using the definition of a Fr´echet derivative for matrix functionals, there isa notion of the gradient that explicitly takes into account the symmetry of the matrix, and this“symmetric gradient” G s is reported to be related to the gradient G which is computed by ignoringsymmetry as G s = G + G T − G ◦ I , where ◦ denotes the elementwise Hadamard product of the twomatrices. The idea of the “symmetric gradient” has now appeared in several publications, as wellas in textbooks and handbooks on matrix calculus which are often cited in the context of machine-learning. After setting up the problem in a finite-dimensional inner-product space, we demonstraterigorously that G s = ( G + G T ) / Key words. matrix calculus, symmetric matrix, Fr´echet derivative, gradient, matrix functional
AMS subject classifications.
1. Introduction.
Matrix functionals defined over an inner-product space ofsquare matrices are a common construct in applied mathematics. In most cases,the object of interest is not the matrix functional itself, but its derivative or gradi-ent (if it be differentiable), and this notion is unambiguous. The Fr´echet derivative,see for e.g. [1] and [2], being a linear functional readily yields the definition of thegradient via the Riesz Representation Theorem.However, there is a sub-class of matrix functionals that frequently occurs in prac-tice whose argument is a symmetric matrix. Such functionals and their gradientsoccur in the analysis and control of dynamical systems which are described by matrixdifferential equations [3], maximum likelihood estimation in statistics, econometricsand machine-learning [4], and in the theory of elasticity and continuum thermodynam-ics [5, 6]. For this sub-class of matrix functionals with a symmetric matrix argument,there appears to be two approaches to define the gradient that lead to different results.By working with the definition of the Fr´echet derivative over the vector spaceof square matrices and specializing it to that of the symmetric matrices which area proper subspace, the gradient (denoted by G sym for convenience) can be obtainedas described in [7]. However, a different idea emerged through matrix calculus aspracticed by the statistics and control systems community – that of a “symmetricgradient”. The root of this idea is the fact that while the space of square matricesin R n × n has dimension n , the subspace of symmetric matrices has a dimension of n ( n + 1) /
2. The second approach aims to explicitly take into account the symmetryof the matrix elements, and view the matrix functional as one defined on the vector ∗ Updated May 6, 2020.
Funding:
SS thanks Los Alamos National Laboratory (US) for funding through † ‡ Applied Mathematics Group (T-5), Theoretical Division, Los Alamos National Laboratory, LosAlamos, NM 87545 ([email protected]). 1
S. SRINIVASAN AND N. PANDA space R n ( n +1) / , compute its gradient in this space before finally reinterpreting it asa symmetric matrix (the “symmetric gradient” G s ) in R n × n . However, the gradientscomputed by the two different methods, G sym , G s are not equal. The question raisedin the title of this article refers to this dichotomy.A perusal of the literature reveals how the idea of the “symmetric gradient” cameinto being among the community of statisticians and electrical engineers that domi-nantly used matrix calculus. Early work in statistics in 1960s such as [8, 9] does notmake any mention of a need for special formulae to treat the case of a symmetric ma-trix, but does note that all the matrix elements must be “functionally independent”.The notion of “independence” of matrix elements was a recurring theme, and symmet-ric matrices, by dint of equality of the off-diagonal elements violated this condition.Gebhardt [10] in 1971 seems to have been the first to remark that the derivative for-mulae do not consider symmetry explicitly but he concluded that no adjustment wasnecessary in his case since the gradient obtained was already symmetric. Tracy andSingh [11] in 1975 echo the same sentiments as Gebhardt about the need for specialformulae. By the end of the decade, the “symmetric gradient” makes its appearance insome form or the other in the work of Henderson [12] in 1979, a review by Nel [13], anda book by Rogers [14] in 1980. McCulloch [15] proves the expression for “symmetricgradient” that we quote here and notes that it applies to calculating derivatives withrespect to variance-covariance matrices, and thus derives the information matrix forthe multivariate normal distribution. By 1982, the “symmetric gradient” is includedin the authoritative and influential textbook by Searle [16]. Today the idea is firmlyentrenched as evidenced by the books [17, 18, 19] and the notes by Minka [20].The idea of the “symmetric gradient” seems to have come up in the controlsystems community (as represented by publications in IEEE) as an offshoot of theextension of the Pontryagin Maximum principle for matrix of controls and stateswhen Athans and Schweppe [3] remark that the formulae for gradient matrices arederived under the assumption of “functional independence” of matrix elements. Later,they warned [21, 22] that special formulae were necessary for symmetric matrices.Geering [23] in 1976 exhibited an example calculation (gradient of the determinant ofa symmetric 2 x 2 matrix) to justify the definition of a “symmetric gradient”. We shallshow that his reasoning was flawed. Brewer [24] in 1977 remarked that the formulaefor gradient matrices in [3, 22, 21] can only be applied when the elements of thematrix are independent, which is not true for a symmetric matrix, and so proceededto derive a general formula for the “symmetric gradient” (identical to McCulloch[15]) through the rules of matrix calculus for use in sensitivity analysis of optimalestimation systems. The same flaw underpins the example calculation of Geering [23]and the putative “proof” of the expression for G s derived by Brewer [24]. Followingon from these works, it appears in other instances [25, 26, 27, 28].At present, the “symmetric gradient” formula is also recorded in [29], a handyreference for engineers and scientists working on inter-disciplinary topics using statis-tics and machine-learning, and the formula’s appearance in [30] shows that it is nolonger restricted to a particular community of researchers.Thus, both notions of the gradient are well-established, and hence the fact thatthese two notions do not agree is a source of enormous confusion for researchers whostraddle application areas, a point to which the authors can emphatically attest to.On the popular site Mathematics Stack Exchange, there are multiple questions (forexample [31, 32, 33, 34, 35, 36, 37]) related to this theme, but their answers deepenand misguide rather than alleviate the existing confusion. Depending on the context,this disagreement between the two notions of gradient has implications that range RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX G s in R n × n is a misinterpretation of the gradient in R n ( n +1) / . In other words, the liftingfrom R n ( n +1) / to R n × n is incorrect. When interpreted correctly, we are inexorablyled to G s = G sym .That finally brings us to the most important reason for writing this article, whichis that derivatives and gradients are fundamental ideas, and there should not beany ambiguity about their definitions. Thus, we felt the urgent need to clarify theissues muddying the waters, and show that the “symmetric gradient”, when calculatedcorrectly, leads to the expected result.The paper is organized as follows: after stating the problem, we begin with twoillustrative examples in Section 2 that allow us to see concretely what we later provein the abstract. After that, Section 3 lays out all the machinery of linear algebra thatwe shall need, ending with the proof of the main result.
2. Problem formulation.
To fix our notation, we introduce the following. Wedenote by S n × n the subspace of all symmetric matrices in R n × n . The space R n × n (and subsequently S n × n ) is an inner product space with the following natural innerproduct h· , ·i F . Definition
For two matrices
A, B in R n × n h A, B i F := tr ( A T B ) defines an inner product and induces the Frobenius norm on R n × n via k A k F := q tr ( A T A ) . Corollary
We collect a few useful facts about the inner product definedabove essential for this paper. For A symmetric, B skew-symmetric in R n × n , h A, B i F = 02. If h A, H i F = 0 for any H in S n × n , then the symmetric part of A given by sym( A ) := ( A + A T ) / is equal to For A in R n × n and H in S n × n , h A, B i F = h sym( A ) , H i F S. SRINIVASAN AND N. PANDA
Proof.
See, for e.g. [40].Consider a real valued function φ : R n × n −→ R . We say that φ is differentiableif its Fr´echet derivative, defined below, exists. Definition
The Fr´echet derivative of φ at A in R n × n is the unique lineartransformation D φ ( A ) in R n × n such that lim k H k→ | φ ( A + H ) − φ ( A ) − D φ ( A )[ H ] |k H k → , for any H in R n × n . The Riesz Representation theorem then asserts the existence ofthe gradient ∇ φ ( A ) in R n × n such that h∇ φ ( A ) , H i F = D φ ( A )[ H ]Note that if A is a symmetric matrix, then by the Fr´echet derivative definedabove, the gradient ∇ φ ( A ) is not guaranteed to be symmetric. Also, observe that thedimension of S n × n is m = n ( n + 1 / S n × n with R m .The reduced dimension along with the fact that Definition 2.3 doesn’t account for thesymmetry of the matrix argument of φ served as a motivation to define a “symmetricgradient” in R n × n to account for the symmetry in S n × n . Claim
Let φ : R n × n −→ R , and let φ sym be the real-valued function that isthe restriction of φ to S n × n , i.e., φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R . Let G be the gradientof φ as defined in Definition . Then G claims is the linear transformation in S n × n that is claimed to be the “symmetric gradient” of φ sym and related to the gradient G as follows [15, 17, 18, 29] G claims ( A ) = G ( A ) + G T ( A ) − G ( A ) ◦ I, where ◦ denotes the element-wise Hadamard product of G ( A ) and the identity I . Theorem 3.8 in the next section will demonstrate that this claim is false. Beforethat, however, note that S n × n is a subspace of R n × n with the induced inner productin Definition 2.1. Thus, the derivative in Definition 2.3 is naturally defined for allscalar functions of symmetric matrices. The Fr´echet Derivative of φ when restrictedto the subspace S n × n automatically accounts for the symmetry structure. Harville[17] notes that the interior of S n × n is empty, and states that hence, Definition 2.3 isnot applicable for symmetric matrices. The inference is incorrect, for while it is truethat the interior of S n × n in R n × n is empty, the interior of S n × n in S n × n is non-empty,and this is the key to Definition 2.5. For completeness, we re-iterate the definition ofFr´echet derivative of φ restricted to the subspace S n × n . Definition
The Fr´echet derivative of the function φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R at A in S n × n is the unique linear transformation D φ ( A ) in S n × n such that lim k H k→ | φ ( A + H ) − φ ( A ) − D φ ( A )[ H ] |k H k → , for any H in S n × n . The Riesz Representation theorem then asserts the existence ofthe gradient G sym ( A ) := ∇ φ sym ( A ) in S n × n such that h G sym ( A ) , H i F = D φ ( A )[ H ] RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX R n × n andthe restricted subspace S n × n . The following corollary states this relationship. Corollary If G ∈ R n × n be the gradient of φ : R n × n −→ R , then G sym =sym( G ) is the gradient in S n × n of φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R .Proof. From Definition 2.3, we know that D φ ( A )[ H ] = h G ( A ) , H i F for any H in R n × n . If we restrict attention to H in S n × n , then, D φ ( A )[ H ] = h G ( A ) , H i F = h∇ φ sym ( A ) , H i F . This is true for any H in S n × n , so that by Corollary 2.2 and uniqueness of the gradient, G sym ( A ) = sym( G ( A ))is the gradient in S n × n . This example will illustrate the differencebetween the gradient on R n × n and S n × n . Fix a non-symmetric matrix A in R n × n and consider a linear functional, φ : R n × n −→ R , given by φ ( X ) = tr ( A T X ) for any X in R n × n .The gradient ∇ φ in R n × n is equal to A , as defined by the Fr´echet derivative Defi-nition 2.3. However, if φ is restricted to S n × n , then observe that ∇ φ (cid:12)(cid:12) S n × n = sym( A ) =( A + A T ) / S n × n inCorollary 2.6 is ensured to be symmetric. We will demonstrate that Claim 2.4 is incor-rect. In fact, the correct symmetric gradient is the one given by the Fr´echet derivativein Definition 2.5, Corollary 2.6, i.e. sym( G ). To do this, we first illustrate through asimple example how G claim s as defined in Claim 2.4 gives an incorrect gradient. This short section is meant to highlight theinconsistencies that result from defining a symmetric gradient given by Claim 2.4.We reconsider Geering’s example [23] and demonstrate the flaw in the argumentthat led to Claim 2.4. Define φ : R × −→ R given by φ ( A ) = det( A ). For anysymmetric matrix A in S × , let A = (cid:18) x yy z (cid:19) .The gradient, defined by the Fr´echet derivative Definition 2.3 is G ( A ) = det( A ) A − T = (cid:18) z − y − y x (cid:19) . If φ is restricted to S × , then observe that sym( G ) = det( A ) A − = (cid:18) z − y − y x (cid:19) .Geering identifies A through the triple [ x, y, z ] T in R , and consequently, weidentify φ ( A ) with φ s ( x, y, z ) = xz − y as a functional on R .Then, ∇ φ s in R is given by [ z, − y, x ] T Geering identifies ∇ φ s with a matrix G s = (cid:18) g g g g (cid:19) in S × , where g = z, g = − y, g = x. Notice that then G s agrees with Claim 2.4 that G s ( A ) = (cid:18) z − y − y z (cid:19) . S. SRINIVASAN AND N. PANDA
However, this identification is inconsistent because the gradients ∇ φ s in R and G s in S × are not independent; rather, for any perturbation H = (cid:18) h h h h (cid:19) identifiedby h s = [ h , h , h ] T , the inner product(2.1) h∇ φ s , h s i R = h G s , H i F This relationship (2.1) expresses the simple idea that follows from the chain andproduct rule for derivatives - that a small perturbation in the argument, identifiedeither by H or h s , leads to the same change in the value of the function, identified aseither φ ( A ) or φ s ( x, y, z ).The crux of the issue is that Geering’s identification that agrees with Claim 2.4violates the relationship, but the correct identification g = z, g = ( − y ) / , g = x satisfies it. Thus, we have shown using Geering’s example, that Claim 2.4 cannot holdand that G s = sym( G ) is the correct gradient.In the subsequent sections, we shall prove (2.1) in greater generality and rigourfor any differentiable function φ : S n × n −→ R , and also show how the same incorrectidentification leads to the spurious Claim 2.4.
3. Gradient of real-valued functions of symmetric matrices.
Matrices in R n × n can be naturally identified with vectors in R n . Thus a real valued functiondefined on R n × n can be naturally identified with a real valued function defined on R n .Moreover, the inner product on R n × n defined in Definition 2.1 is naturally identifiedwith the Euclidean inner product on R n . This identification is useful when the goalis to find derivatives of scalar functions in R n × n . The scheme then is to identify thescalar function on R n × n with a scalar function on R n , compute its gradient and usethe identification to go back to construct the gradient in R n × n . In case of symmetricmatrices, the equation in Claim 2.4 is claimed to be the identification of the gradientin S n × n after computations in R m , since symmetric matrices are identified with R m where m = n ( n + 1) /
2. In this section, we show that the claim is false. We first beginby formalizing these natural identifications we discussed in this paragraph.
Definition
The function vec : R n × n −→ R n given by vec( A ) := A ... A n A ... , identifies a matrix A in R n × n with a vector vec( A ) in R n . This operation can be inverted in obvious fashion, i.e., given the vector, one canreshape to form the matrix through the mat operator defined below.
Definition R n −→ R n × n is the function given by, mat(vec( A )) = A, for any A in R n . RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX S n × n of R n × n is the subspace of all symmetric matrices and the ob-ject of investigation in this paper. Since this subspace has a dimension m = n ( n +1) / R m . This identification isgiven by the elimination operation P defined below. Definition
Let V be the range of vec restricted to S n × n i.e. V = vec ( S n × n ) .The elimination operator P is the function P : V −→ R m that eliminates the redun-dant entries of a vector v in V . The operator P lets us identify symmetric matrices in S n × n with vector in R m via the vech operator defined below. Definition
The operator vech is the function vech : S n × n −→ R m given by (3.1) vech( A ) = P vec( A ) , for any symmetric matrix A in S n × n . On the other hand,
Definition the duplication operator D : R m −→ V given by (3.2) vec( A ) = D vech( A ) , for any A in S n × n acts as the inverse of the elimination operator P . Lemma
Any a in R m can be lifted to a symmetric matrix in S n × n .Proof. mat( D ( a )) lifts a in R m to a symmetric matrix in S n × n .We record some properties of the duplication operator D that will be useful inproving our main theorem Theorem 3.8 later. Lemma
Let D be the duplication operator defined in Definition . Thefollowing are true.
1. Null D = { } D T vec( A ) = vech( A ) + vech( A T ) − vech( A ◦ I ) ∀ A ∈ S n × n D T D in S m × m is a positive-definite, symmetric matrix
4. ( D T D ) − existsProof. See [42]Consider a real valued function φ : R n × n −→ R and its restriction φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R . Then φ sym can be identified with a scalar function φ s : R m −→ R .Moreover, there is a relationship between the gradients calculated from the differentrepresentations of the function. The next theorem formalizes this concept and demon-strates two fundamental ideas - 1) the notion of Fr´echet derivative naturally carriesover to the subspace of symmetric matrices, hence there is no need to identify anequivalent representation of the functional in a lower dimensional space and, 2) ifsuch an equivalent representation is constructed, a careful analysis leads to the cor-rect gradient defined by the Fr´echet derivative. Theorem
Consider a real-valued function φ : R n × n −→ R whose restriction φ sym is the function φ sym = φ (cid:12)(cid:12) S n × n S n × n −→ R . Let G := ∇ φ be the gradient of φ ,so that ∇ φ sym = sym( G ) is the gradient of φ sym . φ sym can be identified with a scalar function φ s : R m −→ R given by φ s = φ sym ◦ mat ◦ D with m = n ( n + 1) / . S. SRINIVASAN AND N. PANDA If ∇ φ s in R m is the gradient of φ s , then the symmetric matrix G s in S n × n given by G s = mat( D ( D T D ) − ∇ φ s ) is the correct “symmetric gradient” of φ in the sense that h G s , H i F = h∇ φ s , vech( H ) i R m = h sym( G ) , H i F , for all H in S n × n . Thus, G s = sym( G ) . Before proving Theorem 3.8 we establish a few useful Lemmas that are interestingin their own right. Remark 3.11 will illustrate a plausible argument of Claim 2.4.
Lemma
Let
A, B be two symmetric matrices in S n × n . Then we have thefollowing equivalence h A, B i F = h vec( A ) , vec( B ) i R n = (cid:10) D T D vech( A ) , vech( B ) (cid:11) R m = (cid:10) vech( A ) , D T D vech( B ) (cid:11) R m , where h· , ·i R n , h· , ·i R m are the usual Euclidean inner products.Proof. h A, B i F = tr ( A T B ) = h vec( A ) , vec( B ) i R n by Def inition h D vech( A ) , D vech( B ) i R n from Equation (3.2)= (cid:10) D T D vech( A ) , vech( B ) (cid:11) R m by definition of transpose operatorThe observation that h A, B i F = h vech( A ) , vech( B ) i is a crucial one and lies atthe heart of the discrepancy alluded to in the title of this article. Instead, if we wantto refactor the inner product of two elements in R m into one in R n × n one has Lemma for any a, b in R m , h a, b i R m = (cid:10) mat( D ( D T D ) − a ) , mat( Db ) (cid:11) F . Proof. h a, b i R m = (cid:10) ( D T D )( D T D ) − a, b (cid:11) R m by Lemma (cid:10) D ( D T D ) − a, Db (cid:11) R n = (cid:10) mat( D ( D T D ) − a ) , mat( Db ) (cid:11) F by Lemma φ sym := φ (cid:12)(cid:12) S n × n : S n × n −→ R what we demonstrated through the example earlier – that Claim 2.4 is false, and thatthe “symmetric gradient” G s = sym( G ) computed using Corollary 2.6. Proof.
Theorem 3.8(3.3) S n × n R V ⊂ R n R mφ sym := φ (cid:12)(cid:12) S n × n vecmat PD φ s RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX
91. The operators defined above establish the commutative diagram given in Equa-tion (3.3). These yield the following relation(3.4) φ s = φ sym ◦ mat ◦ D, where ◦ represents the usual composition of functions and m = n ( n + 1) / φ sym = φ s ◦ P ◦ vecThus, for any symmetric matrix H , the chain-rule for Fr´echet derivativesyields(3.6) D φ sym ( H ) = D φ s ◦ D P ◦ D vec( H ) . By noting that P , D and vec are linear operators, the equation above yieldsthe following relationship between the Fr´echet derivatives.(3.7) D φ sym ( H ) = D φ s ◦ vech( H ) . With the usual inner products defined earlier, the Riesz Representation The-orem gives us the following relationship between the gradients,(3.8) h sym( G ) , H i F = h∇ φ s , vech( H ) i R m for each H in S n × n (see Corollary 2.6 and Subsection 2.1 for the fact that thegradient ∇ φ sym is given by sym( G )).Since ∇ φ s is a vector in R m , it can be lifted to the space of symmetric matricesto yield the matrix G s as in Lemma 3.6. However such a lifting fails to satisfythe inner product relationship given by (3.8) and yields the incorrect gradient.See Remark 3.11.By Lemma 3.10 and the fact that mat( D vech( H )) = mat(vec( H )) = H , wefind that the correct lifting is given by G s = mat( D ( D T D ) − ∇ φ s ) in S n × n such that,(3.9) h∇ φ s , vech( H ) i R m = h G s , H i F for each H in S n × n .To show that G s is indeed the correct expression for the “symmetric gradient”we need to show that G s = sym( G ). This follows immediately since we nowhave h sym( G ) , H i F = h G s , H i F = h∇ φ s , vech( H ) i R m .This completes our proof of Theorem 3.8. Remark D statedin Lemma 3.7, we can relate ∇ φ s to ∇ φ sym = sym( G ) in the following way(3.10) ∇ φ s = D T vec( ∇ φ sym ) = vech( ∇ φ sym ) + vech( ∇ φ Tsym ) − vech( ∇ φ sym ◦ I )This is equivalent to ∇ φ s = vech( G + G T − G ◦ I ) . S. SRINIVASAN AND N. PANDA
If we ignore Lemma 3.10 and the constraint Equation (3.8), and instead naivelyuse Lemma 3.6 to set G s = mat( D ∇ φ s ) as illustrated in the example in Subsection 2.2earlier, we get G s = mat( D vech( G + G T − G ◦ I )) which simplifies to G s = mat(vec( G + G T − G ◦ I )) = G + G T − G ◦ I Thus, we have shown that the same fundamental flaw discovered in Subsection 2.2underpins the “proof” of the spurious Claim 2.4.
Remark R ,the same arguments will be valid for matrix functionals defined over the complexfield C with an appropriate modification of the definition of the inner-product inDefinition 2.1. Remark R n × n can be obtained from the Fr´echet derivativeover R n × n by an appropriate projection/restriction to the relevant linear manifold asshown in Corollary 2.6. In this paper, the linear manifold was the set of symmetricmatrices designated as S n × n . One can adapt the same ideas expressed here to obtainthe derivative over the subspace of skew-symmetric, diagonal, upper-triangular, orlower triangular matrices. However, note that this remark does not apply to the setof orthogonal matrices since it is not a linear manifold.Thus, the crux of the theorem was the recognition that while the lifting of anelement in R m to S n × n follows Lemma 3.6, the gradient in R m must satisfy Equa-tion (3.8) and instead must be lifted by Lemma 3.10.In the context of an application, the implication of Theorem 3.8 depends on theway the gradient is calculated, and what is the quantity of interest. If the quantityof interest is the gradient of a scalar function defined over S n × n , then the correctgradient is the one given by Theorem 3.8, whatever be the method used to evaluateit. However, if the quantity of interest is not the gradient but the argument of thefunction, say one obtained by gradient-descent in an optimization algorithm, thenthe implication depends on how the argument is represented. If the argument isrepresented as an element of S n × n , then the gradient sym( G ) should be used. Mostsolvers and optimization routines, however do not accept matrices as arguments. Inthese cases, one actually works with the function φ s , whose gradient ∇ φ s in R m posesno complications. However, the output argument returned by the gradient-descentwill still lie in R m and will have to be lifted to S n × n by Lemma 3.6 to be useful.
4. Conclusions.
In this article, we investigated the two different notions of agradient that exist for a real-valued function when the argument is a symmetric ma-trix. The first notion is the mathematical definition of a Fr´echet derivative on thespace of symmetric matrices. The other definition aims to eliminate the redundantdegrees of freedom present in a symmetric matrix and perform the gradient calcu-lation in the space of reduced-dimension and finally map the result back into thespace of matrices. We showed, both through an example and rigorously through atheorem, that the problem in the second approach lies in the final step as the gradi-ent in the reduced-dimension space is mapped into a symmetric matrix. Moreover,the approach does not recognize that Definition 2.5, restricted to S n × n , already ac-counts for the symmetry in the matrix argument; thus there is no need to identify anequivalent representation of the functional in a lower-dimensional space of dimension m = n ( n + 1) /
2. However, we demonstrated that if such an equivalent representation
RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX
REFERENCES[1] James Munkres.
Analysis on Manifolds . Addison Wessley, New York, 1991.[2] Ward Cheney.
Analysis for Applied Mathematics , volume 208. Springer Science & BusinessMedia, 2013.[3] Michael Athans and Fred C. Schweppe. Gradient matrices and matrix calculations. TechnicalReport Technical Note 1965-53, Massachusets Institute of Technology Lincoln Laboratory,Nov 1965. [Online; accessed 1. Nov. 2019].[4] J. R. Magnus and H. Neudecker.
Matrix Differential Calculus with Applications in Statisticsand Econometrics.
John Wiley, New York, 1988.[5] Morton E. Gurtin, Eliot Fried, and Lallit Anand.
Mechanics and Thermodynamics of Continua .Cambridge University Applied Mathematics Research eXpress, 2010.[6] Ray W. Ogden.
Non-linear Elastic Deformations . Dover Publications, New York, 1997.[7] Mikhail Itskov.
Tensor Algebra and Tensor Analysis for Engineers . Springer, Cham, 2019.[8] Paul S. Dwyer. Some Applications of Matrix Derivatives in Multivariate Analysis.
Journal ofthe American Statistical Association , 62(318):607–625, Jun 1967.[9] Derrick S. Tracy and Paul S. Dwyer. Multivariate Maxima and Minima with Matrix Derivatives.
Journal of the American Statistical Association , 64(328):1576–1594, Dec 1969.[10] Friedrich Gebhardt. Maximum likelihood solution to factor analysis when some factors arecompletely specified.
Psychometrika , 36(2):155–163, Jun 1971.[11] D. S. Tracy and R. P. Singh. Some Applications of Matrix Differentiation in the GeneralAnalysis of Covariance Structures.
Sankhy¯a: The Indian Journal of Statistics, Series A(1961-2002) , 37(2):269–280, Apr 1975.[12] H. V. Henderson and S. R. Searle. Vec and vech operators for matrices, with some uses injacobians and multivariate statistics.
Canadian Journal of Statistics , 7:65–81, 1979.[13] D. G. Nel. On matrix differentiation in statistics.
South African Statistical Journal , 14(2):137–193, Jan 1980.[14] G.S. Rogers.
Matrix Derivatives . Marcel Dekker, New York, 1980.[15] Charles E. McCulloch. Symmetric Matrix Derivatives with Applications.
Journal of the Amer-ican Statistical Association , Mar 1980.[16] S.R. Searle.
Matrix Algebra for Statistics . John Wiley, New York, 1982.[17] David A. Harville.
Matrix Algebra from a Statistician’s Perspective . Springer-Verlag, NewYork, 1997.[18] George Arthur Frederick Seber.
A Matrix Handbook for Statisticians . John Wiley & Sons, NewJersey, 2008.[19] A. M. Mathai.
Jacobians of Matrix Transformations and Functions of Matrix Argument .World Scientific, Singapore, 1997.[20] Thomas Minka. Old and New Matrix Algebra Useful for Statistics, 2001. [Online; accessed 1.Nov. 2019].[21] Michael Athans. Matrix minimum principle. Technical Report Report ESL-R-317, MassachusetsInstitute of Technology Lincoln Laboratory, Aug 1967. [Online; accessed 4. Mar. 2020].[22] Michael Athans. The matrix minimum principle.
Information and Control , 11(5):592–606, Nov1967.[23] H. Geering. On calculating gradient matrices.
IEEE Transactions on Automatic Control ,21(4):615–616, Aug 1976.[24] J. Brewer. The gradient with respect to a symmetric matrix.
IEEE Transactions on AutomaticControl , 22(2):265–267, Apr 1977.[25] P. Walsh. On symmetric matrices and the matrix minimum principle.
IEEE Transactions onAutomatic Control , 22(6):995–996, Dec 1977.[26] A. E. Yanchevsky and V. J. Hirvonen. Optimization of feedback systems with constrainedinformation flow.
International Journal of Systems Science , 12(12):1459–1468, 1981.[27] A.-M. Parring. About the concept of the matrix derivative.
Linear Algebra and its Applications ,176:223–235, Nov 1992.[28] D. H. van Hessem and O. H. Bosgra. A full solution to the constrained stochastic closed- S. SRINIVASAN AND N. PANDAloop mpc problem via state and innovations feedback and its receding horizon implemen-tation. In , volume 1, pages 929–934 Vol.1, 2003.[29] Kaare Brandt Petersen and Michael Syskind Pedersen. The Matrix Cookbook, 11 2012. Version20121115.[30] Iain Murray. Differentiation of the Cholesky decomposition.
ArXiv e-prints , Feb 2016.[31] me10240 (https://math.stackexchange.com/users/66158/me10240). Making sense of matrix de-rivative formula for determinant of symmetric matrix as a frechet derivative? MathematicsStack Exchange. URL:https://math.stackexchange.com/q/2436680 (version: 2017-09-21).[32] tomka (https://math.stackexchange.com/users/118706/tomka). What is the derivative of thedeterminant of a symmetric positive definite matrix? Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/1981210 (version: 2017-04-13).[33] wueb (https://math.stackexchange.com/users/238307/wueb). Understand-ing notation of derivatives of a matrix. Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2131708 (version: 2017-04-13).[34] me10240 (https://math.stackexchange.com/users/66158/me10240). Can derivative formulaein matrix cookbook be interpreted as frechet derivatives? Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2508276 (version: 2017-11-06).[35] TopDog (https://math.stackexchange.com/users/569224/topdog). Ma-trix gradient of ln(det( x )). Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2816512 (version: 2018-06-12).[36] jds (https://math.stackexchange.com/users/31891/jds). Derivative with respect to symmetricmatrix. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3005374(version: 2018-11-19).[37] kMaster (https://math.stackexchange.com/users/58360/kmaster). Derivativeof the inverse of a symmetric matrix. Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/982386 (version: 2014-10-20).[38] William Thomson Lord Kelvin. Elements of a mathematical theory of elasticity. PhilosophicalTransactions of the Royal Society of London , 146:481 – 498, 1856.[39] W. Voigt.
Lehrbuch der Kristallphysik . B.G. Teubner, Leipzig, 1910.[40] Steven Roman.
Advanced Linear Algebra , volume 3. Springer.[41] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivativesof matrix and tensor expressions. In
Advances in Neural Information Processing Systems(NIPS) . 2018.[42] Jan R. Magnus and H. Neudecker. The elimination matrix: Some lemmas and applications.