[PDF] What is the gradient of a scalar function of a symmetric matrix ?

Abstract

Perusal of research articles that deal with the topic of matrix calculus reveal two different approaches to calculation of the gradient of a real-valued function of a symmetric matrix leading to two different results. In the mechanics and physics communities, the gradient is calculated using the definition of a \frechet derivative, irrespective of whether the argument is symmetric or not. However, members of the statistics, economics, and electrical engineering communities use another notion of the gradient that explicitly takes into account the symmetry of the matrix, and this "symmetric gradient" G s is reported to be related to the gradient G computed from the \frechet derivative with respect to a general matrix as G s =G+ G T −G∘I , where ∘ denotes the elementwise Hadamard product of the two matrices. We demonstrate that this relation is incorrect, and reconcile both these viewpoints by proving that G s =sym(G) .

Full PDF

aa r X i v : . [ m a t h . NA ] M a y WHAT IS THE GRADIENT OF A SCALAR FUNCTION OF ASYMMETRIC MATRIX ? ∗ SHRIRAM SRINIVASAN † AND

NISHANT PANDA ‡ Abstract.

Perusal of research articles in the statistics and electrical engineering communitiesthat deal with the topic of matrix calculus reveal a diﬀerent approach to calculation of the gradientof a real-valued function of a symmetric matrix. In contrast to the standard approach wherein thegradient is calculated using the deﬁnition of a Fr´echet derivative for matrix functionals, there isa notion of the gradient that explicitly takes into account the symmetry of the matrix, and this“symmetric gradient” G s is reported to be related to the gradient G which is computed by ignoringsymmetry as G s = G + G T − G ◦ I , where ◦ denotes the elementwise Hadamard product of the twomatrices. The idea of the “symmetric gradient” has now appeared in several publications, as wellas in textbooks and handbooks on matrix calculus which are often cited in the context of machine-learning. After setting up the problem in a ﬁnite-dimensional inner-product space, we demonstraterigorously that G s = ( G + G T ) / Key words. matrix calculus, symmetric matrix, Fr´echet derivative, gradient, matrix functional

AMS subject classiﬁcations.

1. Introduction.

Matrix functionals deﬁned over an inner-product space ofsquare matrices are a common construct in applied mathematics. In most cases,the object of interest is not the matrix functional itself, but its derivative or gradi-ent (if it be diﬀerentiable), and this notion is unambiguous. The Fr´echet derivative,see for e.g. [1] and [2], being a linear functional readily yields the deﬁnition of thegradient via the Riesz Representation Theorem.However, there is a sub-class of matrix functionals that frequently occurs in prac-tice whose argument is a symmetric matrix. Such functionals and their gradientsoccur in the analysis and control of dynamical systems which are described by matrixdiﬀerential equations [3], maximum likelihood estimation in statistics, econometricsand machine-learning [4], and in the theory of elasticity and continuum thermodynam-ics [5, 6]. For this sub-class of matrix functionals with a symmetric matrix argument,there appears to be two approaches to deﬁne the gradient that lead to diﬀerent results.By working with the deﬁnition of the Fr´echet derivative over the vector spaceof square matrices and specializing it to that of the symmetric matrices which area proper subspace, the gradient (denoted by G sym for convenience) can be obtainedas described in [7]. However, a diﬀerent idea emerged through matrix calculus aspracticed by the statistics and control systems community – that of a “symmetricgradient”. The root of this idea is the fact that while the space of square matricesin R n × n has dimension n , the subspace of symmetric matrices has a dimension of n ( n + 1) /

2. The second approach aims to explicitly take into account the symmetryof the matrix elements, and view the matrix functional as one deﬁned on the vector ∗ Updated May 6, 2020.

Funding:

SS thanks Los Alamos National Laboratory (US) for funding through † ‡ Applied Mathematics Group (T-5), Theoretical Division, Los Alamos National Laboratory, LosAlamos, NM 87545 ([email protected]). 1

S. SRINIVASAN AND N. PANDA space R n ( n +1) / , compute its gradient in this space before ﬁnally reinterpreting it asa symmetric matrix (the “symmetric gradient” G s ) in R n × n . However, the gradientscomputed by the two diﬀerent methods, G sym , G s are not equal. The question raisedin the title of this article refers to this dichotomy.A perusal of the literature reveals how the idea of the “symmetric gradient” cameinto being among the community of statisticians and electrical engineers that domi-nantly used matrix calculus. Early work in statistics in 1960s such as [8, 9] does notmake any mention of a need for special formulae to treat the case of a symmetric ma-trix, but does note that all the matrix elements must be “functionally independent”.The notion of “independence” of matrix elements was a recurring theme, and symmet-ric matrices, by dint of equality of the oﬀ-diagonal elements violated this condition.Gebhardt [10] in 1971 seems to have been the ﬁrst to remark that the derivative for-mulae do not consider symmetry explicitly but he concluded that no adjustment wasnecessary in his case since the gradient obtained was already symmetric. Tracy andSingh [11] in 1975 echo the same sentiments as Gebhardt about the need for specialformulae. By the end of the decade, the “symmetric gradient” makes its appearance insome form or the other in the work of Henderson [12] in 1979, a review by Nel [13], anda book by Rogers [14] in 1980. McCulloch [15] proves the expression for “symmetricgradient” that we quote here and notes that it applies to calculating derivatives withrespect to variance-covariance matrices, and thus derives the information matrix forthe multivariate normal distribution. By 1982, the “symmetric gradient” is includedin the authoritative and inﬂuential textbook by Searle [16]. Today the idea is ﬁrmlyentrenched as evidenced by the books [17, 18, 19] and the notes by Minka [20].The idea of the “symmetric gradient” seems to have come up in the controlsystems community (as represented by publications in IEEE) as an oﬀshoot of theextension of the Pontryagin Maximum principle for matrix of controls and stateswhen Athans and Schweppe [3] remark that the formulae for gradient matrices arederived under the assumption of “functional independence” of matrix elements. Later,they warned [21, 22] that special formulae were necessary for symmetric matrices.Geering [23] in 1976 exhibited an example calculation (gradient of the determinant ofa symmetric 2 x 2 matrix) to justify the deﬁnition of a “symmetric gradient”. We shallshow that his reasoning was ﬂawed. Brewer [24] in 1977 remarked that the formulaefor gradient matrices in [3, 22, 21] can only be applied when the elements of thematrix are independent, which is not true for a symmetric matrix, and so proceededto derive a general formula for the “symmetric gradient” (identical to McCulloch[15]) through the rules of matrix calculus for use in sensitivity analysis of optimalestimation systems. The same ﬂaw underpins the example calculation of Geering [23]and the putative “proof” of the expression for G s derived by Brewer [24]. Followingon from these works, it appears in other instances [25, 26, 27, 28].At present, the “symmetric gradient” formula is also recorded in [29], a handyreference for engineers and scientists working on inter-disciplinary topics using statis-tics and machine-learning, and the formula’s appearance in [30] shows that it is nolonger restricted to a particular community of researchers.Thus, both notions of the gradient are well-established, and hence the fact thatthese two notions do not agree is a source of enormous confusion for researchers whostraddle application areas, a point to which the authors can emphatically attest to.On the popular site Mathematics Stack Exchange, there are multiple questions (forexample [31, 32, 33, 34, 35, 36, 37]) related to this theme, but their answers deepenand misguide rather than alleviate the existing confusion. Depending on the context,this disagreement between the two notions of gradient has implications that range RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX G s in R n × n is a misinterpretation of the gradient in R n ( n +1) / . In other words, the liftingfrom R n ( n +1) / to R n × n is incorrect. When interpreted correctly, we are inexorablyled to G s = G sym .That ﬁnally brings us to the most important reason for writing this article, whichis that derivatives and gradients are fundamental ideas, and there should not beany ambiguity about their deﬁnitions. Thus, we felt the urgent need to clarify theissues muddying the waters, and show that the “symmetric gradient”, when calculatedcorrectly, leads to the expected result.The paper is organized as follows: after stating the problem, we begin with twoillustrative examples in Section 2 that allow us to see concretely what we later provein the abstract. After that, Section 3 lays out all the machinery of linear algebra thatwe shall need, ending with the proof of the main result.

2. Problem formulation.

To ﬁx our notation, we introduce the following. Wedenote by S n × n the subspace of all symmetric matrices in R n × n . The space R n × n (and subsequently S n × n ) is an inner product space with the following natural innerproduct h· , ·i F . Definition

For two matrices

A, B in R n × n h A, B i F := tr ( A T B ) deﬁnes an inner product and induces the Frobenius norm on R n × n via k A k F := q tr ( A T A ) . Corollary

We collect a few useful facts about the inner product deﬁnedabove essential for this paper. For A symmetric, B skew-symmetric in R n × n , h A, B i F = 02. If h A, H i F = 0 for any H in S n × n , then the symmetric part of A given by sym( A ) := ( A + A T ) / is equal to For A in R n × n and H in S n × n , h A, B i F = h sym( A ) , H i F S. SRINIVASAN AND N. PANDA

Proof.

See, for e.g. [40].Consider a real valued function φ : R n × n −→ R . We say that φ is diﬀerentiableif its Fr´echet derivative, deﬁned below, exists. Definition

The Fr´echet derivative of φ at A in R n × n is the unique lineartransformation D φ ( A ) in R n × n such that lim k H k→ | φ ( A + H ) − φ ( A ) − D φ ( A )[ H ] |k H k → , for any H in R n × n . The Riesz Representation theorem then asserts the existence ofthe gradient ∇ φ ( A ) in R n × n such that h∇ φ ( A ) , H i F = D φ ( A )[ H ]Note that if A is a symmetric matrix, then by the Fr´echet derivative deﬁnedabove, the gradient ∇ φ ( A ) is not guaranteed to be symmetric. Also, observe that thedimension of S n × n is m = n ( n + 1 / S n × n with R m .The reduced dimension along with the fact that Deﬁnition 2.3 doesn’t account for thesymmetry of the matrix argument of φ served as a motivation to deﬁne a “symmetricgradient” in R n × n to account for the symmetry in S n × n . Claim

Let φ : R n × n −→ R , and let φ sym be the real-valued function that isthe restriction of φ to S n × n , i.e., φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R . Let G be the gradientof φ as deﬁned in Deﬁnition . Then G claims is the linear transformation in S n × n that is claimed to be the “symmetric gradient” of φ sym and related to the gradient G as follows [15, 17, 18, 29] G claims ( A ) = G ( A ) + G T ( A ) − G ( A ) ◦ I, where ◦ denotes the element-wise Hadamard product of G ( A ) and the identity I . Theorem 3.8 in the next section will demonstrate that this claim is false. Beforethat, however, note that S n × n is a subspace of R n × n with the induced inner productin Deﬁnition 2.1. Thus, the derivative in Deﬁnition 2.3 is naturally deﬁned for allscalar functions of symmetric matrices. The Fr´echet Derivative of φ when restrictedto the subspace S n × n automatically accounts for the symmetry structure. Harville[17] notes that the interior of S n × n is empty, and states that hence, Deﬁnition 2.3 isnot applicable for symmetric matrices. The inference is incorrect, for while it is truethat the interior of S n × n in R n × n is empty, the interior of S n × n in S n × n is non-empty,and this is the key to Deﬁnition 2.5. For completeness, we re-iterate the deﬁnition ofFr´echet derivative of φ restricted to the subspace S n × n . Definition

The Fr´echet derivative of the function φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R at A in S n × n is the unique linear transformation D φ ( A ) in S n × n such that lim k H k→ | φ ( A + H ) − φ ( A ) − D φ ( A )[ H ] |k H k → , for any H in S n × n . The Riesz Representation theorem then asserts the existence ofthe gradient G sym ( A ) := ∇ φ sym ( A ) in S n × n such that h G sym ( A ) , H i F = D φ ( A )[ H ] RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX R n × n andthe restricted subspace S n × n . The following corollary states this relationship. Corollary If G ∈ R n × n be the gradient of φ : R n × n −→ R , then G sym =sym( G ) is the gradient in S n × n of φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R .Proof. From Deﬁnition 2.3, we know that D φ ( A )[ H ] = h G ( A ) , H i F for any H in R n × n . If we restrict attention to H in S n × n , then, D φ ( A )[ H ] = h G ( A ) , H i F = h∇ φ sym ( A ) , H i F . This is true for any H in S n × n , so that by Corollary 2.2 and uniqueness of the gradient, G sym ( A ) = sym( G ( A ))is the gradient in S n × n . This example will illustrate the diﬀerencebetween the gradient on R n × n and S n × n . Fix a non-symmetric matrix A in R n × n and consider a linear functional, φ : R n × n −→ R , given by φ ( X ) = tr ( A T X ) for any X in R n × n .The gradient ∇ φ in R n × n is equal to A , as deﬁned by the Fr´echet derivative Deﬁ-nition 2.3. However, if φ is restricted to S n × n , then observe that ∇ φ (cid:12)(cid:12) S n × n = sym( A ) =( A + A T ) / S n × n inCorollary 2.6 is ensured to be symmetric. We will demonstrate that Claim 2.4 is incor-rect. In fact, the correct symmetric gradient is the one given by the Fr´echet derivativein Deﬁnition 2.5, Corollary 2.6, i.e. sym( G ). To do this, we ﬁrst illustrate through asimple example how G claim s as deﬁned in Claim 2.4 gives an incorrect gradient. This short section is meant to highlight theinconsistencies that result from deﬁning a symmetric gradient given by Claim 2.4.We reconsider Geering’s example [23] and demonstrate the ﬂaw in the argumentthat led to Claim 2.4. Deﬁne φ : R × −→ R given by φ ( A ) = det( A ). For anysymmetric matrix A in S × , let A = (cid:18) x yy z (cid:19) .The gradient, deﬁned by the Fr´echet derivative Deﬁnition 2.3 is G ( A ) = det( A ) A − T = (cid:18) z − y − y x (cid:19) . If φ is restricted to S × , then observe that sym( G ) = det( A ) A − = (cid:18) z − y − y x (cid:19) .Geering identiﬁes A through the triple [ x, y, z ] T in R , and consequently, weidentify φ ( A ) with φ s ( x, y, z ) = xz − y as a functional on R .Then, ∇ φ s in R is given by [ z, − y, x ] T Geering identiﬁes ∇ φ s with a matrix G s = (cid:18) g g g g (cid:19) in S × , where g = z, g = − y, g = x. Notice that then G s agrees with Claim 2.4 that G s ( A ) = (cid:18) z − y − y z (cid:19) . S. SRINIVASAN AND N. PANDA

However, this identiﬁcation is inconsistent because the gradients ∇ φ s in R and G s in S × are not independent; rather, for any perturbation H = (cid:18) h h h h (cid:19) identiﬁedby h s = [ h , h , h ] T , the inner product(2.1) h∇ φ s , h s i R = h G s , H i F This relationship (2.1) expresses the simple idea that follows from the chain andproduct rule for derivatives - that a small perturbation in the argument, identiﬁedeither by H or h s , leads to the same change in the value of the function, identiﬁed aseither φ ( A ) or φ s ( x, y, z ).The crux of the issue is that Geering’s identiﬁcation that agrees with Claim 2.4violates the relationship, but the correct identiﬁcation g = z, g = ( − y ) / , g = x satisﬁes it. Thus, we have shown using Geering’s example, that Claim 2.4 cannot holdand that G s = sym( G ) is the correct gradient.In the subsequent sections, we shall prove (2.1) in greater generality and rigourfor any diﬀerentiable function φ : S n × n −→ R , and also show how the same incorrectidentiﬁcation leads to the spurious Claim 2.4.

3. Gradient of real-valued functions of symmetric matrices.

Matrices in R n × n can be naturally identiﬁed with vectors in R n . Thus a real valued functiondeﬁned on R n × n can be naturally identiﬁed with a real valued function deﬁned on R n .Moreover, the inner product on R n × n deﬁned in Deﬁnition 2.1 is naturally identiﬁedwith the Euclidean inner product on R n . This identiﬁcation is useful when the goalis to ﬁnd derivatives of scalar functions in R n × n . The scheme then is to identify thescalar function on R n × n with a scalar function on R n , compute its gradient and usethe identiﬁcation to go back to construct the gradient in R n × n . In case of symmetricmatrices, the equation in Claim 2.4 is claimed to be the identiﬁcation of the gradientin S n × n after computations in R m , since symmetric matrices are identiﬁed with R m where m = n ( n + 1) /

2. In this section, we show that the claim is false. We ﬁrst beginby formalizing these natural identiﬁcations we discussed in this paragraph.

Definition

The function vec : R n × n −→ R n given by vec( A ) :=  A ... A n A ... ,  identiﬁes a matrix A in R n × n with a vector vec( A ) in R n . This operation can be inverted in obvious fashion, i.e., given the vector, one canreshape to form the matrix through the mat operator deﬁned below.

Definition R n −→ R n × n is the function given by, mat(vec( A )) = A, for any A in R n . RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX S n × n of R n × n is the subspace of all symmetric matrices and the ob-ject of investigation in this paper. Since this subspace has a dimension m = n ( n +1) / R m . This identiﬁcation isgiven by the elimination operation P deﬁned below. Definition

Let V be the range of vec restricted to S n × n i.e. V = vec ( S n × n ) .The elimination operator P is the function P : V −→ R m that eliminates the redun-dant entries of a vector v in V . The operator P lets us identify symmetric matrices in S n × n with vector in R m via the vech operator deﬁned below. Definition

The operator vech is the function vech : S n × n −→ R m given by (3.1) vech( A ) = P vec( A ) , for any symmetric matrix A in S n × n . On the other hand,

Definition the duplication operator D : R m −→ V given by (3.2) vec( A ) = D vech( A ) , for any A in S n × n acts as the inverse of the elimination operator P . Lemma

Any a in R m can be lifted to a symmetric matrix in S n × n .Proof. mat( D ( a )) lifts a in R m to a symmetric matrix in S n × n .We record some properties of the duplication operator D that will be useful inproving our main theorem Theorem 3.8 later. Lemma

Let D be the duplication operator deﬁned in Deﬁnition . Thefollowing are true.

1. Null D = { } D T vec( A ) = vech( A ) + vech( A T ) − vech( A ◦ I ) ∀ A ∈ S n × n D T D in S m × m is a positive-deﬁnite, symmetric matrix

4. ( D T D ) − existsProof. See [42]Consider a real valued function φ : R n × n −→ R and its restriction φ sym := φ (cid:12)(cid:12) S n × n S n × n −→ R . Then φ sym can be identiﬁed with a scalar function φ s : R m −→ R .Moreover, there is a relationship between the gradients calculated from the diﬀerentrepresentations of the function. The next theorem formalizes this concept and demon-strates two fundamental ideas - 1) the notion of Fr´echet derivative naturally carriesover to the subspace of symmetric matrices, hence there is no need to identify anequivalent representation of the functional in a lower dimensional space and, 2) ifsuch an equivalent representation is constructed, a careful analysis leads to the cor-rect gradient deﬁned by the Fr´echet derivative. Theorem

Consider a real-valued function φ : R n × n −→ R whose restriction φ sym is the function φ sym = φ (cid:12)(cid:12) S n × n S n × n −→ R . Let G := ∇ φ be the gradient of φ ,so that ∇ φ sym = sym( G ) is the gradient of φ sym . φ sym can be identiﬁed with a scalar function φ s : R m −→ R given by φ s = φ sym ◦ mat ◦ D with m = n ( n + 1) / . S. SRINIVASAN AND N. PANDA If ∇ φ s in R m is the gradient of φ s , then the symmetric matrix G s in S n × n given by G s = mat( D ( D T D ) − ∇ φ s ) is the correct “symmetric gradient” of φ in the sense that h G s , H i F = h∇ φ s , vech( H ) i R m = h sym( G ) , H i F , for all H in S n × n . Thus, G s = sym( G ) . Before proving Theorem 3.8 we establish a few useful Lemmas that are interestingin their own right. Remark 3.11 will illustrate a plausible argument of Claim 2.4.

Lemma

Let

A, B be two symmetric matrices in S n × n . Then we have thefollowing equivalence h A, B i F = h vec( A ) , vec( B ) i R n = (cid:10) D T D vech( A ) , vech( B ) (cid:11) R m = (cid:10) vech( A ) , D T D vech( B ) (cid:11) R m , where h· , ·i R n , h· , ·i R m are the usual Euclidean inner products.Proof. h A, B i F = tr ( A T B ) = h vec( A ) , vec( B ) i R n by Def inition h D vech( A ) , D vech( B ) i R n from Equation (3.2)= (cid:10) D T D vech( A ) , vech( B ) (cid:11) R m by deﬁnition of transpose operatorThe observation that h A, B i F = h vech( A ) , vech( B ) i is a crucial one and lies atthe heart of the discrepancy alluded to in the title of this article. Instead, if we wantto refactor the inner product of two elements in R m into one in R n × n one has Lemma for any a, b in R m , h a, b i R m = (cid:10) mat( D ( D T D ) − a ) , mat( Db ) (cid:11) F . Proof. h a, b i R m = (cid:10) ( D T D )( D T D ) − a, b (cid:11) R m by Lemma (cid:10) D ( D T D ) − a, Db (cid:11) R n = (cid:10) mat( D ( D T D ) − a ) , mat( Db ) (cid:11) F by Lemma φ sym := φ (cid:12)(cid:12) S n × n : S n × n −→ R what we demonstrated through the example earlier – that Claim 2.4 is false, and thatthe “symmetric gradient” G s = sym( G ) computed using Corollary 2.6. Proof.

Theorem 3.8(3.3) S n × n R V ⊂ R n R mφ sym := φ (cid:12)(cid:12) S n × n vecmat PD φ s RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX

91. The operators deﬁned above establish the commutative diagram given in Equa-tion (3.3). These yield the following relation(3.4) φ s = φ sym ◦ mat ◦ D, where ◦ represents the usual composition of functions and m = n ( n + 1) / φ sym = φ s ◦ P ◦ vecThus, for any symmetric matrix H , the chain-rule for Fr´echet derivativesyields(3.6) D φ sym ( H ) = D φ s ◦ D P ◦ D vec( H ) . By noting that P , D and vec are linear operators, the equation above yieldsthe following relationship between the Fr´echet derivatives.(3.7) D φ sym ( H ) = D φ s ◦ vech( H ) . With the usual inner products deﬁned earlier, the Riesz Representation The-orem gives us the following relationship between the gradients,(3.8) h sym( G ) , H i F = h∇ φ s , vech( H ) i R m for each H in S n × n (see Corollary 2.6 and Subsection 2.1 for the fact that thegradient ∇ φ sym is given by sym( G )).Since ∇ φ s is a vector in R m , it can be lifted to the space of symmetric matricesto yield the matrix G s as in Lemma 3.6. However such a lifting fails to satisfythe inner product relationship given by (3.8) and yields the incorrect gradient.See Remark 3.11.By Lemma 3.10 and the fact that mat( D vech( H )) = mat(vec( H )) = H , weﬁnd that the correct lifting is given by G s = mat( D ( D T D ) − ∇ φ s ) in S n × n such that,(3.9) h∇ φ s , vech( H ) i R m = h G s , H i F for each H in S n × n .To show that G s is indeed the correct expression for the “symmetric gradient”we need to show that G s = sym( G ). This follows immediately since we nowhave h sym( G ) , H i F = h G s , H i F = h∇ φ s , vech( H ) i R m .This completes our proof of Theorem 3.8. Remark D statedin Lemma 3.7, we can relate ∇ φ s to ∇ φ sym = sym( G ) in the following way(3.10) ∇ φ s = D T vec( ∇ φ sym ) = vech( ∇ φ sym ) + vech( ∇ φ Tsym ) − vech( ∇ φ sym ◦ I )This is equivalent to ∇ φ s = vech( G + G T − G ◦ I ) . S. SRINIVASAN AND N. PANDA

If we ignore Lemma 3.10 and the constraint Equation (3.8), and instead naivelyuse Lemma 3.6 to set G s = mat( D ∇ φ s ) as illustrated in the example in Subsection 2.2earlier, we get G s = mat( D vech( G + G T − G ◦ I )) which simpliﬁes to G s = mat(vec( G + G T − G ◦ I )) = G + G T − G ◦ I Thus, we have shown that the same fundamental ﬂaw discovered in Subsection 2.2underpins the “proof” of the spurious Claim 2.4.

Remark R ,the same arguments will be valid for matrix functionals deﬁned over the complexﬁeld C with an appropriate modiﬁcation of the deﬁnition of the inner-product inDeﬁnition 2.1. Remark R n × n can be obtained from the Fr´echet derivativeover R n × n by an appropriate projection/restriction to the relevant linear manifold asshown in Corollary 2.6. In this paper, the linear manifold was the set of symmetricmatrices designated as S n × n . One can adapt the same ideas expressed here to obtainthe derivative over the subspace of skew-symmetric, diagonal, upper-triangular, orlower triangular matrices. However, note that this remark does not apply to the setof orthogonal matrices since it is not a linear manifold.Thus, the crux of the theorem was the recognition that while the lifting of anelement in R m to S n × n follows Lemma 3.6, the gradient in R m must satisfy Equa-tion (3.8) and instead must be lifted by Lemma 3.10.In the context of an application, the implication of Theorem 3.8 depends on theway the gradient is calculated, and what is the quantity of interest. If the quantityof interest is the gradient of a scalar function deﬁned over S n × n , then the correctgradient is the one given by Theorem 3.8, whatever be the method used to evaluateit. However, if the quantity of interest is not the gradient but the argument of thefunction, say one obtained by gradient-descent in an optimization algorithm, thenthe implication depends on how the argument is represented. If the argument isrepresented as an element of S n × n , then the gradient sym( G ) should be used. Mostsolvers and optimization routines, however do not accept matrices as arguments. Inthese cases, one actually works with the function φ s , whose gradient ∇ φ s in R m posesno complications. However, the output argument returned by the gradient-descentwill still lie in R m and will have to be lifted to S n × n by Lemma 3.6 to be useful.

4. Conclusions.

In this article, we investigated the two diﬀerent notions of agradient that exist for a real-valued function when the argument is a symmetric ma-trix. The ﬁrst notion is the mathematical deﬁnition of a Fr´echet derivative on thespace of symmetric matrices. The other deﬁnition aims to eliminate the redundantdegrees of freedom present in a symmetric matrix and perform the gradient calcu-lation in the space of reduced-dimension and ﬁnally map the result back into thespace of matrices. We showed, both through an example and rigorously through atheorem, that the problem in the second approach lies in the ﬁnal step as the gradi-ent in the reduced-dimension space is mapped into a symmetric matrix. Moreover,the approach does not recognize that Deﬁnition 2.5, restricted to S n × n , already ac-counts for the symmetry in the matrix argument; thus there is no need to identify anequivalent representation of the functional in a lower-dimensional space of dimension m = n ( n + 1) /

2. However, we demonstrated that if such an equivalent representation

RADIENT OF SCALAR FUNCTIONS OF A SYMMETRIC MATRIX

REFERENCES[1] James Munkres.

Analysis on Manifolds . Addison Wessley, New York, 1991.[2] Ward Cheney.

Analysis for Applied Mathematics , volume 208. Springer Science & BusinessMedia, 2013.[3] Michael Athans and Fred C. Schweppe. Gradient matrices and matrix calculations. TechnicalReport Technical Note 1965-53, Massachusets Institute of Technology Lincoln Laboratory,Nov 1965. [Online; accessed 1. Nov. 2019].[4] J. R. Magnus and H. Neudecker.

Matrix Diﬀerential Calculus with Applications in Statisticsand Econometrics.

John Wiley, New York, 1988.[5] Morton E. Gurtin, Eliot Fried, and Lallit Anand.

Mechanics and Thermodynamics of Continua .Cambridge University Applied Mathematics Research eXpress, 2010.[6] Ray W. Ogden.

Non-linear Elastic Deformations . Dover Publications, New York, 1997.[7] Mikhail Itskov.

Tensor Algebra and Tensor Analysis for Engineers . Springer, Cham, 2019.[8] Paul S. Dwyer. Some Applications of Matrix Derivatives in Multivariate Analysis.

Journal ofthe American Statistical Association , 62(318):607–625, Jun 1967.[9] Derrick S. Tracy and Paul S. Dwyer. Multivariate Maxima and Minima with Matrix Derivatives.

Journal of the American Statistical Association , 64(328):1576–1594, Dec 1969.[10] Friedrich Gebhardt. Maximum likelihood solution to factor analysis when some factors arecompletely speciﬁed.

Psychometrika , 36(2):155–163, Jun 1971.[11] D. S. Tracy and R. P. Singh. Some Applications of Matrix Diﬀerentiation in the GeneralAnalysis of Covariance Structures.

Sankhy¯a: The Indian Journal of Statistics, Series A(1961-2002) , 37(2):269–280, Apr 1975.[12] H. V. Henderson and S. R. Searle. Vec and vech operators for matrices, with some uses injacobians and multivariate statistics.

Canadian Journal of Statistics , 7:65–81, 1979.[13] D. G. Nel. On matrix diﬀerentiation in statistics.

South African Statistical Journal , 14(2):137–193, Jan 1980.[14] G.S. Rogers.

Matrix Derivatives . Marcel Dekker, New York, 1980.[15] Charles E. McCulloch. Symmetric Matrix Derivatives with Applications.

Journal of the Amer-ican Statistical Association , Mar 1980.[16] S.R. Searle.

Matrix Algebra for Statistics . John Wiley, New York, 1982.[17] David A. Harville.

Matrix Algebra from a Statistician’s Perspective . Springer-Verlag, NewYork, 1997.[18] George Arthur Frederick Seber.

A Matrix Handbook for Statisticians . John Wiley & Sons, NewJersey, 2008.[19] A. M. Mathai.

Jacobians of Matrix Transformations and Functions of Matrix Argument .World Scientiﬁc, Singapore, 1997.[20] Thomas Minka. Old and New Matrix Algebra Useful for Statistics, 2001. [Online; accessed 1.Nov. 2019].[21] Michael Athans. Matrix minimum principle. Technical Report Report ESL-R-317, MassachusetsInstitute of Technology Lincoln Laboratory, Aug 1967. [Online; accessed 4. Mar. 2020].[22] Michael Athans. The matrix minimum principle.

Information and Control , 11(5):592–606, Nov1967.[23] H. Geering. On calculating gradient matrices.

IEEE Transactions on Automatic Control ,21(4):615–616, Aug 1976.[24] J. Brewer. The gradient with respect to a symmetric matrix.

IEEE Transactions on AutomaticControl , 22(2):265–267, Apr 1977.[25] P. Walsh. On symmetric matrices and the matrix minimum principle.

IEEE Transactions onAutomatic Control , 22(6):995–996, Dec 1977.[26] A. E. Yanchevsky and V. J. Hirvonen. Optimization of feedback systems with constrainedinformation ﬂow.

International Journal of Systems Science , 12(12):1459–1468, 1981.[27] A.-M. Parring. About the concept of the matrix derivative.

Linear Algebra and its Applications ,176:223–235, Nov 1992.[28] D. H. van Hessem and O. H. Bosgra. A full solution to the constrained stochastic closed- S. SRINIVASAN AND N. PANDAloop mpc problem via state and innovations feedback and its receding horizon implemen-tation. In , volume 1, pages 929–934 Vol.1, 2003.[29] Kaare Brandt Petersen and Michael Syskind Pedersen. The Matrix Cookbook, 11 2012. Version20121115.[30] Iain Murray. Diﬀerentiation of the Cholesky decomposition.

ArXiv e-prints , Feb 2016.[31] me10240 (https://math.stackexchange.com/users/66158/me10240). Making sense of matrix de-rivative formula for determinant of symmetric matrix as a frechet derivative? MathematicsStack Exchange. URL:https://math.stackexchange.com/q/2436680 (version: 2017-09-21).[32] tomka (https://math.stackexchange.com/users/118706/tomka). What is the derivative of thedeterminant of a symmetric positive deﬁnite matrix? Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/1981210 (version: 2017-04-13).[33] wueb (https://math.stackexchange.com/users/238307/wueb). Understand-ing notation of derivatives of a matrix. Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2131708 (version: 2017-04-13).[34] me10240 (https://math.stackexchange.com/users/66158/me10240). Can derivative formulaein matrix cookbook be interpreted as frechet derivatives? Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2508276 (version: 2017-11-06).[35] TopDog (https://math.stackexchange.com/users/569224/topdog). Ma-trix gradient of ln(det( x )). Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/2816512 (version: 2018-06-12).[36] jds (https://math.stackexchange.com/users/31891/jds). Derivative with respect to symmetricmatrix. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3005374(version: 2018-11-19).[37] kMaster (https://math.stackexchange.com/users/58360/kmaster). Derivativeof the inverse of a symmetric matrix. Mathematics Stack Exchange.URL:https://math.stackexchange.com/q/982386 (version: 2014-10-20).[38] William Thomson Lord Kelvin. Elements of a mathematical theory of elasticity. PhilosophicalTransactions of the Royal Society of London , 146:481 – 498, 1856.[39] W. Voigt.

Lehrbuch der Kristallphysik . B.G. Teubner, Leipzig, 1910.[40] Steven Roman.

Advanced Linear Algebra , volume 3. Springer.[41] S¨oren Laue, Matthias Mitterreiter, and Joachim Giesen. Computing higher order derivativesof matrix and tensor expressions. In

Advances in Neural Information Processing Systems(NIPS) . 2018.[42] Jan R. Magnus and H. Neudecker. The elimination matrix: Some lemmas and applications.