[PDF] Metrics for matrix-valued measures via test functions

Abstract

It is perhaps not widely recognized that certain common notions of distance between probability measures have an alternative dual interpretation which compares corresponding functionals against suitable families of test functions. This dual viewpoint extends in a straightforward manner to suggest metrics between matrix-valued measures. Our main interest has been in developing weakly-continuous metrics that are suitable for comparing matrix-valued power spectral density functions. To this end, and following the suggested recipe of utilizing suitable families of test functions, we develop a weakly-continuous metric that is analogous to the Wasserstein metric and applies to matrix-valued densities. We use a numerical example to compare this metric to certain standard alternatives including a different version of a matricial Wasserstein metric developed earlier.

Full PDF

11 Metrics for matrix-valued measuresvia test functions

Lipeng Ning and Tryphon T. Georgiou

Abstract

It is perhaps not widely recognized that certain common notions of distance betweenprobability measures have an alternative dual interpretation which compares correspondingfunctionals against suitable families of test functions. This dual viewpoint extends in a straight-forward manner to suggest metrics between matrix-valued measures. Our main interest has beenin developing weakly-continuous metrics that are suitable for comparing matrix-valued powerspectral density functions. To this end, and following the suggested recipe of utilizing suitablefamilies of test functions, we develop a weakly-continuous metric that is analogous to theWasserstein metric and applies to matrix-valued densities. We use a numerical example tocompare this metric to certain standard alternatives including a different version of a matricialWasserstein metric developed in [1], [2].

I. I

NTRODUCTION

Consider the set of probability measures P ( I ) := { µ : dµ ( θ ) ≥ for θ ∈ I and µ ( I ) = 1 } where, herein, I is always thought to be an interval. There is a large family of metrics,comparing µ , µ ∈ P , that are expressed in the form sup f ∈F | (cid:90) f ( dµ − dµ ) | (1)with F being a suitable set of functions on I . Probability metrics that can be expressedin this form (1) are often referred to either as metrics with a ζ -structure [3] or as integralprobability metrics [4], [5].The family of metrics that can be expressed as in (1) includes many familiar ones suchas the total variance, Kolmogorov’s distance and the 1-Wasserstein metric. In point offact, the total variation which is deﬁned by (cid:107) µ − µ (cid:107) TV := (cid:90) I | dµ ( x ) − dµ ( x ) | , L. Ning is with Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115,[email protected]. Georgiou is with the Department of Electrical & Computer Engineering, University of Minnesota, Minneapolis,MN 55455, [email protected] research was supported in part by the NSF under Grant 1027696, the AFOSR under Grant FA9550-12-1-0319,and the Vincentine Hermes-Luh Endowment. a r X i v : . [ c s . S Y ] S e p can also be expressed as (cid:107) µ − µ (cid:107) TV = sup f (cid:26)(cid:90) f ( dµ − dµ ) | (cid:107) f (cid:107) ∞ ≤ (cid:27) , Kolmogorov’s metric which is deﬁned by d K ( µ , µ ) := sup x ∈ I | F ( x ) − F ( x ) | with F and F being the cumulative distribution function of µ and µ , respectively, canbe expressed as d K ( µ , µ ) = sup f (cid:26)(cid:90) I f ( dµ − dµ ) | (cid:90) I | f (cid:48) | dx ≤ (cid:27) see [6, page 73], and so does the -Wasserstein metric [7].We focus on the -Wasserstein metric which can be deﬁned over more general spaces.It is deﬁned with respect to the metric d ( x, y ) = | x − y | , as follows. If P µ ,µ ⊂ P ( I × I ) denotes the subset of probability measures on I × I that have µ and µ as marginals,the -Wasserstein distance between µ and µ is d W ( µ , µ ) := inf m ∈P µ ,µ (cid:90) I × I d ( x, y ) dm ( x, y ) . Naturally, it can be expressed in a dual form as d W ( µ , µ ) = sup f (cid:26)(cid:90) I f ( dµ − dµ ) | (cid:107) f (cid:107) Lip ≤ (cid:27) (2)where (cid:107) f (cid:107) Lip := sup x,y ∈ I | f ( x ) − f ( y ) || x − y | denotes the Lipschitz semi-norm.An important property of the Wasserstein metric is that it metrizes weak ∗ convergenceof probability measures [7, page 212], that is, given µ and any sequence of measures { µ k , k = 1 , , . . . } , d W ( µ k , µ ) −→ k →∞ if and only if, for all continuous and bounded f , (cid:90) I f dµ k −→ k →∞ (cid:90) I f dµ. Intuitively, small changes in a weak ∗ -sense reﬂect small changes in any relevant statistics.This is clearly a desirable feature in any experimental engineering quantiﬁcation ofdistances between distributions, whether these represent probability, power, spectral power or other entities. For this precise reason, the Wasserstein metric has turned out to be auseful tool in modeling of slowly varying time-series [8] and for comparing covariancematrices [9], among many other applications [6].In the sequel we say that a metric is weakly continuous if it metrizes weak ∗ conver-gence. The use of the terminology “weak” instead of “weak ∗ ” is common in probabilitytheory and is adapted in this paper.The main contribution of the present paper is a particular generalization of the Wasser-stein metric to the space of matrix-valued measures with possibly non-equal mass. Thismetric differs from an analogous metric in our recent work [1] which also represents ageneralization of Wasserstein distances to matricial measures. The present metric is basedon a dual formalism where we compare measures on a suitable set of test functions.Before dealing with the matricial case, in Section II, we ﬁrst discuss how to modifythe Wasserstein metric so as to compare measures with non-equal masses; a subsequentmatricial generalization follows along similar lines. In Section III, we present certainrelated ideas from non-commutative geometry for devising metrics to compare states ofnon-commutative algebras; such states are “non-commutative” generalizations of proba-bility measures. In Section IV, we develop the sought weakly-continuous metric betweenmatrix-valued measures. Our interest is in spectral analysis of multivariate time-seriesand, thereby, we appeal to a pertinent numerical example to highlight differences andsimilarities of the proposed metric to alternatives; this is given in Section V.The notational convention we follow is to use regular font, as in µ, m , for scalarvalues, variables, and functions, and to use boldface fonts, as in µ , m , for matrix-valuedfunctions or elements of a general algebra.II. W ASSERSTEIN - LIKE METRIC FOR UNBALANCED MEASURES

We begin by discussing a certain adaptation of the Wasserstein metric to use onunbalanced measures [10], that is, for the case when the measures we deal with mayhave unequal integrals. Our motivation stems from the need for a weakly continuousdistance to be used on power spectral densities of stationary stochastic processes. Thedual formulation serves as a template for a subsequent matricial version.Consider the set of non-negative scalar measures M ( I ) := { µ : dµ ( θ ) ≥ for θ ∈ I } . For any two µ , µ ∈ M , let d W ,κ ( µ , µ ) :=inf ˆ µ , ˆ µ (cid:40) d W (ˆ µ , ˆ µ ) + κ (cid:88) k =1 (cid:107) µ k − ˆ µ k (cid:107) TV (cid:41) (3)where κ > is used to weigh in the relative importance of the two terms. In [10], theoptimizing variables ˆ µ , ˆ µ represent “noise-free” measures having equal mass while the “error” differences µ k − ˆ µ k for k = 1 , , are attributed to statistical variability. The dualof (3) is d W ,κ ( µ , µ ) =sup f { (cid:90) I f ( dµ − dµ ) | (cid:107) f (cid:107) Lip ≤ , (cid:107) f (cid:107) ∞ ≤ κ } (4)where (cid:107) f (cid:107) ∞ = max x ∈ I | f ( x ) | . Since test functions in (4) are bounded, d W ,κ ( µ , µ ) isbounded as well [10], thereby, it can be easily shown that d W ,κ is a weakly continuousmetric (see [10] for details). Applications of this metric to power spectral analysis hasbeen pursued in [11].We remark that the essence in (1) is to postulate a family of test functions which is richenough so as to distinguish measures while the functions in the class be equicontinuousand uniformly bounded [11]. For the Wasserstein metric, the family of test functionis the class of Lipschitz functions. A very similar rationale has been introduced innon-commutative geometry, where via a suitable generalization of the Lipschitz semi-norm, one obtains a metric between non-commutative states, namely the Connes’ spectraldistance. This is discussed next.III. M ETRICS IN NON - COMMUTATIVE GEOMETRY

We specialize our discussion to the algebra of n × n complex-valued matrices M n ( C ) ;for the more general setting of non-commutative algebras see [12], [13]. Given A, B ∈ M n ( C ) , or any algebra, we denote by [ A, B ] = AB − BA the commutator of A and B .In general, a state ρ of a non-commutative algebra A is a positive linear functional thatmaps f ∈ A to R or C . The set of states of A is denoted as S ( A ) . For example, a state ρ of the commutative algebra C ( I ) of continuous functions on I uniquely corresponds toa probability measure µ such that for any f ∈ C ( I ) ρ ( f ) = (cid:90) I f dµ. In quantum mechanics, a state ρ of a non-commutative algebra of “observables” is a density matrix ; this is positive semideﬁnite with trace equals to one. For f ∈ A , ρ ( f ) = tr( ρf ) . Thus, states are thought of as a generalization of probability measures and ρ ( · ) = E ρ {·} is thought of as the expectation operator. A. Connes’ spectral distance

Consider a non-commutative algebra of operators A on a Hilbert space. Let D be aspeciﬁc ﬁxed operator often referred to as the Dirac operator. Connes’ spectral distance[12] between ρ , ρ ∈ S ( A ) is deﬁned as d D ( ρ , ρ ) := sup f ∈A {| ρ ( f ) − ρ ( f ) | | (cid:107) [ D, f ] (cid:107) ≤ } . The operator norm of the commutator [ D, f ] takes the role of the Lipschitz semi-norm.To illustrate the insight in viewing (cid:107) [ D, f ] (cid:107) as the analogue of a Lipschitz semi-norm,consider f to be a smooth function on R and choose D = − i∂ x . For any smooth function g , [ D, f ] g = Df g − f Dg = − i∂ x ( f g ) + if ∂ x g = − i∂ x f g. Thus [ D, f ] takes g (cid:55)→ − i∂ x f g . The operator norm (cid:107) [ D, f ] (cid:107) equals to the operator normthe mapping g (cid:55)→ − i∂ x f g , which is precisely the Lipschitz semi-norm of f .Consider A to be the algebra C ( I ) of continuous functions on I . Thus, for any two ρ , ρ ∈ S ( C ( I )) which correspond to probability measures µ and µ , respectively, d D ( ρ , ρ ) = sup f ∈A (cid:26)(cid:90) I f ( dµ − dµ ) | (cid:107) [ D, f ] (cid:107) ≤ (cid:27) . The relation of this distance to the 1-Wasserstein is explained in [13], [14]. We elaboratewith an example and provide directions for further generalization.Connes’ spectral distance can be unbounded. For instance, let A = M ( R ) and let D = (cid:20) (cid:21) . Consider as states the set S ( A ) of positive semi-deﬁnite matrices with trace equal to one—observables can also be taken in A . Take an observable f = (cid:20) a bc d (cid:21) and note that [ D, f ] = (cid:20) c − b d − aa − d b − c (cid:21) . Thus, the distance between states ρ = (cid:20) p q q − p (cid:21) and ρ = (cid:20) p q q − p (cid:21) , namely, d D ( ρ , ρ )= sup f ∈A {| tr( ρ f ) − tr( ρ f ) | | (cid:107) [ D, f ] (cid:107) ≤ } = sup (cid:26) | ( p − p )( a − d ) + ( q − q )( b + c ) || (cid:107) (cid:20) c − b d − aa − d b − c (cid:21) (cid:107) ≤ (cid:27) is ∞ . To see this note that if q (cid:54) = q , then d D ( ρ , ρ ) can take any positive value for asuitable choice of b = c . Remark 1:

In analogy with (4), d D,κ ( ρ , ρ ) := (5) sup f ∈A (cid:26) | ρ ( f ) − ρ ( f ) | | (cid:107) [ D, f ] (cid:107) ≤ , (cid:107) f (cid:107) ≤ κ (cid:27) deﬁnes a bounded metric. Likewise, the Connes’ spectral distance can be readily gener-alized to d D ( ρ , ρ ) = sup f ∈A (cid:8) | ρ ( f ) − ρ ( f ) | | (cid:107) [ D i , f ] (cid:107) ≤ for i ∈ { , . . . , n } (cid:9) (6)for a suitable set of n Dirac operators quantifying “slope” in several possible “directions.”In the next section we deal in more detail with the algebra of matrix-valued continuousfunctions on I , namely, C ( C n × n , I ) where states correspond to matrix-valued measures.Our interest stems from spectral analysis of multivariate time-series and in the nextsection, we present a Wasserstein-like metric between matrix-valued measures. The for-malism is completely analogous in that the metric is constructed in a dual formalism byquantifying how measures act on suitably constrained matrix-functions.IV. A W ASSERSTEIN - LIKE METRIC BETWEEN MATRIX - VALUED MEASURES

Spectral analysis of time-series aims at detecting power and correlations betweensignals at different parts of the frequency spectrum. Power spectral estimates are typicallybased on moments, i.e., integrals of the power density, or simply measurements. Hence,it is unreasonable to utilize metrics between power spectral densities that are not weaklycontinuous.For the case of multivariable time-series, power densities are matrix-valued. Thus, anymetric must weigh in both the frequency content of the power as well as its directionality.Typically, the directionality of singular vectors of a matrix-power spectral density at agiven frequency relates to the relative strength of the corresponding signal-componentsat the location of the measurement channels (sensors). Therefore, in order to quantify theperformance of estimation algorithms, accurately detect changes in time series (events),localize the directionality of echo (e.g., in radar), etc., one needs physically meaning-ful metrics that weigh in relevant characteristics of power spectra. Thus, at the veryleast, metrics ought to be weakly continuous and allow the comparison of matrix-valueddensities.Several distance measures have been proposed and extensively used in applicationsand the literature, see e.g., [15], [16], [17], [18], [19], [1]. However, besides the fact thatthese fail to be metrics, most fail to be weakly continuous as well (e.g., the Itakura-Saitodistance, etc.). In the present section, following the recipe outlined earlier, in (4-6), we develop a weakly continuous metric between matrix-valued measures that can be thoughtof as a generalization of the Wasserstein metric. Our viewpoint herein differs from, yetit can be seen as complementing to the one in [1].For a Hermitian matrix-valued function f , let (cid:107) f (cid:107) Lip := sup x (cid:54) = y (cid:107) f ( x ) − f ( y ) (cid:107) d ( x, y ) where d ( x, y ) is a metric on I . Then (cid:107) f (cid:107) Lip ≤ implies that f ( x ) − f ( y ) ≤ d ( x, y ) I with I being the identity matrix in the sense of positive semi-deﬁniteness. For any twomatrix-valued power spectral measures µ , µ we deﬁne: d W ,κ ( µ , µ ) =sup f { (cid:90) I tr( f ( d µ − d µ ) | (cid:107) f (cid:107) Lip ≤ , (cid:107) f (cid:107) ≤ κ } . (7)It is quite clear that this deﬁnes a metric and as we state next, it is also weakly continuous. Proposition 2:

Consider a sequence of matrix-valued power spectral measures { µ k : k = 1 , , . . . } and µ . Then d W ,κ ( µ k , µ ) −→ k →∞ (8)if and only if for any continuous, bounded, Hermitian-valued function f on I the followingholds tr (cid:18)(cid:90) I f d µ k (cid:19) −→ k →∞ tr (cid:18)(cid:90) I f d µ (cid:19) . (9) Proof:

See Appendix A.We provide an interpretation of (7) that draws a connection to optimal mass transportvery much like in Section II. For this, we need the following expression for the totalvariation: (cid:107) µ − µ (cid:107) TV := (cid:90) I (cid:107) d µ − d µ (cid:107) ∗ where (cid:107)·(cid:107) ∗ denotes the nuclear norm, i.e., the sum of singular values. We also need to thefollowing matricial analog of the 1-Wasserstein metric between matrix-valued measures µ and µ with the same “total matricial mass” µ ( I ) = µ ( I ) , d W ( µ , µ ) = inf m (cid:26) (cid:90) I × I d ( x, y ) (cid:107) d m ( x, y ) (cid:107) ∗ | (cid:90) y ∈ I d m ( x, y ) = d µ ( x ) , (cid:90) x ∈ I d m ( x, y ) = d µ ( y ) (cid:27) . We remark that, in the above, d m ( x, y ) needs not to be positive semideﬁnite. Theoptimization seeks a distribution for the nuclear norm of d m ( x, y ) which may now be thought of actually as the “mass” transported from x to y . When µ and ν are scalar-valued probability measure, clearly, d W ( µ , ν ) is the 1-Wasserstein metric. The followingstatement represents a generalization of (3). Proposition 3:

For two matrix-valued measures µ and µ on I , d W ,κ ( µ , µ ) := (10) inf ˆ µ , ˆ µ (cid:40) d W ( ˆ µ , ˆ µ ) + κ (cid:88) k =1 (cid:107) µ k − ˆ µ k (cid:107) TV (cid:41) Proof:

See Appendix B. V. E

XAMPLE

We highlight the characteristics of the proposed distance d W ,κ with a numericalexample. In this, we compare three matrix-valued densities f , f , f and compute thevalues assigned by the metric between them, for each pair. The nature and directionalityof their spectral content is such that f can naturally be thought of as “sitting” in themiddle of the other two. Thus, if a metric is to be intuitive, it ought to assign distancesaccordingly. We compare the relative values assigned by our metric as well as threealternatives. These are, d IS , d TV and the metric introduced in our earlier work [1]. It isseen that (7), as well as the metric in [1], are quite similar, while the other two, d IS and d TV , give relative distances that do not reﬂect the intuition suggested above.The three chosen matricial densities are: f ( θ ) = (cid:20) .

40 1 (cid:21) (cid:34) .

00 0 . (cid:35) (cid:20) . e − jθ (cid:21) where a ( z ) = (1 − . π z + 0 . z )(1 − . π z + 0 . z ) a ( z ) = (1 − . π

12 ) z + 0 . z )(1 − . π z + 0 . z ) a ( z ) = (1 − . π z + 0 . z )(1 − . π z + 0 . z ) for θ ∈ [0 , π ] . These are shown in Figure 1 and since, for each i ∈ { , , } , f i is aHermitian-valued and positive semi-deﬁnite, our convention is to display the magnitudeand phase of their entries as follows: | f i, (1 , | , | f i, (1 , | ( = | f i, (2 , | ) and | f i, (2 , | aredisplayed in subplots (1,1), (1,2) and (2,2), respectively, and ∠ f i, (1 , ( = − ∠ f i, (2 , ) isshown in subplot (2,1). Peak power in f is at θ = π and most of it resides in the second channel, i.e., in f , (2 , . The power in f splits equality between the two channels and its peak is at θ = π . In f , peak power is in the ﬁrst channel and at around θ = π . Similarly, itis worth observing the characteristics and phase angles of the cross spectra. All in all, f appears to be “sitting in the middle” between f and f . By comparing the relativedistances we can now assess whether these reﬂect the above intuition, i.e., that f is inthe middle between f and f .We compute and compare distances given by the generalized Itakura-Saito distance[15] d IS ( f , f ) := (cid:90) π tr (cid:0) f f − − log( f f − ) − I (cid:1) dθ the total variation d TV ( f , f ) := (cid:107) f − f (cid:107) TV and the metric d T ,κ proposed in [1]. Tothis end, we sample the f i ’s on a frequency grid with resolution ∆ θ = π . To obtain d T ,κ , we ﬁrst scale so that traces are normalized to . The distances are given in TableI where we have chosen κ = 1 for both d W ,κ and d T ,κ . f , (1 , ( θ ) f , (1 , ( θ ) f , (1 , ( θ ) f , (1 , ( θ ) f , (1 , ( θ ) f , (1 , ( θ ) | f , (1 , ( θ ) || f , (1 , ( θ ) || f , (1 , ( θ ) | f , (2 , ( θ ) f , (2 , ( θ ) f , (2 , ( θ ) Fig. 1. Matrix-valued power spectra f , f and f are shown in red dashed line, blue solid line and green dashdotline, respectively. Subplots (1,1), (1,2) and (2,2) show f i, (1 , , | f i, (1 , | (same as | f i, (2 , | ) and f i, (2 , . Subplot (2,1)shows ∠ ( f i, (1 , ) for i ∈ { , , } . We observe that d IS gives a rather unintuitive result since we expect similar distancesbetween the two pairs f , f and f , f . The total variation does not differentiate the Distance between pairs of density functions f , f f , f f , f d IS . × . × . × d TV d W ,κ d T ,κ ISTANCES BETWEEN DENSITY FUNCTIONS . three pairs since all three are found equally close to each other. Using d T , , d T , ( f , f ) is very close to d T , ( f , f ) and they are both at nearly half compared to d T , ( f , f ) .Then d W , is quite similar to d T , . These comparisons suggest that d W ,κ as well as d T , ,both reﬂect quite closely the intuition and what one would expect after inspecting therelative distribution of power and directionality in all three densities.VI. C ONCLUDING REMARKS

The main thesis of this paper is that it is natural to quantify distances by evaluatingmeasures against suitable families of test functions. After explaining the general recipe onrepresentative metrics on scalar densities, including the one proposed in our earlier work[10], we expand on possible directions that lead to metrics for matrix-valued measuresand density functions. We discuss in detail one so-derived weakly continuous metricwhich can be thought of as a natural generalization of the 1-Wasserstein metric betweenmatrix-valued densities. Comparison with alternatives as well as a similar metric in [1]is explained on an academic example. A key property of the metric introduced herein isthe weak continuity which is not shared by any of the alternatives.A

PPENDIX AP ROOF OF P ROPOSITION f . To seethis, for any Lipschitz and bounded f , it can be scaled so that f / max {(cid:107) f (cid:107) Lip , max (cid:107) f (cid:107) κ } is a feasible element in (7). Then, the proof requires showing that any element in the setof continuous, bounded Hermitian matrix-valued function C b ( H (cid:96) × (cid:96) , I ) can be approachedby sequences of Lipschitz and bounded ones from above and below respectively.We need the following expression for suitable inf and the sup of an f ∈ C b ( H (cid:96) × (cid:96) , I ) : inf x ∈ I f ( x ) := arg sup m { tr( m ) | f ( x ) − m ≥ } sup x ∈ I f ( x ) := arg inf m { tr( m ) | m − f ( x ) ≥ } with ordering in the sense of positive semi-deﬁniteness. For n ∈ N , we denote f low ,n ( x ) := inf y ∈ I { f ( y ) + nd ( x, y ) I } f up ,n ( x ) := sup y ∈ I { f ( y ) − nd ( x, y ) I } . Then, f low ,n ( x ) ≤ f ( x ) ≤ f up ,n ( x ) . For any ﬁxed x , { f low ,n ( x ) : n ∈ N } and { f up ,n ( x ) : n ∈ N } are increasing sequences and decreasing sequences, respectively, and they bothconverge to f ( x ) . It is also important to note that (cid:107) f low ,n (cid:107) Lip ≤ n and (cid:107) f up ,n (cid:107) Lip ≤ n .To see this, for any x, y ∈ [0 , , we have f low ,n ( x ) − f low ,n ( y )= inf z (cid:8) f ( z ) + nd ( x, z ) I (cid:9) − inf ˆ z (cid:8) f (ˆ z ) + nd ( y, ˆ z ) I (cid:9) ≤ sup z (cid:8) ( f ( z ) + nd ( x, z ) I ) − ( f ( z ) + nd ( y, z ) I ) (cid:9) = sup z n ( d ( x, z ) − d ( y, z )) I ≤ nd ( x, y ) I. Thus (cid:107) f low ,n (cid:107) Lip ≤ n , and (cid:107) f up ,n (cid:107) Lip ≤ n can be proved in a similar manner. Then wehave lim sup k →∞ tr (cid:18)(cid:90) f d µ k (cid:19) ≤ lim inf n →∞ lim sup k →∞ tr (cid:18)(cid:90) f up ,n d µ k (cid:19) = lim inf n →∞ tr (cid:18)(cid:90) f up ,n d µ (cid:19) = tr (cid:18)(cid:90) f d µ (cid:19) . Similarly, using the sequence { f low ,n ( x ) : n ∈ N } we can also show that lim inf k →∞ tr (cid:18)(cid:90) f d µ k (cid:19) ≥ tr (cid:18)(cid:90) f d µ (cid:19) . This completes the proof. A

PPENDIX BP ROOF OF P ROPOSITION inf m , ˆ µ , ˆ ν (cid:26) (cid:90) I × I d ( x, y ) (cid:107) m ( x, y ) (cid:107) ∗ + κ (cid:88) k =1 (cid:107) µ k − ˆ µ k (cid:107) TV | (cid:90) y ∈ I d m ( x, y ) = d ˆ µ ( x ) , (cid:90) x ∈ I d m ( x, y ) = d ˆ µ ( y ) (cid:27) . (11) To derive the dual formulation, we need to rewrite the nuclear norm as follows (cid:107) d m (cid:107) ∗ = max (cid:107) w (cid:107)≤ tr( w d m ) (cid:107) d µ k − ˆ d µ k (cid:107) ∗ = max (cid:107) λ k (cid:107)≤ tr( λ k ( d µ k − d µ k )) . Let φ , φ be the Lagrange multipliers for the two constraints. Using the Lagrangemultiplier method, (11) equals the following inf m , ˆ µ , ˆ µ sup φ , φ (cid:107) w (cid:107) , (cid:107) λ (cid:107) , (cid:107) λ (cid:107)≤ (cid:26) (cid:90) tr( φ ( x ) − κ λ ( x )) ˆ d µ ( x )+ (cid:90) tr( φ ( y ) − κ λ ( y )) ˆ d µ ( y )+ (cid:90) tr ( d ( x, y ) w ( x, y ) − φ ( x ) − φ ( y )) d m ( x, y )+ κ (cid:90) tr (cid:18) λ ( x ) d µ ( x ) + λ ( x ) µ ( x ) (cid:19)(cid:27) . (12a)Since this optimization problem is convex, the optimal value does not change by switchingthe inf and sup in (12a). The optimal assignement for w , φ k , λ k must satisfy d ( x, y ) w ( x, y ) − φ ( x ) − φ ( y ) = 0 , (13a) φ ( x ) − κ λ ( x ) = 0 , (13b) φ ( y ) − κ λ ( y ) = 0 . (13c)Setting x = y in (13a), we obtain φ = − φ =: φ . Thus φ ( x ) − φ ( y ) ≤ d ( x, y ) I, and (cid:107) φ (cid:107) ≤ κ. By substituting these conditions to (12a), we derive (7).R

EFERENCES [1] L. Ning, T. Georgiou, and A. Tannenbaum, “Matrix-valued Monge-Kantorovich optimal mass transport,”

IEEETransactions on Automatic Control , to appear, 2015.[2] L. Ning, “Matrix-valued optimal mass transportation and its applications,” Ph.D. dissertation, University ofMinnesota, December 2013.[3] V. Zolotarev, “Probability metrics,”

Theory of Probability & Its Applications , vol. 28, no. 2, pp. 278–302, 1984.[4] A. M¨uller, “Integral probability metrics and their generating classes of functions,”

Advances in AppliedProbability , pp. 429–443, 1997.[5] S. T. Rachev, L. Klebanov, S. V. Stoyanov, and F. Fabozzi,

The methods of distances in the theory of probabilityand statistics . Springer, 2013.[6] S. T. Rachev,

Probability metrics and the stability of stochastic models . Wiley Chichester, 1991, vol. 334.[7] C. Villani,

Topics in optimal transportation . American Mathematical Society, 2003, vol. 58.[8] X. Jiang, Z. Luo, and T. Georgiou, “Geometric methods for spectral analysis,”

IEEE Transactions on SignalProcessing , vol. 60, no. 3, pp. 1064–1074, 2012.[9] L. Ning, X. Jiang, and T. Georgiou, “On the geometry of covariance matrices,”

IEEE Signal Processing Letters ,vol. 20, no. 8, pp. 787–790, 2013. [10] T. Georgiou, J. Karlsson, and M. Takyar, “Metrics for power spectra: an axiomatic approach,” IEEE Transactionson Signal Processing , vol. 57, no. 3, pp. 859–867, 2009.[11] J. Karlsson and T. Georgiou, “Uncertainty bounds for spectral estimation,”

IEEE Transactions on AutomaticControl , vol. 58, no. 7, pp. 1659–1673, 2013.[12] A. Connes,

Noncommutative geometry . Academic press, 1995.[13] M. Rieffel, “Metrics on state spaces,”

Doc. Math., J. DMV , vol. 4, pp. 559–600, 1999.[14] F. D’Andrea and P. Martinetti, “A view on optimal transport from noncommutative geometry,”

SIGMA , vol. 6,no. 057, p. 24, 2010.[15] A. Ferrante, C. Masiero, and M. Pavon, “Time and spectral domain relative entropy: A new approach tomultivariate spectral estimation,”

IEEE Transactions on Automatic Control , vol. 57, no. 10, pp. 2561–2575,2012.[16] A. Ferrante, M. Pavon, and F. Ramponi, “Hellinger versus Kullback–Leibler multivariable spectrum approxima-tion,”

IEEE Transactions on Automatic Control , vol. 53, no. 4, pp. 954–967, 2008.[17] A. Ferrante, M. Pavon, and M. Zorzi, “A maximum entropy enhancement for a family of high-resolution spectralestimators,”

IEEE Transactions on Automatic Control , vol. 57, no. 2, pp. 318–329, 2012.[18] B. Afsari and R. Vidal, “The alignment distance on space of linear dynamical systems,” in

IEEE Conference onDecision and Control , 2013.[19] X. Jiang, L. Ning, and T. Georgiou, “Distances and Riemannian metrics for multivariate spectral densities,”