MCMC Algorithms for Posteriors on Matrix Spaces
MMCMC Algorithms for Posteriorson Matrix Spaces
Alexandros Beskos ∗ Department of Statistical Science, University College LondonandKengo Kamatani † Department of Engineering Science, Osaka UniversityAugust 10, 2020
Abstract
We study Markov chain Monte Carlo (MCMC) algorithms for target distributionsdefined on matrix spaces. Such an important sampling problem has yet to be ana-lytically explored. We carry out a major step in covering this gap by developing theproper theoretical framework that allows for the identification of ergodicity proper-ties of typical MCMC algorithms, relevant in such a context. Beyond the standardRandom-Walk Metropolis (RWM) and preconditioned Crank–Nicolson (pCN), a con-tribution of this paper in the development of a novel algorithm, termed the ‘Mixed’pCN (MpCN). RWM and pCN are shown not to be geometrically ergodic for an im-portant class of matrix distributions with heavy tails. In contrast, MpCN has verygood empirical performance within this class. Geometric ergodicity for MpCN is notfully proven in this work, as some remaining drift conditions are quite challengingto obtain owing to the complexity of the state space. We do, however, make a lotof progress towards a proof, and show in detail the last steps left for future work.We illustrate the computational performance of the various algorithms through sim-ulation studies, first for the trivial case of an Inverse-Wishart target, and then for achallenging model arising in financial statistics.
Keywords:
Preconditioned Crank–Nicolson; Drift Condition; Matrix-Valued stochastic dif-ferential equation. ∗ Alexandros Beskos acknowledges support from a Leverhulme Trust Prize. † Kengo Kamatani acknowledges support from JSPS KAKENHI Grant Numbers 16K00046, 20H04149and JST CREST Grant Number JPMJCR14D7. a r X i v : . [ s t a t . C O ] A ug Introduction
Statistical models with parameters defined on matrix spaces arise naturally in many ap-plications, with maybe most typical the case of covariance matrices. Related measureshave thus been developed, with most prominent the Matrix-Normal, Wishart and Inverse-Wishart distributions (Barnard et al., 2000). Numerous extensions have appeared, see e.g.O’Malley and Zaslavsky (2008); Huang and Wand (2013) for a scaled and an hierarchicalInverse-Wishart, Barnard et al. (2000) for a strategy that extracts the correlation matrix,Roverato (2002) and Dobra et al. (2011) for the Hyper-Inverse-Wishart and G-Wishartdistributions, respectively.Beyond standard conjugate settings – e.g. Inverse-Wishart prior for the covariance ofGaussian observations – more involved hierarchical models have generated a need for de-veloping a suite of accompanying MCMC methods. This work reviews standard algorithmsand introduces a novel one (MpCN), motivated by an MCMC method on vector-spacesused in Kamatani (2017). The paper invokes a theoretical framework that permit the anal-ysis of the ergodicity properties of some of the presented algorithms or – in the case ofMpCN – makes a lot of progress towards a (quite challenging) proof, and describes the lastremaining steps left for future research. We show that RWM and pCN do not work welleven for an Inverse-Wishart target as they are not geometrically ergodic for heavy-taileddistributions. In contrast, MpCN has much better empirical performance on such targets.Our main contributions are summarised as follows.(i) We develop a new MCMC method – MpCN –, and provide a motivation for itsunderpinnings. MpCN is characterised by better empirical performance against RWM orpCN, in a number of numerical studies.(ii) We prove that targets on the space of positive definite matrices can be ‘upcasted’onto corresponding laws on unrestricted matrices, the latter space permitting a direct pathfor the development of MCMC methodology.(iii) We prove that RWM, pCN are not geometrically ergodic for a wide class of matrix-valued targets. We make a lot of progress into demonstrating geometric ergodicity forMpCN, and highlight the remaining steps for the completed proof.(iv) We run MpCN on a challenging hierarchical model providing a matrix-extension ofthe influential scalar Stochastic Volatility (SV) dynamic jump-models by Barndorff-Nielsenand Shephard (2001). We stress that the paper focuses on MCMC methods with ‘blind’proposals. This is in agreement with the selected SV application – and, more generally,modern pseudo-marginal methods (Andrieu and Roberts, 2009) for complex models – whereguided proposals or Gibbs sampler schemes are typically impractical.The paper develops as follows. Section 2 introduces relevant measures on matrix spaces.Section 3 presents MCMC algorithms on such spaces. Section 4 develops ergodicity resultsfor the RWM and pCN algorithms. Section 5 investigates ergodicity properties of MpCN.Section 6 shows a collection of numerical results. Section 7 provides conclusions and pointsat future work.
Notation: R + := (0 , ∞ ). We write X = d Y if variables X , Y have the same law. M ( p, q ) is the set of p × q (real-valued) matrices, GL( p ) the group of p × p invertiblematrices, Sym( p ) the set of p × p symmetric matrices, P + ( p ) the set of p × p symmetric,2ositive-definite matrices, O ( p ) the set of orthogonal matrices, O ( p, q ), p ≥ q , the space of p × q matrices U such that U (cid:62) U = I q . We use the notation A = ( A ij ) to indicate individualmatrix elements. Let (cid:107) A (cid:107) F be the Frobenius norm, with inner product (cid:104) A, B (cid:105) F = tr( A (cid:62) B ). A (cid:62) is the transpose of A , and det( A ) its determinant. The derivative of h : M ( p, q ) → R is the p × q matrix Dh ( A ) := ( ∂h ( A ) /∂A ij ) . For A ∈ P + ( p ), we denote by A / the matrix B ∈ P + ( p ) such that BB = A . For
U ∈ P + ( p ), p ≥ q , we define ν U on M ( p, q ), ν U (d X ) := Γ q ( p/ q π pq (cid:0) (det U ) − q/ det( X (cid:62) U − X ) p/ (cid:1) Leb(d X ) , Leb(d X ) = p (cid:89) i =1 q (cid:89) j =1 d X ij , Γ p ( · ) is the multivariate Gamma function. We denote ν U simply as ν when U = I p . Lebdenotes the Lebesgue measure, on a space of dimensions implied by the context. TheLebesgue measure and ν U will be used as reference measures on M ( p, q ). Note that ν U is invariant on M ( p, q ) under right multiplication, X (cid:55)→ Xa , a ∈ GL( q ). If p = q then ν ≡ ν U is a unimodular Haar measure on the locally compact topological group GL( q ) and ν ≡ ν − , where ν − ( A ) := ν ( A − ) for any Borel set A ⊆ GL( q ); see e.g. Sections 5, 7 ofFarrell (1985), Section 60 of Halmos (1950).On P + ( q ), we define the measure, µ (d S ) := (det S ) − ( q +1) / (cid:89) ≤ i ≤ j ≤ q d S ij . (1)We note that µ is invariant under the transform S (cid:55)→ aSa (cid:62) , a ∈ GL( q ); see Section 5 ofFarrell (1985). Also, as in Problem 7.10.6 of Farrell (1985), µ ≡ µ − .For the compact topological space O ( p, q ), there is a uniform distribution that will bedenoted d U in this work. See pg. 18 of Chikuse (2003) and Section 7.6 of Farrell (1985) formore details. Note that the uniform distribution on O ( p, q ), for p ≥ q , is the marginal on O ( p, q ) of the uniform distribution on O ( p ), see Theorem 3.3.1 of Chikuse (2003). Example 2.1.
For parameters M ∈ M ( p, q ) , Σ ∈ P + ( p ) , T ∈ P + ( q ) , the Matrix-Normaldistribution N p,q ( M, Σ , T) has density with respect to Leb , φ p,q ( X ; M, Σ , T) = exp (cid:0) − tr [ T − ( X − M ) (cid:62) Σ − ( X − M ) ] / (cid:1) (2 π ) pq/ (det Σ) q/ (det T) p/ . xample 2.2. Let r ∈ R , r > q − , and T ∈ P + ( q ) . The Wishart distribution W q ( r, T ) is a probability measure on P + ( q ) with density with respect to µ , π ( S ) = det( S ) r/ exp( − tr [ T − S ] / rq/ (det T ) r/ Γ q ( r/ . (2) Example 2.3.
Let r ∈ R , r > q − , and T ∈ P + ( q ) . The Inverse-Wishart distribution W − q ( r, T ) is a probability measure on P + ( q ) with density with respect to µ , π ( S ) = (det T ) p/ exp( − tr [ T S − ] / rq/ det( S ) r/ Γ q ( r/ . Remark 1.
We summarize some relevant distribution properties.(i) If X ∼ N p,q ( M, Σ , T) then, AXB ∼ N r,s ( AM B, A Σ A (cid:62) , B (cid:62) T B ) , where A ∈ M ( r, p ) , B ∈ M ( q, s ) are full rank matrices such that r ≤ p and q ≥ s .(ii) If X i ∼ N p,q ( M i , Σ i , T) ( i = 1 , are independent then, X + X ∼ N p,q ( M + M , Σ + Σ , T) . (iii) If X ∼ N p,q (0 , I p , T ) then S = X (cid:62) X ∼ W q ( p, T ) .(iv) If S ∼ W q ( r, T ) then S − ∼ W − q ( r, T − ) . P + ( q ) Onto M ( p, q ) Let p ≥ q and U ∈ P + ( p ). We show that a distribution on P + ( q ) can be expressed astransform of one on the larger space M ( p, q ) via the surjective mapping x (cid:55)→ x (cid:62) U − x . Theorem 2.4.
Let ˜Π(d S ) = ˜ π ( S ) µ (d S ) be a probability measure on P + ( q ) . Consider thedistribution Π(d x ) on M ( p, q ) , p ≥ q , defined as, Π(d x ) := ˜ π ( x (cid:62) U − x ) ν U (d x ) . (3) If X ∼ Π(d x ) then S = X (cid:62) U − X ∼ ˜Π(d S ) .Proof. We use a result from multivariate calculus. By Theorem 5.2.2 of Farrell (1985), for f : M ( p, q ) → R , (cid:90) M ( p,q ) f ( x ) ν (d x ) = (cid:90) P + ( q ) µ (d S ) (cid:90) O ( p,q ) f ( U S / )d U . (4)In particular, if f ( x ) = g ( x (cid:62) U − x ), for g : P + ( q ) → R , we have (cid:90) M ( p,q ) g ( x (cid:62) U − x ) ν U (d x ) = (cid:90) M ( p,q ) g ( x (cid:62) x ) ν (d x )= (cid:90) P + ( q ) g ( S ) µ (d S ) .
4e have obtained a change of variables formula for map x (cid:55)→ x (cid:62) U − x = S . Replacing g ( · )with ( g · ˜ π )( · ) gives, (cid:90) P + ( q ) g ( S ) ˜Π(d S ) = (cid:90) M ( p,q ) g ( x (cid:62) U − x )Π(d x ) . This completes the proof.
Remark 2.
We will consider MCMC methods on M ( p, q ) . In light of Theorem 2.4, suchalgorithms are directly relevant for distributions ˜Π(d S ) = ˜ π ( S ) µ (d S ) on P + ( q ) since wecan execute the MCMC algorithm on M ( p, q ) , with target law Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) ,and apply the transform x (cid:55)→ x (cid:62) U − x = S on the collected x -samples. We provide some MCMC methods on M ( p, q ), p ≥ q . As noted in Section 1, we focuson MCMC algorithms involving blind proposals, i.e. containing no information about thetarget. Let Π(d x ) be a target distribution on M ( p, q ), and – in a Metropolis–Hastingssetting – Q ( x, d y ) a proposal Markov kernel. Assume that ξ is a σ -finite measure such thatΠ is absolutely continuous with respect to ξ , and Q ( x, d y ) is ξ -reversible. Then, the triplet(Π , Q, ξ ) give rise to a Metropolis–Hastings kernel, so that, for Borel sets A ⊆ M ( p, q ), P ( x, A ) = (cid:90) A Q ( x, d y ) α ( x, y ) + R ( x ) · δ x ( A ) , where we have defined R ( x ) = 1 − (cid:82) Q ( x, d y ) α ( x, y ), for acceptance probability functionthat has the simple form, α ( x, y ) = min { , π ( y ) /π ( x ) } , π ( x ) := (dΠ / d ξ )( x ) . (5)All Metropolis–Hastings kernels – on M ( p, q ) – in this paper will correspond to instancesof such triplets (Π , Q, ξ ). A similar notation is adopted for measures restricted on P + ( q ). The proposal kernel Q ( x, d y ) of the RWM algorithm on M ( p, q ) is defined by the updateRWM : y = x + w, (6)with w ∼ N p,q (0 , U, V ), U ∈ P + ( p ), V ∈ P + ( q ). The proposal kernel Q ( x, · ) = N p,q ( x, U, V )is reversible with respect to the Lebesgue measure on M ( p, q ). Thus, following the notationwe established above, we now have the triplet (Π , Q, Leb), and the acceptance probability a ( x, y ) is as in (5), with π ( x ) the density of Π with respect to Leb. Let ρ ∈ [0 , M ( p, q ) is determined via the proposal,pCN : y = ρ / x + (1 − ρ ) / w, (7)5ith w ∼ N p,q (0 , U, V ). The proposal kernel Q ( x, · ) = N p,q ( ρ / x, U, (1 − ρ ) V ) is reversiblewith respect to N p,q (0 , U, V ). We have the triplet (Π , Q, N p,q (0 , U, V )), so the acceptanceprobability is as in (5), where π ( x ) is density of Π with respect to N p,q (0 , U, V ). The pCNalgorithm was introduced in Section 4.2 of Neal (1999). As shown there, and in severalmore recent works (see e.g. Beskos et al. (2008); Cotter et al. (2013)) pCN can be veryeffective in high dimensions in the context of Gaussian priors and not overly informativeobservations. Algorithms are well-defined even on infinite-dimensional Hilbert spaces; thishas sparked the use of pCN in the area of Bayesian Inverse Problems, see e.g. the overviewin Stuart (2010). M ( p, q ) We return briefly at the scenario of Section 2.3, i.e. adopt the viewpoint that the originaltarget distribution is given as ˜Π(d S ) = ˜ π ( S ) µ (d S ) on P + ( q ), and one aims to generate X ∼ Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) on M ( p, q ), p ≥ q , and return S = X (cid:62) U − X . In such anupcasted setting, one can provide a motivation for the derivation of MpCN. Let x i · denotethe i th row of x ∈ M ( p, q ). Since the density of Π(d x ) with respect to Leb writes as, f (cid:0) z (cid:62) · z · + · · · + z (cid:62) p · z p · (cid:1) ; z := U − / x, where f ( S ) = ˜ π ( S ) det( S ) − p/ , S ∈ P + ( q ), it is clear that, under X ∼ Π(d x ), Z · = d Z · = d · · · = d Z p · , so all rows of Z := U − / X have the same marginal law. Looking back at the pCN proposal(7), as adjusted under the linear transform x (cid:55)→ U − / x =: z , the above understandingcan provide guidance for tuning the algorithmic parameter V ∈ P + ( q ) – corresponding tothe variance of each row-vector of the noise U − / w – given information about the currentposition z = U − / x of the MCMC chain. In a ‘classical’ approach, a likelihood-basedchoice would simply be the sample variance over the rows, (cid:98) V := (cid:80) pi =1 z (cid:62) i · z i · /p . A Bayesianapproach seems preferable as it will ultimately provide an algorithm with a heavier-tailedproposal. A choice of Jeffrey’s prior µ (d V ) (see e.g. Geisser and Cornfield (1963)), with µ as given in (1), combined with a likelihood φ p,q ( z ; 0 , U − / U U − / , V ), U ∈ P + ( p ), is easilyshown to provide the posterior W − q ( p, x (cid:62) U − x ) for V . The above thinking gives rise to the following proposal, applicable for a general law Π(d x )on M ( p, q ) – not only in the upcasted scenario adapted above for purposes of providingsome rationale under a particular viewpoint –,MpCN : y = ρ / x + (1 − ρ ) / w ; (8) w ∼ N p,q (0 , U, V ); V ∼ W − q ( p, x (cid:62) U − x ) . emark 3. The law of [ w | x ] is that of a Matrix-Student-t distribution with Lebesguedensity proportional to z (cid:55)→ det( x (cid:62) U − x + z (cid:62) U − z ) − p ; see e.g. Dickey (1967). MpCN is a Metropolis–Hastings method with target Π(d x ) on M ( p, q ), and proposal asdetermined in (8). Let Q denote the related proposal transition kernel. The lemma thatfollows shows that Q is ν U -reversible, so we have the triplet (Π , Q, ν U ) and the acceptanceprobability is as in (5), where π ( x ) is the density of Π with respect to ν U . Lemma 3.1.
The MpCN proposal kernel Q on M ( p, q ) has density q ( x, y ) with respect to ν U (d y ) , that writes as, q ( x, y ) = (1 − ρ ) pq/ c p,q det( x (cid:62) U − x ) p/ × det( y (cid:62) U − y ) p/ det( R ( x, y )) p , (9) where, c p,q = Γ q ( p )2 q π pq/ Γ q ( p/ , and, R ( x, y ) = x (cid:62) U − x + y (cid:62) U − y − ρ / x (cid:62) U − y − ρ / y (cid:62) U − x. Therefore, Q is ν U -reversible.Proof. Following the definition of MpCN, the joint distribution of [
V, y | x ] writes as, W − q (d V ; p, x (cid:62) U − x ) ⊗ N p,q (d y ; ρ / x, U, (1 − ρ ) V ) . (10)It remains to integrate out V in (10), and take the density of the resulted distribution of[ y | x ] with respect to ν U (d y ). All such calculations can be carried out analytically due to µ (d V ) being conjugate with respect to Matrix-Normal distribution with right-covariancematrix V . Thus, tedious but otherwise straightforward calculations give the density q ( x, y )in expression (9). Since q ( x, y ) = q ( y, x ), we have ν U (d x ) Q ( x, d y ) = ν U (d y ) Q ( y, d x ). P + ( q ) under U = U Returning at the context of an initial target ˜Π(d S ) on P + ( q ), we show here that, whenthe operators U and U – used, respectively, when upcasting the target ˜Π onto M ( p, q )and as a parameter in the MpCN proposal – coincide, then MpCN on M ( p, q ) inducesa Markovian kernel on P + ( q ) under the transform x (cid:55)→ x (cid:62) U − x . Such kernel is easier toanalyse as it exhibits random-walk-type behavior, as shown in the Proposition and Theoremthat follow. We define the operation A ◦ B = B / AB / , A, B ∈ P + ( q ); it follows that( A ◦ B ) − = A − ◦ B − and tr( A ◦ B ) = tr( AB ). Recall also the definition of the uniformmeasure d U in Section 2.1. Proposition 3.2.
Let ˜Π(d S ) = ˜ π ( S ) µ (d S ) and Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) , be probabilitymeasures on P + ( q ) and M ( p, q ) respectively, with U ∈ P + ( q ) , p ≥ q . Given x ∈ M ( p, q ) ,set y ∼ Q ( x, d y ) , where Q = Q ( · ; ρ, U ) is the proposal kernel of MpCN, with parameters ρ ∈ [0 , and U . i) We have the representation, for det( x (cid:62) U − x ) (cid:54) = 0 , y (cid:62) U − y = (cid:15) ◦ ( x (cid:62) U − x ) , for some P + ( q ) -valued random variable (cid:15) with law that does not depend on x , U .(ii) The law of (cid:15) , L (d (cid:15) ) , writes as, L (d (cid:15) ) = (1 − ρ ) pq/ c p,q µ (d (cid:15) ) (cid:90) O ( p,q ) det( (cid:15) ) p/ det( I q + (cid:15) − ρ / ( U (cid:15) / + (cid:15) / U (cid:62) )) p d U , (11) with c p,q as given in Lemma 3.1, where U denotes the first q rows of U . In particular, L (d (cid:15) ) depends on ρ , but not on U .(iii) We have (cid:15) = d (cid:15) − .Proof. Let u ( x ) = U − / x ( x (cid:62) U − x ) − / ∈ O ( p, q ). There exists an orthogonal comple-ment, u ( x ) ∈ M ( p, p − q ) as an analytic function, such that u ( x ) = ( u ( x ) , u ( x )) ∈ O ( p ).We consider linear transformations of y , z i = u i ( x ) (cid:62) U − / y ( x (cid:62) U − x ) − / , i = 1 , , and set z = ( z , z ) ∈ M ( p, q ). Then (cid:15) = z (cid:62) z . By Lemma 3.1, and the change of variablesformula (Lemma 1.5.1 of Chikuse (2003)), the law of z is,(1 − ρ ) pq/ c p,q det( z (cid:62) z ) q/ det( I q + z (cid:62) z − ρ / ( z + z (cid:62) )) p ν (d z ) , which does not involve x or U . Also, by the change of variables formula (4), the law of (cid:15) = z (cid:62) z is as in (11). Finally, the law, L − (d (cid:15) ), of the inverse (cid:15) − writes as (upon recallingthat µ − ≡ µ ), L − (d (cid:15) ) = (1 − ρ ) pq/ c p,q µ (d (cid:15) ) × (cid:90) O ( p,q ) det( (cid:15) − ) p/ det( I q + (cid:15) − − ρ / ( U (cid:15) − / + (cid:15) − / U (cid:62) )) p d U = (1 − ρ ) pq/ c p,q µ (d (cid:15) ) × (cid:90) O ( p,q ) det( (cid:15) ) p/ det( I q + (cid:15) − ρ / ( U (cid:62) (cid:15) / + (cid:15) / U )) p d U . From Theorem 3.3.1 of Chikuse (2003), the uniform distribution on O ( p, q ) is the marginal(on O ( p, q )) of the uniform distribution on O ( p ). Also, by Section 1.4.1 of Chikuse (2003),the uniform distribution on O ( p ) is invariant under the matrix transpose operation. Thus,the distribution of U is also invariant under the matrix transpose operation. The proof isnow complete.Proposition 3.2 leads to the Theorem below.8 heorem 3.3. Let ˜Π(d S ) = ˜ π ( S ) µ (d S ) be a target on P + ( q ) . Define the correspondingupcasted law on M ( p, q ) , Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) , U ∈ P + ( q ) , and let Q = Q ( · ; ρ, U ) be the MpCN proposal kernel. Consider also the kernel ˜ Q ( S, · ) := d (cid:15) ◦ S , S ∈ P + ( q ) , withthe law L (d (cid:15) ) of (cid:15) ∈ P + ( q ) as determined in Proposition 3.2(ii), for parameters ρ ∈ [0 , and p , p ≥ q .If { X n } n ≥ is the MpCN Markov chain with target Π and proposal Q , then the process { S n } n ≥ with, S n := X (cid:62) n U − X n , is a Metropolis–Hastings Markov chain, with respect to its own filtration, with target ˜Π andproposal ˜ Q (and initial position S = d X (cid:62) U − X ). Remark 4. (i) Given Theorem 3.3, we will refer – without confusion – to the Markovkernel, denoted ˜ P , with target ˜Π and proposal ˜ Q = ˜ Q ( · ; ρ, p ) as “an MpCN kernelon P + ( q ) ”. Notice that ˜ Q does not depend on the choice of parameter U .(ii) The result of Theorem 3.3 is not an immediate implication of Proposition 3.2, as notall Markov chains with a proposal, accept/reject rule, and given stationary distribu-tion, are necessarily of Metropolis–Hastings type.Proof of Theorem 3.3. We show that { S n } n ≥ is in the class of Metropolis–Hastings Markovchains introduced in the beginning of Section 3, with triplet ( ˜Π , ˜ Q, µ ). First, we prove µ -reversibility of ˜ Q , by making use of the ν U -reversibility of Q . Indeed, for any given Borelsets A, B ⊆ P + ( q ), upon defining the sets A ∗ = { x ∈ M ( p, q ) : x (cid:62) U − x ∈ A } and B ∗ = { x ∈ M ( p, q ) : x (cid:62) U − x ∈ B } , we have, (cid:90) A ˜ Q ( S, B ) µ (d S ) = (cid:90) A ∗ Q ( x, B ∗ ) ν U (d x )= (cid:90) B ∗ Q ( x, A ∗ ) ν U (d x ) = (cid:90) B ˜ Q ( S, A ) µ (d S ) . The acceptance probability of { X n } n ≥ coincides with that of the triplet ( ˜Π , ˜ Q, µ ) via thesurjective mapping x (cid:55)→ x (cid:62) U − x . The proof is complete. Proposition 3.4.
Let { S n } n ≥ be a Markov process on P + ( q ) with transition kernel corre-sponding to that of an MpCN chain with target ˜Π and proposal ˜ Q ( · ; ρ, p ) . Then, { S − n } n ≥ is a Markov process on P + ( q ) with transition kernel corresponding to that of an MpCNtransition kernel with target ˜Π − and identical proposal ˜ Q ( · ; ρ, p ) .Proof. Recall that { S n } n ≥ corresponds to the triplet ( ˜Π , ˜ Q, µ ). We show that { S − n } n ≥ corresponds to triplet ( ˜Π − , ˜ Q, µ ). Since S (cid:55)→ S − is a bijection in P + ( q ), { S − n } n ≥ isalso a Markov chain. The third component of the triplet is invariant under this transform,since µ − = µ . The second component is also invariant since ˜ Q ( S, A ) = ˜ Q ( S − , A − ), for A ⊆ P + ( q ), because (cid:15) = d (cid:15) − by Proposition 3.2(iii). The first component of the triplet˜Π becomes ˜Π − . Finally, the acceptance probability α ( S, S ∗ ) of the triplet ( ˜Π , ˜ Q, µ ) isthe same as that of ( ˜Π − , ˜ Q, µ ) via the projection S (cid:55)→ S − . Therefore { S − n } n ≥ is theMetropolis–Hastings chain determined by ( ˜Π − , ˜ Q, µ ). The proof is complete.9 emark 5.
If a Markov chain { S n } n ≥ is geometrically ergodic, and T n = f ( S n ) formsa Markov chain – for a map f , on appropriate domains – then { T n } n ≥ is also geomet-rically ergodic, since the σ -algebra generated by { S n } n ≥ contains that of { T n } n ≥ . Thisobservation is used in the following two statements.(i) If the upcasted MpCN kernel on M ( p, q ) is geometrically ergodic, then the deducedMpCN kernel on P + ( q ) is also geometrically ergodic.(ii) Proposition 3.4 implies that, if the MpCN kernel on P + ( q ) targeting ˜Π is geometricallyergodic, then the MpCN kernel targeting ˜Π − (for the same ρ , p ) is also geometricallyergodic. Indicatively, geometric ergodicity of MpCN with Wishart target ˜Π ≡ W q ( r, T ) is equivalent to geometric ergodicity of MpCN with Inverse-Wishart target ˜Π − ≡ W − q ( r, T − ) , for r ∈ R , r > q − , and T ∈ P + ( q ) . We defined MpCN by using Jeffrey’s prior µ (d V ) on the right-covariance parameter V ofpCN. We can deduce a general class of kernels, special instances of which correspond topCN and MpCN. Consider the prior W − q ( r, T ) for V , for r > q −
1, and T ∈ P + ( q ). Then,working as in the derivation of MpCN, we obtain the following proposal,vMpCN : y = ρ / x + (1 − ρ ) / w ; w ∼ N p,q (0 , U, V ); V ∼ W − q ( p + r, T + x (cid:62) U − x ) , where ‘v’ stands for ‘variation’. This kernel can be shown to be reversible with respect to aMatrix-Student- t law (this can provide the acceptance probability) with Lebesgue density z (cid:55)→ det( T + z (cid:62) U − z ) − ( p + r ) / . MpCN corresponds to r = 0, T = 0, and pCN to T = rT , T ∈ P + ( q ), r → ∞ . Throughout this section, and unless specified otherwise, U ∈ P + ( p ), V ∈ P + ( q ), ρ ∈ [0 , M ( p, q ), p ≥ q ;also U ∈ P + ( q ) is the matrix involved when upcasting the target from P + ( q ) to M ( p, q ). Many MCMC methods do not work well for target distributions with contour manifoldsthat degenerate – in a proper sense – in the tails; see, e.g., Section 5 of Jarner and Hansen(2000). To exclude such cases, we consider the following class of functions.
Definition 4.1.
A continuously differentiable function h : M ( p, q ) (cid:55)→ R + satisfies thecontour condition if, lim (cid:107) x (cid:107) F → + ∞ (cid:10) Dh ( x ) (cid:107) Dh ( x ) (cid:107) F , x (cid:107) x (cid:107) F (cid:11) F < . (12)10any laws on P + ( q ) have densities with regular variations. Regular variation on R is aconcept met in a large literature; see, e.g., Resnick (2007); Bingham et al. (1989). Regularvariation on P + ( q ) is a natural extension from R . We first define an appropriate metric. Remark 6.
Matrix A ∈ P + ( q ) writes as A = U Λ U (cid:62) , for U ∈ O ( q ) and positive diagonalmatrix Λ . The logarithmic map log : P + ( q ) → Sym( q ) is given by log A = U (log Λ) U (cid:62) where log Λ := diag { log Λ , . . . , log Λ qq } . We consider a metric d ( · , · ) on P + ( q ) , such that, d ( A, B ) := (cid:107) log( A ◦ B − ) (cid:107) F = (cid:8) q (cid:88) i =1 log λ i (cid:9) / , (13) where λ , . . . , λ q are eigenvalues of A ◦ B − , A, B ∈ P + ( q ) . This is a natural distance in-duced by the logarithmic map; see Theorem XII.1.3 of Lang (1999). Note that a matrix canhave several square roots, in general, but the value of d ( A, B ) is unique. The topology in-duced by the metric d and that by the Frobenius norm are different; the former fits naturallyto P + ( q ) . P + ( q ) is a complete metric space under d , but not under the Frobenius norm:e.g., observe that A n = diag { n − , , . . . , } forms a Cauchy sequence under the Frobeniusnorm, with a limit A = diag { , , . . . , } / ∈ P + ( q ) . Definition 4.2.
A function h : P + ( q ) → R + is regularly varying if there exists r ∈ R , with r > q − , such that, h ( zx ) h ( zI q ) −→ z → + ∞ det( x ) r/ , locally uniformly in x ∈ P + ( q ) under the topology induced by the metric d ( · , · ) in (13). Remark 7.
The probability density function of the Inverse-Wishart law W − q ( r, T ) is reg-ularly varying since, π ( zx ) π ( zI q ) = det( x ) − r/ exp (cid:0) − z − tr (cid:0) T ( x − − I q ) (cid:1)(cid:1) −→ z → + ∞ det( x ) − r/ . We have the following positive result about the RWM algorithm on M ( p, q ). Proposition 4.3 (Jarner and Hansen (2000)) . Consider the law
Π(d x ) = π ( x ) Leb(d x ) on M ( p, q ) . Assume that π is continuously differentiable, satisfies the contour condition, and, lim (cid:107) x (cid:107) F → + ∞ (cid:10) D log π ( x ) , x (cid:107) x (cid:107) F (cid:11) = −∞ . (14) Then, RWM with target Π is geometrically ergodic. We have the following negative result – in the Corollary following the Proposition – forRWM on targets with regularly varying density functions.
Proposition 4.4 (Proposition 2 of Kamatani (2017)) . (i) If the RWM kernel with target Π(d x ) on M ( p, q ) is geometrically ergodic, then, forsome s > , (cid:90) M ( p,q ) exp( s (cid:107) x (cid:107) F )Π(d x ) < ∞ . ii) Let ˜Π(d S ) = ˜ π ( S ) µ (d S ) be a probability distribution on P + ( q ) , with correspondingupcasted law on M ( p, q ) , Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) . If the RWM chain with target Π is geometrically ergodic, then, for some s > , (cid:90) P + ( q ) exp (cid:16) s (cid:112) tr( S ) (cid:17) ˜Π(d S ) < ∞ . Corollary 4.5.
Let ˜Π(d S ) = ˜ π ( S ) µ (d S ) be a probability law on P + ( q ) . If ˜ π is continuous,regularly varying, then the RWM chain with target Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ) on M ( p, q ) is not geometrically ergodic.Proof. Any S ∈ P + ( q ) writes as S = U (cid:62) Λ U = ( (cid:80) i λ i ) U (cid:62) Λ ∗ U = z U (cid:62) Λ ∗ U , with z = (cid:80) i λ i , U ∈ O ( q ), Λ = diag { λ , · · · , λ q } , Λ ∗ = diag { λ ∗ , · · · , λ ∗ q } , 0 < λ < · · · < λ q , where we haveset λ ∗ i = λ i / (cid:80) j λ j . We define a bounded set, for ε > ε := (cid:110) λ ∗ = ( λ ∗ , . . . , λ ∗ q − ) ∈ R q − : ε < λ ∗ < · · · < λ ∗ q , λ ∗ q = 1 − q − (cid:88) i =1 λ ∗ i (cid:111) , We introduced ε here so that we can use Lebesgue’s dominated convergence theorem in thefollowing inequality. By Theorem 5.3.1 of Farrell (1985), for f : R + → R + , we have, (cid:90) S ∈ P + ( q ) f (tr( S )) ˜Π(d S ) = c (cid:90) f (tr Λ) (cid:89) i
1; hence, from Lebesgue’s dominated convergence theorem, h ε ( zx ) h ε ( z ) = h ε ( zx ) / ˜ π ( zI q ) h ε ( z ) / ˜ π ( zI q ) −→ z →∞ x − rq/ . Thus, for f ( x ) = exp( s √ x ), s, x >
0, we have obtained, (cid:90) S ∈ P + ( q ) exp (cid:0) s (cid:112) tr( S ) (cid:1) ˜Π(d S ) ≥ c (cid:90) ∞ exp (cid:0) s √ z (cid:1) h ε ( z ) z − d z.
12y Theorem 1.5.6(iii) of Bingham et al. (1989), h ε ( z ) z − ≥ Cz − rq/ − − δ , for any δ >
0, forsome
C >
0, and sufficiently large z >
0. The claim of Corollary 4.5 follows, since,lim z →∞ exp( s √ z ) h ε ( z ) z − = + ∞ , for any s >
0, that violates the integrability condition in Proposition 4.4(ii).
Using the above results, we check geometric ergodicity for RWM for the standard proba-bility measures written down in Section 2.2.1. Let π ( x ) be the density function of N p,q ( M, Σ , T) under Leb. Then, we have thederivative D log π ( x ) = − Σ − ( x − M )T − . Thus, as (cid:107) x (cid:107) F → ∞ , the inner productterm at the contour condition (12) is dominated from above by, − inf e : (cid:107) e (cid:107) F =1 (cid:104) Σ − e T − , e (cid:105) F (cid:107) Σ − e T − (cid:107) F ≤ − inf e : (cid:107) e (cid:107) F =1 (cid:104) Σ − e T − , e (cid:105) F sup e : (cid:107) e (cid:107) F =1 (cid:107) Σ − e T − (cid:107) F . The denominator is bounded from above, and the numerator is bounded away from 0since, (cid:10) Σ − e T − , e (cid:11) F = tr (cid:2) T − / e (cid:62) Σ − e T − / (cid:3) ≥ λ − (cid:107) e (cid:107) F = λ − , (15)where λ is the largest among all eigenvalues of matrices Σ and T. Thus, the densitysatisfies the contour condition. We can also check that π satisfies (14). Therefore,RWM is geometrically ergodic when the target is N p,q ( M, Σ , T).2. Consider the Wishart distribution W q ( r, T ), r ∈ R , r > q −
1. Since the probabilitymeasure is defined on P + ( q ), we use the upcasting strategy to apply RWM. Thedensity function of the upcasted law of W q ( r, T ) under Leb is, π ( x ) = const. × det( x (cid:62) U − x ) r/ exp (cid:0) − tr [ T − ( x (cid:62) U − x ) ] / (cid:1) det( x (cid:62) U − x ) − p/ . Thus, D log π ( x ) = r − p D log(det( x (cid:62) U − x )) − U − xT − . By the triangle inequality, inequalities (12) and (14) will follow once we show,lim (cid:107) x (cid:107) F →∞ (cid:107) D log(det( x (cid:62) U − x )) (cid:107) F (cid:107) x (cid:107) F = 0 . By Equation (15.8.6) of Harville (1997),dd x ij log(det( x (cid:62) U − x )) = tr (cid:104) ( x (cid:62) U − x ) − dd x ij ( x (cid:62) U − x ) (cid:105) = 2 [ U − x ( x (cid:62) U − x ) − ] ij . (cid:107) D log(det( x (cid:62) U − x )) (cid:107) F = 4 (cid:88) ij [ U − x ( x (cid:62) U − x ) − ] ij ≤ (cid:107) x (cid:107) − F sup e : (cid:107) e (cid:107) F =1 (cid:88) ij [ U − e ( e (cid:62) U − e ) − ] ij , which converges to 0 as (cid:107) x (cid:107) F → ∞ . So inequalities (12) and (14) hold. This impliesthat RWM is geometrically ergodic for the Wishart distribution.3. By Corollary 4.5, since the Inverse-Wishart distribution has regularly varying densityfunction (see Remark 7), RWM is not geometrically ergodic for W − q ( r, T ). We can now prove that pCN is not geometrically ergodic in heavy tailed scenarios.
Proposition 4.6. (i) Consider
Π(d x ) = π ( x ) Leb(d x ) . Let λ be the largest amongstall eigenvalues of U , V (recall these are parameters appearing at the pCN proposalkernel). Set, C z,ε ( s ) := sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F ≤ ε z − (cid:12)(cid:12) log π ( zs ) − log π ( zs (cid:48) ) (cid:12)(cid:12) . If, for some ε > − ρ / , inf s : (cid:107) s (cid:107) F =1 lim inf z →∞ C z,ε ( s ) < − ρ λ − , then the pCN kernel with target Π is not geometrically ergodic.(ii) Consider ˜Π(d S ) = ˜ π ( S ) µ (d S ) . Let ˜ λ be the largest amongst all eigenvalues of U , V and U − (recall U is the matrix appearing in the definition of the upcasted target Π ).Set, C z,ε ( S ) := sup S (cid:48) : d ( S,S (cid:48) ) <ε z − (cid:12)(cid:12) log ˜ π ( z S ) − log ˜ π ( z S (cid:48) ) (cid:12)(cid:12) . If, for some ε > q / − ρ / ) , inf S ∈ P + ( q ) , tr( S )=1 lim inf z →∞ C z,ε ( S ) < − ρ ˜ λ − , then the pCN chain with target Π is not geometrically ergodic.Proof. First we prove (i). Let P be the Markov kernel of the pCN chain. From Proposition5.1 of Roberts and Tweedie (1996) and continuity of x (cid:55)→ P ( x, { x } ), it suffices to showthat sup x ∈ M ( p,q ) P ( x, { x } ) = 1. By definition,1 − P ( x, { x } ) = (cid:90) M ( p,q ) α ( x, y ( w )) φ p,q ( w ; 0 , U, V )d w, (16)14or y = y ( w ) = ρ / x + (1 − ρ ) / w ; notice also that the acceptance probability writes as, α ( x, y ( w )) = min (cid:110) , π ( y ( w )) φ p,q ( x ; 0 , U, V ) π ( x ) φ p,q ( y ( w ); 0 , U, V ) (cid:111) . Let x = zs , for z > (cid:107) s (cid:107) F = 1, so that (cid:107) x (cid:107) F = z (cid:107) s (cid:107) F = z . For any w ∈ M ( p, q ), (cid:107) xz − y ( w ) z (cid:107) F = (cid:107) s − ρ / s + z − (1 − ρ ) / w (cid:107) F −→ z →∞ − ρ / . For each w ∈ M ( p, q ) choose z = z ( w ) > (cid:107) xz − y ( w ) z (cid:107) F < ε for any z > z . Then,the definition of C z,ε ( s ) in the statement of the Proposition implies that for z = (cid:107) x (cid:107) F > z , | log π ( y ( w )) − log π ( x ) | = | log π ( z y ( w ) z ) − log π ( z xz ) | ≤ C z,ε ( z − x ) z . (17)We have that, for fixed w , y ( w ) = ρ / x + O (1) and, − log φ p,q ( y ( w ); 0 , U, V ) + log φ p,q ( x ; 0 , U, V )= (cid:8) tr [ V − y ( w ) (cid:62) U − y ( w ) ] − tr [ V − x (cid:62) U − x ] (cid:9) = − − ρ tr [ V − x (cid:62) U − x ] + O ( (cid:107) x (cid:107) F ) ≤ − − ρ λ − (cid:107) x (cid:107) F + O ( (cid:107) x (cid:107) F ) . (18)Choose a sequence x n ∈ M ( p, q ), for n = 1 , , . . . , with z n = (cid:107) x n (cid:107) F ≥ z , such that,lim sup n →∞ C z n ,ε ( s n ) < − ρ λ − , where s n = x n / (cid:107) x n (cid:107) F . Then, for y n ( w ) = ρ / x n + (1 − ρ ) / w , we have, α ( x n , y n ( w )) = min (cid:110) , π ( y n ( w )) φ p,q ( x n ; 0 , U, V ) π ( x n ) φ p,q ( y n ( w ); 0 , U, V ) (cid:111) ≤ π ( y n ( w )) φ p,q ( x n ; 0 , U, V ) π ( x n ) φ p,q ( y n ( w ); 0 , U, V )= exp (log π ( y n ( w )) − log π ( x n )) × exp ( − log φ p,q ( y n ( w ); 0 , U, V ) + log φ p,q ( x n ; 0 , U, V )) ≤ exp (cid:0)(cid:0) C z n ,ε ( s n ) − − ρ λ − (cid:1) (cid:107) x n (cid:107) F + O ( (cid:107) x n (cid:107) F ) (cid:1) −→ n →∞ , for any fixed w ∈ M ( p, q ). The proof of (i) is completed via the dominated convergencetheorem, as from (16) we obtain that sup P ( x, { x } ) = 1.The proof for (ii) is similar. Let x = zs , for z > (cid:107) U − / s (cid:107) F = 1 as above. Then,for any w ∈ M ( p, q ), d (cid:0) x (cid:62) U − x, y ( w ) (cid:62) U − y ( w ) (cid:1) = d (cid:0) xz (cid:62) U − xz , y ( w ) z (cid:62) U − y ( w ) z (cid:1) = d (cid:0) s (cid:62) U − s, ( s − ρ / s + z − (1 − ρ ) / w ) (cid:62) U − ( s − ρ / s + z − (1 − ρ ) / w ) (cid:1) → d (cid:0) s (cid:62) U − s, ( s − ρ / s ) (cid:62) U − ( s − ρ / s ) (cid:1) = d (cid:0) I q , (1 − ρ / ) I q (cid:1) = q / − ρ / ) . π ( x ) = ˜ π ( x (cid:62) U − x ). For each w ∈ M ( p, q ), there exists z = z ( w ) > (cid:107) U − / x (cid:107) F = z > z then, | log π ( y ( w )) − log π ( x ) | ≤ C z,ε ( z − x (cid:62) U − x ) z . Observe that the trace of z − x (cid:62) U − x is 1. As in the proof of (i),log φ p,q ( y ( w ); 0 , U, V ) + log φ p,q ( x ; 0 , U, V ) = − − ρ tr [ V − x (cid:62) U − x ] + O ( (cid:107)U − / x (cid:107) F ) ≤ − − ρ ˜ λ − (cid:107)U − / x (cid:107) F + O ( (cid:107)U − / x (cid:107) F ) . The rest of the proof is the same as above.We state an immediate consequence of Proposition 4.6.
Corollary 4.7.
Consider the target ˜Π(d S ) = ˜ π ( S ) µ (d S ) on P + ( q ) . If ˜ π is continuous,regularly varying, then the pCN kernel with the corresponding upcasted target Π(d x ) on M ( p, q ) is not geometrically ergodic.Proof. By the regularly varying property, for some r ∈ R , r > q − S (cid:48) : d ( S,S (cid:48) ) <ε (cid:12)(cid:12) log ˜ π ( zS ) − log ˜ π ( zS (cid:48) ) (cid:12)(cid:12) = sup S (cid:48) : d ( S,S (cid:48) ) <ε (cid:12)(cid:12)(cid:12) log ˜ π ( zS ) / ˜ π ( zI q )˜ π ( zS (cid:48) ) / ˜ π ( zI q ) (cid:12)(cid:12)(cid:12) −→ z →∞ sup S (cid:48) : d ( S,S (cid:48) ) <ε (cid:12)(cid:12)(cid:12) log (det S ) − r/ (det S (cid:48) ) − r/ (cid:12)(cid:12)(cid:12) < ∞ . In particular, sup S (cid:48) : d ( S,S (cid:48) ) <ε z − | log ˜ π ( zS ) − log ˜ π ( zS (cid:48) ) | −→ z →∞ . Thus, the pCN kernel is not geometrically ergodic.
We check geometric ergodicity for pCN for the standard targets in Section 2.2. We onlystate negative results as we did not have positive results in this paper.1. Let π ( x ) be the density function of N p,q ( M, Σ , T). Let λ (cid:63) be the smallest amongstall eigenvalues of Σ and T. Then, C z,ε ( s ) = sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F <ε z − | log π ( zs ) − log( zs (cid:48) ) |→ z →∞ sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F <ε (cid:12)(cid:12) tr (cid:2) T − s (cid:62) Σ − s − T − s (cid:48)(cid:62) Σ − s (cid:48) (cid:3) / (cid:12)(cid:12) = sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F <ε (cid:12)(cid:12) tr (cid:2) T − ( s − s (cid:48) ) (cid:62) Σ − ( s + s (cid:48) ) (cid:3) / (cid:12)(cid:12) = sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F <ε (cid:12)(cid:12) (cid:10) ( s − s (cid:48) )T − , Σ − ( s + s (cid:48) ) (cid:11) F / (cid:12)(cid:12) ≤ λ − (cid:63) sup s (cid:48) : (cid:107) s − s (cid:48) (cid:107) F <ε (cid:8) (cid:107) s − s (cid:48) (cid:107) F (cid:107) s + s (cid:48) (cid:107) F / (cid:9) ≤ λ − (cid:63) ε ( (cid:107) s (cid:107) F + ε/ . λ − (cid:63) (1 − ρ / )(1 + (1 − ρ / ) / < − ρ λ − , which is simplified to, 3 − ρ / ρ / < λ (cid:63) λ , then pCN is not geometrically ergodic. Therefore, pCN is not always geometricallyergodic for the normal distribution.2. Similar calculations yield that, for a Wishart distribution W q ( r, T ), we have, C z,ε ( S ) −→ z →∞ sup S (cid:48) : d ( S,S (cid:48) ) <ε (cid:12)(cid:12) tr [ T − ( S − S (cid:48) ) ] (cid:12)(cid:12) / . Therefore, if T is large enough, then the condition in Proposition 4.6(ii) is satisfied,and pCN is not geometrically ergodic.3. pCN is not geometrically ergodic for the upcasted Inverse-Wishart distribution byCorollary 4.7, since the relevant probability density function is a regularly varyingfunction (see Remark 7). Consider a target law Π(d x ) = π ( x ) ν U (d x ). If π ( x ) > x ∈ M ( p, q ) thenMpCN is ergodic. We have made a lot of progress towards proving geometric ergodicity forMpCN, via use of Foster-Lyapunov drift criterion (Meyn and Tweedie, 1993) and Dirichletform. The last remaining steps for a fully completed proof remain subject of future research. Our investigation focuses mainly at the setting of Section 3.3, when: the initial target is˜Π(d S ) = ˜ π (d S ) µ (d S ), S ∈ P + ( q ); the upcasted distribution is Π(d x ) = ˜ π ( x (cid:62) U − x ) ν U (d x ), x ∈ M ( p, q ); the selection U = U gives rise to a Markov chain on P + ( q ) defined via what wehave called the MpCN kernel ˜ P on P + ( q ), with target ˜Π and proposal kernel ˜ Q = ˜ Q ( · ; ρ, p )– see also Remark 4(i). To prove geometric ergodicity, it suffices to show, that for driftfunction V : P + ( q ) → R + , lim sup tr( S ) →∞ ˜ P V ( S ) − V ( S ) V ( S ) <
0; (19)lim sup det( S ) → P V ( S ) − V ( S ) V ( S ) < . (20)We prove (19), for an appropriate drift function. We start with a definition (for matrices S, T ∈ P + ( q ), we write S < T if T − S ∈ P + ( q )).17 efinition 5.1. (i) For T ∈ P + ( p ) , h : M ( p, q ) → R is rapidly varying if, for full rank x, y ∈ M ( p, q ) , and z ∈ P + ( q ) , h ( yz ) h ( xz ) −→ tr( z ) →∞ (cid:26) ∞ , if x (cid:62) T − x > y (cid:62) T − y ;0 , if x (cid:62) T − x < y (cid:62) T − y. (ii) For target distribution ˜Π(d S ) = ˜ π ( S ) µ (d S ) we will call ˜ π rapidly varying when π ( x ) =˜ π ( x (cid:62) U − x ) is rapidly varying. The above is a natural extension of rapid variation for scalar-valued functions; see, e.g.,Section 2.4 of Resnick (2007) for details on the scalar case. Rapid variation is relevant forseveral light-tailed distributions. The Lebesgue density of Matrix-Normal distribution israpidly varying, as is the density ˜ π = ˜ π ( S ) of the Wishart distribution in (2). Proposition 5.2. If ˜ π ( S ) is strictly positive, continuous, rapidly varying function, then(19) holds for drift function V ( S ) = ˜ π ( S ) − α , with any α ∈ (0 , .Proof. The proof is given in Appendix A.It remains to establish (20), where det( S ) →
0, or equivalently, S → S , with S degenerate,positive semi-definite, symmetric matrix. We prove the drift inequality for the special case S = 0; this also provides a proof of (20) in the trivial scalar case q = 1. Proposition 5.3.
Assume that ˜ π is strictly positive, continuous. Suppose that for some ζ (cid:54) = 0 , for any (cid:15) ∈ P + ( q ) , ˜ π ( (cid:15) ◦ S )˜ π ( S ) −→ tr( S ) → det( (cid:15) ) ζ/ . For V ( S ) = ˜ π ( S ) − α , α > , the MpCN kernel ˜ P on P + ( q ) , with target ˜Π , satisfies, lim sup tr( S ) → ˜ P V ( S ) − V ( S ) V ( S ) < . Proof.
Recall that the proposal writes as (cid:15) ◦ S with (cid:15) ∼ L (d (cid:15) ). By the dominated conver-gence theorem,˜ P V ( S ) − V ( S ) V ( S ) = E (cid:104) (cid:110)(cid:16) ˜ π ( (cid:15) ◦ S )˜ π ( S ) (cid:17) − α − (cid:111) min (cid:110) , ˜ π ( (cid:15) ◦ S )˜ π ( S ) (cid:111) (cid:105) → E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9) min (cid:8) , det( (cid:15) ) ζ/ (cid:9) (cid:3) . Since (cid:15) and (cid:15) − have the same law, we obtain, E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9) min (cid:8) , det( (cid:15) ) ζ/ (cid:9) (cid:3) = E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9) , det (cid:15) ≥ (cid:3) + E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9) det( (cid:15) ) ζ/ , det (cid:15) < (cid:3) = E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9) , det (cid:15) ≥ (cid:3) + E (cid:2) (cid:8) det( (cid:15) ) αζ/ − (cid:9) det( (cid:15) ) − ζ/ , det (cid:15) > (cid:3) = E (cid:2) (cid:8) det( (cid:15) ) − αζ/ − (cid:9)(cid:8) − det( (cid:15) ) − (1 − α ) ζ/ (cid:9) , det (cid:15) ≥ (cid:3) , with the last expression being negative. 18o stress the effect of algorithmic parameter ρ , we write the MpCN kernel on M ( p, q )(resp. P + ( q )) as P ρ (resp. ˜ P ρ ) and the corresponding proposal as ˜ Q ρ . In all cases, ρ ∈ [0 , Proposition 5.4. (i) If MpCN kernel P on M ( p, q ) has a spectral gap, then so does P ρ for any ρ ∈ [0 , .(ii) If MpCN kernel ˜ P on P + ( q ) has a spectral gap, then so does ˜ P ρ for any ρ ∈ [0 , .Proof. The proof is given in Appendix B.
Remark 8.
See Appendix B for details on the concept of spectral gap. By Proposition 5.4,geometric ergodicity of the MpCN kernel with target Π on M ( p, q ) (or ˜Π on P + ( q ) ) andparameter ρ ∈ [0 , is implied by that of the MpCN kernel for ρ = 0 . The result isimportant, as working with ˜ P simplifies a lot the involved matrix calculations for derivingdrift conditions. From a practical point of view, the result allows for numerical evidenceover inequality (20) – as in Section 5.2 that follows – by taking advantage of the fact thatthe choice ρ = 0 provides a much more manageable expression for the distribution of thenoise (cid:15) in (11), involved in the MpCN proposal (Proposition 5.5 below exploits this fact). Proposition 5.5. If ρ = 0 , the eigenvalues λ ≤ λ ≤ · · · ≤ λ q of (cid:15) ∼ L (d (cid:15) ) have thefollowing joint Lebesgue density function, up to a normalizing constant, q (cid:89) i =1 λ ( p − q − / i (1 + λ i ) p (cid:89) i This follows from Theorem 5.3.1 of Farrell (1985) together with the analytical ex-pression of L (d (cid:15) ), since, for any integrable f , (cid:90) f ( (cid:15) ) L (d (cid:15) ) ∝ (cid:90) f ( (cid:15) ) det( (cid:15) ) p/ det( I q + (cid:15) ) p µ (d (cid:15) )= (cid:90) f ( (cid:15) ) det( (cid:15) ) ( p − q − / det( I q + (cid:15) ) p Leb q ( q +1) / (d (cid:15) )= (cid:90) q (cid:89) i =1 λ ( p − q − / i (1 + λ i ) p (cid:89) i 1. In this case,˜ P V ( S ) − V ( S ) V ( S ) = (cid:90) P + ( q ) { η ( (cid:15), S ) − α − } min { , η ( (cid:15), S ) }L (d (cid:15) ) , (21)19here, η ( (cid:15), S ) = ˜ π ( (cid:15) ◦ S )˜ π ( S ) = (det (cid:15) ) r/ exp (cid:0) − tr [ ( (cid:15) − I q ) S ] (cid:1) . (Note that the right-hand side of (21) can be defined even if S is degenerate.) We want tonumerically investigate (20). By continuity, it suffices to show that (21) is always negativeif S is degenerate. Furthermore, we can assume that S is diagonal. To see this, firstobserve that for any S ∈ P + ( q ), there is U ∈ O ( q ) such that U (cid:62) S U is diagonal. We alsohave η ( U (cid:15) U (cid:62) , S ) = η ( (cid:15), U (cid:62) S U ), and the law of U (cid:15) U (cid:62) is the same as that of (cid:15) when ρ = 0.Therefore, it is enough to show that (21) is always negative if S is a degenerate diagonalmatrix. The law of the eigenvalues λ , . . . , λ q is determined in Proposition 5.5, and (cid:15) isdecomposed as (cid:15) = U Λ U (cid:62) where U is uniformly distributed in O ( q ) and Λ is a diagonalmatrix with diagonal elements λ , . . . , λ q . Thus, we can now evaluate (21) via numericalintegration.We fix q = 2, so we can assume that S = diag( s, s > 0. As we see in Fig. 1, forsmall enough α ∈ (0 , W q ( r, I q ) for r = p = 2 from Fig. 1,and for r = 4, p = 5 from Fig. 2. Other choices of r and p yield similar figures. A Dirichletform argument can be used to show that geometric ergodicity for ˜Π = W q ( r, I q ) impliesgeometric ergodicity for ˜Π = W q ( r, T ), for general T ∈ P + ( q ). Recall also – see Remark 5– that geometric ergodicity of the MpCN kernel for a Wishart target implies geometricergodicity for an Inverse-Wishart target. −0.015−0.010−0.0050.000 0.0 2.5 5.0 7.5 10.0 s v a l alpha Figure 1: Numerical evaluation of the relative expected drift change ( ˜ P V ( S ) − V ( S )) / V ( S )for target ˜Π = W q ( r, I q ), at S = diag( s, r = p = 2 (and ρ = 0).20 s v a l alpha Figure 2: Numerical evaluation of the relative expected drift change ( ˜ P V ( S ) − V ( S )) / V ( S )for target ˜Π = W q ( r, I q ), at S = diag( s, r = 4, p = 5 (and ρ = 0). We show the following statistics in Tables 1 and 2 in the sequel. TIME: computing time;BIAS: distance between true mean and Monte Carlo average; ESS: effective sample sizeover MCMC log-˜ π trajectory. These results are run on 3.40 GHz Intel Core i7-6700.We consider a Wishart target W q ( r, T ), with T generated from W q ( q, I q ) and r = q + 10.The MCMC initial values are the mean of W q ( r, T ) added to a sample of W q ( q, I q ). Spacedimensions vary from q = 3 to 10, and the number of MCMC iterations are q × . Wehave run with 10 parallel with the TIME and BIAS values shown in tables being averagesover values obtained over these parallel chains. ESS is obtained after merging all parallelchains, i.e., over a total of q × iterations. We choose the step of the RWM proposalso that the acceptance probability is about 23 . p = q , U = V = I q ; for MpCN, ρ is chosen so thatthe acceptance probability is about 10% to 20%.The computing cost of MpCN is about 2 to 3 times higher than that of RWM andpCN, yet the bias is at least 180 times smaller. Fig. 3 suggests that the bias for MpCNis approximately of order 1 /M , with M the number of iterations. ESS is 20 to 70 timesbetter for MpCN vs RWM and 400 to 800 times better for MpCN vs pCN. Fig. 6, 7 showautocorrelations for the traceplots of the log-eigenvalues of the algorithms.We next consider an Inverse-Wishart target W − q ( r, T ) (Table 2). T is generated from W q ( q, I q ), whereas r = q + 10. As proven earlier, RWM and pCN are not geometrically21able 1: Results for Wishart target. MCMC TIME BIAS ESSRWM q = 3 0 . 132 1 . × − q = 5 1 . 32 7 . × − q = 10 13 . . × q = 3 0 . 132 2 . × q = 5 1 . 34 5 . × q = 10 13 . . × q = 3 0 . 354 6 . × − q = 5 2 . 81 1 . × − q = 10 36 . . × − Table 2: Results for Inverse-Wishart target. MCMC TIME BIAS ESSRWM q = 3 0 . 138 9 . × − q = 5 1 . 30 7 . × q = 10 13 . . × q = 3 0 . 152 6 . × − q = 5 1 . 55 7 . × q = 10 17 . . × q = 3 0 . 340 2 . × − q = 5 2 . 74 6 . q = 10 35 . . ergodic in this case, thus are sensitive to the choice of initial value. The latter is set as themean of W − q ( r, T ) added to a sample of √ q − W q ( q, I q ), and is close to the center of thetarget. MpCN is robust to the choice of initial value, a fact that supports the hypothesisedgeometric ergodicity. The Tables and the ACF plots show superior performance for MpCNagainst RWM or pCN – the results for pCN and RWM are fairly similar.22 iteration d i s t an c e method mpcnpcnrwm Distance of the mean and the running averages for dim = 10 Figure 3: Distance between then mean of the Wishart target and the MCMC average. lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = rwm, dim = 10 Figure 4: Autocorrelations of the trajectories of the log-eigenvalues for RWM with Wisharttarget, for dimension q = 10. 23 .000.250.500.751.00 0 2500 5000 7500 10000 lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = pcn, dim = 10 Figure 5: Autocorrelations of the trajectories of the log-eigenvalues for pCN with Wisharttarget, for dimension q = 10. lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = mpcn, dim = 10 Figure 6: Autocorrelations of the trajectories of the log-eigenvalues for MpCN with Wisharttarget, for dimension q = 10. Consider a finite-variation stochastic volatility (SV) model defined via an Ornstein-Uhlenbeck(OU) process on P + ( q ), see Barndorff-Nielsen and Stelzer (2007)),dΣ t = (ΩΣ t − + Σ t − Ω (cid:62) )d t + d L t , (22)24 .000.250.500.751.00 0 2500 5000 7500 10000 lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = rwm, dim = 10 Figure 7: Autocorrelations of the trajectories of the log-eigenvalues for RWM with Inverse-Wishart target, for dimension q = 10. lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = pcn, dim = 10 Figure 8: Autocorrelations of the trajectories of the log-eigenvalues for pCN with Inverse-Wishart target, for dimension q = 10.for Σ = σ ∈ P + ( q ), Ω ∈ M ( q, q ), and matrix subordinator process { L t } , such that,d L t = γ d t + (cid:90) P + ( q ) \{ } x µ (d t, d x ) , .000.250.500.751.00 0 2500 5000 7500 10000 lag v a l ue eigenvalue ACF plot of the log−eigenvalues for method = mpcn, dim = 10 Figure 9: Autocorrelations of the trajectories of the log-eigenvalues for MpCN with Inverse-Wishart target, for dimension q = 10.where γ ∈ P + ( q ) is a matrix-constant and µ (d s, d x ) the extended Poisson random measureon [0 , ∞ ) × P + ( q ) such that E [ µ (d s, d x ) ] = Leb(d s ) ⊗ ν (d x ), with ν (d x ) denoting theL´evy measure of { L t } - thus getting the L´evy triplet ( γ, , ν ). We further assume that E log + (cid:107) L t (cid:107) < ∞ and the spectrum of Ω is σ (Ω) ⊂ ( −∞ , 0) + i R , so that SDE (22) has astationary solution. The SDE in (22) gives rise to a process { Σ t } with values in P + ( q ), seeBarndorff-Nielsen and Stelzer (2007) for details. The solution of (22) in-between events ofthe Poisson measure, writes as,Σ t = exp (cid:8) Ω( t − s ) (cid:9) Σ s exp (cid:8) Ω (cid:62) ( t − s ) (cid:9) . The inferential objective is estimate Ω, γ and parameters of ν (d x ) based on observationsrelated to Σ t .This design gives rise to a flexible, intuitive model, building upon mean-reverting-typedynamics for the complete covariance matrix. Such direction differs from typical approachesin the literature that work with various covariance decompositions (e.g. Dellaportas et al.(2015)), thus lacking direct interpretation of the dynamical behaviour for the completecovariance matrix. To simplify the setting, we consider a case where: i) the unknownparameter is Ω, and Ω ∈ P + ( q ); ii) the L´evy process corresponds to a compound Poissonone, with iid jumps from a Wishart law, with all involved parameters assumed known.Data { y i } are obtained at times 0 < t < · · · < t n , n ≥ 1, with [ y i | Σ t i ] ∼ N (0 , Σ t i ),1 ≤ i ≤ n .We apply a pseudo-marginal algorithm (see e.g. Andrieu and Roberts (2009); Andrieuet al. (2010) replacing the intractable likelihood with unbiased estimators found by a par-ticle filter. We use p = q = 2, n = 50, t i = i , σ = diag { . , . } , and true Ω equalto [0 . , . 02; 0 . , . . 5, and forthe jumps a Wishart distribution W (4 , diag(0 . , . 10 20 30 40 50 . . . . . . time S i gna l : X [ , ] Figure 10: Simulated signal for component Σ . . . . . . . . MCMC iterations O m ega [ , ] Figure 11: MpCN traceplot for the posterior of Ω given all n = 50 observations; the redline shows the true value. MCMC iterations O m ega [ , ] _1da t u m Figure 12: MpCN traceplot for the posterior of Ω given a single observation at time t = 1;the red line shows the true value.et al. (2010) for details on the use of a particle filter as means of obtaining an unbiasedestimate of the likelihood of data to be used within the pseudo-marginal MCMC – a priorfor Ω that is an inverse Wishart distribution W − (5 + 2 − , diag(1 . , . ρ = 0 . U = diag(0 . , . 01) (such choices give prior meandiag(0 . , . 5) and prior marginal standard deviations of about 23 and 13 for the diagonaland off-diagonal elements of Ω resp.). Fig. 10-12 show plots of the MCMC output producedfrom a PC in about 45mins (with C code). MpCN is quite effective in this involved pseudo-marginal setting, even if most parts of the algorithm were not optimised (we made no useof parallelisation and applied the vanilla bootstrap filter; also, MpCN tuning parametersnot carefully optimised). Fig. 11, 12, suggest that n = 50 observation times are alreadyinformative about Ω. Note that the posterior inherits heavy tails from the prior, so MpCNis highly competitive for such a target distribution. Our work presents a first contribution towards a systematic analysis (including derivationand ergodicity properties) of MCMC algorithms on matrix spaces. A number of interestingdirections have now opened up to future research. We summarise some of them here.(i) As a starting point we focused on blind proposals (this matched the requirements ofthe SV model). It is natural to move to derivative-driven methods, e.g. MALA orHMC.(ii) There are some last steps remaining to obtain of complete proof of geometric ergodic-ity of the newly developed MpCN algorithm – this is left for future research, buildingupon the important progress made here.(iii) The SV model sets up a new research direction in financial statistics. It leads tofurther investigations on differential models on matrix spaces and Monte Carlo meth-ods that can be effective therein. This is a challenging task, requiring calibration ofhigh-dimensional matrix-parameters for latent dynamical models (much beyond the2 × P + ( q ) arising in thefield of graphical models. Here interest lies in exploring, e.g., the space of precisionmatrices, given zeros for a number of partial auto-correlations determined by graphs.For relevant references, see Section 3 of Lenkoski (2013) or Wang and Li (2012). In-dicatively, application of the upcasting approach in this context will be important,as it will open up directions for the development of effective MCMC algorithms thatrespect the space restrictions. Scalability with respect to dimensionality q is also ofhigh significance in this setting. A Proof of Proposition 5.2 Recall the concept of rapid variation in Definition 5.1. In the next lemma, we show a keyproperty of probability measures on P + ( q ) with rapidly varying densities.28 emma A.1. Suppose that the target ˜Π(d S ) on P + ( q ) has a strictly positive, continuous,rapidly varying density ˜ π ( S ) with respect to µ (d S ) . Then, w.p.1, lim tr( S ) → + ∞ (cid:12)(cid:12)(cid:12) log (cid:16) ˜ π ( (cid:15) ◦ S )˜ π ( S ) (cid:17)(cid:12)(cid:12)(cid:12) = + ∞ , where (cid:15) ∼ L (d (cid:15) ) – see Proposition 3.2 for the definition of L (d (cid:15) ) .Proof. By tightness of L (d (cid:15) ), for any δ ∈ (0 , r ∈ (0 , 1) such that, P [ r ≤ λ min ( (cid:15) ) ≤ λ max ( (cid:15) ) ≤ r − ] > − δ, where λ min ( S ) and λ max ( S ) are the smallest and largest eigenvalues of S ∈ P + ( q ). Thus,for any C > δ ∈ (0 , P (cid:104) (cid:12)(cid:12)(cid:12) log (cid:16) ˜ π ( (cid:15) ◦ S )˜ π ( S ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C (cid:105) ≤ δ + P [ (cid:15) ∈ A ( S ) ] , where, A ( S ) := (cid:110) (cid:15) ∈ P + ( q ) : (cid:12)(cid:12)(cid:12) log (cid:16) ˜ π ( (cid:15) ◦ S )˜ π ( S ) (cid:17)(cid:12)(cid:12)(cid:12) ≤ C, r ≤ λ min ( (cid:15) ) ≤ λ max ( (cid:15) ) ≤ r − (cid:111) . We write Leb q ( q +1) / for (cid:81) ≤ i ≤ j ≤ q d S ij . Recall that reference measure µ (d S ) has density(det( S )) − ( q +1) / with respect to Leb q ( q +1) / . Thus, L (d (cid:15) ) has bounded density (with respectto Leb q ( q +1) / ) on, { (cid:15) ∈ P + ( q ) : r ≤ λ min ( (cid:15) ) ≤ λ max ( (cid:15) ) ≤ r − } . So, there exists a constant c > P [ (cid:15) ∈ A ( S ) ] ≤ c Leb q ( q +1) / [ (cid:15) ∈ A ( S ) ] . Now, we consider the variable transformation (cid:15) = U (cid:62) Λ U where U ∈ O ( q ) and Λ is adiagonal matrix with positive diagonal elements r ≤ λ < λ < · · · < λ q ≤ r − . FromTheorem 5.3.1 of Farrell (1985), for some constant c > 0, we have,Leb q ( q +1) / [ (cid:15) ∈ A ( S ) ] = c (cid:90) U ∈O ( q ) d U (cid:90) { U (cid:62) Λ U ∈ A ( S ) } (cid:89) i Consider the standard Hilbert space L (Π) = { f : M ( p, q ) → R ; (cid:107) f (cid:107) L (Π) < ∞} where (cid:107) f (cid:107) L (Π) = (cid:112) (cid:104) f, f (cid:105) with inner product, (cid:104) f, g (cid:105) = (cid:104) f, g (cid:105) L (Π) = (cid:90) x ∈ M ( p,q ) f ( x ) g ( x )Π(d x ) . A Markov kernel P on M ( p, q ) is a linear operator on this space via the action,( P f )( x ) = (cid:90) y ∈ E P ( x, d y ) f ( y ) , f ∈ L (Π) . A linear operator P is self-adjoint if (cid:104) P f, g (cid:105) = (cid:104) f, P g (cid:105) , f, g ∈ L (Π); this is equivalentto Π-reversibility of P . Also, a linear operator is positive if (cid:104) f, P f (cid:105) ≥ f ∈ L (Π). BySection XI.8 of Yosida (1995), if P is self-adjoint, its spectrum lies on the real line; if it isalso a positive operator, its spectrum lies on R + . First we will show positivity of the pCNkernel. Using this, we will show positivity of the MpCN kernel. Lemma B.1. The pCN kernel P ( x, d y ) on M ( p, q ) is a positive, self-adjoint linear operatoron L (Π) .Proof. First, we show that the proposal kernel Q of pCN is positive, self-adjoint on L (Π (cid:48) ),where, Q ( x, · ) = N p,q ( ρ / x, (1 − ρ ) U, V ) , Π (cid:48) = N p,q (0 , U, V ) . The self-adjointness property is immediate upon observing, φ p,q ( y ; ρ / x, (1 − ρ ) U, V ) φ p,q ( x ; 0 , U, V ) = φ p,q ( x ; ρ / y, (1 − ρ ) U, V ) φ p,q ( y ; 0 , U, V ) . By the reproductive property of the Gaussian distribution (Remark 1(i, ii)), we have Q / ( x, · ) = N p,q ( ρ / x, (1 − ρ / ) U, V ) , where the linear operator Q / is defined by Q / ( Q / f ) = Qf , f ∈ L (Π (cid:48) ). Observe that Q / is the same as Q with ρ / in the place of ρ . By this fact, Q / is also self-adjoint in L (Π (cid:48) ). Then, (cid:104) f, Qf (cid:105) L (Π (cid:48) ) = (cid:104) f, Q / Q / f (cid:105) L (Π (cid:48) ) = (cid:107) Q / f (cid:107) L (Π (cid:48) ) ≥ . We now show that P itself is positive, self-adjoint on L (Π). Since P is a Π-reversibleMetropolis–Hastings kernel, we proceed to the proof of positiveness. We use the decom-position approach introduced in Lemma 3.1 of Rudolf and Ullrich (2013). Recall that theacceptance probability is α ( x, y ) = min { , π ( y ) π ( x ) } , where π ( x ) is the density of Π with respectto Π (cid:48) . We can write, π ( x ) α ( x, y ) = min { π ( x ) , π ( y ) } = (cid:90) ∞ K ( t ) ( x )1 K ( t ) ( y )d t K ( t ) = { x ∈ M ( p, q ) : π ( x ) ≥ t } . Using the above, we have, (cid:104) f, P f (cid:105) L (Π) = (cid:90) x,y ∈ M ( p,q ) f ( x ) f ( y )Π(d x ) P ( x, d y ) ≥ (cid:90) x,y ∈ M ( p,q ) f ( x ) f ( y ) α ( x, y )Π(d x ) Q ( x, d y )= (cid:90) x,y ∈ M ( p,q ) f ( x ) f ( y ) α ( x, y ) π ( x )Π (cid:48) (d x ) Q ( x, d y )= (cid:90) ∞ (cid:90) x,y ∈ M ( p,q ) [1 K ( t ) f ]( x ) [1 K ( t ) f ]( y )Π(d x ) Q ( x, d y )d t = (cid:90) ∞ (cid:104) K ( t ) f, Q [1 K ( t ) f ] (cid:105) L (Π (cid:48) ) d t ≥ . We used the fact that 1 K ( t ) f ∈ L (Π (cid:48) ) by Markov’s inequality. Thus, we have that P is apositive operator.Next, we prove positivity and self-adjointness of the MpCN kernel. Such properties areinherited by the pCN kernel. Let, Q V ( x, · ) = N p,q ( ρ / x, U, (1 − ρ ) V ) , Π (cid:48) V = N p,q (0 , U, V ) , (23)be the proposal kernel of pCN and its invariant distribution, respectively, for matrix pa-rameter V . Observe that ν U (d x ) W − q (d V ; p, x (cid:62) U − x ) = c ∗ p,q Π (cid:48) V (d x ) µ (d V )where c ∗ p,q = Γ q ( p/ / q π pq/ . As in the proof of Lemma 3.1 and equation (10), followingthe Bayesian paradigm construction of the MpCN kernel, we have, ν U (d x ) Q ( x, d y ) = (cid:90) V ∈ P + ( q ) ν U (d x ) W − q (d V ; p, x (cid:62) U − x ) Q V ( x, d y )= c ∗ p,q (cid:90) V ∈ P + ( q ) Π (cid:48) V (d x ) Q V ( x, d y ) µ (d V ) , where Q is the MpCN kernel. Proposition B.2. The MpCN kernel P ( x, d y ) is a positive, self-adjoint linear operator on L (Π) .Proof. Self-adjointness follows from reversibility. We will prove the positivity part. As inthe pCN kernel case, for the target probability distribution Π(d x ) = π ( x ) ν U (d x ) we have, (cid:104) f, P f (cid:105) L (Π) = (cid:90) f ( x ) f ( y )Π(d x ) P ( x, d y ) ≥ (cid:90) x,y ∈ M ( p,q ) f ( x ) f ( y ) α ( x, y )Π(d x ) Q ( x, d y )= (cid:90) ∞ (cid:90) x,y ∈ M ( p,q ) [1 K ( t ) f ]( x )[1 K ( t ) f ]( y ) ν U (d x ) Q ( x, d y )d t. 32y (23), together with positivity and self-adjointness of the pCN proposal kernel, we obtain, (cid:104) f, P f (cid:105) L (Π) ≥ c ∗ p,q (cid:90) ∞ (cid:90) V ∈ P + ( q ) (cid:90) x,y ∈ M ( p,q ) [1 K ( t ) f ]( x )[1 K ( t ) f ]( y )Π (cid:48) V (d x ) Q V ( x, d y ) µ (d V )d t = c ∗ p,q (cid:90) ∞ (cid:90) V ∈ P + ( q ) (cid:104) K ( t ) f, Q V [1 K ( t ) f ] (cid:105) L (Π (cid:48) V ) µ (d V )d t ≥ . Here, we used the fact that 1 K ( t ) f ∈ L (Π (cid:48) V ) by Markov’s inequality.Thus, all eigenvalues of the MpCN kernel are in R + . Let spec( P ) be the eigenvalues ofa self-adjoint Markov kernel P on L (Π) – excluding the constant function. A transitionkernel P is said to have a spectral gap if 1 − sup spec( P ) > − sup | spec( P ) | > P is a positive operator. It is known that, if P is positive andself-adjoint, it is geometrically ergodic if and only if it has a spectral gap (see, e.g., Robertsand Rosenthal (1997); Roberts and Tweedie (2001)). Note that, by Theorem XI.8.2 ofYosida (1995), the spectral gap 1 − sup spec( P ) writes as,inf f ∈ L (Π); Π( f )=0 E P ( f, f ) (cid:107) f (cid:107) = 1 − sup f ∈ L (Π); Π( f )=0 (cid:104) f, P f (cid:105)(cid:107) f (cid:107) , (24)where, E P ( f, f ) = (cid:90) x,y ∈ M ( p,q ) ( f ( x ) − f ( y )) Π(d x ) P ( x, d y ) . We study the spectral gap of the MpCN kernel. Recall from the main text, that to stressthe involvement of parameter ρ , we write the MpCN kernel as P ρ . We denote by Q ρ theproposal kernel for P ρ . Proof of Proposition 5.4 . Part (i):For q × q symmetric matrices A , B , we write A ≤ B if v (cid:62) Av ≤ v (cid:62) Bv for any vector v ; if A ≤ B then det( A ) ≤ det( B ). First we note that,(1 − ρ ) x (cid:62) U − x + ( y − ρ / x ) (cid:62) U − ( y − ρ / x ) ≤ x (cid:62) U − x + 2 y (cid:62) U − y, where the left-hand side is equal to R ( x, y ) defined in Lemma 3.1; the above inequalityfollows from the right-hand side being equal to,[(1 − ρ ) x (cid:62) U − x + ( y − ρ / x ) (cid:62) U − ( y − ρ / x )] +[(1 − ρ ) x (cid:62) U − x + ( y + ρ / x ) (cid:62) U − ( y + ρ / x )] . Let q ρ ( x, y ) be the probability density function of Q ρ ( x, · ) with respect to ν . Via theexplicit form of q ρ ( x, y ) (see Lemma 3.1), the above inequality implies, q ρ ( x, y ) ≥ − pq q ( x, y ) . E P ρ ( f, f ) = 12 (cid:90) x,y ∈ M ( p,q ) ( f ( x ) − f ( y )) Π(d x ) P ρ ( x, d y )= 12 (cid:90) x,y ∈ M ( p,q ) ( f ( x ) − f ( y )) Π(d x ) α ( x, y ) Q ρ ( x, d y ) ≥ − pq (cid:90) x,y ∈ M ( p,q ) ( f ( x ) − f ( y )) Π(d x ) α ( x, y ) Q ( x, d y ) ≥ − pq E P ( f, f ) . Thus, existence of a spectral gap of P implies that P ρ also has one.Part (ii):We define the set, L (cid:63) (Π) := { f : M ( p, q ) → R : f ( x ) = g ( x (cid:62) U − x ) for a g ∈ L ( ˜Π) } ⊆ L (Π) . Spaces L ∗ (Π), L ( ˜Π) are isomorphic via the mapping f ↔ g . For f ∈ L (cid:63) (Π) we have that P f ( x ) ∈ L (cid:63) (Π), thus operator P restricted on L (cid:63) (Π) is positive, self-adjoint, with P f ( x ) ≡ ˜ P g ( s ), where s = x (cid:62) U − x . It follows trivially, that we also have E P ρ ( f, f ) ≥ − pq E P ( f, f )– as obtained in Part (i) – for f ∈ L (cid:63) (Π), or equivalently E ˜ P ρ ( g, g ) ≥ − pq E ˜ P ( g, g ), for E ˜ P ρ ( g, g ) defined in an obvious way. The proof is now complete. References Andrieu, C., A. Doucet, and R. Holenstein (2010). Particle Markov chain Monte Carlomethods. J. R. Stat. Soc. Ser. B Stat. Methodol. 72 (3), 269–342.Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient MonteCarlo computations. Ann. Statist. 37 (2), 697–725.Barnard, J., R. McCulloch, and X.-L. Meng (2000). Modeling covariance matrices interms of standard deviations and correlations, with application to shrinkage. Statist.Sinica 10 (4), 1281–1311.Barndorff-Nielsen, O. E. and N. Shephard (2001). Non-gaussian ornstein–uhlenbeck-basedmodels and some of their uses in financial economics. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) 63 (2), 167–241.Barndorff-Nielsen, O. E. and R. Stelzer (2007). Positive-definite matrix processes of finitevariation. Probability and Mathematical Statistics-Wroclaw University 27 (1), 3.Beskos, A., G. Roberts, A. Stuart, and J. Voss (2008). MCMC methods for diffusionbridges. Stoch. Dyn. 8 (3), 319–350.Bingham, N. H., C. M. Goldie, and J. L. Teugels (1989). Regular variation , Volume 27 of Encyclopedia of Mathematics and its Applications . Cambridge University Press, Cam-bridge. 34hikuse, Y. (2003). Statistics on special manifolds , Volume 174 of Lecture Notes in Statis-tics . Springer-Verlag, New York.Cotter, S. L., G. O. Roberts, A. M. Stuart, and D. White (2013). MCMC methods forfunctions: modifying old algorithms to make them faster. Statist. Sci. 28 (3), 424–446.Dellaportas, P., A. Plataniotis, and M. K. Titsias (2015). Scalable inference for a fullmultivariate stochastic volatility model. arXiv preprint arXiv:1510.05257 .Dickey, J. M. (1967). Matricvariate generalizations of the multivariate t distribution andthe inverted multivariate t distribution. The Annals of Mathematical Statistics 38 (2),511–518.Dobra, A., A. Lenkoski, and A. Rodriguez (2011). Bayesian inference for general gaussiangraphical models with application to multivariate lattice data. Journal of the AmericanStatistical Association 106 (496), 1418–1433.Farrell, R. H. (1985). Multivariate calculation . Springer Series in Statistics. Springer-Verlag,New York.Geisser, S. and J. Cornfield (1963). Posterior distributions for multivariate normal param-eters. J. Roy. Statist. Soc. Ser. B 25 , 368–376.Halmos, P. R. (1950). Measure Theory . D. Van Nostrand Company, Inc., New York, N. Y.Harville, D. A. (1997). Matrix algebra from a statistician’s perspective . Springer-Verlag,New York.Huang, A. and M. P. Wand (2013). Simple marginally noninformative prior distributionsfor covariance matrices. Bayesian Analysis 8 (2), 439–452.Jarner, S. and E. Hansen (2000). Geometric ergodicity of Metropolis algorithms. StochasticProcess. Appl. 85 (2), 341–361.Kamatani, K. (2017). Ergodicity of Markov chain Monte Carlo with reversible proposal. Journal of Applied Probability 54 (2), 638–654.Lang, S. (1999). Fundamentals of differential geometry , Volume 191 of Graduate Texts inMathematics . Springer-Verlag, New York.Lenkoski, A. (2013). A direct sampler for G-Wishart variates. Stat 2 (1), 119–128.Meyn, S. P. and R. L. Tweedie (1993). Markov Chains and Stochastic Stability. Springer.Neal, R. M. (1999). Regression and classification using Gaussian process priors. In Bayesianstatistics, 6 (Alcoceber, 1998) , pp. 475–501. Oxford Univ. Press, New York.O’Malley, A. J. and A. M. Zaslavsky (2008). Domain-level covariance analysis for mul-tilevel survey data with structured nonresponse. Journal of the American StatisticalAssociation 103 (484), 1405–1418. 35esnick, S. I. (2007). Heavy-tail phenomena: probabilistic and statistical modeling . Springerseries in operations research and financial engineering. New York: Springer.Roberts, G. O. and J. S. Rosenthal (1997). Geometric ergodicity and hybrid Markov chains. Electron. Comm. Probab. 2 , no. 2, 13–25 (electronic).Roberts, G. O. and R. L. Tweedie (1996). Geometric convergence and central limit theoremsfor multidimensional Hastings and Metropolis algorithms. Biometrika 83 (1), 95–110.Roberts, G. O. and R. L. Tweedie (2001). Geometric L and L convergence are equivalentfor reversible Markov chains. J. Appl. Probab. 38A , 37–41. Probability, statistics andseismology.Roverato, A. (2002). Hyper inverse Wishart distribution for non-decomposable graphsand its application to Bayesian inference for Gaussian graphical models. ScandinavianJournal of Statistics 29 (3), 391–411.Rudolf, D. and M. Ullrich (2013). Positivity of hit-and-run and related algorithms. Electron.Commun. Probab. 18 , no. 49, 8.Stuart, A. M. (2010). Inverse problems: a Bayesian perspective. Acta numerica 19 , 451–559.Wang, H. and S. Z. Li (2012). Efficient Gaussian graphical model determination underG-Wishart prior distributions. Electronic Journal of Statistics 6 , 168–198.Yosida, K. (1995).