Parsimonious Feature Extraction Methods: Extending Robust Probabilistic Projections with Generalized Skew-t
PP ARSIMONIOUS F EATURE E XTRACTION M ETHODS :E XTENDING R OBUST P ROBABILISTIC P ROJECTIONS WITH G ENERALIZED S KEW - T Dorota Toczydlowska
School of Mathematical and Physical SciencesUniversity of Technology Sydney [email protected]
Gareth W. Peters
Department of Actuarial Mathematics and StatisticsHeriot-Watt University [email protected]
Pavel V. Shevchenko
Department of Actuarial Studies and Business AnalyticsMacquarie University [email protected]
September 25, 2020 A BSTRACT
We propose a novel generalisation to the Student-t Probabilistic Principal Component methodologywhich: (1) accounts for an asymmetric distribution of the observation data; (2) is a frameworkfor grouped and generalised multiple-degree-of-freedom structures, which provides a more flexibleapproach to modelling groups of marginal tail dependence in the observation data; and (3) separatesthe tail effect of the error terms and factors. The new feature extraction methods are derived inan incomplete data setting to efficiently handle the presence of missing values in the observationvector. We discuss various special cases of the algorithm being a result of simplified assumptions onthe process generating the data. The applicability of the new framework is illustrated on a data setthat consists of crypto currencies with the highest market capitalisation. K eywords probabilistic PCA; EM algorithm; robust orthogonal projections; skew grouped t-Copula; missing data;tail dependence; dependence modelling; The study focuses on extension to the approach of Principal Component Analysis (PCA), as defined in [1] , [2] or[3]. PCA and related matrix factorisation methodologies are widely used in data-rich environments for dimensionalityreduction, data compression, feature-extraction techniques or data de-noising. The methodologies identify a lower-dimensional linear subspace to represent the data, which captures second-order dominant information contained inhigh-dimensional data sets. PCA can be viewed as a matrix factorisation problem which aims to learn the lower-dimensional representation of the data, preserving its Euclidean structure. However, in the presence of either a non-Gaussian distribution of the data generating distribution or in the presence of outliers which corrupt the data, thestandard PCA methodology provides biased information about the lower-rank representation.In many applications, the stochastic noise or observation errors in the data set are assumed to be, in some sense, “well-behaved”; for instance, additive, light-tailed, symmetric and zero-mean. When non-robust feature extraction methodsare naively utilised in the presence of violations of these implicit statistical assumptions, the information containedin the extracted features cannot be relied upon, resulting in misleading inference. Therefore, it is critical to ensurethat the feature extraction captures information about correct characteristics of the process generating the data. In thefollowing study, we relax the inherent assumption of “well-behaved” observation noise by developing a class of robustestimators that can withstand violations of such assumptions, which routinely arise in real data sets. a r X i v : . [ s t a t . M E ] S e p he investigated framework facilitates incorporation of prior assumptions about the data distribution into the model toensure robust analysis of large incomplete datasets.Many improvements to the classical PCA methodology have been introduced in the literature to accommodate variouscharacteristics that may deviate from the standard assumptions on the data when applying classical PCA methods. Forinstance robust variants of standard PCA modify the distance measure between each observation and its lower-rankapproximation.Since the standard problem of PCA corresponds to finding the directions which maximise the covariance of the pro-jected data, one group of improvements focuses on robustifying the calculation of the estimators of mean and covari-ance matrices, see [4], [5], [6], [7] or [8].The use of a robust estimator is, in fact, equivalent to introducing observation-specific weights to the loss function ofthe standard PCA. The majority of these approaches are based on down-weighting, or even removing, observationswith outlying distances; this categorises them as a non-probabilistic set of methodologies, as they do not directlyincorporate any assumption about the noise distribution. The concept of observation-specific weights is also addressedin [9] with weights, which are inversely proportional to the distance between the data points and which specify thedistance measure on the sample. The study of [10] assumes the local representation of the data as Gaussian.The other class of approaches to address robust PCA investigates different types of measures which assess the distancebetween a set of observations and its projection. Consequently, new frameworks provide procedures which are efficientin the presence of various assumptions on the noise. The framework of PCA in the presence of sparse and of arbitraryamplitude has been studied by [11], [12], [13] and [14], [15], among many others, where authors proposed variousPCA algorithms which incorporate regularised covariance shrinkage of an L -type loss function.A set of alternative methodologies, called projection pursuit methods, address the problem of representing a data matrixby sparse and dense components, see the review in [16]. Starting with the studies of [17], [18] or [19], the alternativePCA methodology aims to combine both sparse and dense noise patterns. The studies investigate a model for PCAwhich decomposes the data into a lower-rank matrix comprised of a small and dense noise matrix and a large andsparse noise matrix. Most of these methods are very sensitive to the initialisation step. For instance, the frameworksinvestigated by [19] requires a good estimate of the magnitude of the dense noise, which is usually difficult to obtain.The work of [20] proposes a formulation of the problem which focuses on the exact recovery of the eigenvectorrepresentation of the data matrix, rather than the recovery of the data matrix as is broadly approached in other studies.The other class of methodologies, which extends PCA to its probabilistic interpretation, were introduced by [21] and[22] as Probabilistic Principal Component Analysis (PPCA). In its first formulation, the standard PPCA assumes theobservation vector to be Gaussian what allows for a straightforward interpretation of representation obtained by PCAin terms of PPCA, see [3]. PPCA can be easily tailored to handle incomplete information in the sample data andallows the utilisation of the probabilistic assumptions about both the type of missingness as well as the distribution ofmissing values. The Expectation-Maximisation (EM) algorithm, formalised by [23] and discussed in detail, with itsextensions, in [24], is especially suited for inference of probabilistic models with unobserved or hidden variables.One of the natural extensions of the standard formulation of PPCA is to introduce a heavy-tailed distribution to theprocess generating the data. [25], [26], [27], [28] and [29] address this problem and explore the use of the Student-tassumption on the noise distribution and its impact on the robustness of the methodology with respect to dense noise.[30] formulates the PPCA problem with the Laplace error term and Gaussian latent variables. Other works introducingsparsity to the solution of PPCA are [31] and [32], both of which incorporate a sparse prior distribution on the model’sparameters via a variational Expectation-Maximisation. [33] add sparse domain constraints on the distribution of latentvariables.[34] improve the robustness of PPCA to both sparse and dense outliers of significant magnitude by assuming that theerror term and latent random vector follow a Cauchy distribution. Another flexible framework for PPCA is introducedin works of [35], [36], [37], [38] where they propose PPCA frameworks for mixture models, that follow Gaussianor Student-t distributions, in order to model arbitrary probability density functions of the observation process or thedistribution of the noise which corrupts the data.The reviewed robust feature-extraction methods are primarily based on the assumption that observations are inde-pendent over time and that the marginal distributions of their components have the same profile of heavy tails. Thisreasoning might be criticised for having a limited ability to capture various tail-dependence patterns in multivariatedata analysis. Therefore, we want to investigate an approach that will be able to accommodate a broader range of de-pendence and marginal distribution assumptions in the data-generating mechanism. We comment that this new modelfor PPCA can easily be reduced to the simpler representation such as Gaussian PPCA and Student-t PPCA if the datareveals such characteristics. 2herefore, our first contribution is to separate the tail effect of the error terms and factors that reflects the representationof the original data in the new basis. It allows for independent assumptions about the profiles of heavy tails of the errorterm and the original representation.Secondly, we show how to employ Grouped t-copula into the PPCA framework. The Grouped t-copula allows for agrouped or individual degrees of freedom parameter per marginal of a random vector. It has been explored in [39], [40]and [41] in the context of risk modelling. The new component allows the marginal elements of unobserved vectors tohave individual or grouped profiles of heavy-tails dependency structures and, consequently, provides greater flexibilityin capturing second order characteristics of the data set.Our next contribution is to combine the described concepts with the flexibility of modelling an asymmetric correlationand heavy- tail dependence in a multivariate setting. We focus on the skewed Student-t distribution from the Gen-eralized Hyperbolic family of distributions as defined and discussed in [42] or [43], and comprehensively comparedwith other families of skewed distributions in [44]. The type of a Student-t copula that accommodate skewness andindividual degrees of freedom is studied in [45].In addition, we study the robustness of the developed class of the PPCA as defined by [46]. The introduced structuralcomponents of the representation of the data generating process increase the flexibility of PPCA frameworks to takeinto account different features of the data. These features may impact on the obtained projection as well as on itsrotation. Given numerical examination, we show that this flexibility allows capturing complex characteristic of thedata generating process, when they appear. Also, it results in an accurate estimation in the presence of the dynamicsthat are consistent with the assumptions of standard PCA or PPCA approaches.Lastly, we develop an efficient Expectation-Maximisation (EM) algorithm of [23] that estimates the parameters ofthis new class of PPCA methods. The framework handles the presence of missing data, and we comment how theprocedure can be adjusted to various assumptions about the patterns of missing data.We apply our framework to cryptocurrencies data, and show how the new methodology can be accommodated to guideportfolio construction by measuring market concentration, the potential for diversification or hedging. Let the d -dimensional random vector Y t represents a process that generates the observation data with a realisation y t attime t . We observe N realisations of Y t , y N = (cid:8) y , . . . , y N (cid:9) . The standard PPCA, that assumes Gaussian distributionof Y t , has been introduced by [21]. The method seeks k - dimensional uncorrelated latent vector X t which providesthe most meaningful model of Y t , Y t = µ + X t W Td × k + (cid:15) t , (1)for a vector of constants µ ∈ R d and d -dimensional error term random vector (cid:15) t . As remarked in [21], in contrary tothe standard PCA, the probabilistic version does not requires the orthogonality condition of W , that is W T W = I k ,where I k denotes a k by k identity matrix. This condition was essential in non-probabilistic PCA in order to imposea restriction or identification of a unique solution. In the optimisation problem in the classical PCA, the conditionlimits the space of possible solutions that minimize the distance between the original data and its projection to thenew orthonormal space. On the other hand, the objective function of the probabilistic PPCA can be represented by thelikelihood of the considered model and when it is combined in PPCA with a distribution on the factors, the marginallikelihood having integrated out the random factors removes the need for such a constraint.In the classical Gaussian PPCA formulation it is assumed that the latent random vectors being sought in the featureextraction, that characterise the data, are distributed according to a multivariate normal distributions, that is X t ∼N (0 , I k ) and (cid:15) t ∼ N (0 , σ I d ) where σ ≥ . They are also assumed to be mutually independent and independent overtime. Given the model in (1), Y t is also d − dimensional random vector which follows Gaussian distribution and isindependent over time, with mean µ and the covariance matrix C = WW T + σ I d .In the Gaussian PPCA, the objective is to estimate the projection matrix W , the vector µ and the scalar σ given themarginal distribution of Y t Y t | Ψ ∼ N (cid:0) µ , WW T + σ I d (cid:1) , for the static parameters Ψ = [ W , µ , σ ] of the model in (1). Given N realisations of Y t | Ψ , the marginal likelihood L (Ψ; y N ) := π Y N | Ψ ( y N ) of the model under the Gaussian case can be factorized as L (Ψ; y N ) = (cid:0) π (cid:1) − N (cid:12)(cid:12)(cid:12) WW T + σ I d (cid:12)(cid:12)(cid:12) − N exp (cid:26) − N (cid:88) t =1 ( y t − µ )( WW T + σ I d ) − ( y t − µ ) T (cid:27) , (2)where the marginalisation is undertaken with regards to X t random vectors.3n order to calculate the covariance matrix we have to estimate the parameters Ψ and marginalise X t , achieved by theiterative procedure of the EM algorithm of [23]. The steps and derivation of the algorithm have been described in [47]or [21] where no missingness is assumed. The EM algorithm finds Maximum Likelihood Estimation (MLE) estimatesof parameters in probabilistic models when the direct optimisation of a likelihood function is not feasible. The MLEestimates of Ψ ∈ Ω are computed by maximising the marginalized likelihood function L (Ψ; y N ) which in the GaussianPPCA model is given in (2). The space Ω represents the parameter space of Ψ .In order to iteratively find a stationary point of the function in (2), the EM algorithms exploits the artificial formulationof probabilistic models with regards to Y t as incomplete information about the studied model with the latent vector X t being assumed to be a missing part of the complete random vector ( Y t , X t ). The joint model is know for certain as-sumptions about the distribution of error and X t and so the joint likelihood of Y t and X t is known given its probabilitydensity function π Y t, X t | Ψ ( y t , x t ) . The random vector Y t acts as an observable elements of this vector.Each iteration of the EM algorithm seeks maximizers of L (Ψ; y N ) with respect to Ψ , and consists of two steps: anexpectation step (E-step) and a maximisation step (M-step). The E-step infers missing values or latent variables, X N , by finding their distribution given the known observed values Y N , and current estimates of parameters. It thenintegrates the joint log-likelihood or complete data likelihood with regards to the distribution of these latent randomvectors. At the i -th iteration, the E-step specifies an estimate of complete information formulated as a function ofparameters, that is Q (Ψ ∗ , Ψ) := E X N | Y N , Ψ ∗ (cid:104) log π Y N , X N | Ψ ( y N , x N ) (cid:105) for Ψ ∗ = Ψ ( i ) . (3)Next, the M-step maximises the marginalised complete data likelihood obtained from the E-step in (3) which is nowjust a function of observed Y N and parameters ΨΨ ( i +1) = argmax Ψ ∈ Ω Q (Ψ ( i ) , Ψ) . The key idea behind the steps of the algorithm is to use the following representation of the logarithm of the likelihoodfunction, which exploits Bayes’ rule applied to π Y t, X t | Ψ ( y t , x t ) , that is l (Ψ , y N ) := log L (Ψ; y N ) = log π Y N , X N | Ψ ( y N , x N ) − log π X N | Y N , Ψ ( x N ) . (4)By noting that the term l (Ψ , y N ) is invariant under the expectation with respect to the conditional distribution X N | Y N , Ψ ∗ we have that log π Y N | Ψ ( y N ) = E X N | Y N , Ψ ∗ (cid:104) log π Y N , X N | Ψ ( y N , x N ) (cid:105) − E X N | Y N , Ψ ∗ (cid:104) log π X N | Y N , Ψ ( x N ) (cid:105) , for some Ψ ∗ ∈ Ω . In their study, [23] shows that the maximizers of log π Y N | Ψ ( y N ) can be specified by iterativelyoptimising the first component of the representation in (4), the expectation Q (Ψ ∗ , Ψ) defined in (3), using the steps ofthe EM algorithm. [23] shows that the sequence of the log-likelihood function evaluations, obtained iteratively in EMalgorithm updates of the parameters, denoted by (cid:110) l ( i ) (cid:111) i ∈ N for l ( i ) = l (Ψ ( i ) , y N ) , is non-decreasing and consequently,each iteration of the EM algorithm results in the update of the parameter Ψ which increases the loglikelihood in (2)or leaves it unchanged. Therefore, the EM algorithm monotonically increases the likelihood function during eachiteration. The studies of [23], [48] and [49] investigate additional assumptions such as monotonicity of the sequence (cid:110) l ( i ) (cid:111) i ∈ N or the smoothness of the objective function which need to be satisfied in order to ensure that the sequence (cid:110) l ( i ) (cid:111) i ∈ N converges to a stationary point or, more specifically, a local or global maximum. Following the concept of combining Skew-t and Grouped t-copula distributions discussed in [45], we introduce aPPCA model which allows one to develop tail dependence structures more flexible than the ones under GaussianPPCA. Our novel proposed PPCA model will allow a greater degree of flexibility, especially when asymmetry ispresent in tail dependence between pairs or sub-sets of the multivariate random observation vectors. We achieve thisby developing novel extensions based on grouped and generalised Student-t PPCA models.Consider the stochastic representation of the Student-t random variables, which can be expressed as a scale mixtureof a Gaussian random vector and Gamma variable, as formulated in [50] or [51]. Our extension to PPCA assumesrepresenting the scale mixtures of X t and (cid:15) t by independent Gamma random variables. Consequently, the vectorsthemselves are mutually independent and have individual dependency structures. The assumption provides the modelwith the additional flexibility to determine which component of the model impacts on marginal tails behaviour of theobservation vector. We introduce the coefficient of skewness which specifies the strength of asymmetry in the distribu-tion of the unobserved random vectors using the definition of the hyperbolic Skew-t distribution as introduced in [42]4r [43]. We choose this definition of Skew-t distribution due to two reasons: simplicity of conditional distributionsgiven by the stochastic representation; and the appealing property of the hyperbolic Skew-t distribution remarked in[44], that the tails of corresponding distributions have different behaviours, polynomial and exponential. Therefore,the tails of the distribution can have different magnitude of heaviness.Lastly, we want to highlight that under appropriate assumptions on the deterministic parameters of the introducedmodels, the generalized PPCA framework reduces to various special cases such as the PPCA model under a Groupedt-copula distribution (when skewness is equal to zero), or the PPCA model under a Skew-t distribution (when degreesof freedom are equal per marginal).We consider two cases of the distributions: the first in which the random vectors X t and (cid:15) t are independently andnon-identically distributed, and the second in which they are identically and conditionally independently distributedover time – the assumptions and the derivations of the latter model are given in [52]. Let us denote two mutually independent and identically distributed over time uniform random variables S (cid:15),t , S x,t ∼U (0 , . For convenience of the notation, we denote d - and k -dimensional random vectors U t and V t , respectively, U t = χ − ν (cid:15) ( S (cid:15),t ) ν (cid:15) , . . . , χ − νd(cid:15) ( S (cid:15),t ) ν d(cid:15) × d and V t = χ − ν x ( S x,t ) ν x , . . . , χ − νkx ( S x,t ) ν kx × k , (5)for vectors of non-negative real numbers ν (cid:15) = { ν (cid:15) , . . . , ν d(cid:15) } and ν x = { ν x , . . . , ν kx } and χ − ν denoting the quantile functionof the Chi-square distribution with ν degrees of freedom. Note, that each of the vectors U t and V t follows a multivariateGamma distribution, are mutually independent and independent in time. However, the components of the vectors aredependent. In fact, they are co-monotonic, since they are constructed as transformations of a common uniform variableat time t .Let us denote d -dimensional and k -dimensional real valued model parameter vectors, δ (cid:15) and δ x . The stochastic repre-sentation of the d -dimensional error term (cid:15) t and the k -dimensional latent variable X t is given by X t = V − t ◦ δ x + (cid:113) V − t ◦ Z x,t and (cid:15) t = U − t ◦ δ (cid:15) + (cid:113) σ U − t ◦ Z (cid:15),t , (6)for Z x,t and Z (cid:15),t being mutually independent standard normal multivariate variables, k - and d -dimensional respectively,and being independent of U t and V t (or S x,t and S (cid:15),t likewise). The operator ◦ denotes the Hadamard product, thatis, for two d -dimensional vectors a and b , the product of thee vectors results in the d -dimensional vector a ◦ b = (cid:0) a b , . . . , a d b d (cid:1) . Consequently, we have the following joint probability density function of the Generalized Skew-tPPCA (GSt PPCA) model given N realisations of the random vectors at times t = 1 , . . . , N , that is π Y N , X N , U N , V N | Ψ ( y N , x N , u N , v N ) = N (cid:89) t =1 (cid:26) π Y t | X t, U t, V t, Ψ ( y t ) · π X t | U t, V t, Ψ ( x t ) · π U t | Ψ ( u t ) · π V t | Ψ ( v t ) (cid:27) , (7)for Ψ = [ W , µ , σ , δ (cid:15) , δ x , ν x , ν (cid:15) ] being a vector which consists of all static parameters in the model specified in (1)under GSt PPCA assumptions. Recall that the distributions of Y t , X t and (cid:15) t are conditionally multivariate Gaussian,such that Y t | X t , U t , V t , Ψ ∼ N (cid:0) µ + δ (cid:15) D − (cid:15),t + X t W T , σ D − (cid:15),t (cid:1) , X t | V t , Ψ ∼ N (cid:0) δ x D − x,t , D − x,t (cid:1) , (cid:15) t | U t , Ψ ∼ N (cid:0) δ (cid:15) D − (cid:15),t , σ D − (cid:15),t (cid:1) , where D (cid:15),t = U t ...
00 0 U dt d × d , D x,t = V t ...
00 0 V kt k × k . Under appropriate assumptions on the deterministic parameters of the model, the GSt PPCA reduces to simpler specialcases. We show how these special cases are defined using independent PPCA model assumptions from Subsection 3.1as the identical and conditionally distributed case proceeds analogously. We can straightforwardly obtain the followingthree models
Special Case 1: δ (cid:15) = δ x = . 5iven the skewness coefficients equal to zero, the generalized PPCA model reduces to Grouped-t GSt PPCA case with the representation of the latent processes X t = (cid:113) V − t ◦ Z x,t and (cid:15) t = (cid:113) σ U − t ◦ Z (cid:15),t , results in the conditional distributions given by Y t | X t , U t , V t , Ψ ∼ N (cid:0) µ + X t W T , σ D − (cid:15),t (cid:1) , X t | V t , Ψ ∼ N (cid:0) , D − x,t (cid:1) , (cid:15) t | U t , Ψ ∼ N (cid:0) , σ D − (cid:15),t (cid:1) . Special Case 2: ν (cid:15) = . . . = ν d(cid:15) = ν (cid:15) and ν x = . . . = ν kx = ν x .When we assume that the marginal distributions of X t and (cid:15) t are characterized by the same heavy tails, the scalingvariable U t and V t become one dimensional as they components are identical, U t = χ − ν(cid:15) ( S(cid:15),t ) ν(cid:15) and V t = χ − νx ( Sx,t ) νx .Therefore, the generalized PPCA model reduces to Skew-t GSt PPCA case with the representation of the latentprocesses X t = V − t δ x + (cid:113) V − t Z x,t and (cid:15) t = U − t δ (cid:15) + (cid:113) σ U − t Z (cid:15),t , results in the conditional distributions given by Y t | X t , U t , V t , Ψ ∼ N (cid:0) µ + U − t δ (cid:15) + X t W T , σ U − t I d (cid:1) , X t | V t , Ψ ∼ N (cid:0) V − t δ x , V − t I k (cid:1) , (cid:15) t | U t , Ψ ∼ N (cid:0) U − t δ (cid:15) , σ U − t (cid:1) . Special Case 3: δ (cid:15) = δ x = , ν (cid:15) = . . . = ν d(cid:15) = ν (cid:15) and ν x = . . . = ν kx = ν x .When we assume that the marginal distributions of X t and (cid:15) t are characterized by the same heavy tails and theskewness coefficients are equal to zero, the generalized PPCA model reduces to Student-t GSt PPCA case whichseparates the tail effect of the error term (cid:15) t and the latent process X t on the observation vector Y t . Given theindependent case from Subsection 3.1 , the representation of the latent processes X t = (cid:113) V − t Z x,t and (cid:15) t = (cid:113) σ U − t Z (cid:15),t , results in the conditional distributions given by Y t | X t , U t , V t , Ψ ∼ N (cid:0) µ + X t W T , σ U − t I d (cid:1) , X t | V t , Ψ ∼ N (cid:0) , V − t I k (cid:1) , (cid:15) t | U t , Ψ ∼ N (cid:0) , σ U − t (cid:1) . Recall that the above formulation is an additional variant to the Student-t PPCA model derived in [25], [26] or [53].The difference lies in the stochastic representation of the random vectors (cid:15) t and X t . In contrary to the existingmodels, our formulation assumes the independence of the scaling random variables, U t and V t , and therefore allowsto model separately the tails of (cid:15) t and X t . Y t We want to define eigenvectors and eigenvalues of the covariance matrix of Y t given its formulation according to themodel in (1) under the model assumption discussed in Section 3.1, that is Cov Y t | Ψ (cid:2) Y t (cid:3) := E Y t | Ψ (cid:2) Y Tt Y t (cid:3) − E Y t | Ψ (cid:2) Y t (cid:3) T E Y t | Ψ (cid:2) Y t (cid:3) = WCov X t | Ψ (cid:2) X t (cid:3) W T + Cov (cid:15) t | Ψ (cid:2) (cid:15) t (cid:3) . We discuss the parametrisation of the covariance matrix of the observation process in terms of the static parameters ofthe model defined in Section 3.1. We highlight how assumptions of the special cases of the GSt PPCA model listed inSubsection 3.1.1 impact on an eigendecomposition of
Cov Y t | Ψ (cid:2) Y t (cid:3) as well as the characteristics of the representationdetermined by latent vector X t . We show that(1) when skewness is not present in the model (Cases 1 & 3), the eigenvectors of Cov Y t | Ψ (cid:2) Y t (cid:3) are defined by thesingular vectors of W . In the presence of the skewness, the eigenvectors are also defined by the parametersof skewness, δ x and δ (cid:15) ;(2) when skewness is present in the model, the expectations and covariance matrices of the random vectors X t and (cid:15) t are defined by the corresponding parameters of skewness and might not be zero mean or diagonal,respectively. Therefore, the new representation determined by X t might not be orthogonal;(3) the parameters of degrees of freedom influence the eigenvalues of Cov Y t | Ψ (cid:2) Y t (cid:3) .6 .2.1 Solution for GSt PPCA Without Grouped Heavy Tails of Marginals (Cases 2 & 3) The latent variable and the error terms has the following stochastic representation X t = V − t δ x + (cid:113) V − t Z x,t and (cid:15) t = U − t δ (cid:15) + (cid:113) σ U − t Z (cid:15),t . Firstly, let us recall that since V t ∼ Γ (cid:0) vx , vx (cid:1) and U t ∼ Γ (cid:0) v(cid:15) , v(cid:15) (cid:1) , the moments of their inverses are E Vt | Ψ (cid:2) V − t (cid:3) = v x v x − and Cov Vt | Ψ (cid:2) V − t (cid:3) = 2 v x ( v x − v x − , E Ut | Ψ (cid:2) U − t (cid:3) = v (cid:15) v (cid:15) − and Cov Ut | Ψ (cid:2) U − t (cid:3) = 2 v (cid:15) ( v (cid:15) − v (cid:15) − . The moments of the marginal distributions of X t and (cid:15) t , which are needed to specify the covariance matrix of Y t , arethen E X t | Ψ (cid:2) X t (cid:3) = E Vt | Ψ (cid:104) E X t | Vt, Ψ (cid:2) X t (cid:3)(cid:105) = v x v x − δ x , Cov X t | Ψ (cid:2) X t (cid:3) = E Vt | Ψ (cid:104) Cov X t | Vt, Ψ (cid:2) X t (cid:3)(cid:105) + Cov Vt | Ψ (cid:104) E X t | Vt, Ψ (cid:2) X t (cid:3)(cid:105) = v x v x − I k + δ Tx δ x v x ( v x − ( v x − , and E (cid:15) t Ψ (cid:2) (cid:15) t (cid:3) = E Ut | Ψ (cid:104) E (cid:15) t | Ut, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) = v (cid:15) v (cid:15) − δ (cid:15) , Cov (cid:15) t | Ψ (cid:2) (cid:15) t (cid:3) = E Ut | Ψ (cid:104) Cov (cid:15) t | Ut, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) + Cov Ut | Ψ (cid:104) E (cid:15) t | Ut, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) = σ v (cid:15) v (cid:15) − I d + δ T(cid:15) δ (cid:15) v (cid:15) ( v (cid:15) − ( v (cid:15) − . Hence, the covariance matrix of the observation process Y t is given by Cov Y t | Ψ (cid:2) Y t (cid:3) = W (cid:18) v x v x − I k + δ Tx δ x v x ( v x − ( v x − (cid:19) W T + σ v (cid:15) v (cid:15) − I d + δ T(cid:15) δ (cid:15) v (cid:15) ( v (cid:15) − ( v (cid:15) − . The d × k matrix W has the following singular value decomposition W = M d × d D d × k N k × k . If δ (cid:15) = δ x = we can showthat Cov Y t | Ψ (cid:2) Y t (cid:3) = M (cid:18) DD T v x v x − σ v (cid:15) v (cid:15) − I d (cid:19) M T . Since the matrix D has only non-zero diagonal elements, the eigenvectors of the covariance matrix correspond to theleft singular-vectors of the matrix W . However, when we relax the assumption about the parameters which controlthe skewness of the distribution, and allow δ (cid:15) (cid:54) = and δ x (cid:54) = then the eigenvectors are again dependent on the outerproducts of the parameters of skewness. The latent variable and the error terms have the following stochastic representation as in (6) and their moments aredetermined as E X t | Ψ (cid:2) X t (cid:3) = E V t | Ψ (cid:104) E X t | V t, Ψ (cid:2) X t (cid:3)(cid:105) = E V t | Ψ (cid:2) V − t (cid:3) ◦ δ x , Cov X t | Ψ (cid:2) X t (cid:3) = E V t | Ψ (cid:104) Cov X t | V t, Ψ (cid:2) X t (cid:3)(cid:105) + Cov V t | Ψ (cid:104) E X t | V t, Ψ (cid:2) X t (cid:3)(cid:105) = E V t | Ψ (cid:2) V − t (cid:3) ◦ I k + (cid:16) δ Tx δ x (cid:17) ◦ Cov V t | Ψ (cid:2) V − t (cid:3) , and E (cid:15) t (cid:107) Ψ (cid:2) (cid:15) t (cid:3) = E U t | Ψ (cid:104) E (cid:15) t (cid:107) U t, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) = E U t | Ψ (cid:2) U − t (cid:3) ◦ δ (cid:15) , Cov (cid:15) t | Ψ (cid:2) (cid:15) t (cid:3) = E U t | Ψ (cid:104) Cov (cid:15) t | U t, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) + Cov U t | Ψ (cid:104) E (cid:15) t | U t, Ψ (cid:2) (cid:15) t (cid:3)(cid:105) = E U t | Ψ (cid:2) U − t (cid:3) ◦ I d + (cid:16) δ T(cid:15) δ (cid:15) (cid:17) ◦ Cov U t | Ψ (cid:2) U − t (cid:3) , Hence, the covariance matrix of the observation process Y t is given by Cov Y t | Ψ (cid:2) Y t (cid:3) = W (cid:18) E V t | Ψ (cid:2) V − t (cid:3) ◦ I k + (cid:16) δ Tx δ x (cid:17) ◦ Cov V t | Ψ (cid:2) V − t (cid:3)(cid:19) W T + E U t | Ψ (cid:2) U − t (cid:3) ◦ I d + (cid:16) δ T(cid:15) δ (cid:15) (cid:17) ◦ Cov U t | Ψ (cid:2) U − t (cid:3) , that is defined by E V t | Ψ (cid:2) V − t (cid:3) × k = E Sx,t | Ψ (cid:2) T x ( s x,t ) − (cid:3) = (cid:18) v x v x − , . . . , v kx v kx − (cid:19) × k , E U t | Ψ (cid:2) U − t (cid:3) × d = E S(cid:15),t | Ψ (cid:2) T (cid:15) ( s (cid:15),t ) − (cid:3) = (cid:18) v (cid:15) v (cid:15) − , . . . , v d(cid:15) v d(cid:15) − (cid:19) × d . Cov V t | Ψ (cid:2) V − t (cid:3) k × k = (cid:90) (cid:16) T x ( s x,t ) − − E Sx,t | Ψ (cid:2) T x ( s x,t ) − (cid:3)(cid:17) T (cid:16) T x ( s x,t ) − − E Sx,t | Ψ (cid:2) T x ( s x,t ) − (cid:3)(cid:17) ds x,t , Cov U t | Ψ (cid:2) U − t (cid:3) d × d = (cid:90) (cid:16) T x ( s (cid:15),t ) − − E S(cid:15),t | Ψ (cid:2) T (cid:15) ( s (cid:15),t ) − (cid:3)(cid:17) T (cid:16) T (cid:15) ( s (cid:15),t ) − − E S(cid:15),t | Ψ (cid:2) T (cid:15) ( s (cid:15),t ) − (cid:3)(cid:17) ds (cid:15),t , where we applied the Jacobian of the transformation and the following relations between the probability densityfunctions for the random variables S (cid:15),t and S x,t defined in Subsection 3.1, that is π U t | Ψ ( u t ) = π S(cid:15),t | Ψ ( s (cid:15),t ) (cid:12)(cid:12)(cid:12) ∂T (cid:15) ( s (cid:15),t ) ∂s (cid:15),t (cid:12)(cid:12)(cid:12) − = [0 , ( s (cid:15),t ) (cid:12)(cid:12)(cid:12) ∂T (cid:15) ( s (cid:15),t ) ∂s (cid:15),t (cid:12)(cid:12)(cid:12) − ,π V t | Ψ ( v t ) = π Sx,t | Ψ ( s x,t ) (cid:12)(cid:12)(cid:12) ∂T x ( s x,t ) ∂s x,t (cid:12)(cid:12)(cid:12) − = [0 , ( s x,t ) (cid:12)(cid:12)(cid:12) ∂T x ( s x,t ) ∂s x,t (cid:12)(cid:12)(cid:12) − . Let the d × k matrix W has the singular decomposition W = M d × d D d × k N k × k . If δ (cid:15) = δ x = we can show that Cov Y t | Ψ (cid:2) Y t (cid:3) = M (cid:18) DD T ◦ E Sx,t | Ψ (cid:2) T x ( S x,t ) − (cid:3) + σ E S(cid:15),t | Ψ (cid:2) T (cid:15) ( S (cid:15),t ) − (cid:3) ◦ I d (cid:19) M T . Since the matrix D has only non-zero diagonal elements, the eigenvectors of the covariance matrix correspond to theleft singular-vectors of the matrix W . However, when we relax the assumption about the parameters which controlthe skewness of the distribution, and allow that δ (cid:15) (cid:54) = and δ x (cid:54) = , then again the eigenvectors are again dependent onthe outer products of the parameters of skewness but also are defined by Cov U t | Ψ (cid:2) U − t (cid:3) and Cov V t | Ψ (cid:2) V − t (cid:3) that are notnecessary diagonal matrices. Until now, we assumed that the data set, which we analyse, does not contain any missing observations. We startwith discussing the concept of missingness, and address the questions of how such characteristics of the data canbe incorporated. To achieve this we introduce the new notation where the random vector Y t is partitioned into twosubvectors, one which contains observed values Y ot and the second which indicates missing entries Y mt , such that Y t = (cid:2) Y ot , Y mt (cid:3) . We denote d o as the number of observed elements of the vector Y t and d m = d − d o the number ofmissing entries at time t . The numbers d o and d m can vary over time.In the incomplete-data case related sections we denote by W o and W m the d o × k and d m × k non-square submatrices of W with corresponding rows to the elements of the vector Y t which are observed and missing, respectively. In general,by lower index o and m , we further refer to the elements of some objects corresponding to observed and missing valuesof Y t , respectively.Let us define the random vector R t which indicates which entries of Y t are missing and denotes them by , otherwise . Recall, that a single observation consists of the pair [ Y ot , R t ] with distribution parameters [Ψ , Θ] respectively. Weassume the parameters to be distinct. The likelihood of parameters is proportional to the conditional probability Y ot , R t | Ψ , Θ that is π Y ot , R t | Ψ , Θ ( y ot , r t ) = (cid:90) π Y ot , Y mt , R t | Ψ , Θ ( y ot , y mt , r t ) d y mt = (cid:90) π R t | Y t, Ψ , Θ ( r t ) π Y t | Ψ , Θ ( y t ) d y mt . (8)In our study, we assume the pattern of missing data to be MAR - missing at random as defined in [54]. The assumptionsimposes the indicator variable R t to be independent of the value of missing data. Then the vector Y t which is MARsatisfies π R t | Y t, Ψ ( r t ) = π R t | Y ot , Ψ ( r t ) giving π Y ot , R t | Ψ , Θ ( y ot ) = π R t | Y ot , Θ ( r t ) (cid:90) π Y t | Ψ ( y t ) d y mt = π R t | Y ot , Θ ( r t ) π Y ot | Ψ ( y ot ) . Under the MAR assumption, the estimation of Ψ via maximum likelihood of the joint distribution Y ot , R t | Ψ , Θ isequivalent to the maximisation of the likelihood of the marginal distribution Y ot | Ψ . Hence, we do not worry about thedistribution of the indicator random variable R t and the joint distribution of Y ot and R t . If the assumption about MARdoes not hold, one needs to solve the integral from (8) in order to maximize the joint likelihood in the correspondingEM algorithm. The two iterative steps of the EM algorithm for Generalized Skew-t PPCA which jointly maximize the expectedlog-likelihood of the observed and hidden variables are the following8 tep 1:
Expectation Step (E-step)We calculate the expectation of the conditional distribution Y m N , X N , U N , V N | Y o N , Ψ over the jointdistribution likelihood function of the model (1). Given N realisations of the variables, the expectation is afunction of two vectors with static parameters Ψ = [ W , µ , σ , δ (cid:15) , δ x ] and Ψ ∗ = [ W ∗ , µ ∗ , σ ∗ , δ ∗ (cid:15) , δ ∗ x ] , that is Q (cid:0) Ψ , Ψ ∗ (cid:1) = E Y m N , X N , U N , V N | Y o N , Ψ (cid:104) log π Y N , X N , U N , V N | Ψ ∗ (cid:0) Y N , X N , U N , V N (cid:1)(cid:105) , Step 2:
Maximisation Step (M-step)We update the vector of static parameters Ψ via maximisation of the resulting Q GSt,ind function with respectto the vector Ψ ∗ , that is ˆΨ ∗ = argmax Ψ ∗ Q (cid:0) Ψ , Ψ ∗ (cid:1) . The Q function of the E-step for the Generalized Skew-t PPCA is provided in Theorem 1 whereas the derivation of itsmaximizers is given in Theorem 2. The step of the EM alogirthm for special cases of the GSt PPCA are provided in[52]. Theorem 1.
Consider observation vector Y t modelled according to (1) with the latent processes (cid:15) t , X t , U t and V t following the assumptions of GSt PPCA given in (5) and (6) . Given N realisations of the observed entries ofthe random vector Y ot , denoted by y o N := (cid:2) y o , . . . , y oN (cid:3) , the E-step of the Expectation-Maximisation algorithm forGeneralized Skew-t PPCA in the incomplete data setting is specified as follows Q (Ψ , Ψ ∗ ) = 1 π Y o N | Ψ ( y o N ) N (cid:88) t =1 (cid:110) I ( y ot ; Ψ , Ψ ∗ ) N (cid:89) s =1 ,s (cid:54) = t I ( y os ; Ψ) (cid:111) . The functions I , I : R do → R are defined as I ( y ot ; Ψ , Ψ ∗ ) := (cid:90) (cid:90) ˜ v ( y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ ) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t ,I ( y ot ; Ψ) := (cid:90) (cid:90) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t . where ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) := v (cid:0) y ot , T (cid:15) ( s (cid:15),t ) , T x ( s x,t ); Ψ , Ψ ∗ (cid:1) specified in Lemma A.5 in Appendix A and the function m : R do × [0 , −→ R is given by m ( y ot , s (cid:15),t , s x,t ; Ψ) = e − δ x (cid:0) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:1) δ Tx π Y ot | S(cid:15),t,Sx,t, Ψ ( y ot ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − . The probability density function π Y ot | S(cid:15),t,Sx,t, Ψ ( y ot ) equals to the density function π Y ot | T(cid:15) ( S(cid:15),t ) ,Tx ( Sx,t ) , Ψ ( y ot ) specified inLemma A.4 in Appendix A where M t = σ D x,t + W T D (cid:15),t W and N t = I d − WM − t W T D (cid:15),t .Proof. The proof of Theorem 1 is given in Subsection A.1 in Appendix A.
Theorem 2.
The solution to the system of equation ∇ Ψ ∗ Q (Ψ , Ψ ∗ ) = which determines the maximizers of the function Q from Theorem 1 with respect to the parameters µ ∗ , δ ∗ (cid:15) , δ ∗ x , σ ∗ are given by explicit formulas defined by two-dimensionalintegration problems on the hypercube [0 , as follows µ ∗ = (cid:20) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ (cid:15) A ( y o N ; Ψ) + µ A ( y o N ; Ψ , Ψ ∗ ) + (cid:16) δ (cid:15) W − σ δ x (cid:17) A ( y o N ; Ψ , Ψ ∗ ) T (cid:21) × A ( y o N ; Ψ) − , δ ∗ (cid:15) = (cid:20) A ( y o N ; Ψ) − µ ∗ A ( y o N ; Ψ) − A ( y o N ; Ψ) W ∗ T + µ A ( y o N ; Ψ) W ∗ T + (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) W ∗ T (cid:21) A ( y o N ; Ψ) − , δ ∗ x = (cid:20) A ( y o N ) − µ A ( y o N ) − (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ) (cid:21) A ( y N , Ψ) − ,σ ∗ = dA ( y o N ;Ψ) (cid:20) A ( y o N ; Ψ) + µ ∗ A ( y o N ; Ψ) µ ∗ T + δ ∗ (cid:15) A ( y o N ; Ψ) δ ∗ T(cid:15) + 2 A ( y o N ; Ψ , Ψ ∗ ) µ T +2 (cid:16) A ( y o N ; Ψ , Ψ ∗ ) − A ( y o N ; Ψ) − µ A ( y o N ; Ψ , Ψ ∗ ) (cid:17) µ ∗ T − A ( y o N ; Ψ , Ψ ∗ ) + σ Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) + Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) − Tr (cid:110) A . ( y o N ; Ψ) T W ∗ (cid:111) − Tr (cid:110) A . ( y o N ; Ψ) T W ∗ (cid:111) + Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) +2 (cid:16) µ ∗ A ( y o N ; Ψ) − A ( y o N ; Ψ) + A ( y o N ; Ψ) W ∗ T − µ A ( y o N ; Ψ) W ∗ T (cid:17) δ ∗ T(cid:15) +2 (cid:16) A ( y o N ; Ψ , Ψ ∗ ) − µ ∗ A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ (cid:15) W ∗ A ( y o N ; Ψ) (cid:17)(cid:0) δ (cid:15) W − σ δ x (cid:1) T (cid:21) ,f ( W ∗ ; Ψ , Ψ ∗ ) = ∂Q (Ψ , Ψ ∗ ) ∂ W ∗ = . he solutions with respect to µ ∗ , δ ∗ (cid:15) and σ ∗ are linear function of the parameter W ∗ . The maximizer of the function Q with respect to W ∗ is determined by the function f : R d × k → R d × k f ( W ∗ ; Ψ , Ψ ∗ ) = A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ T(cid:15) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + δ ∗ T(cid:15) µ A ( y o N ; Ψ)+ δ ∗ T(cid:15) (cid:0) δ W − σ W (cid:1) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) − A ( y o N ; Ψ , Ψ ∗ ) − σ A ( y o N ; Ψ , Ψ ∗ ) T . The function f is linear with respect to the parameters µ ∗ and δ ∗ (cid:15) . The two-dimensional integrals A i on the hypercube [0 , , for i ∈ (cid:110) , . . . , (cid:111) , are defined in Subsection A.2 in Appendix A.Proof. The proof of Theorem 2 is given in Subsection A.2 in Appendix A.
Here we explore our proposed methodology from the perspective of robust estimation rather than a model selection.The introduced models provide estimators of parameters, such as mean and covariance, that contain additional flexi-bility to accommodate characteristics of the data such as skewness or various patterns of the marginal tail dependence.The proposed estimators are obtained as solutions to the estimation equations that may be seen as ’distorted’ in com-parison to the equations for the same parameters in the standard PPCA. We argue that this new class of estimators ismore robust than the standard PPCA frameworks.In general, to verify this conjecture, one may analyse a non-linear system of equations with weighting that specifycontributions of sample points to the formulation of an estimator and study the properties of the robustness imposedby these weights. However, our challenge is that for the new class of the estimators under GSt PPCA model, theweighting functions do not have analytic closed forms. Instead, we propose to study the characteristics of robustnessby the notion of the influence functions and study them numerically in three different ways: by the asymptotic bias ofan estimator, the asymptotic variance of an estimator, ie, the precision of the estimation, and by the sensitivity of anestimator to the effect of outliers.As it is shown in the final part of this section, the GSt PPCA is characterized by the highest robustness in comparison tothe standard PPCA methods according to these three measures. If one suspects that the data might reveal characteristicscaptured by the GSt PPCA family of models, the best choice is to use these PPCA frameworks as we show thesignificant loss of accuracy and robustness for simpler standard approaches such as Gaussian PPCA of [21] or Student-t PPCA of [25] or[26], when these characteristics appear in the data. On the other hand, if the data follows simplerdistributions, the class of GSt PPCA methods is flexible enough to also accommodate simple Guassian and student-tstructures.
Under the Gaussian PPCA model from Section 2, we can specify the marginal distribution of the observation vector Y t to be Gaussian with the mean vector µ and the covariance matrix C = WW T + σ I d . The MLE estimates using thismarginalised likelihood are then ˆ µ , ˆ C = argmax µ , C N (cid:88) t =1 log (cid:0) π Y t | Ψ ( y t ) (cid:1) = argmin µ , C (cid:26) N log (cid:12)(cid:12) C (cid:12)(cid:12) + N (cid:88) t =1 (cid:0) y t − µ (cid:1) C − (cid:0) y t − µ (cid:1) T (cid:27) . After differentiating the objective function to obtain the maximum one obtains the following system of equations (cid:40) − N µ + (cid:80) Nt =1 y t = ,N C − − (cid:80) Nt =1 C − (cid:0) y t − µ (cid:1) T (cid:0) y t − µ (cid:1) C − = , ⇐⇒ (cid:40) ˆ µ = N (cid:80) Nt =1 y t , ˆ C = N (cid:80) Nt =1 (cid:0) y t − ˆ µ (cid:1) T (cid:0) y t − ˆ µ (cid:1) T . The MLE estimators of the Gaussian PPCA model can be obtained in closed form however, they assume that allsamples are of the same importance and hence, an outlying observation contributes to the construction of the estimatorsequally. This outcome results in the information captured by provided eigen vectors losing the intended interpretation(statistical summary) if the data is corrupted even by a single observation. In order to introduce a general notion ofobservation specific weight, the generalized version of MLE by M-estimators for independently distributed data wasproposed in [55] and [56] which defined by the following system of normal equations (cid:80) Nt =1 ψ ( Dt ) Dt C − ( y t − µ ) = , (cid:80) Nt =1 ψ ( Dt ) Dt C − ( y t − µ ) T ( y t − µ ) C − = . D t = ( y t − µ ) C − ( y t − µ ) T is a Mahalanobis distance of a sample point y t and ψ : R + → R . We may look atestimators of the mean and covariance obtained by Student-t PPCA of [25] from the same perspective as the solutionsto the system of equations given by (cid:80) Nt =1 C − (cid:0) y t − µ (cid:1) v + (cid:0) y t − µ (cid:1) C − (cid:0) y t − µ (cid:1) T = ,N C − − ( v + d ) (cid:80) Nt =1 C − (cid:0) y t − µ (cid:1) T (cid:0) y t − µ (cid:1) C − v + (cid:0) y t − µ (cid:1) C − (cid:0) y t − µ (cid:1) T = , ⇐⇒ (cid:80) Nt =1 y t − µ v + (cid:0) y t − µ (cid:1) C − (cid:0) y t − µ (cid:1) T = , C = v + dN (cid:80) Nt =1 (cid:0) y t − µ (cid:1) T (cid:0) y t − µ (cid:1) v + (cid:0) y t − µ (cid:1) C − (cid:0) y t − µ (cid:1) T . and remark on the special weighting function that the method introduces to down weight outliers. If v → , theestimator of the covariance matrix under Student-t PPCA corresponds to the Tyler Estimator of [57] .The above examples of the estimation equation for the PPCA model’s parameters explicitly state the functional for-mulation of the weighting function. It is not a case for the family of GSt models. The analogous system of normalequations for the mean and covariance matrix of random vectors which follows the GSt PPCA model, is given by theoptimisation problem ˆ µ , ˆ W , ˆ σ , ˆ δ (cid:15) , ˆ δ x = argmin µ , W , σ , δ (cid:15), δ x (cid:26) log (cid:0) π Y N | µ , W , σ , δ (cid:15), δ x ( y N ) (cid:1)(cid:27) , where the loglikelihood of the model is formulated as log L (Ψ , y N ) = N (cid:88) t =1 log π Y t | Ψ ( y t ) = N (cid:88) t =1 log (cid:18) (cid:90) (cid:90) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:19) , since by using Lemma A.2 and Lemma A.4 in the Appendix A, the log π Y | Ψ ( y ) is expressed as log π Y | Ψ ( y t ) = log (cid:18) (cid:90) (cid:90) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:19) , (9)for the function m : R d × [0 , −→ R defined in Theorem 1 as m ( y t , s (cid:15),t , s x,t ; Ψ) = e − δ x (cid:0) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:1) δ Tx π Y t | S(cid:15),t,Sx,t, Ψ ( y t ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − . Therefore, the corresponding log-likelihood function is not explicitly stated as it is defined by two-dimensional inte-gration problems. Consequently, it is not easy to find the explicit functional form for weight functions which modifythe system of estimation equations in the GSt PPCA model. However, we still can indirectly study if this new for-mulations reduces the sensitivity of the estimators to the effect of outliers. We can achieve that by using an influencefunction, or more precisely, its approximation, and asses local robustness of an estimator as well as determine itsasymptotic variance.
We follow the definition of the influence function from [46]. Let Ω be an open convex subset of R . Let F be acumulative distribution function parametrized by Ψ ∈ Ω of a d -dimensional random vector Y ∼ F . We observe N iid (independent and identically distributed) realisations of Y , denoted by y , . . . , y N , that determine its empiricalcumulative distribution function, ˆ F N . Let T be a mapping from the data space defined by F to a parameter space Ω thatis Fisher consistent, that is T ( F ) = Ψ and T ( ˆ F N ) = ˆΨ N . For instance, if we are interested in finding an estimator of the expectation of Y , the parameter of interest is Ψ = µ = E F Y . Then T ( F ) = (cid:82) Y y dF ( y ) .The influence function defined in [46] is a Gateaux derivative of an estimator T at the measure F in the direction of alocal point y ∈ Y expressed by IF ( y , T, F ) = lim (cid:15) → T ((1 − (cid:15) ) F + (cid:15) ∆ y ) − T ( F ) (cid:15) , where ∆ y is a probability measure which puts mass at the point y . Practically, an influence function can be describedas a measure of the effect of an infinitesimal contamination at the point y on the estimator that is standardized bythe mass of the contamination. Furthermore, we assume that for the considered class of estimators the distributionfunction F satisfies regularity conditions such as differentiability at Ψ , E F (cid:2) T ( F ) (cid:3) < ∞ and (cid:90) IF ( y , T, F ) dF ( y ) = E F (cid:2) IF ( y , T, F ) (cid:3) = . T ((1 − (cid:15) ) F + (cid:15) ∆ y ) using its Tylorseries expansion around T ( F ) as shown in [58], [46] , that is T ((1 − (cid:15) ) F + (cid:15) ∆ y ) = T ( F ) + (cid:15)IF ( y , T, F ) + o p (1) , where o p (1) denotes the convergence of the reminder to zero in probability. This representation can be seen from amore general perspective as the von Mises expansion as in [59], [60]. If we study the expansion of the functional T atthe empirical distribution F N around the true distribution F , we have T ( F N ) = T ( F ) + (cid:90) IF ( y , T, F ) d ( F N − F )( y ) + o p (1) = (cid:90) IF ( y , T, F ) dF N ( y ) + o p (1) = 1 N N (cid:88) t =1 IF ( y t , T, F ) + o p (1) . The Cramer–Rao inequality relates the asymptotic efficiency to the influence function that determines the asymptoticvariance of an estimator defined by T , √ N ( ˆΨ N − Ψ) → d N (cid:16) , E Y (cid:2) IF ( Y , T, F ) (cid:3)(cid:17) . (10)Influence function can indicate some useful properties of an estimator, ie evaluation of the gross error sensitivitydefined as γ ∗ ( T, F ) = sup y ∈Y (cid:12)(cid:12)(cid:12) IF ( y , T, F ) (cid:12)(cid:12)(cid:12) , (11)as a measure of maximum sensitivity of an estimator to local contamination, or a local-shift sensitivity expressed as λ ∗ ( T, F ) = sup y , y ∈Y (cid:12)(cid:12)(cid:12) IF ( y , T, F ) − IF ( y , T, F ) y − y (cid:12)(cid:12)(cid:12) , (12)that measures the effect of shifting a single observation to a different value in the estimation process. The estimation via the maximum likelihood principle can be generalised as defined in [46] , [55] or [56]. The problemof finding an estimator according to some measure of target ρ : Y × Ω → R that is differentiable and satisfies Leibnizintegral rule, can be specified as Ψ ∗ = argmin Ψ ∈ Ω (cid:90) Y ρ ( y , Ψ) dF ( y ) , and is found by solving the normal equations given by (cid:90) Y ϕ ( y , Ψ) dF ( y ) = 0 , for ϕ ( y , Ψ) = ∂ρ ( y , Ψ) ∂ Ψ . As noted in [46] , if ϕ is strictly monotone, the problem has a unique solution. Given this notationand conditions, the influence function of the parameters Ψ can be expressed as IF ( y , Ψ , F ) = − ϕ ( y , Ψ) (cid:32) ∂ E Y | Ψ (cid:2) ϕ ( Y , Ψ) (cid:3) ∂ Ψ (cid:33) − . We remark that this formulation of an influence function implies that its expectation is zero, since E Y | Ψ (cid:2) IF ( Y , T, F ) (cid:3) = − E Y | Ψ (cid:2) ϕ ( y , Ψ) (cid:3) (cid:32) ∂ E Y | Ψ (cid:2) ϕ ( Y , Ψ) (cid:3) ∂ Ψ (cid:33) − = . In the maximum log-likelihood estimation for independently distributed data, ρ ( y , Ψ) = log π Y | Ψ ( y ) and ϕ ( y , Ψ) = ∂ log π Y | Ψ( y ) ∂ Ψ . The estimators of parameters in GSt PPCA model family are MLE estimators for the logarithm of theprobability density function specified defined by the two-dimensional integration over the function m as shown in (9).Given differentiability classes of the function m shown in Lemma A.7, we can apply products of the Leibniz integralrule to swap the order of differentiating and integrating, both to function m and its first derivative to specify the termsrequired to calculate the influence function of the PPCA model and the influence function of Ψ is expressed as IF ( y , T, F ) = − ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:32) E Y | Ψ (cid:20) ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:21)(cid:33) − , where ∂ log π Y | Ψ ( y ) ∂ Ψ = (cid:82) (cid:82) ∂∂ Ψ m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:82) (cid:82) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t , ∂ ∂ Ψ log π Y | Ψ ( y ) = ∂∂ Ψ (cid:18) (cid:82) (cid:82) ∂∂ Ψ m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:82) (cid:82) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:19) . E Y | Ψ (cid:104) IF ( y , T, F ) (cid:105) = (cid:32) E Y | Ψ (cid:20) ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:21)(cid:33) − E Y | Ψ (cid:34)(cid:18) ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:19) T ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:35)(cid:32) E Y | Ψ (cid:20) ∂ log π Y | Ψ ( y ) ∂ Ψ (cid:21)(cid:33) − . Since E Y | Ψ (cid:2) IF ( Y , T, F ) (cid:3) = , the expression E Y | Ψ (cid:2) IF ( Y , T, F ) (cid:3) represents the variance of an influence function underthe distribution F of Y . Under the assumptions of the consider family of models, we may use the Chebyshev’sinequality to asses the probability of high values of an influence function P (cid:18)(cid:12)(cid:12)(cid:12) IF ( Y , T, F ) (cid:12)(cid:12)(cid:12) ≥ a (cid:19) ≤ E Y | Ψ (cid:2) IF ( Y , T, F ) (cid:3) a , for some a > . Therefore, influence functions with lower variance are less likely to exceed level a and consequently,the corresponding estimators are less likely to be sensitive to the effect of a local contamination. In the classical theory of statistical inference, we assume knowledge of a model that characterises the data generatingprocess. Based on this information, we derive estimators of parameters of interest and quantify their properties. How-ever, the robust theory argues that the perfect model is often not known and if it is known, it will be an approximationof reality that is given by limited sample size. As argued in [46], robust theory considers the distribution of estimatorsnot only under the true model but also under other probability distributions. This short study can be seen from theperspective that we have a set of realisations of a random process. We assume its distribution to follow some statisti-cal model and derive estimators of its parameters. The estimators are just a function of the observed sample set, theintroduced functional on its empirical distribution.Given this interpretation, we illustrate an exercise that shows the behaviour of estimators given two scenarios: S1 the true model that generates the data is consistent with the assumption on the distribution that is used to derivethe estimator; S2 the true model that generates the data is not consistent with the assumption on the distribution that is used to derivethe estimator;Therefore, the scenario (S1) and (S2) assume no misspecification and misspecification of the model that characterizesan observation vector, respectively.We derive the influence function for the parameters σ and W for PPCA frameworks: the standard Gaussian PPCAof [61], the standard Student-t PPCA of [25] and the Grouped-t GSt PPCA as Special Case 1 of the GSt PPCA family.In scenario (S1), we calculate influence functions on the datasets that are consistent with the distributions of the PPCAmodels.In scenario (S2), we assume that the observation vector follows the model of Grouped-t GSt PPCA and study whathappens if the data that we observe reflected more complex structure than assumed in the estimation. We show thatfitting naive models such as Gaussian PPCA and Student-t PPCA, that do not have the structure of Grouped-t GStPPCA or even Student-t GSt PPCA, would lose efficiency and robustness in terms of the asymptotic variance andbias as well as the measures of sensitivity, the gross errors and local-shift sensitivities. We show the impact of theseparation of the tail effect or grouped multiple-degree-of-freedom structures that define patterns of marginal heavy-tail distributions on the loss of robustness across three PPCA models.
To examine numerically the robustness of the estimators, we conduct the following simulation study. We generate M =1000 realisations of the d -dimensional random vector Y ∼ F , y , . . . , y M . Depending on the scenario, the distribution F refers to the model of the Gaussian PPCA, Student-t PPCA or Grouped-t PPCA. We use the stochastic approximationsof the expectations E Y | Ψ (cid:104) log π Y | Ψ ( Y ) (cid:105) , E Y | Ψ (cid:104) ∂∂ Ψ log π Y | Ψ ( Y ) (cid:105) and E Y | Ψ (cid:104) ∂ ∂ Ψ ∂ Ψ log π Y | Ψ ( Y ) (cid:105) by the M realisations of Y formulated as combined Monte Carlo-Quadrature approximations (2-d quadrature for the inner integrals). E Y | Ψ (cid:104) log π Y | Ψ ( Y ) (cid:105) ≈ M (cid:88) t =1 log (cid:32) (cid:90) (cid:90) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:33) , E Y | Ψ (cid:104) ∂∂ Ψ log π Y | Ψ ( Y ) (cid:105) ≈ M (cid:88) t =1 (cid:40) (cid:82) (cid:82) ∂∂ Ψ m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:82) (cid:82) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:41) , Y | Ψ (cid:104) ∂ ∂ Ψ log π Y | Ψ ( Y ) (cid:105) ≈ M (cid:88) t =1 (cid:40) ∂∂ Ψ (cid:18) (cid:82) (cid:82) ∂∂ Ψ m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:82) (cid:82) m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:19)(cid:41) . We assume as well that we know the true parameters of the PPCA models and we calculate their influence functionsunder this assumption. The random vector Y has dimensionality d = 3 , the dimentionality of the latent vector X is k = 2 , µ = , with σ = 0 . and the d × k projection matrix W × = w w w w w w = . .
23 0 . .
021 0 . . For the standard Student-t PPCA we examine its influence function across the grid of degrees of freedom ν ∈ (cid:110) , , , (cid:111) . On the other hand, the random vectors (cid:15) t and X t under Grouped-t GSt PPCA model have vectorsof degrees of freedom ν (cid:15) ∈ (cid:8) , (cid:9) and ν (cid:15) ∈ (cid:8) , (cid:9) . This assumption allows us to have different assumptions onheavy tails per marginal and, due to independence of (cid:15) t and X t , to separate the effect of heavy-tails between the newrepresentation and perturbation. We remark that if all marginals of (cid:15) t and X t have the same profiles of heavy tails,the Grouped-t GSt PPCA collapses to Student-t GSt PPCA. Also, the data that follows distributions implied by theStudent-t PPCA or Grouped-t GSt PPCA models with the degrees of freedom around may be seen analogous tothe cases of normally distributed. First, we examine the robustness of the PPCA models in terms of the asymptotic variance of their estimators. The studyreveals how misspecification of a true distribution of an observation process impacts on an asymptotic variance of theobtained estimators, that its, the precision of the estimation. We show that according to this criterion of robustness,the GSt PPCA model is the most efficient either under correctly specified or misspecified scenarios, (S1) and (S2)respectively. Therefore, the efficiency aspect of the estimation is enhanced in the new class of techniques as well itssensitivity to the perturbation in comparison to the other studied baseline methods of Gaussian PPCA and Student-tPPCA.We show that the efficiency of the estimator of σ , ˆ σ is mostly dependent on the assumptions on the distribution ofperturbation term. Next, we explain that the loss of robustness defined by the asymptotic variance of the estimator of W , ˆ W is not uniform across all elements of the projection matrix and is impacted by both, the marginal distributionassumptions of both, (cid:15) t and X t . When misspecification of the model is present in the estimation and tail dependenceand skewness structure is present in the data generating mechanism, this will impact the efficiency.The asymptotic variances corresponding to the scenario (S1) for Gaussian PPCA, Student-t PPCA and Grouped-t GStPPCA are listed in Table 1. The models used for the estimators of parameters W and σ are consistent with assumptionsabout distributions of the observation vectors. The table presents the asymptotic variances for Gaussian PPCA ormedian values of the asymptotic variances with corresponding interquartile range across all considered degrees offreedom for Student-t PPCA and Grouped-t GSt PPCA. The interquartile range informs us about the dispersion ofvalues across the combinations of the degrees of freedom.The Grouped-t GSt PPCA obtains lowest asymptotic variances, regardless of the degrees of freedom that characterizes Y . It is seemingly uniform across degrees of freedom and parameters. This outcome stems from the fact that both, themedian values and the interquartile ranges are the lowest.Figure 1 shows the logarithm of the asymptotic variances for the scenario ( S2 ), when the observation data follows aGrouped-t GSt PPCA model with different cases for degrees of freedom (x-axis and y-axis of the plots). The estimatorsof parameters are derived upon Gaussian PPCA (a), Student-t PPCA (b) or Grouped-t GSt PPCA (c) models. Theasymptotic variances are scaled per parameter to unify the colour scale.Labels of the y-axis and x-axis on all plots correspond to the multidimensional vectors of degrees of freedom ν (cid:15) and ν x for (cid:15) t and X t , respectively. For instance if ν (cid:15) = [4 , , the corresponding label on the y-axis is ’4_100_4’ and if ν x = [100 , - the label on the x-axis is ’100_4’.Similarly to the scenario (S1), the asymptotic variances under scenario (S2) are the lowest for the estimators derivedunder the Grouped-t GSt PPCA model. If elements of ν (cid:15) and ν x are the same per vector, the framework collapses tothe Student-t GSt PPCA case. Hence, the robustness of the GSt PPCA class of methods is the highest according to thiscriterion. We remark on a few observations that might be of interest to the reader.14 a) Gaussian PPCA(b) Student-t PPCA(c) Grouped-t GSt PPCA Figure 1: The logarithm of asymptotic variances defined in (10) of estimators for σ and W for Gaussian PPCA (a),Student-t PPCA (b) and Grouped-t GSt PPCA (c) under the scenario ( S2 ). The observation data follows Student-tGSt PPCA model under different assumptions on degrees of freedom ν (cid:15) (y-axis) and ν x (x-axis). The columns-wiseorder of the panels corresponds to the variances for different parameters. The row-wise order of the panels in panel (b)corresponds to the assumptions on the degree of freedom ν that the estimators are derived upon. The panel (c) presentsresults for Grouped-t GSt PPCA when distribution assumptions that are used to derive the estimators and characterisethe observation data are consistent. 15he asymptotic variance of ˆ σ decreases with an increasing number of marginals that are light-tail in (cid:15) t . We remarkthat a profile of heavy-tails of the first marginal of (cid:15) t is important as well - if it is heavy-tailed then the asymptoticvariance of ˆ σ increases.The efficiency of ˆ W exhibits interesting behaviours when the sample data has marginal-specific assumptions on dis-tributions of (cid:15) t and X t . The asymptotic variance is element-specific for ˆ w i , for i = 1 , . . . , , and depends mostly on theprofile of heavy-tails that corresponds to the marginal of X t which is a projection of Y t . However, its efficiency isalso slightly impacted by the heavy-tail distribution of element-specific marginal of (cid:15) t , that corresponds to the elementof Y t which is projected by ˆ w i . For instance, the estimation of w , w , w has the highest precision when the firstcomponent of ν x is , that is for the labels on the x-axis ’4_100’ and ’4_4’ as these components of W project onto thefirst marginal of X t .The same patterns of asymptotic variances given by different combinations of values in ν (cid:15) and ν x that characterizesample data are observed for the estimators defined under Gaussian PPCA and Student-t PPCA models. However,their asymptotic variances are significantly higher and more sensitive to the distribution of the error terms, especiallyif it is not captured by the model’s assumptions used for the estimation. Cases of Gaussian PPCA and Student-tPPCA with high degrees of freedom are the most illustrative for this example. For instance, let us consider asymptoticvariances of w when the first marginal of (cid:15) t is light-tail - the precision of the parameter’s estimation under theseframeworks declines. In addition, when the t-Student PPCA model is assumed to be more heavy-tailed, ν ∈ (cid:8) , (cid:9) ,the corresponding estimator of σ loses efficiency, when a combination of light- and heavy-tail marginal profiles of (cid:15) t and X t are assumed. Also, the framework becomes less robust to the data that was corrupted by a light-tail perturbation.Lastly, we observe that the discrepancy between the asymptotic variances of ˆ w and the other components in W is thehighest under Grouped-t PPCA.Table 1: The asymptotic variances defined in (10) of estimators for σ and W for PPCA, Student-t PPCA and Student-tGSt PPCA under the scenario ( S1 ) given their stochastic approximations. The measure is standardized per parameter.The values corresponding to Student-t PPCA and Grouped-t GSt PPCA reflect the median across all true models forthe observation vector that are dependent on the combinations of degrees of freedom. The values in the bracketsrepresent an interquartile range. PPCAGaussian Student-t Grouped-t GSt σ w w w w w w The sensitivity of estimators is another property that we discuss to compare the robustness of the three PPCA frame-works using the notion of their influence functions. The sensitivity measures a maximum impact of outlying pointson the estimators. We focus on the gross error sensitivity, that reflects a maximum impact of single contaminationon an estimator, defined in (11). Therefore it reflects the information of how a perturbation of a single point on thedata set decreases the information conveyed by the dataset about a true parameter. Secondly, we study the local-shiftsensitivity that measures the effect of removing a probability mass from one point from the domain of the random vari-able to another and its maximum effect on the estimator, defined in (12). Therefore, it reflects the effect of maximumdeviation of points in the data and their impact on the estimation, standardized by a range of the deviation.Given M realisations of the observation random vectors, we evaluate influence functions point-wise. The gross-errorsensitivity is calculated as a maximum absolute value of evaluated influence functions for a generated dataset. Thelocal-shift sensitivity is calculated by measuring the pairwise L distances between realisations and separately percorresponding influence functions. Then, the maximum of the fractions defined in (12) is obtained for different casesof the data.The numerical approximations of γ ∗ ( T, F ) and λ ∗ ( T, F ) under the scenario (S1) for Gaussian PPCA, Student-t PPCAand Grouped-t GSt PPCA are listed in Table 2. The table presents the values for Gaussian PPCA or the median valueswith corresponding interquartile range across all considered degrees of freedom for Student-t PPCA and Grouped-tGSt PPCA. Both sensitivity measures confirm that the estimators under the Grouped-t GSt PPCA are the most robustaccording to the sensitivity measures. 16 a) Gaussian PPCA(b) Student-t PPCA(c) Grouped-t GSt PPCA Figure 2: The logarithm of the gross error sensitivity from (11) of estimators for σ and W for Gaussian PPCA (a),Student-t PPCA (b) and Grouped-t GSt PPCA (c) under the scenario ( S2 ) given realisation of Y . The observationdata follows Student-t GSt PPCA model under different assumptions on degrees of freedom ν (cid:15) (y-axis) and ν x (x-axis).The columns-wise order of the panels corresponds to the variances for different parameters. The row-wise order ofthe panels in panel (b) corresponds to the assumptions on the degree of freedom ν that the estimators of σ and W arederived upon. The panel (c) presents results for Grouped-t GSt PPCA when distribution assumptions that are used toderive the estimators and characterise the observation data are consistent.17 a) Gaussian PPCA(b) Student-t PPCA(c) Grouped-t GSt PPCA Figure 3: The logarithm of local-shift sensitivity from (12) of estimators for σ and W for Gaussian PPCA (a), Student-t PPCA (b) and Grouped-t GSt PPCA (c) under the scenario ( S2 ) given realisation of Y . The observation datafollows Student-t GSt PPCA model under different assumptions on degrees of freedom ν (cid:15) (y-axis) and ν x (x-axis).The columns-wise order of the panels corresponds to the variances for different parameters. The row-wise order ofthe panels in panel (b) corresponds to the assumptions on the degree of freedom ν that the estimators of σ and W arederived upon. The panel (c) presents results for Grouped-t GSt PPCA when distribution assumptions that are used toderive the estimators and characterise the observation data are consistent.18able 2: The gross error sensitivity defined in (11) and the local-shift sensitivity defined in (12) of estimators for σ and W for PPCA, Student-t PPCA and Student-t GSt PPCA under the scenario ( S1 ) given given M realisations of Y .The measures are standardised by parameter. The values corresponding to Student-t PPCA and Grouped-t GSt PPCAreflect the median across all true models for the observation vector that are dependent on the combinations of degreesof freedom. The values in the brackets represent an interquartile range. PPCAGaussian Student-t Grouped-t GSt Gaussian Student-t Grouped-t GSt γ ∗ ( T, F ) λ ∗ ( T, F ) σ w w w w w w The sensitivity of the estimators for the two measures under the misspecified case, the scenario (S2), are illustrated inFigure 2 and Figure 3 , for γ ∗ ( T, F ) and λ ∗ ( T, F ) , respectively.The gross-error sensitivity exhibits similar patterns that are observed for the asymptotic variance. Given Grouped-tGSt PPCA, we observe a decline of the sensitivity of ˆ σ to the contamination of a single point when the distribution ofthe perturbation term is more light-tailed. However, when the estimator is derived under Gaussian PPCA or Student-tPPCA models, the single point contamination has the highest impact on the estimation σ when models do not capturethe heavy-tail distribution of (cid:15) t . The sensitivity measure of the estimator under the Student-t PPCA model decreaseswhen the assumed distribution is more heavy-tailed. This increase of robustness for estimation of σ , that depends on ν , is more rapid than when we considered the asymptotic variance. The gross-error sensitivity of ˆ W is element-specificand depends on marginal-specific assumptions on distributions of (cid:15) t and X t . We observe similar patterns as with theasymptotic variance. Again, the estimation of components of W under Gaussian PPCA and Student-t PPCA is moreimpacted by the distribution of (cid:15) t , especially when the estimators have no flexibility to handle heavy-tail data. Also,the gross-error sensitivity of the estimators under the Student-t PPCA decreases when ν decreases and becomes moreuniform across different assumptions on the sample data.The robustness of the estimators that is defined by the local shift sensitivity, λ ∗ ( T, F ) , is illustrated in Figure 3. Thesensitivity analysis by this measure results in similar patterns, especially for W , that are observed for the asymptoticvariance. However, it is less impacted by the different assumptions on the distribution of (cid:15) t . The Grouped-t GSt PPCAis significantly more robust according to this measure than the other PPCA frameworks. The main difference betweenthe outcomes for the local-shift sensitivity and the other studied measures of robustness is the sensitivity of ˆ σ . Weobserve almost uniform sensitivity of this estimator under Grouped-t GSt PPCA regardless of the data assumptions.When the estimator is derived under the Gaussian PPCA or Student-t PPCA models, it is least robust for the data thathas light-tail perturbation. Also, we observe little change for the Student-t PPCA performance when the degrees offreedom ν change. In the final part, we focus on the analysis of bias of the estimation for σ and W given the increasing sample size.We narrow our study and examine accuracy of the estimation for Student-t GSt PPCA as a special case of Grouped-tGSt PPCA, Student-t PPCA and Gaussian PPCA frameworks under the misspecified data case, the scenario (S2), anddifferent sample sizes N = 100 , , , , . We generate M = 50 replications of Y N for each distributionand sample size assumptions to numerically calculate the mean square errors of the estimation, having the true valuesof the parameters specified as before.Figure 4 shows the change of estimation accuracy for PPCA frameworks over increasing sample size measured by themean squared error of σ and its estimate, and the mean squared error of W across all its elements. We observe thatthe increasing sample size increasingly improves the estimation via Student-t GSt PPCA (red line), especially of σ .The estimation via Student-t GSt PPCA is consistently highly accurate regardless of the degrees of freedom. When thedistribution of (cid:15) t is light-tail, the accuracy of the estimators of Student-t PPCA is close to the one provided by the morecomplex framework. For this data case, Gaussian PPCA estimates well only σ , and its discrepancy of the accuracyof the estimation via Student-t GSt PPCA becomes higher with bigger sample size. On the other hand, when thedistribution of X t is light-tail, the estimation via Student-t PPCA and Gaussian PPCA results in good estimates of W ,however, at the prices of the estimators of σ . We remark that when the data follows more complex distribution such19s Grouped-t GSt PPCA model, the discrepancies in the estimation accuracy increases significantly and the standardmethods struggle to provide accurate estimation.Figure 4: The logarithm of the mean squared errors of the estimates for σ and W (across all its elements) for GaussianPPCA, Student-t PPCA and Student-t GSt under the scenario ( S2 ) versus sample size N . The column-wise order ofthe panels corresponds to the distribution assumption of Y N , the degrees of freedom ν (cid:15) of (cid:15) t and ν x of X t , that definesthe Student-t GSt PPCA model of Y N . The row-wise order corresponds to the logarithms of the means square errorsper parameter. The colours of lines correspond to accuracy under different PPCA models assumptions. PCA or PPCA methods can be used to measure market concentration and the potential for diversification. They areoften employed to identify highly concentrated assets or to reduce the complexity of large sets of financial instrumentsby transforming them into a new set of uncorrelated components. One example of an application of these componentsis a strategy of diversified risk parity on the new representation that allocates portfolio weights to the original set.These feature extraction frameworks can be seen as a set of techniques to reveal common factors, called principalcomponents, in a way that best explains the variability in the original data. The transformation of the observationdata into principal components is defined in such a way that principal components have a decreasing variance. Themethods suggest how to lower the dimensionality of our original data set by excluding elements which are in majoritydescribed by components with the least significant contribution to the overall variance and therefore reduce the size ofinvestment portfolios universe of possible constituent assets to perform risk-based asset selection and weighting.In the following part we study the linear interactions between Bitcoin and 19 other altcoin crypto assets that are rankedhighest on the list of top virtual currencies by market cap given two separate periods, 2018, so-called Initial CoinOffering (ICO) period where most of the cryptocurrencies projects were born and 2019, that is a start of lendingmarkets when the other altcoins coins started to be less frequently traded. The details of the considered assets aregiven in Table 3. The data for our study was collected from the Coin Metrices website (https://coinmetrics.io). Wefollow the categorisation of the assets from the Cryptoslate website (https://cryptoslate.com).We do not intend to present an optimal model that describes the dynamics of the studied data sets. Our motivation isto emphasise the effects of the robustness of the family of GSt PPCA methods that have been discussed in Section 5.We present the eigen decomposition of the covariance matrix of the set of crypto assets given its estimators definedby different PPCA models: the standard Gaussian PPCA by [61], the standard Student-t PPCA of [25] and the threespecial cases of the family of GSt PPCA models introduced in Section 3: Grouped-t GSt PPCA as Special Case 1 fromSection 3.1.1, Student-t GSt PPCA as Special Case 3 and Skew-t GSt PPCA as simplified Special Case 2 discussed inAppendix B. We show that the analysis of the eigendecomposition of the covariance matrix of the observation sets can20rovide useful insights into the distinguishable components of the variance that have economic interpretations. In thenext section, we show that the flexibility of the PPCA frameworks from GSt PPCA family results in the separation ofthe covariance matrix into the components that have clearer economic interpretation than the other considered PPCAmethods.Table 3: The list of altcoin crypto assets from 01-01-2018 to 31-12-2019 with corresponding categories. Ticker Currency Name Category Ticker Currency Name Categorybat Basic Attention Token Advertising etc Ethereum Classic Smart Contractspowr Power Ledge Energy eth Ethereum Smart Contractsbnb Binance Coin Exchange lsk Lisk Smart Contractsomg OmiseGO Financial Service neo NEO Smart Contractsxrp XRP Financial Service xlm Stellar Smart Contractsdash Dash Governance dai Dai Stablecoinlink Chainlink Interoperability usdt_eth Tether (Ethereum) Stablecoinxmr Monero Privacy bch Bitcoin Cash Technologyada Cardano Smart Contracts btc Bitcoin Technologyeos EOS Smart Contracts ltc Litecoin Technology
Figure 5: The standardized daily returns of crypto currencies listed in Table 3. To begin our study, we unify the magnitude of assets’ values over time by considering their standardized returns. Wecalculate standard daily returns which are defined as a daily nominal change in price over time references to the USDollar stable coin USD Tether (USDT) as numeraire. This was selected as it is the stable coin with highest marketcapitalization and utilization in all key exchanges. We divide the set of returns in subsets related to 2018 and 2019 andstandardize them robustly per currency exchange rate by the Huber M-estimators of [62].Figure 5 illustrates the daily returns of the currencies. The dashed vertical lines divide the set of returns into twoseparate periods, one corresponding to 2018 and another one to 2019. Figure 6 shows the marginal estimates of theskewness by daily returns of assets. It is our motivation to include the model with skewness to our analysis.Figure 8 illustrates the interactions between pairs of examined assets (off-diagonal panels) as well and histogramsof the returns (diagonal panels) for year 2018 (red color) and 2019 (black color), respectively. We observe weakerdependence between stable coins, USD Tether and Dai, and the other assets across the years. The panels in Figure 8show possible increasing collinearity between Dai and Bitcoin, Bitcoin Cash and Litecoin, the three assets from thesame category ’Technology’ in 2019. On the other hand, the dependence between USD Tether and the remaining assetsstays weak. The weak collinearity between stable coins and the rest of the assets stems from their design. They havebeen created to bridge between the highly volatile crypto currencies and stability of fiat currencies but still providinganonymity to its buyers. 21t is also worth to point out weakening dependence between Chainlink (link) and the other assets with 2019. In 2018,we can still observe a moderate correlation between Chainlink and coins such as Basic Attention Token (bat), Cardano(ada), Dash (dash) and assets from Technology. It becomes significantly weaker in 2019 as shown in Figure 8.Figure 6: The yearly sample estimates of univariate skewness for returns of crypto currencies listed in Table 3.The columns indicate the year of the sample. The estimates for the standardized returns are the same as for non-standardized. The colors corresponds to the different sample periods, 2018 (red) and 2019 (black). We seek to compare the interpretation of the observed collinearity with the eigen decomposition of the covariance thatwould select uncorrelated directions which explain the majority of the variance in the analysed data set of cryptoassets. The optimal PPCA model choices and resulting log-likelihoods for PPCA frameworks are shown in Table 4.Overall results of the covariance decompositions have implications for the benefits of diversification as they indicatethat the majority of the altcoin crypto assets are driven by a common factor that is highly correlated to Bitcoin. Thisco-moving group of assets is characterized by the highest contribution to the overall variance. The identification ofthis collinearity can aid the portfolio selection and management as holding only one of these assets provide most of thebenefits for the diversification and allows to invested funds in other assets. On the other had, the remaining principaldirection indicate uncorrelated assets that can be used for the risk hedging purposes.The proportion of the market variance explained by the first component increases from 2018 to 2019. It suggests thatBitcoin returns in 2019 to be the main driver of altcoins and, consequently, of the overall variance on the market. Weremark that the decomposition by the GSt PPCA family of methods is always characterized by the higher proportionof the first principal components to the overall variance.The observed estimates of sample skewness in Figure 6 motivate us to use a grid of elements per margin to select theskewness parameter, δ x for Skew-t GSt PPCA. Therefore, we consider that the elements of the -dimensional vector δ x can take values in the set {− , − . , − . , , . , . , } . Only in 2018 for non-robustly standardized returns, the firstcomponent of X t was characterized by skewness and δ x = [0 . , , . Otherwise, the best model for Skew-t GSt PPCAassumes zero and the model simplifies to Student-t GSt PPCA model.The estimates of the three eigenvectors corresponding to the highest eigenvalues given by each of PPCA approaches areillustrated in Figure 7 (a). The row-wise-wise order of panels corresponds to the different periods and standardisationmethodologies. The labels of the y-axis correspond to the crypto assets with category-specific font colours.The estimates of the principal directions are consistent with the interpretation of the linear interactions between theassets. In 2018, especially for the frameworks from the GSt PPCA family, the variance of the data set is decomposedinto the principal direction that reflects the dependence between all crypto assets, except stable coins. The remainingprincipal components are dominated by each of the stable coins separately, USD Tether and Dai, respectively. Giventhe weak interaction between USD Tether and Dai in 2018 as in Figure 8, it is an expected finding. Therefore, weobserve little collinearity between USD Tether and Dai what is an observation that can be utilized in improvementsto portfolio diversification. Only Gaussian PPCA indicated negative collinearity between the group of crypto assetscorrelated with Bitcoin and the stable coin Dai that is indicated by the first component.The contribution of the principal directions to the variance of the dataset measured by eigenvalues of the covariancematrix is illustrated in Figure 7 (b). The panels show the proportion between the first eigenvalue and the remainingones. In 2018, we observe higher contribution to the overall variance of the second principal direction that representsthe dynamic of Dai, especially for the Gaussian and Student-t PPCA. The variance explained by the third components,22hat characterizes Theter, is significantly lower. The methods from the GSt PPCA family suggest higher proportion ofthe first principal components to the overall variance.In 2019, all frameworks suggest higher loadings on stable coins in the first component than in 2018; the secondprincipal direction concentrates solely on the dynamic of Chain Link and the third component is dominated by stablecoins. The frameworks from GSt PPCA family suggest higher loadings on one of the stable coins, Dai or USD Tether.Therefore, the stable coins are recognized as collinear by all frameworks in 2019. Also, their correlation with therest of the assets increases. The proportions of eigenvalues illustrated in Figure 7 (b) suggest that the first componentdominates the variance of the dataset and the remaining ones have a significantly lower contributions. Therefore, thevariance of the market explained by the stable coins is reduced in 2019. (a) (b) Figure 7: The yearly estimates of eigenvectors of the covariance matrix of Y t (a) and the proportions between theestimates of the first eigenvalue ( d ) and the remaining ones( d , d ) obtained via PPCA algorithms (x-axis): GaussianPPCA, Student-t PPCA, Student-t GSt PPCA and Grouped-t GSt PPCA and Skew-t GSt PPCA. The column-wise ordercorresponds to the year of the sample set and the standardisation method applied to returns.The observation processconsists of cryptocurrencies listed in Table 3 with category-specific assumption on heavy tails if possible. The research presented in this work constitutes important and novel contributions towards probabilistic feature ex-traction methods and their application to statistical modelling. We focused on developing a dimensionality reductionmethodology, which addresses a difficult but not uncommon situation when the underlying observation data is not fullyobserved; that is, it both contains missing information and is corrupted by noise. We develop a framework of dimen-sionality reduction which adapts PPCA and extends the standard assumptions on the distribution of the observationdata to heavy-tailed and skewed distributions. The method addresses a common situation, in which the elements of anobservation vector have non-trivial dependency structures. It is especially relevant to large data problems when subsetsof the observation vector represent different, complex information, often combined from various data sets. Therefore,23e introduce a method which assumes many sources of corruption which are no longer identical across the dimen-sionality of the observation vector. Crucially, the model improves the ability to capture tail dependence patterns in themultivariate data analysis. The adaptation of the skew-t distribution in the PPCA framework adds flexibility to accountfor asymmetric distributions if they are relevant to an observation set. We derive efficient Expectation-Maximisationalgorithm for the estimation of the parameters in the new model.We assessed the robustness of the new class of PPCA methods by examining their estimation accuracy and variousmeasures of sensitivity to corruption defined by influence function on simulation studies. The developed family ofmethods is characterized by the highest robustness in comparison to standard PPCA methods such as Gaussian PPCAor Student-t PPCA. If the sample data reveals characteristics captured by the GSt PPCA family, such as separationof the tail effect or grouped multiple-degree-of-freedom structures that define patterns of marginal heavy-tail distri-butions, we show the significant loss of accuracy and robustness for simpler approaches and high robustness of GStPPCA family. The class of GSt PPCA methods is also most robust when the data follow simpler distributions.We illustrated the applicability and performance of the new class of methods on a real study where we examine linearinteractions and covariance decompositions on the dataset of cryptocurrencies. We commented on the practicalaspect of the exercise such as benefits of diversification as the study identifies that the majority of the assets are drivenby a common factor that is highly correlated to Bitcoin and has the highest contribution to the overall variance. Thisoutcome can aid portfolio selection and management. Also, the remaining components of the decomposition revealuncorrelated assets that can be used for risk hedging purposes. References [1] H Hotelling. Analysis of a complex of statistical variables into principal components.
Journal of EducationalPsychology , 24(6):417–441, 1933.[2] I. T. Jolliffe.
Principal Component Analysis . Springer-Verlag New York, 2002.[3] Rene. Vidal, Yi Ma, and S.S Sastry.
Generalized Principal Component Analysis . Springer-Verlag New York,2016.[4] Christophe Croux and Gentiane Haesbroeck. Principal component analysis based on robust estimators of thecovariance or correlation matrix: influence functions and efficiencies.
Biometrika , 87(3):603–618, 2000.[5] F. De la Torre and M. J. Black. Robust principal component analysis for computer vision. In
Proceedings ofIEEE 8th International Conference on Computer Vision , pages 362–369, 2001.[6] Christophe Croux and Anne Ruiz-Gazen. High breakdown estimators for principal components: The projection-pursuit approach revisited.
Journal of Multivariate Analysis , 95(1):206–226, 2005.[7] Mia Hubert, Peter J Rousseeuw, and Karlien Vanden Branden. ROBPCA: A New Approach to Robust PrincipalComponent Analysis.
Technometrics , 47(1):64–79, 2005.[8] Christophe Croux, Peter Filzmoser, and M R Oliveira. Algorithms for Projection–Pursuit robust principal com-ponent analysis.
Chemometrics and Intelligent Laboratory Systems , 87(2):218–225, 2007.[9] JinHyeong Park, Zhenyue Zhang, Hongyuan Zha, and R. Kasturi. Local smoothing for manifold learning. In
Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition , vol-ume 2, pages 452–459, 2004.[10] D. Zhao, Z. Lin, and X. Tang. Laplacian pca and its applications. In
Proceedings of IEEE 11th InternationalConference on Computer Vision , pages 1–8, 2007.[11] Robert Tibshirani. Regression Shrinkage and Selection via the Lasso.
Journal of the Royal Statistical Society,Series B , 58(1):267–288, 1996.[12] Christophe Croux and Peter Filzmoser. Robust factorization of a data matrix. In
COMPSTAT, Proceedings inComputational Statistics , pages 245–250, 1998.[13] A. Ukkelberg and O. Borgen. Outlier detection by robust alternating regression.
Analytica Chimica Acta ,277(2):489–494, 1993.[14] Qifa Ke and Takeo Kanade. Robust L1 Norm Factorization in the Presence of Outliers and Missing Data by Al-ternative Convex Programming. In
Proceedings IEEE Conference on Computer Vision and Pattern Recognition ,pages 739–746, 2005.[15] Chris Ding, Ding Zhou, Xiaofeng He, and Hongyuan Zha. R1-PCA : Rotational Invariant L1-norm PrincipalComponent Analysis for Robust Subspace Factorization. In
Proceedings of the 23rd International Conferenceon Machine Learning , pages 281–288, 2006. 2416] Thierry Bouwmans, Andrews Sobral, Sajid Javed, Soon Ki, and El-hadi Zahzah. Decomposition into low-rankplus additive matrices for background / foreground separation : A review for a comparative evaluation with alarge-scale dataset.
Computer Science Review , 23:1–71, 2017.[17] John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma. Robust Face Recognition via SparseRepresentation.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 31(2):210–227, 2009.[18] Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis?
Journal ofthe ACM , 58(3):1–37, 2011.[19] Z. Zhou, X. Li, J. Wright, E. Candès, and Y. Ma. Stable principal component pursuit. In
Proceedings of 2010IEEE International Symposium on Information Theory , pages 1518–1522, 2010.[20] Huan Xu, Constantine Caramanis, and Sujay Sanghavi. Robust PCA via Outlier Pursuit.
IEEE Transactions onInformation Theory , 58(5):3047–3064, 2012.[21] M. E. Tipping and C. M. Bishop. Probabilistic Principal Component Analysis.
Journal of the Royal StatisticalSociety, Series B , 61(3):622–661, 1999.[22] Sam Roweis. EM Algorithms for PCA and SPCA. In
Proceedings of Advances in Neural Information ProcessingSystems 10 , pages 626–632, 1998.[23] A. P. Dempster, M. N. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algo-rithm.
Journal of Royal Statistical Society, Series B , 39(1):1–38, 1977.[24] G. McLachlan and T. Krishnan.
The EM Algorithm and Extensions . Wiley, 1997.[25] D. de Ridder and V. Franc. Robust subspace mixture models using t -distributions. In
Proceedings of the BritishMachine Vision Conference 2003 , pages 319–328, 2003.[26] Zia Khan and Frank Dellaert. Robust Generative Subspace Modeling : The Subspace t Distribution. In
Proceed-ing of British Machine Vision Conference , pages 1–17, 2003.[27] C. Archambeau, N. Delannay, and M Verleysen. Robust Probabilistic Projections. In
Proceedings of the 23 rdInternational Conference on Machine Learning , pages 33–40, 2006.[28] Yi Fang and Myong K. Jeong. Robust Probabilistic Multivariate Calibration Model.
Technometrics , 50(3):305–316, 2008.[29] Tao Chen, Elaine Martin, and Gary Montague. Robust probabilistic PCA with missing data and contributionanalysis for outlier detection.
Computational Statistics and Data Analysis , 53(10):3706–3716, 2009.[30] J. Gao. Robust L1 Principal Component Analysis and Its Bayesian Variational Inference.
Neural Computation ,20(2):555–572, 2008.[31] Yue Guan and Jennifer Dy. Sparse probabilistic principal component analysis. In
Proceedings of the TwelthInternational Conference on Artificial Intelligence and Statistics , pages 185–192, 2009.[32] F. R. Bach and C. Archambeau. Sparse probabilistic projections. In
Proceedings of Advances in Neural Infor-mation Processing Systems 21 , pages 73–80, 2009.[33] Rajiv Khanna, Joydeep Ghosh, Russell Poldrack, and Oluwasanmi Koyejo. Sparse Submodular ProbabilisticPCA. In
Proceedings of the 18th International Conference on Artificial Intelligence and Statistics , pages 453–461, 2015.[34] Pengtao Xie and Eric Xing. Cauchy Principal Component Analysis.
Available at arXiv preprint arXiv:1412.6506 ,pages 1–14, 2015.[35] R. A. Redner and H. F. Walker. Mixture Densities, Maximum Likelihood and the EM Algorithm.
SIAM Review ,26(2):195–239, 1984.[36] D Peel and J. G. McLachlan. Robust mixture modelling using the t distribution.
Statistics and Computing ,10:339–348, 2000.[37] C. Archambeau.
Probabilistic Models in Noisy Environments And their Application to a Visual Prosthesis for theBlind . Phd, Universite Catholique de Louvain, 2005.[38] Tao Chen, Julian Morris, and Elaine Martin. Probability Density Estimation via Infinite Gaussian Mixture Model: Application to Statistical Process Monitoring.
Journal of the Royal Statistical Society, Series C , 55(5):699–715,2006.[39] Stéphane Daul, Filip Lindskog, and Alexander McNeil. The Grouped t-copula with an Application to CreditRisk.
RISK , 16:73–76, 2003. 2540] Xiaolin Luo and Pavel V. Shevchenko. The t copula with multiple parameters of degrees of freedom: bivariatecharacteristics and application to risk management.
Quantitative Finance , 10(9):1039–1054, 2010.[41] Xiaolin Luo and Pavel V. Shevchenko. Bayesian Model Choice of Grouped t-copula.
Methodology and Comput-ing in Applied Probability , 14(4):1097–1119, 2012.[42] O. Barndorff-Nielsen and P. Blaesild. Hyperbolic distributions and ramifications: Contributions to theory andapplication. In
Statistical Distributions in Scientific Work , pages 19–44, 1981.[43] S. Demarta and A. J. McNeil. The t Copula and Related Copulas.
International Statistical Review , 73(1):111–129, 2005.[44] K. Aas and I. H. Haff. The Generalised Hyperbolic Skew Student’s t-distribution.
Journal of Financial Econo-metrics , 4(2):275–309, 2006.[45] Christ Church. The Asymmetric t-Copula with Individual Degrees of Freedom. Master’s thesis, University ofOxford, 2012.[46] F. R. Hampel, E. M. Ronchetti, P. J. Rousseeuw, and Stahel W. A.
Robust Statistics: The Approach Based onInfluence Functions . Wiley, 1986.[47] Donald B Rubin and Dorothy T Thayer. EM algorithms for ML factor analysis.
Psychometrika , 47(1):69–76,1982.[48] C. F. J. Wu. On the Convergence properties of the EM Algorithm.
The Annals of Statistics , 11(1):95–103, 1983.[49] R. A. Boyles. On the Convergence of the EM Algorithm.
Journal of the Royal Statistical Societ, Series B ,45(1):44–50, 1983.[50] A. K. Gupta and D. K. Nagar.
Matrix Variate Distributions . Chapman & Hall, 1999.[51] Samuel Kotz and Saralees Nadarajah.
Multivariate t Distributions and Their Applications . Cambridge UniversityPress, 2004.[52] Dorota Toczydlowska.
Machine Learning Developments in Dependency Modelling and Feature Extraction . PhDthesis, University College London, 2019.[53] Dorota Toczydlowska and Gareth W. Peters. Financial Big Data Solutions for State Space Panel Regression inInterest Rate Dynamics.
Econometrics , 6(3), 2018.[54] R. J. A. Little and D. B Rubin.
Statistical Analysis with Missing Data . Wiley, 2002.[55] Ricardo Antonio Maronna. Robust M-Estimators of Multivariate Location and Scatter.
The Annals of Statistics ,4(1):51–67, 1976.[56] Peter J. Huber and Elvezio M. Ronchetti.
Robust Statistics . Wiley, Hoboken, 2009.[57] David E. Tyler. Statistical analysis for the angular central Gaussian distribution on the sphere.
Biometrika ,74(3):579—-589, 1987.[58] Frank R. Hampel. The influence curve and its role in robust estimation.
Journal of the American StatisticalAssociation , 69(346):383–393, 1974.[59] L. T. Fernholz. von Mises Calculus For Statistical Functionals . Lecture Notes in Statistics. Springer-Verlag NewYork, 1983.[60] A. W. van der Vaart.
Asymptotic Statistics . Cambridge University Press, 1998.[61] Michael E Tipping and Christopher M Bishop. Mixtures of Probabilistic Principal Component Analysers.
NeuralComputation , 11(2):443–482, 1999.[62] Peter J. Huber. Robust Estimation of a Location Parameter.
The Annals of Mathematical Statistics , 35(1):73–101,1964.[63] A K Gupta. Multivariate skew t-distribution.
Statistic , 37(4):359–363, 2003.[64] M Abramowitz and I A Stegun.
Handbook of Mathematical Functions with Formulas, Graphs, and MathematicalTables . Dover Publications, New York, 1974.[65] F. Faa di Bruno. On a New Formula of Differential Calculus.
The Quarterly Journal of Pure and AppliedMathematics , 1, 1857. 26
Proofs of the EM Algorithm for Generalized Skew-t Probabilistic Principal ComponentAnalysis
Lemma A.1.
Let X be a d -dimensional random vectors such that X = (cid:2) X , X (cid:3) for d -dimensional subvector X and d -dimensional subvector X , d + d = d . If X follows multivariate Gaussian distribution, that is X = (cid:104) X , X (cid:105) ∼ N (cid:32)(cid:2) µ , µ (cid:3) , (cid:34) Σ Σ Σ Σ (cid:35) (cid:33) , then for i, j = 1 , and i (cid:54) = j , we have the following X i | X j ∼ N (cid:16) µ i + (cid:2) X j − µ j (cid:3) Σ − jj Σ ji , Σ ii − Σ ij Σ − jj Σ ji (cid:17) for X i ∼ N (cid:0) µ i , Σ ii (cid:1) , where µ , µ are d - and d -dimensional real valued vectors, respectively, Σ and Σ are d × d and d × d symmetric,positive-definite real valued matrices, and Σ = Σ T is a d × d real valued matrix.Proof. Please refer to Theorem 2.3.12 in [63].
Lemma A.2.
Let a d dimensional observation vector Y t be modelled as in (1) under the Generalized Skew-t PPCAmodel defined in Section 3 with the latent processes X t , (cid:15) t , U t and V t following the assumptions defined in (5) and (6).The join probability function of the variables Y t , X t , U t and V t can be decomposed into the product of three functions π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π X t | Y t, U t, V t, Ψ ( x t ) π Y t | U t, V t, Ψ ( y t ) C ( u t , v t ; Ψ (cid:1) , where π X t | Y t, U t, V t, Ψ ( x t ) and π Y t | U t V t, Ψ ( x t ) are the conditional probability function of the k dimensional randomvector X t | Y t , U t , V t ∼ N (cid:0) µ x,t , Σ x,t (cid:1) and d dimensional random vector Y t | U t , V t ∼ N (cid:0) µ y,t , Σ y,t (cid:1) , respectively, for thefirst and second central moments of the distributions given by µ x,t = (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) M − t and Σ x,t = σ M − t , µ y,t = µ + δ (cid:15) D − (cid:15),t + σ δ x M − t W T D (cid:15),t N − t and Σ y,t = σ N − t for M t = σ D x,t + W T D (cid:15),t W and N t = D (cid:15),t − D (cid:15),t WM − t W T D (cid:15),t . The function C : R d + × R k + −→ R is given by C ( u t , v t ; Ψ) = π U t | Ψ ( u t ) π V t | Ψ ( v t ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − × exp (cid:26) − δ x (cid:16) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:17) δ Tx (cid:27) . Proof.
Using the Chain Rule of probabilities we obtain the following decomposition of the likelihood function π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π Y t | X t, U t, V t, Ψ ( y t ) · π X t | U t, V t, Ψ ( x t ) · π U t | Ψ ( u t ) · π V t | Ψ ( v t )= π U t | Ψ ( u t ) · π V t | Ψ ( v t ) (cid:0) π (cid:1) − d + k (cid:0) σ (cid:1) − d (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) exp (cid:40) − (cid:0) x t − µ x,t (cid:1) Σ − x,t (cid:0) x t − µ x,t (cid:1) T (cid:41) × exp (cid:40) − (cid:18) δ x D − x,t δ Tx + 1 σ (cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t (cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) T − µ x,t Σ − x,t µ Tx,t (cid:19)(cid:41) , (13)where Σ x,t = σ M − t , µ x,t = (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) M − t and M t = σ D x,t + W T D (cid:15),t W . Let us denote π X t | Y t, U t, V t, Ψ ( x t ) = (cid:0) π (cid:1) − k (cid:12)(cid:12)(cid:12) Σ x,t (cid:12)(cid:12)(cid:12) − exp (cid:26) − (cid:0) x t − µ x,t (cid:1) Σ − x,t (cid:0) x t − µ x,t (cid:1) T (cid:27) , and remark that it is a probability function of k dimensional Gaussian random variable with the mean vector µ x,t andthe covariance matrix Σ x,t . The probability function π X t | Y t, U t, V t, Ψ ( x t ) contains all expressions with the vector x t present in the probability function π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) . In the next step we show how to obtain the formulafor the probability function π Y t | U t, V t, Ψ ( x t ) , which jointly with the function π X t | Y t, U t, V t, Ψ ( x t ) , contains all expressionswith the vector y t . Using (13), we obtain the following formulations π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π X t | Y t, U t, V t, Ψ ( x t ) π U t | Ψ ( u t ) π V t | Ψ ( v t ) (cid:0) π (cid:1) − d (cid:0) σ (cid:1) − d − k (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − × exp (cid:26) − δ x D − x,t δ Tx − (cid:0) y t − µ − δ (cid:15) D − (cid:15),t − σ δ x M − t W T D (cid:15),t N − t (cid:124) (cid:123)(cid:122) (cid:125) y t − µ y,t (cid:1) Σ − y,t (cid:0) y t − µ − δ (cid:15) D − (cid:15),t − σ δ x M − t W T D (cid:15),t N − t (cid:124) (cid:123)(cid:122) (cid:125) y t − µ y,t (cid:1) T (cid:41) exp (cid:40) − (cid:18) − σ δ x M − t W T D (cid:15),t N − t D (cid:15),t WM − t δ Tx − σ δ x M − t δ Tx (cid:19)(cid:41) , where Σ y,t = σ N − t for N t = D (cid:15),t − D (cid:15),t WM − t W T D (cid:15),t . Let us denote µ y,t = µ + δ (cid:15) D − (cid:15),t + σ δ x M − t W T D (cid:15),t N − t anddefine π Y t | U t, V t, Ψ ( y t ) = (cid:0) π (cid:1) − d (cid:12)(cid:12)(cid:12) Σ y,t (cid:12)(cid:12)(cid:12) − exp (cid:40) − (cid:0) y t − µ y,t (cid:1) Σ − y,t (cid:0) y t − µ y,t (cid:1) T (cid:41) . The probability function π Y t | U t, V t, Ψ ( y t ) is a density of the d -dimensional random variable which follows the Gaussiandistribution with the mean vector µ y,t and the covariance matrix Σ y,t . Using the definition of the density function weobtain that π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π X t | Y t, U t, V t, Ψ ( x t ) π Y t | U t, V t, Ψ ( y t ) π U t | Ψ ( u t ) π V t | Ψ ( v t ) × (cid:0) σ (cid:1) − d − k (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) σ N − t (cid:12)(cid:12)(cid:12) exp (cid:40) − δ x (cid:16) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:17) δ Tx (cid:41) . Therefore, denoting C ( u t , v t ; Ψ) = π U t | Ψ ( u t ) · π V t | Ψ ( v t ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − × exp (cid:26) − δ x (cid:16) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:17) δ Tx (cid:27) , we obtain the following decomposition of the joint probability function π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π X t | Y t, U t, V t, Ψ ( x t ) π Y t | U t, V t, Ψ ( y t ) C ( u t , v t ; Ψ) . Lemma A.3.
Let us recall the probability function π X t | Y t, U t, V t, Ψ ( x t ) of a k -dimensional Gaussian random variable X t with the covariance matrix Σ x,t and the mean vector µ x,t defined in Lemma A.2. Under the assumptions of theGeneralized Skew-t PPCA specified in Section 3, the solution to the following integration problem (cid:90) R k log (cid:16) π Y t, X t, U t, V t | Ψ ∗ ( y t , x t , u t , v t ) (cid:17) π X t | Y t, U t, V t, Ψ ( x t ) d x t , is equal to the function w : R d × R d + × R k + −→ R defined as following w ( y t , u t , v t ; Ψ , Ψ ∗ ) = log π U t | Ψ ∗ ( u t ) + log π V t | Ψ ∗ ( v t ) − k + d π − d σ ∗ + 12 d (cid:88) i =1 log u it + 12 k (cid:88) j =1 log v jt − δ ∗ x D − x,t δ ∗ x − σ ∗ (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) D (cid:15),t (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) T − Tr (cid:26) σ M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) + (cid:18)(cid:16) y t − µ − δ (cid:15) D − (cid:15),t (cid:17) D (cid:15),t W + σ δ x (cid:19) M − t (cid:18) σ ∗ W ∗ T D (cid:15),t (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) T + δ ∗ Tx (cid:19) − Tr (cid:26) M − t (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) T (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) where M t = σ D x,t + W T D (cid:15),t W .Proof. Let us recall the probability function π X t | Y t, U t, V t, Ψ ( x t ) defined in Lemma A.2 which is a density of a k -dimensional Gaussian random variable X t with the covariance matrix Σ x,t = σ M − t and the mean vector µ x,t = (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) M − t , where M t = σ D x,t + W T D (cid:15),t W . Hence, the solution to the integration problemis given by w ( y t , u t , v t ; Ψ , Ψ ∗ ) = (cid:90) R k log (cid:16) π Y t, X t, U t, V t | Ψ ∗ ( y t , x t , u t , v t ) (cid:17) π X t | Y t, U t, V t, Ψ ( x t ) d x t = (cid:90) R k (cid:18) log π Y t | X t, U t, V t | Ψ ∗ ( y t ) + log π X t | U t, V t, Ψ ∗ ( x t ) + log π U t | Ψ ∗ ( u t ) + log π V t | Ψ ∗ ( v t ) (cid:19) π X t | Y t, U t, V t, Ψ ( x t ) d x t = log π U t | Ψ ∗ ( u t ) + log π V t | Ψ ∗ ( v t ) − k + d π − d σ ∗ + 12 d (cid:88) i =1 log u it + 12 k (cid:88) j =1 log v jt − δ ∗ x D − x,t δ ∗ Tx − σ ∗ (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) D (cid:15),t (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) T − Tr (cid:26) σ M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) (cid:18)(cid:16) y t − µ − δ (cid:15) D − (cid:15),t (cid:17) D (cid:15),t W + σ δ x (cid:19) M − t (cid:18) σ ∗ W ∗ T D (cid:15),t (cid:16) y t − µ ∗ − δ ∗ (cid:15) D − (cid:15),t (cid:17) T + δ ∗ Tx (cid:19) − Tr (cid:26) M − t (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) T (cid:16)(cid:0) y t − µ − δ (cid:15) D − (cid:15),t (cid:1) D (cid:15),t W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) Lemma A.4.
Let us consider the partition of the observation vector Y t = [ Y ot , Y mt ] into the subvector with observedand missing entries, Y ot and Y mt , respectively. The observation vector Y t is modelled as in (1) under the assumptionsstated in Subsection 3.1. The conditional distribution of the random vector Y t | U t , V t , Ψ , derived in Lemma A.2, isGaussian with the mean vector µ y,t = µ + δ (cid:15) D − (cid:15),t + σ δ x M − t W T N − t and the covariance matrix Σ y,t = σ D − (cid:15),t N − t ,where M t = σ D x,t + W T D (cid:15),t W and N t = (cid:16) I d − WM − t W T D (cid:15),t (cid:17) . The conditional distribution Y mt | Y ot , U t , V t , Ψ isalso Gaussian with the mean vector ˜ µ Y m,t and the covariance matrix ˜ Σ Y m,t such that ˜ µ Y m,t = µ my,t + (cid:0) Y ot − µ my,t (cid:1) Σ oo − y,t Σ omy,t ˜ Σ Y m,t = Σ mmy,t − Σ moy,t Σ oo − y,t Σ omy,t . In addition, the marginal distribution of Y ot | U t , V t , Ψ is Gaussian with the mean vector µ oy,t and the covariance matrix Σ ooy,t . The subvectors µ oy,t and µ my,t contain elements of the vector µ y,t which correspond to the observed or missingentries of the observation vector Y t , respectively, and are d o and d m -dimensional. The square matrices Σ ooy,t and Σ mmy,t contain elements of the matrix Σ y,t which correspond by rows and columns to the observed or missing entries of theobservation vector Y t , respectively, and are d o × d o and d m × d m -dimensional. The non-square matrix Σ moy,t = Σ om Ty,t contains elements of the matrix Σ y,t which correspond by rows to the missing and by columns to the observed entriesof the observation vector Y t , and is d m × d o -dimensional.Proof. Please refer to Lemma A.2 for the derivation of the conditional distribution Y t | U t , V t , Ψ and apply the formulasfor the standard conditional Gaussian distribution from Lemma A.1. Lemma A.5.
Let us consider the partition of the observation vector Y t = [ Y ot , Y mt ] into the subvector with observedand missing entries, Y ot and Y mt , respectively. The observation vector Y t is modelled as in (1) under the assumptionsof Independent Generalized Skew-t PPCA stated in Subsection 3.1. Let us recall the function w ( y t , u t , v t ; Ψ , Ψ ∗ ) definedas in Lemma A.3 and the conditional distribution of the random vector Y mt | Y ot , U t , V t , Ψ specified in Lemma A.4. Givena realisation of the vector Y t at time points t = 1 , . . . , N with observable components y ot , the function v defined by thefollowing integration problem v ( y ot , u t , v t ; Ψ , Ψ ∗ ) = (cid:90) R dm w ( y t , u t , v t ; Ψ , Ψ ∗ ) π Y mt | Y ot , U t, V t, Ψ ( y mt ) d y mt , (14) can be expressed as v ( y ot , u t , v t ; Ψ , Ψ ∗ ) = log π U t | Ψ ∗ ( u t ) + log π V t | Ψ ∗ ( v t ) − k + d π − d σ ∗ + 12 d (cid:88) i =1 log u it + 12 k (cid:88) j =1 log v jt − δ ∗ x D − x,t δ ∗ Tx − σ ∗ Tr (cid:110) E Y mt | Y ot , U t , V t , Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t (cid:111) + 1 σ ∗ E Y mt | Y ot , U t , V t , Ψ (cid:2) Y t (cid:3)(cid:16) D (cid:15),t µ ∗ T + δ ∗ T(cid:15) (cid:17) − σ ∗ µ ∗ D (cid:15),t µ ∗ T − σ ∗ µ ∗ δ ∗ T(cid:15) − σ ∗ δ ∗ (cid:15) D − (cid:15),t δ ∗ T(cid:15) + 1 σ ∗ Tr (cid:110) E Y mt | Y ot , U t , V t , Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t W ∗ T D (cid:15),t (cid:111) + E Y mt | Y ot , U t , V t , Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t (cid:18) − σ ∗ W ∗ T D (cid:15),t µ ∗ T − σ ∗ W ∗ T δ ∗ T(cid:15) + δ ∗ Tx (cid:19) + 1 σ ∗ (cid:18) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:19) M − t W ∗ T D (cid:15),t E Y mt | Y ot , U t , V t , Ψ (cid:2) Y t (cid:3) T + (cid:18) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:19) M − t (cid:18) − σ ∗ W ∗ T D (cid:15),t µ ∗ T − σ ∗ W ∗ T δ ∗ T(cid:15) + δ ∗ Tx (cid:19) − Tr (cid:26) M − t W T D (cid:15),t E Y mt | Y ot , U t , V t , Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − Tr (cid:26) M − t W T D (cid:15),t E Y mt | Y ot , U t , V t , Ψ (cid:2) Y t (cid:3) T (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − Tr (cid:26) M − t (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) T (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − σ Tr (cid:26) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) . sing the mean ˜ µ Y m,t and the covariance matrix ˜ Σ Y m,t of the random vector Y mt | Y ot , U t , V t , Ψ from Lemma A.4, thefirst and second non-central moments of the vector Y t are the following E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) = (cid:2) y ot , ˜ µ Y m,t (cid:3) , E Y mt | Y ot , U t, V t, Ψ (cid:2) Y Tt Y t (cid:3) = (cid:34) ˜ Σ Y m,t (cid:35) + E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) T E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) . Proof.
Let us recall, that the observation vector Y t = [ Y ot , Y mt ] is partitioned into the d o -dimensional subvector withobserved entries, Y ot and the d m -dimensional subvector with unobserved entries, Y mt , such that d m = d − d m . Giventhe function w ( y t , u t , v t ; Ψ , Ψ ∗ ) defined in Lemma A.3, the solution to the integration problem from (14) is derived asfollows v ( y ot , u t , v t ; Ψ , Ψ ∗ ) = (cid:90) R dm w ( y t , u t , v t ; Ψ , Ψ ∗ ) π Y mt | Y ot , U t, V t, Ψ ( y mt ) d y mt . The probability function π Y mt | Y ot , U t, V t, Ψ ( y mt ) is specified in Lemma A.4 as well as the first and second non-central mo-ments of the distribution Y mt | Y ot , U t , V t , Ψ . Consequently we have the following solutions to the integrations problems (cid:90) R dm y t π Y mt | Y ot , U t, V t, Ψ ( y mt ) d y mt = E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) = (cid:2) Y ot , ˜ µ Y m,t (cid:3) , (cid:90) R dm y Tt y t π Y mt | Y ot , U t, V t, Ψ ( y mt ) d y mt = E Y mt | Y ot , U t, V t, Ψ (cid:2) Y Tt Y t (cid:3) = (cid:34) ˜ Σ Y m,t (cid:35) + E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) T E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) , for ˜ µ Y m,t = µ my,t + (cid:0) Y ot − µ my,t (cid:1) Σ oo − y,t Σ omy,t and ˜ Σ Y m,t = Σ mmy,t − Σ moy,t Σ oo − y,t Σ omy,t , where µ y,t = µ + δ (cid:15) D − (cid:15),t + σ δ x M − t W T N − t and Σ y,t = σ D − (cid:15),t N − t , given M t = σ D x,t + W T D (cid:15),t W and N t = (cid:16) I d − WM − t W T D (cid:15),t (cid:17) , which are the central momentsof the conditional distribution Y t | U t , V t , Ψ derived in Lemma A.2. Hence, we can express the integral (14) using themoments of the random vector Y mt | Y ot , U t , V t , Ψ , that is v ( y ot , u t , v t ; Ψ , Ψ ∗ ) = log π U t | Ψ ∗ ( u t ) + log π V t | Ψ ∗ ( v t ) − k + d π − d σ ∗ + 12 d (cid:88) i =1 log u it + 12 k (cid:88) j =1 log v jt − δ ∗ x D − x,t δ ∗ Tx − σ ∗ Tr (cid:110) E Y mt | Y ot , U t, V t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t (cid:111) + 1 σ ∗ E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3)(cid:16) D (cid:15),t µ ∗ T + δ ∗ T(cid:15) (cid:17) − σ ∗ µ ∗ D (cid:15),t µ ∗ T − σ ∗ µ ∗ δ ∗ T(cid:15) − σ ∗ δ ∗ (cid:15) D − (cid:15),t δ ∗ T(cid:15) + 1 σ ∗ Tr (cid:110) E Y mt | Y ot , U t, V t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t W ∗ T D (cid:15),t (cid:111) + E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t (cid:18) − σ ∗ W ∗ T D (cid:15),t µ ∗ T − σ ∗ W ∗ T δ ∗ T(cid:15) + δ ∗ Tx (cid:19) + 1 σ ∗ E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) D (cid:15),t W ∗ M − t (cid:18) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:19) T + (cid:18) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:19) M − t (cid:18) − σ ∗ W ∗ T D (cid:15),t µ ∗ T − σ ∗ W ∗ T δ ∗ T(cid:15) + δ ∗ Tx (cid:19) − Tr (cid:26) M − t W T D (cid:15),t E Y mt | Y ot , U t, V t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − Tr (cid:26) M − t W T D (cid:15),t E Y mt | Y ot , U t, V t, Ψ (cid:2) Y t (cid:3) T (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − Tr (cid:26) M − t (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) T (cid:16) − µ D (cid:15),t W − δ (cid:15) W + σ δ x (cid:17) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) − σ Tr (cid:26) M − t (cid:16) σ ∗ W ∗ T D (cid:15),t W ∗ + D x,t (cid:17)(cid:27) . Lemma A.6.
The quantile distribution function of the Chi-square variable with ν degrees of freedom, χ − ν ( s ) : [0 , → [0 , ∞ ) , belongs to the differentiable class C [ ν as has derivatives of all orders not greater than [ ν ] given by the followingdifferential relation ∂ m χ − ν ( s ) ∂s m = 2 − m m (cid:88) j =1 (cid:16) mj (cid:17) ( − m + j ∂ m χ − ν − j , where the operator [ c ] for any number c returns the closest integer number to c , which is not greater than s . Hence, thequantile distribution function χ − ν ( s ) is continuous on the closed interval [0 , and has continuous derivatives up to [ ν ] order. roof. Please refer to [64] and recall that the degrees of freedom belongs to the class of positive numbers and conse-quently ν − m is greater than . Lemma A.7.
The function m : R d × [0 , −→ R defined in Theorem 1 is of class C r(cid:15),rx (cid:0) [0 , (cid:1) with respect to s (cid:15),t and s x,t for r (cid:15) = min (cid:2) ν (cid:15) (cid:3) and r x = min (cid:2) ν x (cid:3) .Proof. We show that the function m : R d × [0 , → R m ( y t , s (cid:15),t , s x,t ; Ψ) := π Y | S(cid:15),t,Sx,t, Ψ ( y t ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D (cid:15),t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) D x,t (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − e − δ x (cid:16) D − x,t − σ M − t W T D (cid:15),t N − t D (cid:15),t WM − t − σ M − t (cid:17) δ Tx , has continuous partial derivatives ∂ i + j m ( y t , s (cid:15),t , s x,t ; Ψ) ∂s i(cid:15),t ∂s jx,t , (15)for ≤ i ≤ r (cid:15) and ≤ j ≤ r x . Let m ( y t , s (cid:15) , s x ) := ˜ m ( s (cid:15) , s x ) . The function ˜ m can be formulated as a composite ofpolynomial p and the function g such that ˜ m ( s (cid:15) , s x ) = ( p ◦ g )( s (cid:15) , s x ) for i = 1 , . . . , d + k where • g : [0 , → R d + × R k + , such that g ( s (cid:15) , s x ) := (cid:0) T (cid:15) ( s (cid:15) ) , T x ( s x ) (cid:1) = (cid:0) g ( s (cid:15) ) , . . . , g d ( s (cid:15) ) , g d +1 ( s x ) . . . , g d + k ( s x ) (cid:1) for g i :[0 , → R + being the following (cid:0) g ( s ) , . . . , g d ( s ) (cid:1) = (cid:18) χ − ν (cid:15) ( s ) ν (cid:15) , . . . , χ − νd(cid:15) ( s ) ν d(cid:15) (cid:19) and (cid:0) g d +1 ( s ) , . . . , g d + k ( s ) (cid:1) = (cid:18) χ − ν x ( s ) ν x , . . . , χ − νkx ( s ) ν kx (cid:19) , given the function T (cid:15) : [0 , → R d + and T x : [0 , → R k + defined in Theorem 1. • p : R d + × R k + → R , such that p ( t (cid:15) , t x ) = (cid:81) k =1 p k ( t (cid:15) , t x ) for ( t (cid:15) , t x ) ∈ R d + × R k + and p k : R d + × R k + → R being thefollowing p ( t (cid:15) , t x ) = (cid:18) d (cid:89) i =1 t (cid:15),i k (cid:89) j =1 t x,j (cid:19) , p ( t (cid:15) , t x ) = det (cid:16) M ( t (cid:15) , t x ) (cid:17) − , p ( t (cid:15) , t x ) = det (cid:16) N ( t (cid:15) , t x ) (cid:17) − ,p ( t (cid:15) , t x ) = exp (cid:26) − δ x (cid:16) diag( t x ) − − σ M ( t (cid:15) , t x ) − W T diag( t (cid:15) ) N ( t (cid:15) , t x ) − diag( t (cid:15) ) WM ( t (cid:15) , t x ) − − σ M ( t (cid:15) , t x ) − (cid:17) δ Tx (cid:27) ,p ( t (cid:15) , t x ) = (cid:16) πσ (cid:1) − d det (cid:16) N ( t (cid:15) , t x ) (cid:17) exp (cid:26) − σ (cid:0) y − u ( t (cid:15) , t x ) (cid:1) N ( t (cid:15) , t x ) − (cid:0) y − u ( t (cid:15) , t x ) (cid:1) T , (cid:27) where M ( t (cid:15) , t x ) d × d = σ diag( t x ) + W T diag( t (cid:15) ) W ,N ( t (cid:15) , t x ) k × k = diag( t (cid:15) ) − diag( t (cid:15) ) W M ( t (cid:15) , t x ) W T diag( t (cid:15) ) ,u ( t (cid:15) , t x ) = µ + δ (cid:15) diag( t (cid:15) ) − + σ δ x M ( t (cid:15) , t x ) − W T diag( t (cid:15) ) N ( t (cid:15) , t x ) − . Given the above representation of ˜ m , the differentiability class of the function can be determined by specifying theminimum differentiability class of g and p . It can be seen by applying the chain rule formula for multivariate partialderivatives from [65] to (4) which results in ∂ i + j ∂s i(cid:15),t ∂s jx,t ˜ m ( s (cid:15),t , s x,t ; Ψ) = ∂∂ t (cid:15) ∂ t Tx p ( t (cid:15) , t x ) ∂ i + j ∂s i(cid:15),t ∂s jx,t g ( s (cid:15),t , s x,t ) . We start by specifying the class of the functions T (cid:15) and T x defined as T (cid:15) ( s ) := (cid:18) χ − ν (cid:15) ( s ) ν (cid:15) , . . . , χ − νd(cid:15) ( s ) ν d(cid:15) (cid:19) × d and T x ( s ) := (cid:18) χ − ν x ( s ) ν x , . . . , χ − νkx ( s ) ν kx (cid:19) × k . Using Lemma A.6, it is straightforward to show that T (cid:15) ( s ) and T x ( s ) have continuous derivatives up to the order min i [ vi(cid:15) ] for T (cid:15) ( s ) and min j [ vjx ] for T x ( s ) . It stems from the fact that the differentiability class of the multidimensionalfunctions equals to the minimum differentiability class of its marginals. Consequently, the function g belongs to theclass C r(cid:15),rx ([0 , ) for r (cid:15) = min (cid:2) ν (cid:15) (cid:3) and r x = min (cid:2) ν x (cid:3) with respect to s (cid:15) and s x , respectively.Secondly, we show that the functions p is infinitely differentiable under the model assumptions from Subsection 3.1.Since the function p is polynomial, it is infinitely differentiable. The function p k for k = 2 , . . . , are rational polynomi-als and, consequently, are infinitely differentiable except for regions where det (cid:16) M ( t (cid:15) , t x ) (cid:17) = 0 or det (cid:16) N ( t (cid:15) , t x ) (cid:17) = 0 .However, given the distribution assumptions of the GSt PPCA model, the determinants of M ( t (cid:15) , t x ) (cid:17) and N ( t (cid:15) , t x ) arealways greater than zero. It is due to the fact the matrices represent covariances of random vectors as shown in provenin Lemma A.2 and hence, are positive-definite from definition. It implicates that the function p = (cid:81) k =1 p k is smooth.The above reasoning shows that function ˜ m belongs to the class C r(cid:15),rx (cid:0) [0 , (cid:1) for r (cid:15) = min (cid:2) ν (cid:15) (cid:3) and r x = min (cid:2) ν x (cid:3) .31 .1 Proof of Theorem 1 Proof.
Let the observation vector Y t is partitioned into two subvectors, Y t = [ Y ot , Y mt ] with observed and unobservedentries of Y t , respectively. The joint probability function of the variables Y t , X t , U t and V t for Generalized Skew-tPPCA model is specified in (7). Given the independence of the random vectors Y t , X t , U t and V t over their realisa-tions, the E-step of the corresponding EM algorithm can be written as Q (Ψ , Ψ ∗ ) = E Y m N , X N , U N , V N | Y o N , Ψ (cid:104) log π Y N , X N , U N , V N | Ψ ∗ ( Y N , X N , U N , V N ) (cid:105) = c (cid:90) R N × d + (cid:90) R N × k + (cid:90) R N × dm N (cid:88) t =1 (cid:40) (cid:90) R N × k (cid:20) log (cid:18) π Y t, X t, U t, V t | Ψ ∗ ( y t , x t , u t , v t ) (cid:19) cm × N (cid:89) s =1 π Y s, X s, U s, V s | Ψ ( y s , x s , u s , v s ) (cid:21) d x N (cid:41) d y m N d u N d v N , for c = π Y o N | Ψ( y o N ) ≥ . Using Lemma A.2, we can decompose the probability function π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) into the product of three functions π Y t, X t, U t, V t | Ψ ( y t , x t , u t , v t ) = π X t | Y t, U t, V t, Ψ ( x t ) π Y t | U t, V t, Ψ ( y t ) C ( u t , v t ; Ψ (cid:1) . Please refer to the result derived in Lemma A.2 to specify probability functions and the function C ( u t , v t ; Ψ (cid:1) : R d + × R k + −→ R .Therefore, we can further simplify the formulation of the Q function Q (Ψ , Ψ ∗ ) = c (cid:90) R N × d + (cid:90) R N × k + (cid:90) R N × dm (cid:32) N (cid:88) t =1 w ( y t , u t , v t ; Ψ , Ψ ∗ ) (cid:33)(cid:32) N (cid:89) s =1 π Y s | U s, V s, Ψ ( y s ) (cid:33) d y m N (cid:32) N (cid:89) s =1 C ( u t , v t ; Ψ (cid:1)(cid:33) d u N d v N , for the function w : R d × R d + × R k + −→ R defined in Lemma A.3. We can follow the similar steps to simplify the integrationover the vector y m N using the fact that y t are mutually independent . The next step is to specify the solutions to theintegration problems (cid:90) R dm π Y t | U t, V t, Ψ ( y t ) d y mt and (cid:90) R dm w ( y t , u t , v t ; Ψ , Ψ ∗ ) π Y t | U t, V t, Ψ ( y t ) d y mt , which rely on π Y mt | Y ot , U t, V t, Ψ ( y mt ) . The probability function π Y t | U t, V t, Ψ ( y t ) is a Gaussian and conditional distributionsof Gaussian random vectors are broadly known. We derive the probability functions in Lemma A.4 such that π Y t | U t, V t, Ψ ( y t ) = π Y mt | Y ot , U t, V t, Ψ ( y mt ) π Y ot | U t, V t, Ψ ( y ot ) , and use them to obtain the formulation of the Q function as following Q (Ψ , Ψ ∗ ) = c (cid:90) R N × d + (cid:90) R N × k + N (cid:88) t =1 (cid:40) (cid:90) R dm w ( y t , u t , v t ; Ψ , Ψ ∗ ) π Y mt | Y ot , U t, V t, Ψ ( y mt ) d y mt (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) v ( y o N , u N , v N ;Ψ , Ψ ∗ ) × (cid:18) N (cid:89) s =1 π Y ot | U t, V t, Ψ ( y ot ) C ( u t , v t ; Ψ (cid:1)(cid:19)(cid:124) (cid:123)(cid:122) (cid:125) h ( y o N , u N , v N ;Ψ) d u N d v N . Given the solution to the integration problem v ( y o N , u N , v N ; Ψ , Ψ ∗ ) in Lemma A.5 and denoting f h ( y ot , u t , v t ; Ψ) = π Y ot | U t, V t, Ψ ( y ot ) π U t | Ψ ( u t ) π V t | Ψ ( v t ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D − x,t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N − t (cid:12)(cid:12)(cid:12) × exp (cid:40) − δ x (cid:16) D − x,t − σ M − t W T N − t D (cid:15),t WM − t − σ M − t (cid:17) δ Tx (cid:41) , the E-step of the Generalized Skew-t PPCA is the following Q (Ψ , Ψ ∗ ) = N (cid:88) t =1 (cid:40) (cid:90) R d + (cid:90) R k + v ( y ot , u t , v t ; Ψ , Ψ ∗ ) h ( y ot , u t , v t ; Ψ) d u t d v t N (cid:89) s =1 ,s (cid:54) = t (cid:90) R d + (cid:90) R k + h ( y os , u s , v s ; Ψ) d u s d v s (cid:41) . (16)Let us recall the definition of the mutually independent multivariate Gamma random vectors U t and V t from (5) inSubsection 3.1 as transformations of the mutually independent univariate uniform random variables S (cid:15),t , S x,t ∈ U (0 , by the quantile functions T (cid:15) : [0 , → R d + and T x : [0 , → R k + such that T (cid:15) ( s ) := (cid:18) χ − ν (cid:15) ( s ) ν (cid:15) , . . . , χ − νd(cid:15) ( s ) ν d(cid:15) (cid:19) × d and T x ( s ) := (cid:18) χ − ν x ( s ) ν x , . . . , χ − νkx ( s ) ν kx (cid:19) × k . χ − ν is the quantile distribution function of a univariate Chi-square random variable with ν degrees of freedom.By the change of variables S (cid:15),t = T − (cid:15) ( U t ) and S x,t = T − x ( V t ) , we can reduce the d × k dimensional integration problemsfrom (16) to the two-dimensional integrations on the unit hypercube [0 , .Firstly, we remark that the corresponding density functions π U t | Ψ ( u t ) and π V t | Ψ ( v t ) can be expressed by the densityfunction of the random variables S (cid:15),t and S x,t defined in Subsection 3.1, that is π U t | Ψ ( u t ) = π S(cid:15),t | Ψ ( s (cid:15),t ) (cid:12)(cid:12)(cid:12) ∂T (cid:15) ( s (cid:15),t ) ∂s (cid:15),t (cid:12)(cid:12)(cid:12) − = [0 , ( s (cid:15),t ) (cid:12)(cid:12)(cid:12) ∂T (cid:15) ( s (cid:15),t ) ∂s (cid:15),t (cid:12)(cid:12)(cid:12) − ,π V t | Ψ ( v t ) = π Sx,t | Ψ ( s x,t ) (cid:12)(cid:12)(cid:12) ∂T x ( s x,t ) ∂s x,t (cid:12)(cid:12)(cid:12) − = [0 , ( s x,t ) (cid:12)(cid:12)(cid:12) ∂T x ( s x,t ) ∂s x,t (cid:12)(cid:12)(cid:12) − . The distribution of Y ot conditioned on U t and V t can also by expressed by S (cid:15),t and S x,t as π Y ot | T(cid:15) ( S(cid:15),t ) ,Tx ( Sx,t ) , Ψ ( y ot ) = π Y ot | S(cid:15),t,Sx,t, Ψ ( y ot ) . Recall that the definitions of matrices D (cid:15),t and D x,t under the transformation U t = T (cid:15) ( S (cid:15),t ) and V t = T x ( S x,t ) become D (cid:15),t = χ − ν (cid:15) ( S(cid:15),t ) ν (cid:15) ...
00 0 χ − νd(cid:15) ( S(cid:15) ) νd(cid:15) d × d , D x,t = χ − ν x ( Sx,t ) ν x ...
00 0 χ − νkx ( Sx,t ) νkx k × k . Hence, the E-step of the EM algorithm for GSt PPCA from (16) can be expressed by the sum of two-dimensionalintegration problems Q (Ψ , Ψ ∗ ) = 1 π Y o N | Ψ ( y o N ) N (cid:88) t =1 (cid:26) I ( y ot ; Ψ , Ψ ∗ ) N (cid:89) s =1 ,s (cid:54) = t I ( y os ; Ψ) (cid:27) , for the functions I , I : R do → R I ( y ot ; Ψ , Ψ ∗ ) := (cid:90) (cid:90) v ( y ot , T (cid:15) ( s (cid:15),t ) , T x ( s x,t ); Ψ , Ψ ∗ ) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t ,I ( y ot ; Ψ) := (cid:90) (cid:90) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t , and the function m : R do × [0 , −→ R m ( y ot , s (cid:15),t , s x,t ; Ψ) = π Y ot | S(cid:15),t,Sx,t, Ψ ( y ot ) (cid:0) σ (cid:1) k (cid:12)(cid:12)(cid:12) D − x,t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12) N t (cid:12)(cid:12)(cid:12) − e − δ x (cid:0) D − x,t − σ M − t W T N − t D (cid:15),t WM − t − σ M − t (cid:1) δ Tx . A.2 Proof of Theorem 2
Proof.
Let us define a function H : R ( N − × do −→ R which is a product of N − integration problems on [0 , ,represented by the function I from Theorem 1, given by H (cid:0) y o N/t ; Ψ (cid:1) := N (cid:89) s =1 ,s (cid:54) = t I ( y os ; Ψ) . (17)The set y o N/t is a N − sequence of d o -dimensional vectors with observed entries of y t without the vector y ot , that is y o N/t = (cid:110) y o , . . . , y ot − , y ot +1 , . . . , y oN (cid:111) . The maximizers of the function Q with respect to µ ∗ , σ ∗ , δ ∗ (cid:15) and δ ∗ x have closed form solutions defined by sequencesof various two-dimensional integration problem as follows ∂∂ µ ∗ Q (Ψ , Ψ ∗ ) = ⇐⇒ N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) ∂∂ µ ∗ ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t = ⇐⇒ µ ∗ = (cid:20) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ (cid:15) A ( y o N ; Ψ) + µ A ( y o N ; Ψ , Ψ ∗ ) + (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ , Ψ ∗ ) T (cid:21) A ( y o N ; Ψ) − ,∂∂ δ ∗ (cid:15) Q (Ψ , Ψ ∗ ) = ⇐⇒ N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) ∂∂ δ ∗ (cid:15) ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t = ⇒ δ ∗ (cid:15) = (cid:20) A ( y o N ; Ψ) − µ ∗ A ( y o N ; Ψ) − A ( y o N ; Ψ) W ∗ T + µ A ( y o N ; Ψ) W ∗ T + (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) W ∗ T (cid:21) A ( y o N ; Ψ) − ,∂∂ δ ∗ x Q (Ψ , Ψ ∗ ) = ⇐⇒ N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) ∂∂ δ ∗ x ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) m (cid:0) y ot , s (cid:15),t , s x,t ; Ψ (cid:1) ds (cid:15),t ds x,t = ⇐⇒ δ ∗ x = (cid:20) A ( y o N ; Ψ) − µ A ( y o N ; Ψ) − (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) (cid:21) A ( y N , Ψ) − ,∂∂σ ∗ Q (Ψ , Ψ ∗ ) = ⇐⇒ N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) ∂∂σ ∗ ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t = ⇐⇒ σ ∗ = 1 dA ( y o N ; Ψ) (cid:20) A ( y o N ; Ψ) + µ ∗ A ( y o N ; Ψ) µ ∗ T + δ ∗ (cid:15) A ( y o N ; Ψ) δ ∗ T(cid:15) + 2 A ( y o N ; Ψ , Ψ ∗ ) µ T + 2 (cid:16) A ( y o N ; Ψ , Ψ ∗ ) − A ( y o N ; Ψ) − µ A ( y o N ; Ψ , Ψ ∗ ) (cid:17) µ ∗ T − A ( y o N ; Ψ , Ψ ∗ ) + σ Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) + Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) − Tr (cid:110) A . ( y o N ; Ψ) T W ∗ (cid:111) − Tr (cid:110) A . ( y o N ; Ψ) T W ∗ (cid:111) + Tr (cid:110) A ( y o N ; Ψ) T W ∗ (cid:111) + 2 (cid:16) µ ∗ A ( y o N ; Ψ) − A ( y o N ; Ψ) + A ( y o N ; Ψ) W ∗ T − µ A ( y o N ; Ψ) W ∗ T (cid:17) δ ∗ T(cid:15) + 2 (cid:16) A ( y o N ; Ψ , Ψ ∗ ) − µ ∗ A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ (cid:15) W ∗ A ( y o N ; Ψ) (cid:17)(cid:0) δ (cid:15) W − σ δ x (cid:1) T (cid:21) , where A ( y o N ; Ψ) × := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) d × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) d × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D − (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) k × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D − x,t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) k × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) M − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t W ∗ M − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t W ∗ M − t W T D (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t W ∗ T D (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ) d × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t WM − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) d × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t W ∗ M − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) d × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t WM − t W ∗ T D (cid:15),t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , ( y o N ; Ψ) d × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) d × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t µ ∗ T E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) d × k := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t (cid:16) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) − µ ∗ (cid:17) T (cid:16) µ D (cid:15),t W + δ (cid:15) W − σ δ x (cid:17) M − t × m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) k × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t W ∗ M − t W T D (cid:15),t E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t × m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A . ( y o N ; Ψ , Ψ ∗ ) k × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t W ∗ M − t W T D (cid:15),t E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) T (cid:16) µ D (cid:15),t W + δ (cid:15) W − σ δ x (cid:17) M − t × m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A . ( y o N ; Ψ , Ψ ∗ ) k × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t W ∗ M − t (cid:16) µ D (cid:15),t W + δ (cid:15) W − σ δ x (cid:17) T E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y t (cid:3) D (cid:15),t WM − t × m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) , A ( y o N ; Ψ , Ψ ∗ ) k × d := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) D (cid:15),t W ∗ M − t (cid:16) µ D (cid:15),t W + δ (cid:15) W − σ δ x (cid:17) T (cid:16) µ D (cid:15),t W + δ (cid:15) W − σ δ x (cid:17) M − t × m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) ,A ( y o N ; Ψ) × := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) Tr (cid:110) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t (cid:111) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) ,A ( y o N ; Ψ , Ψ ∗ ) × := N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) Tr (cid:110) E Y mt | Y ot ,S(cid:15),t,Sx,t, Ψ (cid:2) Y Tt Y t (cid:3) D (cid:15),t WM − t W ∗ T D (cid:15),t (cid:111) m ( y ot , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) . The maximizer that corresponds to the parameter W ∗ is more difficult to obtain since the partial derivative of thefunction Q with respect to W ∗ requires integrating the matrix products which contain W ∗ and cannot be furthersimplified, that is ∂∂ W ∗ Q (Ψ , Ψ ∗ ) = ⇐⇒ N (cid:88) t =1 (cid:26) H (cid:16) y o N/t ; Ψ (cid:17) (cid:90) (cid:90) ∂∂ W ∗ ˜ v (cid:0) y ot , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) m (cid:0) y ot , s (cid:15),t , s x,t ; Ψ (cid:1) ds (cid:15),t ds x,t = ⇐⇒ A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ T(cid:15) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + δ ∗ T(cid:15) µ A ( y o N ; Ψ)+ δ ∗ T(cid:15) (cid:0) δ W − σ W (cid:1) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) − A ( y o N ; Ψ , Ψ ∗ ) − σ A ( y o N ; Ψ , Ψ ∗ ) = The maximizer with respect to W ∗ requires solving a root finding problem ∂Q (Ψ , Ψ ∗ ) ∂ W ∗ = . Recall that ∂Q (Ψ , Ψ ∗ ) ∂ W ∗ containselements of µ ∗ and δ ∗ (cid:15) . By denoting f : R d × k → R d × k such that f ( W ∗ ; Ψ , Ψ ∗ ) := ∂Q (Ψ , Ψ ∗ ) ∂ W ∗ is given by f ( W ∗ ; Ψ , Ψ ∗ ) = A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) − δ ∗ T(cid:15) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + δ ∗ T(cid:15) µ A ( y o N ; Ψ)+ δ ∗ T(cid:15) (cid:0) δ W − σ W (cid:1) A ( y o N ; Ψ) − A ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) + A . ( y o N ; Ψ , Ψ ∗ ) − A ( y o N ; Ψ , Ψ ∗ ) − σ A ( y o N ; Ψ , Ψ ∗ ) T . ∇ Ψ ∗ Q (Ψ , Ψ ∗ ) = reduces to µ ∗ = A ( y o N ; Ψ) − (cid:104) A ( y o N ; Ψ) − A ( y o N ; Ψ) δ ∗ (cid:15) − A ( y o N ; Ψ) W ∗ T + µ WA ( y o N ; Ψ) W ∗ T + (cid:0) δ (cid:15) W + σ δ x (cid:1) A ( y o N ; Ψ) W ∗ T (cid:105) , δ ∗ (cid:15) = A ( y o N ; Ψ) − (cid:104) A ( y N ; Ψ) − A ( y N ; Ψ) µ ∗ − A ( y o N ; Ψ) W ∗ T + µ WA ( y o N ; Ψ) W ∗ T + (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) W ∗ T (cid:105) , δ ∗ x = A ( y o N ; Ψ) − (cid:16) A ( y N ; Ψ) − µ WA ( y o N ; Ψ) − (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ (cid:1)(cid:17) ,f ( W ∗ ; Ψ , Ψ ∗ ) = 0 ,σ ∗ = dA y o N ;Ψ) (cid:34) A ( y o N ; Ψ) − A ( y o N ; Ψ) µ ∗ T − A ( y o N ; Ψ) δ ∗ T(cid:15) + A ( y o N ; Ψ) µ ∗ µ ∗ T + 2 A ( y o N ; Ψ) µ ∗ δ ∗ T(cid:15) + A ( y o N ; Ψ) δ ∗ (cid:15) δ ∗ T(cid:15) − Tr (cid:110) A ( y o N ; Ψ) W ∗ T (cid:111) + 2 A ( y o N ; Ψ , Ψ ∗ ) W T µ T + 2 A ( y o N ; Ψ , Ψ) W T δ T(cid:15) − σ A ( y o N ; Ψ , Ψ) δ Tx + 2 (cid:16) A ( y o N ; Ψ) − µ WA ( y o N ; Ψ) − (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) (cid:17) W ∗ T µ ∗ T +2 (cid:16) A ( y o N ; Ψ) − µ WA ( y o N ; Ψ) − (cid:0) δ (cid:15) W − σ δ x (cid:1) A ( y o N ; Ψ) (cid:17) W ∗ T δ ∗ T(cid:15) + Tr (cid:110) A ( y o N ; Ψ) W ∗ T W ∗ (cid:111) − Tr (cid:110)(cid:16) A ( y o N ; Ψ) + A ( y o N ; Ψ) T (cid:17) W ∗ T W ∗ (cid:111) + Tr (cid:110) A ( y o N ; Ψ) W ∗ T W ∗ (cid:111) + σ Tr (cid:110) A ( y o N ; Ψ) W ∗ T W ∗ (cid:111)(cid:35) . B Simplified Skew-t GSt PPCA
We calculate the steps of the EM algorithm for a simplified model of the Skew-t GSt PPCA than introduced in Sub-section 3.1. The following algorithm is derived using the assumption that there is no missing data and the skewness isonly present in the representation of X t , that is X t, × k = δ x V − t + (cid:113) V − t Z x,t , (cid:15) t × d = (cid:113) σ U − t Z (cid:15),t , Y t, × d = µ + X t W T + (cid:15) t , (18)for U t := T (cid:15) ( s ) := χ − ν (cid:15) ( s ) ν (cid:15) and V t := T x ( s ) := χ − ν x ( s ) ν x . In the follow-up publication to this work, we show that the Skew-t GSt PPCA is subject to identification problemsthat arise from the joint estimation of µ and δ x via the EM algorithm. To address this problem, we propose to excludethe estimation of both parameters from the EM algorithm. The new set of normal equations that defines the iterativemaximisation of Q does not include the step that maximizes the objective function with respect to µ or δ x .We want to avoid making an assumption that the observation process is zero- mean. Therefore, we argue that if thereis no presence of missing values, we can introduce a simple correction that provides good accuracy of estimates forall parameters, µ , W , σ and δ x , when parameters W , σ are specified by the EM algorithm that assumes no interceptterm. We determine δ x by a grid search, and the intercept term µ is obtained by an iterative adjustment. We use thefact that the first moment of the marginal distribution of X t equals E X t | Ψ (cid:2) X t (cid:3) = E Vt | Ψ (cid:104) E X t | Vt, Ψ (cid:2) X t (cid:3)(cid:105) = v x v x − δ x , and consequently E Y t | Ψ (cid:2) Y t (cid:3) = µ + v x v x − δ x W T . Therefore, we specify the update of µ , µ ∗ , over each iteration of the EM algorithm as µ ∗ = 1 N N (cid:88) t =1 Y t − v x v x − δ x W T , for fixed δ x , and calculate the maximizers W ∗ and σ ∗ given the centred realisations ˜ Y t = Y t − µ ∗ . Instead of thesample average, one can use more robust estimators of the first moment.The steps of the EM algorithm for the centred data (no µ ) with δ x specified on a grid assumes the following stochasticrepresentation of the observation process ˜ Y t, × d = X t W T + (cid:15) t , (19)and the corresponding conditional distribution ˜ Y t | X t , U t , V t , Ψ π (˜ y t | x t , u t , v t , Ψ) = (cid:0) π (cid:1) − d (cid:0) σ (cid:1) − d U d t exp (cid:110) − σ (cid:16) U t ˜ y t ˜ y Tt − U t ˜ y t Wx Tt + U t x t W T Wx Tt (cid:17)(cid:111) . Algorithm 1:
Algorithm of the EM algorithm for Skew-t GSt PPCA with the parameter δ x being specified on a gridof values and the adjustment for the intercept µ . Input:
Define v (cid:15) , v x and the grid for δ gridx ; Input:
Define initial parameters for the EM algorithm, W (0) , σ (0)2 ;Calculate ¯ µ = N (cid:80) Nt =1 y t ; for δ gridx = (cid:8) a, . . . , b (cid:9) do Initialize W (0) , σ (0)2 ; for i = 1 , . . . , M do Specify µ ( i ) = ¯ µ − vxvx − δ gridx W ( i − T ;Specify ˜ Y ( i ) N × d = Y N × d − µ ( i ) ;Calculate maximizers W ( i ) , σ ( i )2 as in Lemma B.2 for the centred data set ˜ Y ( i )1: N ;Select the optimal δ ∗ x with the highest log-likelihood for the sample y N . Lemma B.1.
Let the observation vector ˜ Y t and the latent variables (cid:15) t , X t , U t and V t be modelled as in (18) and (19) . Given N realisations of the random vector ˜ Y t , the E-step of the Expectation-Maximisation algorithm for centredSkew-t GSt PPCA in the complete data setting is specified as Q (Ψ , Ψ ∗ ) = 1 π ˜ Y N | Ψ ( y N ) N (cid:88) t =1 (cid:110) I (˜ y t ; Ψ , Ψ ∗ ) N (cid:89) s =1 ,s (cid:54) = t I (˜ y s ; Ψ) (cid:111) . The functions I , I : R d → R are defined as I ( ˜ y t ; Ψ , Ψ ∗ ) := (cid:90) (cid:90) w (˜ y t , s (cid:15),t , s x,t ; Ψ , Ψ ∗ ) m (˜ y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t ,I ( ˜ y t ; Ψ) := (cid:90) (cid:90) m (˜ y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t , where ˜ w (cid:0) ˜ y t , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) : R d × [0 , −→ R is defined as ˜ w (cid:0) ˜ y t , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) := w (cid:0) ˜ y t , T (cid:15) ( s (cid:15),t ) , T x ( s x,t ); Ψ , Ψ ∗ (cid:1) for w (cid:0) ˜ y t , s (cid:15),t , s x,t ; Ψ , Ψ ∗ (cid:1) = (cid:90) R k log π (˜ y t , x t , u t , v t ; Ψ ∗ ) π ( x t | ˜ y t , u t , v t ; Ψ) d x t , for π ( x t | y t , u t , v t | Ψ) being a probability density function of the conditional random vector X t | ˜ Y t , U t , V t , Ψ ∼ N (cid:18)(cid:0) U t ˜ Y t W + σ δ x (cid:1) M − t , σ M t (cid:19) , M t = T (cid:15) ( s (cid:15),t ) W T W + σ T x ( s x,t ) I k and the function m : R do × [0 , −→ R is given by m (˜ y t , s (cid:15),t , s x,t , Ψ) := (cid:0) π (cid:1) − d (cid:0) σ (cid:1) k − d T (cid:15) ( s (cid:15),t ) d T x ( s x,t ) k (cid:12)(cid:12)(cid:12) M t (cid:12)(cid:12)(cid:12) − exp (cid:110) − σ (cid:16) T (cid:15) ( s (cid:15),t )˜ y t ˜ y Tt (cid:111) × exp (cid:110) − σ (cid:16) σ T x ( s x,t ) − δ x δ Tx − (cid:0) T (cid:15) ( s (cid:15),t )˜ y t W + σ δ x (cid:1) M − t (cid:0) T (cid:15) ( s (cid:15),t )˜ y t W + σ δ x (cid:1) T (cid:17)(cid:111) . Proof.
The proof of Lemma B.1 follows the same steps as the proof of Theorem 1 in Subsection A.1 in Appendix Abut assuming no δ (cid:15) and no µ . Lemma B.2.
The solution to the system of equation ∇ Ψ ∗ Q (Ψ , Ψ ∗ ) = which determines the maximizers of the function Q from Lemma B.1 with respect to the parameters σ and W are given by explicit formulas defined by two-dimensionalintegration problems on the hypercube [0 , as follows W ∗ = (cid:104) A + σ A (cid:105)(cid:16) σ A + A (cid:17) − ,σ ∗ = d A − (cid:20) A ( y N ; Ψ) − Tr (cid:110) A W ∗ T (cid:111) − σ A δ Tx + Tr (cid:26)(cid:16) σ A + A (cid:17) W ∗ T W ∗ (cid:27)(cid:21) , where the two-dimensional integrals A i on the hypercube [0 , , for i ∈ (cid:110) , . . . , (cid:111) , are defined in Subsection A.2 inAppendix A with the exceptions A ( y N ; Ψ) d × k := N (cid:88) t =1 (cid:26) H (cid:0) y N/t ; Ψ (cid:1) y Tt (cid:90) (cid:90) T (cid:15) ( s (cid:15),t ) δ x M − t m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) ; A ( y N ; Ψ) k × k := N (cid:88) t =1 (cid:26) H (cid:0) y N/t ; Ψ (cid:1) (cid:90) (cid:90) T (cid:15) ( s (cid:15),t ) M − t (cid:16) T (cid:15) ( s (cid:15),t ) y t W + σ δ x (cid:17) T × (cid:16) T (cid:15) ( s (cid:15),t ) y t W + σ δ x (cid:17) M − t m ( y t , s (cid:15),t , s x,t ; Ψ) ds (cid:15),t ds x,t (cid:27) . roof. The proof of Lemma B.2 follows the steps of the proof to Theorem 2 in Subsection A.2 in Appendix A butassuming no δ (cid:15) and µ and no maximizers of the function Q with respect to δ x . C Figures & Tables
Table 4: The model choices (selected degrees of freedom, if applicable) and resulting log-likelihood ( log L ) for PPCAfraneworks: the Gaussian PPCA, Student-t PPCA, Student-t GSt PPCA, Grouped-t GSt PPCA and Skew-t GSt PPCAfor standardized daily returns of crypto assets listed in Table 3 over the two sample periods, 2018 and 2019. log L -5451.735 -2743.588Student-t PPCA log L -10455.642 -9989.970 ν log L -10385.006 -10003.074 ν (cid:15) ν x log L -11005.430 -9841.640 ν (cid:15) ν x log L -10366.933 -9987.246 ν (cid:15),Advertising ν (cid:15),Exchange ν (cid:15),Technology ν (cid:15),Energy ν (cid:15),SmartContracts ν (cid:15),Interoperability ν (cid:15),Governance ν (cid:15),Privacy ν (cid:15),FinancialService ν (cid:15),Stablecoin
100 2 ν x, ν x, ν x,20