Bayesian Covariance Modelling of Large Tensor-Variate Data Sets & Inverse Non-parametric Learning of the Unknown Model Parameter Vector
aa r X i v : . [ s t a t . A P ] D ec Bayesian Covariance Modelling of Large Tensor-Variate Data Sets & InverseNon-parametric Learning of the Unknown Model Parameter Vector
Kangrui Wang Dalia Chakrabarty
Department of MathematicsUniversity of Leicesterkw202 @ le.ac.uk Department of MathematicsUniversity of Leicesterdc252 @ le.ac.uk;Department of StatisticsUniversity of [email protected] Abstract
We present a method for modelling the covari-ance structure of tensor-variate data, with the ul-terior aim of learning an unknown model pa-rameter vector using such data. We expressthe high-dimensional observable as a functionof this sought model parameter vector, and at-tempt to learn such a high-dimensional functiongiven training data, by modelling it as a realisa-tion from a tensor-variate Gaussian Process (GP).The likelihood of the unknowns given trainingdata, is then tensor-normal. We choose vaguepriors on the unknown GP parameters (mean ten-sor and covariance matrices) and write the poste-rior probability density of these unknowns giventhe data. We perform posterior sampling usingRandom-Walk Metropolis-Hastings. Thereafterwe learn the aforementioned unknown model pa-rameter vector by performing posterior samplingin two different ways, given test and trainingdata, using MCMC, to generate 95 % HPD credi-ble region on each unknown. We make an appli-cation of this method to the learning of the loca-tion of the Sun in the Milky Way disk.
A cornerstone of scientific pursuit involves seeking thevalue of an unknown model parameter, given data thatconstitutes measurements of an observable. Such an ex-ercise is of course relevant only when the observable is a function of the model parameter. Knowing the func-tional relationship between this observable and the modelparameter, one can then inversely learn the value of themodel parameter at which the data at hand is realised(Ramsay and Silverman, 2014; Hofmann, 2011; Tarantola,2005). The Bayesian equivalent of this approach consti-tutes sampling from the posterior predictive density of theunknown parameter, given the data and the model for theaforementioned functional relationship. However, this veryfunctional relationship (between the observable and the un-known parameter), may not be known apriori. The standardapproach within supervised learning, is to then train themodel for this function using training data (Gelman et al.,1995; Neal, 1998), in order to subsequently predict the un-known parameter, given the data at hand. If the observableis high-dimensional (tensor-valued in general), this func-tion is rendered high-dimensional (tensor-variate) as well,leading us to the inverse learning of the unknown modelparameter in a high-dimensional situation (Alquier et al.,2011). Conventional inverse problem approaches are typi-cally in low-dimensions (Cavalier, 2008; Tarantola, 2005).However, as the procurement of high-dimensional databecomes more common in different scientific disciplines(Zhao et al., 2014; Xu et al., 2012), we will more com-monly encounter the difficult task of inversely learninga model parameter vector, subsequent to the supervisedlearning of a high-dimensional function.We present a new method to perform Bayesian inverselearning on an unknown model parameter vector, given testand training data. The core of our methodology lies inthe nonparametric supervised learning of the tensor-variatefunction of the model parameter vector that gives the ob-servable, using tensor-variate training data. We achieve thisby modelling such a function by treating it as sampled froma Gaussian Process of corresponding dimensionality, i.e. atensor-variate Gaussian Process, the mean and covariancestructure of which we learn. Such modelling would in turnimply that the joint probability of a set of realisations of nverse Learning & Covariance Modelling using Tensor-Normal GP this function, (which equivalently is a set of values of theobservable), is the tensor-normal density.Earlier Hoff, P. D. (2011) used Tucker decomposition toextract the covariance matrices of a tensor-variate GaussianProcess (GP) and proceeded to compute a maximum like-lihood solution for the covariance matrices and the meantensor of the GP. Zhao et al. (2014) introduced tensor ker-nels in order to compute the distance between two tensorsfor which (tensor) kernel parametrisation has been under-taken. Subsequently, they applied this to a graph classifi-cation problem. Hou et al. (2015) solved a tensor-variateGaussian Process-based regression problem using a localleast square method–they focus on the complexity of ten-sor GP regression for large data sets.However, in our work the aim is to learn an unknownmodel parameter vector; this is accomplished via the su-pervised learning of the functional relationship betweenitself and the tensor-shaped data, where the said func-tion is modelled using a tensor-variate GP. Both the su-pervised learning of this function, as well as the inverselearning of the unknown model parameter vector, are un-dertaken in a Bayesian framework, using MCMC-basedinference (Random-Walk Metropolis-Hastings or RW; seeRobert and Casella (2004)). To this effect, we do indeedlearn the covariance structure of the tensor-variate GP,though our aim takes us beyond this exercise. In fact,we propose three different ways of learning the multiplecovariance matrices of the GP, depending on availabilityof information and feasibility constraints (which are moti-vated by the dimensions of such matrices). Furthermore,we undertake the learning of the sought model param-eter vector by performing RW-based sampling from theposterior predictive density of the unknowns, given thetest+training data and the learnt model of the GP (covari-ance and mean structure), as well as by performing suchsampling from the joint posterior probability density of themodel parameter and the GP parameters, given all data.We implement this methodology to learn the location vec-tor of the Sun in the disk of the Milky Way galaxy. Thebasis of this ambition lies in the fact that Galactic featuresaffect stellar velocities, so that in principle, an observedset of stellar velocity vectors can be treated to be a func-tion of the vector of the Galactic feature parameters (seeChakrabarty et al. (2015)). However, the feature parame-ters of interest are related to the galactocentric solar loca-tion in known ways. Thus, equivalently, the observed set ofstellar velocity vectors can be treated to be a function of thesolar location vector. Here the set of velocity vectors–i.e.the matrix of stellar velocity components–is that of a cho-sen number of stellar neighbours of the Sun. It is only afterlearning the said function, that we can predict the unknownsolar location, given the data. Here, training data comprisesmatrices, each row of which is the velocity vector of a stel-lar neighbour of the Sun, with each such matrix generated at a fiduciary solar location (design point), in astronomicalsimulations. Thus the training data is 3-tensor (set of ma-trices, with each generated at a design point). The test datais available as a matrix of velocity measurements of stel-lar neighbours of the Sun (Chakrabarty, 2007). We learnthe aforementioned function by modelling it with a high-dimensional Gaussian Process, the parameters of which welearn using MCMC techniques. Thereafter, we perform in-verse Bayesian learning of the unknown solar location vec-tor.
Let the observable V be a k − -rank tensor, i.e. V ∈ R m × m ... × m k − , m j is a positive integer, j = 1 , . . . , k − . We treat V to be an unknown function of the model pa-rameter S , where S ∈ R d . Thus, we define V = ξ ( S ) ,where ξ ( · ) is this unknown function, where–by virtue ofthis equation– ξ ( · ) is a k − -tensor-variate function itself.We are going to predict the value s ( test ) of S at the new ortest data v ( test ) . In order to do this, the unknown tensor-variate function ξ ( · ) needs to be learnt, given the trainingdata D := { ( s ( ∗ )1 , v ) , . . . , ( s ( ∗ ) n , v n ) } , where s ( ∗ ) i is the i -th design point at which the value v i of the observableis generated, i = 1 , . . . , n . Such supervised learning canbe done using parametric regression techniques, (such asfitting with splines/wavelets). In the conventional inverseproblem approach, the ξ ( · ) learnt using such techniques,will thereafter need to be inverted and this inverse operatedupon the test data, to yield the value s ( test ) of S at whichthe test data is realised. The shortcoming of using themethod of splines/wavelets is that the correlations betweenthe components of the high-dimensional function ξ ( · ) arenot properly learnt. Moreover, such parametric regressioncauses computational difficulties as the dimensionality ofthe observable increases.This drives us to use Gaussian Process (GP) basedmethods–we treat the k − -tensor-variate function as arealisation from a k − -tensor-variate GP. Upon learningthe parameters of this GP using the training data, we arethen able to write the posterior predictive of S given thetest+training data and the GP parameters. This is the stan-dard supervised learning scheme that we are going to usehere, except, in one case we sample from the posterior pre-dictive of s ( test ) given all data, and alternatively, performposterior sampling from the joint posterior probability den-sity of all unknowns, given all data (test and training). Ineach case, we extract the marginal posterior of each pa-rameter given the data, using our MCMC-based inferencescheme (see Section 3 of supplementary material).We treat ξ ( · ) as a realisation from a tensor-variate GP,where the rank of this tensor-variate process is k − . Itthen follows that the observable V is also a realisation fromthis GP. As a result, the set of n realisations of this observ- ang & Chakrabarty able, i.e. the training data D = { v , . . . , v n } , follows a k -variate tensor normal distribution with mean tensor M and k covariance matrices Σ , ..., Σ k : { v , . . . , v n } ∼T N k ( M , Σ , ..., Σ k ) , so that the likelihood of mean ten-sor M , and the covariance matrices Σ , ..., Σ k , given thetraining data, is a tensor-normal density: p ( D | M , Σ , ..., Σ k ) ∝ exp( −k ( V − M ) × A − ... × k A − k k / (1)where the covariance matrix Σ p = A p A Tp , p = 1 , ..., k ,i.e. A p is the unique square-root of the positive semi-definite covariance matrix Σ p . The operator “ × p ” repre-sents the p -mode multiplication of a tensor with a matrix.The tensor-normal distribution is extensively discussed inthe literature, (Xu et al., 2012; Hoff, P. D., 2011). If we were to propose to learn each element of each covari-ance matrix using MCMC (or at least the upper triangle ofeach such matrix, invoking symmetry of covariance matri-ces), we would be committing to the learning of a very largenumber of parameters indeed. The computational complex-ity increases rapidly when the number of observations in-creases. In order to reduce the task to be computationallytractable, one possibility is to use kernel parametrisationof the covariance matrices Σ , ..., Σ k , and then learn theparameters of these kernels using MCMC.Of the i -th covariance matrix, let the jp -th element be σ ( i ) jp .Then σ ( i ) jp bears information about the covariance amongstthe slices of the data set achieved at the j -th and p -th in-put variables. The “input variable” referred to here, isthe variable in the input space, i.e. the model parame-ter S . We define σ ( i ) jp = K i ( s j , s p ) , where K i ( s j , s p ) is the kernel function K i ( · , · ) , computed at the j -th and p -th input variables. Thus, the number of unknown pa-rameters involved in the learning of the covariance func-tion of the high-dimensional GP would then reduce to thenumber of hyper-parameters of the kernel function K i ( · , · ) , i = 1 , . . . , k . In other words, such kernel parametrisationdoes help reduce the number of unknowns that need to beinferred upon, using MCMC.However, there are two situations in which we might optout of practising kernel parametrisation. Firstly, suchparametrisation may cause information loss that may notbe acceptable; one may then resort to learning each el-ement of the covariance matrix (Aston and Kirch, 2012).Another situation when we avoid kernel parametrisation isthe following. Let us consider the i -th covariance matrixthat holds information about the covariance amongst slicesof the training data achieved at distinct indices for i , i.e.along the i -th direction. We may not always be aware of the variable in the input space that takes different values atthe different i -indices. In such situations, kernel parametri-sation of elements of the covariance matrix is not possible,since such kernels need to be computed at pairwise differ-ent values of the input variable.In such situations, we would opt to learn the elements ofthe covariance matrix directly using MCMC. However, asdiscussed above, such can be computationally daunting. Ifthe computational task is then rendered too time intensive,then we will perform an empirical estimation of the i -th co-variance matrix. An empirical estimate can be performedby collapsing each high-dimensional data slice along all-but-one directions (other than the i -th direction), to achievea vector in place of the original high-dimensional slice atthe i -th index value. The vectors at the different i -indicesthen possess the compressed information from all the rele-vant dimensions of the data variable. The covariance ma-trix Σ i is then approximated by an empirical estimate ofthe covariance amongst such vectors.Indeed such an empirical estimate of any covariance matrixmay then be easily generated, but it indulges in linearisa-tion amongst the different dimensionalities of the observ-able. So when the Σ i covariance matrix bears informationabout high-dimensional slices of the data at the different i -indices, such linearisation may cause loss of informationabout the covariance structure.In summary, we model the covariance matrices as kernelparametrised or empirically-estimated or learnt directly us-ing MCMC. A computational worry is the burden of invert-ing any of the covariance matrices; for a covariance matrixthat is m i × m i , the computational order for matrix inver-sion is well known to be O ( m i ) (Knuth, 1997).In our application, when we implement kernel parametrisa-tion, we choose to use the Squared Exponential (SQE) co-variance kernel. However, other kinds of kernel functionscan be used. In the application discussed in this paper, thedata is continuous and we assume the covariance structureto be stationary. It is recalled that the SQE form can beexpressed as K ( s j , s p ) = a jp exp (cid:0) − ( s j − s p ) T Q ( s j − s p ) (cid:1) (2)where s j is the j -th value of the input vector and Q is a diagonal matrix, an element of which is the re-ciprocal of a correlation length scale. These correla-tion length scales are then unknown parameters that arelearnt from the data using MCMC. As we see from Equa-tion 2, for S ∈ R d , Q is a d × d -dimensional squarediagonal matrix, so that there are d unknown correla-tion lengths to learn using MCMC for a given covari-ance matrix. Here a jp is the amplitude of the covariance.However, we set (cid:2) a jp exp (cid:0) − ( s j − s p ) T Q ( s j − s p ) (cid:1)(cid:3) ≡ A (cid:2) exp (cid:0) − ( s j − s p ) T Q ′ ( s j − s p ) (cid:1)(cid:3) , where A is a globalscale such that all local amplitudes are < and the Q ′ nverse Learning & Covariance Modelling using Tensor-Normal GP diagonal matrix contains the reciprocal of the correlationlength scales, modulated by such local amplitudes. Theglobal scale A is subsumed as the scale to one of the covari-ance matrices of the tensor-normal distribution at hand–thisis the matrix, the elements of (the upper/lower triangle of)which are learnt directly using MCMC, i.e. without resort-ing to any parametrisation or to any form of empirical esti-mation.We write the likelihood using Equation 1. Using this like-lihood and adequately chosen priors on the unknown pa-rameters, we can write the posterior density of the un-known parameters given the training data, where the un-knowns include parameters of the covariance matrices–ifkernel parametrised, or elements of such a matrix itself–ifbeing directly learnt using MCMC. Once this is achieved,we use the RW MCMC technique to sample from the jointposterior probability density of the unknown parameters ofeach covariance matrix and mean tensor, given the trainingdata. We generate the marginal posterior probability den-sities of each unknown parameter given training data, andidentify the 95 % Highest Probability Density (HPD) credi-ble regions on each parameter.Thereafter, the prediction of s ( test ) can be performed.We treat the tensor variate normal distribution as the sum ofa mean function and a zero mean normal distribution. Themean tensor is M ∈ R m × m ... × m k . We work in this ap-plication by removing an estimate of the mean tensor fromthe non-zero mean tensor-normal distribution. In this appli-cation, we have used the maximum likelihood estimation ofthe mean tensor. We are going to illustrate our method using an applicationon astronomical data. Following on from the introductorysection, in this application the set of measurements that areinvoked to allow for the learning of the location of the Sunin the Milky Way (MW) disk (assumed a two-dimensionalobject), happens to be the velocities of the stars that sur-round the Sun, as measured by the observer, i.e. us, seatedat the Sun, since on galactic scales, our location on the MWdisk is equivalent to the solar location. A sample of ourneighbouring stars had their velocity vectors measured bythe Hipparcos satellite (see Chakrabarty (2007) for details).Thus, this measured or test data is a matrix, each row ofwhich is a star’s velocity vector.If a sample of stars are allowed to evolve under the influ-ence of certain Galactic features, from a primordial time tothe current, these features will drive stars of different ve-locities to different locations on the MW disk. Thus, thestars that end up in the neighbourhood of the Sun, have ve-locity vectors as given by the test data matrix, because ofthe influence of the Galactic features on them. Thus, the matrix of velocities of stars in the solar neighbourhood isrelated to the parameters of these Galactic features. Sincesuch feature parameters can be scaled to the galactocen-tric location vector of the Sun, (discussed below in Sec-tion 4.2), we treat the matrix of stellar velocities V to befunctionally related to the solar or observer location vector S , i.e. V = ξ ( S ) . Here for V ∈ R m × m × n and S ∈ R d , ξ : R d −→ R m × m × n .We learn this function ξ ( · ) using training data that com-prises n pairs of values of chosen solar location vector andthe stellar velocity matrix generated at this solar location.Thus, the full training data is a 3-tensor comprising n ma-trices of dimensions m × m , each of which is generated ata design point s i , i = 1 , . . . , n . The training data is the out-put of astronomical simulations presented by Chakrabarty(2007). In this application, we will learn the covariancestructure of the training data and predict the value of the so-lar/observer location parameter S , at which the measuredor test data is realised.In Chakrabarty et al. (2015), the matrix of velocities wasvectorised, so that the observable was then a vector. In ourcase, the observable is V –a matrix. By this process of vec-torisation, Chakrabarty et al. (2015) miss out on the oppor-tunity to learn the covariance amongst the columns of thevelocity matrix, (i.e. amongst the components of the ve-locity vector), distinguished from the covariance amongstthe rows, (i.e. amongst the stars that are at distinct relativelocations with respect to the observer). Our work allowsfor clear quantification of such covariances. More impor-tantly, our work provides a clear methodology for learning,given high-dimensional data comprising measurements ofa tensor-valued observable.In our application we realise that the location vector of theobserver is 2-dimensional, i.e. d =2 since the Milky Waydisk is assumed to be 2-dimensional. Also, each stellar ve-locity vector is also 2-dimensional, i.e. m =2. Chakrabarty(2007) generated such training data by first placing a reg-ular 2-dimensional polar grid on a chosen annulus in anastronomical model of the MW disk. In the centroid ofeach grid cell, an observer was placed. There were n gridcells, so, there were n observers placed in this grid, suchthat the i -th observer measured the velocities of m i starsthat landed in her grid cell, at the end of a simulated evo-lution of a sample of stars that were evolved in this modelof the MW disk, under the influence of the feature param-eters that mark this MW model. We indexed the m i starsby their location with respect to the observer inside the gridcell, and took a stratified sample of m stars from this col-lection of m i stars while maintaining the order by stellarlocation inside each grid; i = 1 , . . . , n . Thus, each of theobservers records a sheet of information that contains the2-dimensional velocity vectors of m stars, i.e. the train-ing data comprises n m × -dimensional matrices, i.e. thetraining data is a 3-tensor. We call this tensor D ( n × m × . ang & Chakrabarty
We realise that the i -th such velocity matrix or sheet, is re-alised at the observer location s i that is the i -th design pointin our training data. We use n =216 and m =50. The testdata measured by the Hipparcos satellite is then the 217-thsheet, except we are not aware of the value of S that thissheet is realised at. We clarify that in this polar grid, ob-server location S is given by 2 coordinates: the first S tellsus about the radial distance between the Galactic centre andthe observer, while the second coordinate of S denotesthe angular separation between a straight line that joins theGalactic centre to the observer, and a pre-fixed axis in theMW. This axis is chosen to be the long axis of an elongatedbar of stars that rotates, pivoted at the Galactic centre, asper the astronomical model of the MW that was used togenerate the training data.As mentioned above, the maximum likelihood estimate ofthe mean tensor is removed from the data to allow us towork with a zero mean tensor normal density that repre-sents the likelihood.Since the data is a 3-tensor (built of n observations ofthe × -dimensional matrix-variate observable V ),the likelihood is a 3-tensor normal distribution, with zeromean tensor (following the removal of the estimated mean)and 3 covariance matrices that measure:–amongst-observer-location covariance ( Σ (216 × ),–amongst-stars-at-different-relative-position-w.r.t.-observer covariance ( Σ (50 × ), and–amongst-velocity-component covariance ( Σ (2 × ).We perform kernel parametrisation of Σ , using the SQEkernel such that the jp -th element of Σ is [ σ jp ] =exp (cid:0) − ( s j − s p ) T Q ( s j − s p ) (cid:1) , j, p = 1 , . . . , . Since S is a 2-dimensional vector, Q is a 2 × q (1)11 and q (1)22 of which representthe reciprocals of the correlation length scales. Indeed ourmodel of the covariance function suggests the same corre-lation length scales between any two sampled functions andthis is a simplification. Thus, the learning of Σ has beenreduced to the learning of 2 correlation length scale param-eters. Here the correlation length scales that form the ele-ments of the diagonal matrix Q are amplitude-modulated,as discussed above in the paragraph following Equation 2The covariance matrix Σ (50 × bears information aboutcovariance amongst stars that are at the same relative po-sition w.r.t. the observer who observes it. There is noclear physical interpretation of what such a covariancemeans. We realise that we are not aware of any input vari-able in the training data at which the different (horizon-tal) sheets containing velocities of such stars are attainedin the data. Therefore, we need to learn the elements ofthis matrix directly using MCMC, which will however im-ply that 2450/2+50 elements will have to be directly learnt.The computational burden of this task being unacceptablydaunting, we resort to performing an empirical estimate of this covariance matrix. Let [ v ( b ) st ] be the b -th × matrixrealised as the horizontal slice taken at the b -th row in thetraining data tensor. Assume that the covariance matrix Σ bears information about the covariance amongst such “hor-izontal slices” taken at different values of b . Let the bc -thelement of Σ be e bc . Here b = 1 , . . . , , c = 1 , . . . , .We can write the estimate of e bc to be: [ˆ e bc ] =12 − × X t =1 " × X s =1 ( v ( b ) st − ¯ v ( b ) t ) × ( v ( c ) st − ¯ v ( c ) t ) ! , where ¯ v ( b ) t = (cid:16)P s =1 v ( b ) st (cid:17) is the sample mean of the t -thcolumn of the matrix [ v ( b ) st ] . Σ measures covariance amongst the matrices or sheets ob-tained at distinct components of the velocity vector. Asthere are only such 2 components, there are 2 such sheets.However, we are not aware of any input variable at whichthese sheets are realised. Therefore we need to learn the 4elements of this matrix directly from MCMC. As the co-variance matrix is symmetric, we need to learn only 3 ofthe 4 parameters. We are going to learn the two diagonalelements and one non-diagonal element in the Σ matrix.The two diagonal elements will be learnt by our MCMC al-gorithm directly. However, the non-diagonal element σ (3)12 can be written as σ (3)12 = ρ q σ (3)11 σ (3)22 where ρ is the cor-relation amongst these two vertical sheets in D . Thus, in-stead of learning the σ (3)12 directly, we choose to learn thecorrelation parameter ρ , using our MCMC algorithm.Thus, from the training data alone, we have 5 parametersto learn: q (1)11 , q (1)22 , σ (3)11 , ρ , σ (3)22 , of the covariance structure,to learn from the data, where these parameters are definedas in: Q = q (1)11 q (1)22 ! ; Σ = σ (3)11 σ (3)12 σ (3)12 σ (3)22 ! ; ρ = σ (3)12 q σ (3)11 σ (3)22 The likelihood of the training data given the GP parametersis then given as per Equation 1: ℓ ( D | q (1)11 , q (1)22 , σ (3)11 , σ (1)22 , ρ ) = (2 π ) − m/ ( Y i =1 | Σ i | − m/ m i ) × exp( −k ( D − ˆ M ) × A − × ˆ A − × A − k / . (3)where Σ p = A p A Tp , p = 1 , , and ˆ M is the empiricalestimate of the mean tensor and ˆ Σ is the empirical esti-mate of the covariance matrix Σ such that ˆ Σ = ˆ A ˆ A T .Here m = 216 , m = 50 , m = 2 , and m = m m m .This allows us to write the joint posterior probability den-sity of the unknown parameters given the training data. We nverse Learning & Covariance Modelling using Tensor-Normal GP generate posterior samples from it using MCMC. To writethis posterior, we impose non-informative priors π ( · ) oneach of our unknowns (uniform on q (1) · and Jeffry’s on Σ ).The posterior probability density of our unknown GP pa-rameters, given the training data is then π ( q (1)11 , q (1)22 , σ (3)11 , σ (2)22 , ρ | D ) ∝ ℓ ( D | Σ , Σ ) × π ( q (1)11 ) π ( q (1)22 ) π ( Σ ) . (4)The results of our learning and estimation of the mean andcovariance structure of the GP used to model this tensor-variate data, is discussed below in Section 4. After learning all GP parameters, we are going to predictthe location of the Sun s ( test ) , i.e. of the observer who ob-served the velocity matrix of nearby stars as in the test data.This test data v ( test ) is a × matrix or slice and includesmeasurements of velocities of 50 stars that are neighboursof the Sun. We use two different methods for making infer-ence on s ( test ) = ( s ( test )1 , s ( test )2 ) T , in next section.In one method we learn the GP parameters and s ( test )1 and s ( test )2 simultaneously from the same MCMC chain run us-ing both training and test data. The tensor that includesboth test and training data has dimensions of × × .We call this augmented data D ∗ = { v , ..., v , v ( test ) } ,to distinguish it from the training data D . This 217-th sheetof (test) data is realised at the unknown value s ( test ) of S ,and upon its addition, the updated covariance amongst thesheets generated at the different values of S , is renamed Σ ∗ , which is now rendered × -dimensional. Then Σ ∗ includes information about s ( test ) via the SQE-basedkernel parametrisation discussed in Section 2. The effect ofthe inclusion of the test data on the other covariance matri-ces is less; we refer to them as (empirically estimated) ˆ Σ ∗ and Σ ∗ . The updated (empirically estimated) mean tensoris ˆ M ∗ . The likelihood for the augmented data is: ℓ ( D ∗ | s ( test ) , Σ ∗ , Σ ∗ ) = (2 π ) − m/ Y i =1 | Σ ∗ i | − m/ m i ! × exp h −k ( D ∗ − ˆ M ∗ ) × ( A ∗ ) − × ( ˆ A ∗ ) − × ( A ∗ ) − k / i (5)where ˆ A ∗ is the square root of ˆ Σ ∗ . Here m = 217 , m =50 , m = 2 , and m = m m m . Here A ∗ is the squareroot of Σ ∗ and depends on s ( test ) .The posterior of the unknowns given the test+training datais: π ( s ( test )1 , s ( test )2 , Σ ∗ , Σ ∗ | D ∗ ) ∝ ℓ ( D ∗ | s ( test )1 , s ( test )2 , Σ ∗ , Σ ∗ ) × π ( s ( test )1 ) π ( s ( test )2 ) π ( q (1 ∗ )22 ) π ( q (1 ∗ )11 ) π ( Σ ∗ ) . (6) As discussed above, we use non-informative priors on allGP parameters and uniform priors on s ( test )1 and s ( test )2 . So π ( s ( test ) p ) = U ( l p , u p ) , p = 1 , , where l p and u p are cho-sen depending on the spatial boundaries of the fixed area ofthe Milky Way disk that was used in the astronomical sim-ulations of Chakrabarty (2007). Recalling that the observeris located in a two-dimensional polar grid, Chakrabarty(2007) set the lower boundary on the value of the angu-lar position of the observer to 0 and the upper boundary is π/ radians, i.e. 90 degrees, where the observer’s angu-lar coordinate is the angle made by the observer-Galacticcentre line to the long-axis of the elongated Galactic barmade of stars that rotates pivoted at the Galactic centre(discussed in Section 1). The observer’s radial locationis maintained within the interval [1.7,2.3] in model units,where the model units for length are related to galactic unitfor length, as discussed in Section 4.2.In the second method, we infer s ( test ) by sam-pling from the posterior predictive of s ( test ) given thetest+training data and the modal values of the parame-ters q (1)11 , q (1)22 , σ (3)11 , ρ, σ (3)22 that were learnt using the train-ing data. Thus, here Σ ∗ = [( σ ∗ ) jp ] , j =1; p =1 , where ( σ ∗ ) jp = (cid:2) exp (cid:0) − ( s j − s p ) T Q ( s j − s p ) (cid:1)(cid:3) , with the un-known s = s ( test ) and the diagonal elements of thediagonal matrix Q given as q (1)11 and q (1)22 that were learntusing training data alone. Similarly, Σ is retained as waslearnt using the training data alone. The posterior predic-tive of s ( test ) is π ( s ( test )1 , s ( test )2 | D ∗ , Σ ∗ , Σ ) ∝ ℓ ( D ∗ | s ( test )1 , s ( test )2 , Σ ∗ , Σ ) × π ( s ( test )1 ) π ( s ( test )2 ) × π ( q (1 ∗ )22 ) π ( q (1 ∗ )11 ) π ( Σ ) | V ∗ ) . (7)where ℓ ( D ∗ | s ( test )1 , s ( test )2 , Σ ∗ , Σ ) is as given in Equa-tion 5, with Σ ∗ replaced by Σ . The priors on s ( test )1 and s ( test )2 are as discussed above. For all parameters, we useNormal proposal densities that have experimentally chosenvariances. In Figure 1 of the supplementary section, we present thetrace of the likelihood of the training data given the 5unknown GP parameters, as well as the traces of themarginal posterior probability density of these unknowns q (1)11 , q (1)2 , σ (3)11 , σ (3)22 , ρ , given training data. The stationarityof the traces betrays the achievement of convergence of thechain. The marginal posterior probability densities of eachunknown parameter given training data alone is displayedas histograms in Figure 1. 95 % HPD credible regions com-puted on each learnt parameter given training data alone,are displayed in Table 1 of the supplementary section. ang & Chakrabarty −1.048−1.0475−1.047−1.0465 x 10 likelihood 4000 4500 5000 5500 60000500010000 q (1)11
70 80 90 100 1100500010000 q (1)22 σ (3)11 −0.08 −0.06 −0.04 −0.02 00500010000 ρ σ (3)22 Figure 1: Histogram representations of marginal posteriorprobability densities of the 5 sought GP parameters gener-ated by RW, given the training data.We notice that the reciprocal correlation length scale q (1)11 is an order of magnitude higher than q (1)22 ; correlation be-tween values of sampled function ξ ( · ) , computed at 2 dif-ferent s and the same s then wanes more quickly in thancorrelation between sampled functions computed at same s and different S values. Here s = ( s , s ) T and giventhat S is the location of the observer who observes the ve-locities of her neighbouring stars on a two-dimensional po-lar grid, S is interpreted as the radial coordinate of theobserver’s location in the Galaxy and S is the observer’sangular coordinate. Then it appears that the velocities mea-sured by observers at different radial coordinates, but at thesame angle, are correlated over shorter length scales thanvelocities measured by observers at the same radial coor-dinate, but different angles. This is understood to be dueto the astro-dynamical influences of the Galactic featuresincluded by Chakrabarty (2007) in the simulation that gen-erates the training data that we use here. This simulation in-corporates the joint dynamical effect of the Galactic spiralarms and the Galactic bar (of stars) that rotate at differentfrequencies (as per the astronomical model responsible forthe generation of our training data), pivoted at the centreof the Galaxy. An effect of this joint handiwork of the barand the spiral arms is to generate distinctive stellar velocitydistributions at different radial (i.e. along the S direction)coordinates, at the same angle ( s ). On the other hand, thestellar velocity distributions are more similar at different S values, at the same s . This pattern is borne by Figure 9 ofChakrabarty (2004), in which the radial and angular vari-ation of the standard deviations of these bivariate velocitydistributions are plotted. Then it is understandable why thecorrelation length scales are shorter along the S direction,than along the S direction. Furthermore, for the correla-tion parameter ρ , physics suggests that the correlation will −1.054−1.053−1.052 x 10 likelihood 7000 7500 8000 8500 9000020004000 q (1)11
70 80 90 100 110020004000 q (1)22 σ (3)11 −0.1 −0.08 −0.06 −0.04 −0.02 0020004000 ρ σ (3)22 Figure 2: Histogram representations of marginal densitiesof s ( test )1 and s ( test )2 and the 5 GP parameters, from anMCMC chain run using test and training data.be zero among the two components of a velocity vector.These two components are after all, the components of thevelocity vector in a 2-dimensional orthogonal basis. How-ever, the MCMC chain shows that there is a small (nega-tive) correlation between the two components of the stellarvelocity vector. s ( test ) In the first method, we perform posterior sampling usingRW, from the joint posterior probability density of all pa-rameters (GP parameters as well as solar location vector),given test+training data. In Figure 2, we present the re-sults of marginal posterior probability densities of the solarlocation coordinates s ( test )1 , s ( test )2 ; q ∗ and q ∗ that getupdated once the test data is added to augment the train-ing data, and parameters σ ∗ , σ ∗ and ρ ∗ . Traces of thelikelihood and of the marginal posterior probabilities of alllearnt parameters, given all data, are included in Figure 2of the supplementary section. 95 % HPD credible regionscomputed on each parameter in this inference scheme, aredisplayed in Table 1 of the supplementary section. We no-tice that the values of the inverse correlation length q (1 ∗ )11 has undergone substantial change with the introduction ofthe test data over the originally used training data–with themodal value increasing by a factor little less than 2. How-ever q (1 ∗ )22 is almost the same as q (1)22 . Thus, the introduc-tion of the test data has caused shorter correlation lengthscales along the S direction to be imposed, i.e. two ob-servers seated at two different radial coordinates will findtheir observed velocities correlated over even shorter lengthscales–where such velocities are part of the augmented dataset–than when the training data alone is used. However,there is very little effect of the added information from the nverse Learning & Covariance Modelling using Tensor-Normal GP test data on q (1)22 . This indicates that while the generationof the velocities of nearby stars at a given observer angularcoordinate was done well in the astronomical simulationsperformed by Chakrabarty (2007), the simulations failedto adequately capture the generation of stellar velocities ata given radial location. At least, on a comparative note,the simulations were a better representative of the test datameasured by the Hipparcos satellite, when it came to thedependence of the velocities on observer angular coordi-nate, than on the observer radial coordinate.The marginal distributions of s ( test )1 indicates that themarginal is nearly bimodal, with modes at about 1.85 and2 in model units. The distribution of s ( test )2 on the otherhand is quite strongly skewed towards values of s ( test )2 . radians, i.e. s ( test )2 . degrees, though the probabilitymass in this marginal density falls sharply after about 0.4radians, i.e. about 23 degrees. These values tally quite wellwith previous work (Chakrabarty et al., 2015). In that ear-lier work, using the training data that we use in this work,(constructed using the the astronomical model sp bar discussed by Chakrabarty et al. (2015)), the marginal dis-tribution of s ( test )1 was learnt to be bimodal, with modesat about 1.85 and 2, in model units–this is what we findin our inference scheme. The distribution of s ( test )2 foundby (Chakrabarty et al., 2015) is however more constricted,with a sharp mode at about 0.32 radians (i.e. about 20 de-grees). We do notice a mode at about this value in our infer-ence, but unlike in the results of (Chakrabarty et al., 2015),we do not find the probability mass declining to low val-ues beyond about 15 degrees. One possible reason for thislack of compatibility could be that in (Chakrabarty et al.,2015), the matrix of velocities V was vectorised, so thatthe training data then resembled a matrix, rather than a 3-tensor as we know it to be. Such vectorisation could haveled to some loss of correlation information, leading to theresults of (Chakrabarty et al., 2015).When we predict s ( test ) using test+training data, at the(modal values of the) GP parameters that are learnt from thetraining data, we generate samples from the posterior pre-dictive of s ( test ) (Equation 7) using RW. The marginal pos-terior predictive densities of s ( test )1 and s ( test )2 are shown inFigure 3. Trace of the likelihood and traces of the marginalposterior probability of the solar location parameters areshown in Figure 3 of the supplementary section.Results from the marginal predictive densities have sim-ilarities with the results from the marginals obtained inthe other inference scheme, shown in Figure 2. Firstly,the modes of s ( test )1 at about 1.85 and 2 are again noticedin Figure 3. The posterior predictive of s ( test )2 is noticedto bear high probability mass in the [0,1] radian interval,though–as with the inference using the first method–heretoo, there is a decline after about 0.4 radians. However, thesmall dip noticed in Figure 2 at about 0.3 radians, is more Figure 3: Histograms representing marginal posterior pre-dictive probability densities of s ( test )1 and s ( test )2 giventest+training data and GP parameters that are learnt giventraining data alone.pronounced here. In Chakrabarty et al. (2015), a secondarymode inward of about 0.3 radians, was missed. Thus, thevectorisation-based approach used by Chakrabarty et al.(2015) is found to have missed correlation information atlow angles, i.e. very close to the long axis of the Galacticbar. The radial coordinate of the observer in the Milky Way,i.e. the solar radial location is dealt with in model units,but will need to be scaled to real galactic unit of distance,which is kilo parsec (kpc). Now, from independent astro-nomical work, the radial location of the Sun is set as 8 kpc.Then our estimate of S is to be scaled to 8 kpc, whichgives 1 model unit of length to be kpcour estimate of S .Our main interest in learning the solar location is to findthe frequency Ω bar with which the Galactic bar is rotat-ing, pivoted at the galactic centre, loosely speaking. Here Ω bar = v v = 220 km/s (seeChakrabarty (2007) for details). The solar angular locationbeing measured as the angular distance from the long-axisof the Galactic bar, our estimate of S actually tells us theangular distance between the Sun-Galactic centre line andthe long axis of the bar. These estimates are included inTable 2 of the supplementary section. Our aim here is to advance a general methodology thatallows for covariance modelling of high-dimensional datasets, to then be able to make an inverse learning of the inputvariable, given new or test data, when such data becomesavailable. To this effect, we make a simple application ofthe method, to thereafter check our results against resultsobtained previously, using the same data that we use. Ourresults compare favourably with previous work discussedin the literature (Chakrabarty et al., 2015). ang & Chakrabarty
Supplementary Section
Introduction
In this supplement, there are twosections. The first section presents some of theresults of our inference in 3 figures and 2 ta-bles. The second section presents a schematicrepresentation of the MCMC technique–namelythe Random-Walk Metropolis-Hastings (RW)–that we have used in our work to perform pos-terior sampling.
1. Results
In our application we model the func-tional relationship between the 2-dimensionallocation vector S of the observer and theobservable–which is 50 × matrix V of 2-dimensional velocity vectors of 50 stars in theneighbourhood of a fiduciary observer.Figure 4 presents trace of the likelihood of thetraining data given the unknown parameters ofthe GP that is used to model this functional re-lationship. Traces of the marginal posterior prob-ability density of each of these GP parameters,given the training data, are also shown. −1.048−1.0475−1.047−1.0465 x 10 likelihood 0 2 4 6x 10 (1)11 (1)22 σ (3)11 −0.1−0.050 ρ σ (3)22 Figure 4: Traces of likelihood and marginal posterior den-sity of GP parameters q (1)11 , q (1)22 , σ (3)11 , σ (3)22 , ρ given trainingdata. Figure 5 presents trace of the likelihood ofthe training+test data given the unknown pa-rameters of the GP and the location s ( test ) =( s ( test )1 , s ( test )2 ) T of the Sun at which the test datais realised. Traces of the marginal posterior prob-ability density of each of these GP parameters and the solar location, given the training+testdata, are also shown. Details of this inference indiscussed in Section 3.1 of the main paper. −1.054−1.053−1.052 x 10 likelihood 0 0.5 1 1.5 2x 10 (1)11 (1)22 σ (3)11 −0.1−0.050 ρ σ (3)22
012 s Figure 5: Traces of likelihood and marginal posteriordensity of parameters s ( test )1 , s ( test )2 , q (1)11 , q (1)22 , σ (3)11 , σ (3)22 , ρ given training and test data. Figure 6 presents traces of the posterior predic-tive probability of the location parameters s ( test )1 and s ( test )2 of the Sun, given all data and the modalvalues of parameters of the GP learnt using train-ing data. Again, details of this inference is dis-cussed in Section 3.1 of the main paper. Figure 6: Traces of posterior predictive density of param-eters s ( test )1 , s ( test )2 given training+test data and the modalvalues of the five GP parameters listed above, that werelearnt from the training data alone. nverse Learning & Covariance Modelling using Tensor-Normal GP
Table 1:
HPD on each learnt parameter, using thethree inference schemesParameters using only training data sampling from posterior predictive sampling from joint q (1)11 [4572.4,5373.2] [7566.8,8460.4] q (1)22 [82.50,93.12] [82.30,93.44] σ (3)11 [0.9884,1.0337] [0.9848,1.0310] ρ [-0.0627,-0.0310] [-0.0620,-0.0304] σ (3)22 [0.4087,0.4270] [0.4116,0.4306] s - [1.7496,2.0995] [1.7547,2.0816] s - [0.079,0.7609] [0.0393,0.8165]Table 2: HPD on each Galactic feature parameterlearnt from the solar location coordinates learnt using thetwo predictive inference schemes listed above and as re-ported in a past paper for the same training and test data.
HPD for Ω bar (km/s/kpc) for angular distance of bar to Sun (degrees)from posterior predictive [48 . , .
73] [4 . , . from joint posterior [48 . , . . , . from Chakrabarty et. al (2015) [46 . , .
98] [17 . , . Table 1 summarises the 95 % HPD credible re-gions of all learnt parameters, under the 3 differ-ent inference schemes, namely learning the GPparameters from training data alone; learning theGP and solar location parameters by samplingfrom the joint posterior probability density of allparameters, given all data; learning the solar lo-cation parameters by sampling from the posteriorpredictive of these, given all data and the modalvalues of the GP parameters learnt using trainingdata alone.Table 2 displays the Galactic feature parame-ters that derive frof the learnt solar location pa-rameters, under the different inference schemes,namely, sampling from the joint posterior prob-ability of all parameters given all data and fromthe posterior predictive of the solar location co-ordinates given all data and GP parameters al-ready learnt from training data alone. The de-rived Galactic feature parameters are the the barrotational frequency Ω bar in the real astronom-ical units of km/s/kpc and the angular distancebetween the bar and the Sun, in degrees. The ta-ble also includes results from Chakrabarty et. al(2015), the reference for which is in the main pa- per.
2. Random-Walk Metropolis-Hastings
In thissection we discuss the used MCMC technique–to be precise, RW. We base our discussion to theinference that is made using test+training data D ∗ , on the unknown observer location coordi-nates s ( test )1 and s ( test )2 at which the test data isrealised, the unknown correlation length scales q ∗ and q ∗ that parametrise the SQE parametri-sation of the covariance matrix Σ , the diagonalelements σ (3)11 and σ (3)22 of the covariance matrix Σ ∗ and the correlation between its non-diagonalelements, ρ ∗ . Here, the “ ∗ superscript on the un-known parameters indicate their values that canbe different as inferred using the augmented data D ∗ , as distinguished from their value learnt withtraining data alone–which was marked with noasterisked superscript.In the discussion of the inference scheme below,we refer to our 7 unknowns with the notation θ , . . . , θ .(1) Set the seed θ i = θ (0) i , i = 1 , . . . , ang & Chakrabarty (2) In the t -th iteration, let current θ i be θ ( t ) i . Pro-pose the value ˜ θ i ∼ N ( θ ( t − i , σ i ) . Do thisfor each i = 1 , . . . , (3) Compute the acceptance ratio α = π ( ˜ θ . . . ˜ θ | D ∗ ) π ( θ ( t )1 . . . ˜ θ ( t )7 | D ∗ ) Here the joint posterior π ( ·| D ∗ ) , of un-knowns given the augmented data, is givenin Equation 6 of the main paper.Generate uniform random number u ∼U [0 , . If u ≤ α , accept θ ( t ) i = ˜ θ i ∀ i =1 , . . . , . If u > α , set θ ( t ) i = θ i ( t − ∀ i . Re-turn to Step 2. References
J. Ramsay and B. W. (eds.) Silverman.
Func-tional Data Analysis . Springer Series in Statis-tics. Springer, Philadelphia, 2014.B. Hofmann. Ill-posedness and regularizationof inverse problemsa review on mathematicalmethods. In
The Inverse Problem. Sympo-sium ad Memoriam H. v. Helmholtz, H. Lubbig(Ed). , pages 45–66. Akademie-Verlag, Berlin;VCH, Weinheim, 2011.A. Tarantola.
Inverse Problem Theory and Meth-ods for Model Parameter Estimation . SIAM,Philadelphia, 2005.A. Gelman, J. B. Carlin, H. S. Stern, and D. R.Rubin.
Bayesian Data Analysis . Chapman andHall, 1995. Second Edition.R. M. Neal. Regression and classification usinggaussian process priors (with discussion). InJ. M. Bernardo et. al, editor,
Bayesian Statistics6 , pages 475–501. Oxford University Press,1998.P. Alquier, E. Gautier, and G. (eds.) Stoltz.
In-verse Problems and High-Dimensional Estima-tion . Number 203 in Lecture Notes in Statstics.Springer-Verlag, Berlin Heidelberg, 2011.L. Cavalier. Nonparametric statistical inverseproblems.
Inverse Problems , 24(3):0340046,2008. Q. Zhao, G. Zhou, L. Zhang, and A. Ci-chocki. Tensor-variate gaussian processes re-gression and its application to video surveil-lance. In
Acoustics, Speech and Signal Pro-cessing (ICASSP), 2014 IEEE InternationalConference on , pages 1265–1269. IEEE, 2014.Z. Xu, F. Yan, and A. Qi. Infinite tucker de-composition: Nonparametric bayesian modelsfor multiway data analysis. In
Proceedings ofthe 29th International Conference on MachineLearning , 2012.Hoff, P. D. Separable covariance arrays via theTucker product, with applications to multivari-ate relatonal data.
Bayesian Analysis , 6(2):179–196, 2011.M. Hou, Y. Wang, and B. Chaib-draa. Online lo-cal gaussian process for tensor-variate regres-sion: Application to fast reconstruction of limbmovements from brain signal. In
Acoustics,Speech and Signal Processing (ICASSP), 2015IEEE International Conference on , pages 5490– 5494. IEEE, 2015.C. Robert and G. Casella.
Monte Carlo Statis-tical Methods . Number 2 in Springer Texts inStatistics. Springer-Verlag, New York, 2004.Dalia Chakrabarty, M. Biswas, and S. Bhat-tacharya. Bayesian nonparametric estimationof milky way parameters using matrix-variatedata, in a new gaussian process based method.
Electronic Journal of Statistics , 9(1):1378–1403, 2015.D. Chakrabarty. Phase Space around the SolarNeighbourhood.
Astronomy & Astrophysics ,467:145, 2007.J. A. D. Aston and Claudia Kirch. Evaluatingstationarity via change-point alternatives withapplications to fmri data.
Annals of AppliedStatistics , 6(4):1906–1948, 2012.D. Knuth.
The Art of Computer Programming:Seminumerical Algorithms . Addison-WesleyLongman Publishing Co., Boston, MA, USA,1997.D. Chakrabarty. Patterns in the outer parts ofgalactic disks.