Health risk modelling by transforming a multi-dimensional unknown distribution to a multi-dimensional Gaussian
HHealth risk modelling by transforming amulti-dimensional unknown distribution to amulti-dimensional Gaussian
Varun KapoorMPI-biophysical chemistry, G¨[email protected], [email protected] 6, 2018
Abstract
The traditional approach of health risk modelling with multipledata sources proceeds via regression-based methods assuming a marginaldistribution for the outcome variable. The data is collected for N subjects over a J time-period or from J data sources. The responseobtained from i th subject is (cid:126)Y i = ( Y i , · · · , Y iJ ). For N subjects we ob-tain a J dimensional joint distribution for the subjects. In this work wepropose a novel approach of transforming any J dimensional joint dis-tribution to that of a J dimensional Gaussian keeping the Shannon en-tropy constant. This is in stark contrast to the traditional approachesof assuming a marginal distribution for each Y ij by treating the Y (cid:48) ij sas independent observations. The said transformation is implementedin our computer package called ENTRA. Information about the health outcomes in many epidemiological studies isobtained from multiple data sources or over a certain time-period with mul-tiple observations. The multiple data sources provide multiple measures ofthe same underlying variable, measured on a similar scale. As an example N adults are chosen for high-blood pressure study at the age of 18 and areasked about their diet, smoking and drinking habits and their blood pres-sures are measured. The same subjects over the course of time are monitoredagain on the basis of the same variables choice as before. This illustrates thestandard data collecting exercise to measure health risk. Once such a data1 a r X i v : . [ phy s i c s . d a t a - a n ] S e p s available we can begin the risk modelling to ascertain the factors whichcontribute towards high-blood pressure and those that contribute towardslow-blood pressure.To put it mathematically, given N subjects and J data sources or time-points the data is collected as a p dimensional vector of covariates, (cid:126)X ij =( X ij , · · · , X ijp ) where 1 < i < N and 1 < j < J . Given such a vec-tor the outcome is reported as a variable Y ij for the i th subject and fromthe j th data source. Then we can construct a J dimensional vector (cid:126)Y i as (cid:126)Y i = ( Y i , · · · , Y iJ ). In a J dimensional space the above vector for a givenvalue of i is a point. Since 1 < i < N the total number of points in the J dimensional space equal N . These N points follow a certain distribu-tion which is a-priori unknown. In the conventional analysis the outcomevariables Y (cid:48) ij s are treated as independent variables and nothing is assumedabout the joint distribution in the J dimensional space. The assumptionabout the independence is not correct but as we will see in Section [2] thisdoes not affect the statistical analysis that we intend to carry out.We propose a novel analysis tool to carry out health risk modelling bytransforming the J dimensional a-priori unknown density to that of a Gaus-sian density whilst keeping the Shannon entropy constant. To do so wetransform the N number of J dimensional vectors in a basis set consistingof divergence-free vector fields. [3]. The condition that the basis set canonly consist of divergence-free vector fields enforces the entropy conservingcondition. Entropy conserving condition can be enforced by having volumepreserving maps and our choice of the basis set represents a flow of incom-pressible fluid [1, 2] hence is a volume preserving map.To determine the coefficients of these basis vectors such that the J di-mensional density is a Gaussian, Karplus theorem is used [4, 5]. We willdemonstrate in Section[4] that how this theorem allows for determination ofthe basis coefficients in such a way that the J dimensional density is trans-formed to a Gaussian.The paper is divided as follows, in Section[2] the traditional approachto model health risk is reviewed and a novel approach is proposed. In Sec-tion[3] the construction of basis set consisting of high dimensional vectorfields is shown. In Section[4] the question of determining the coefficients ofthe basis set is settled. The algorithm and program structure is discussed inSection[5]. In Section[6] a test case is computed to see how the transforma-tion works in practice. Finally we conclude our work done so far and make2roposals for the future work. Consider a trial to model high-blood pressure with N subjects monitoredover J time-period. The response obtained from the i th subject can bewritten as a J dimensional vector as (cid:126)Y i = ( Y i , · · · , Y iJ ) . (1)Traditionally a joint distribution for N subjects is not specified. Instead aworking generalized linear model (GLM) to describe the marginal distribu-tion of Y ij as in Liang, Zeger (1986) [6] is used, f ( Y ij ) = exp (cid:20) Y ij θ ij − a ( θ ij ) φ + b ( Y ij , θ ij ) (cid:21) (2)If the output Y ij is a binary random variable then the parameters forthe above exponential family are φ = 1 , a ( θ ij ) = log [1 + e θ ij ] , θ ij = log (cid:20) µ ij − µ ij (cid:21) , b ( Y ij , θ ij ) = 0 (3)The probability of a favorable outcome if Y ij is a binary random variablecan be modelled via a Logit function asLogit( P [ Y ij = 0 | (cid:126)X ij ]) = (cid:126)X ij (cid:126)β, (4)where Y ij = 0 represents the favourable outcome i.e low risk of high-blood pressure whereas Y ij = 1 corresponds to high-risk of high-blood pres-sure.Given Eq[2] the log-likelihood function can be written as lnL = N (cid:88) i =1 J (cid:88) j =1 Y ij θ ij − a ( θ ij ) (5)To determine the regression parameters (cid:126)β we differentiate the log-likelihoodwith respect to (cid:126)β , this gives the following equation to estimate (cid:126)β (cid:48) s ∂InL∂ (cid:126)β = N (cid:88) i =1 J (cid:88) j =1 (cid:126)X ij ( Y ij − µ ij ) = 0 . (6)3his is the traditional approach to determine the regression parametersthat help us evaluate the factors which contribute towards high or low healthrisks given the data. Equation[6] was derived assuming that the outcome Y ij is a binary random variable, however similar equation can be derivedwhen the outcome Y ij is of any other type.In this approach it is assumed that all the Y (cid:48) ij s are independent obser-vations. This assumption although not correct, yields the estimates for (cid:126)β which are valid but their variances are not. However using techniques suchas empirical variance estimator valid standard errors can be obtained[7]. Having reviewed the traditional approach we now present our idea. As re-marked earlier the joint distribution of the N subjects is not specified in the J dimensional space as this is an a-priori unknown. However if the unknowndistribution is transformed to that of a J dimensional Gaussian then a validanalysis tool can be developed. We have developed an algorithm and a pro-gram which does this transformation in an entropy conserving way i.e theShannon entropy is preserved during the transformation.To preserve the Shannon entropy requires transformation of the highdimensional vectors in a basis set consisting of divergence-free vector fieldsas that represents a volume preserving map thereby preserving Shannon en-tropy. Such a construction of orthocomplete basis set is available in anydimensions [3]. We use this mathematical construct to transform then N number of J dimensional vectors. To determine the basis coefficients suchthat the J dimensional density is transformed to that of a Gaussian Karplustheorem discussed in Section[4] is invoked. Once the coefficients are deter-mined the J dimensional joint distribution of the subjects is a Gaussianhaving the same Shannon entropy as the starting distribution. This hasbeen implemented in ENTRA.Once the joint distribution for the N subjects is known log-likelihoodfunction can be written for such a distribution and the regression parametersdetermined thereby. In the next Section we show the construction of thedivergence-free vector fields used in our program ENTRA.4 J dimensional divergence-free vector fields A mathematically rigorous construction of divergence free smooth vectorfields in any dimensions was provided in [3]. We use it to construct a basisset consisting of J dimensional divergence free vector fields. To do so wedefine the following J × J matrix valued operatorˆ O = − ( I ) J × J ∇ + ∇∇ T . (7)Here the first term is a J × J dimensional Laplacian operator, I being the J × J identity matrix and the second term consists of column and row vectorsof the gradient operator in J dimensions.This operator acts on a smooth scalar function which we construct froma J dimensional vector (cid:126)x as φ l ( || (cid:126)x − (cid:126)x l || ) = e −|| (cid:126)x − (cid:126)x l || / (2 σ ) , (8)where the symbol || , || is the Euclidean distance between two J dimen-sional vectors. σ is chosen to be 0 . × ∆ x where ∆ x is the spacing betweenthe basis vectors. The vector (cid:126)x l is chosen as a constant l ∆ x where l goesfrom ( − L, L ) .Now we define a matrix valued function by applying the operator inEq[7] to the scalar field in Eq[8] asΦ lσ ( (cid:126)x ) J × J = (cid:8) − ( I ) J × J ∇ + ∇∇ T (cid:9) φ l ( || (cid:126)x − (cid:126)x l || ) (9)= (cid:26) J − σ − σ || (cid:126)x − (cid:126)x l || (cid:27) ( I ) J × J e −|| (cid:126)x − (cid:126)x l || / (2 σ ) + (cid:20) σ ( (cid:126)x − (cid:126)x l )( (cid:126)x − (cid:126)x l ) T (cid:21) e −|| (cid:126)x − (cid:126)x l || / (2 σ ) It was proven rigorously that the columns of the above matrix consist ofdivergence free vector fields [3, 9]. For a given choice of centre we thereforeobtain J × J dimensional vector field. From the results in [3, 9] ∇ .(cid:126)v Jl = 0.For a given centre (cid:126)x l there are J number of J dimensional mutuallyorthogonal basis vectors. Hence for each centre we have a complete basisset. Due to such a construction the basis vectors enforce the divergencefree condition strictly. Each vector has a unique coefficient c k attached toit,1 < k < J .In the next section we demonstrate this for a simple 2 D case.5 (cid:45) (cid:45) (cid:45) ; (cid:45) (cid:45) (cid:45) (cid:45) Figure 1: Plot of vector field (cid:126)v c and (cid:126)v c respectively. To demonstrate how the vector field looks like we take a simple 2D case anddefine the scalar operator in Eq[8] as φ l (cid:16) || (cid:126)x − (cid:126) || (cid:17) = 1 √ πσ e − ( x + y ) / σ (10)The matrix valued function Φ σ ( (cid:126)x ) × becomes1 √ πσ e − ( x + y ) / σ (cid:32) − y σ + σ xyσ xyσ − x σ + σ (cid:33) (11)The vectors (cid:126)v c and (cid:126)v c defined from the columns of the above matrix asbelow are then divergence free, (cid:126)v c = 1 √ πσ c e − ( x + y ) / σ (cid:32) − y σ + σ xyσ (cid:33) (12) (cid:126)v c = 1 √ πσ c e − ( x + y ) / σ (cid:18) xyσ − x σ + σ (cid:19) (13)Any linear combination of these divergence free fields is also divergencefree. We plot these fields in Fig.1 for c = c = 1 and σ = 0 . D case we have two mutuallyorthogonal divergence free basis vectors which constitute the complete basisset in two dimensions. The result holds in general for any dimensions [3, 9]. In the J dimensional space of the outcome variable the N subjects arerepresented by N points. These N points are arranged according to a certaindensity ρ . Karplus theorem states that for a given covariance Gaussiandistribution maximizes entropy. Mathematically this can be written as J [ ρ ] = S [ ρ G ] − S [ ρ ] ≥ . (14)Here ρ G is the Gaussian density.The covariance matrix is defined as C = (cid:104) ( (cid:126)x − (cid:104) (cid:126)x (cid:105) )( (cid:126)x − (cid:104) (cid:126)x (cid:105) ) T (cid:105) . (15)Here the symbol (cid:104)(cid:105) denotes the ensemble average over the N subjects im-plying that C is a J × J dimensional matrix. (cid:126)x denotes a J dimensionalvector or a point in the configuration space.The equality sign in Eq[14] holds only if the underlying density distributionis a Gaussian. Now we introduce the transformation f that preserves theentropy, i.e, S [ f ( ρ )] = S [ ρ ] . (16)With this transformation we want to deform the density ρ towards aGaussian density, this then becomes the following minimization problem [8] f min = min f ∈ G ( J [ f ( ρ )]) . (17)Here G is the group of all the smooth entropy preserving transforms.Since the transformation leaves the entropy unchanged and as the entropyof a Gaussian density is proportional to the determinant of the covariancematrix, to solve the above minimization problem we have to minimize thedeterminant of the covariance matrix of the Gaussian density. Since byKarplus theorem the covariance of ρ G is same as that of ρ , we can use thecovariance in Eq[15] and write the above minimization problem as f min = min f ∈ G det C [ f ( ρ )] . (18)7s a consequence of the above corollary if we have a basis set consistingof divergence free vector fields under which J dimensional vectors for N subjects are transformed, then the basis coefficients can be determined byminimizing the covariance matrix determinant of the transformed vectorswith respect to the basis coefficients.In the next section we describe the construction of the basis set consistingof divergence free smooth vector fields. This construction is implemented inthe program package ENTRA, that I have developed. Given N subjects the outcome variable for each subject is a J dimensionalvector[1]. We compute the covariance matrix Eq[15] which is a J × J di-mensional matrix. The matrix is then diagonalized by an orthogonal trans-formation T . The covariance matrix C can then be written as C = T Λ T T (19)with Λ being the eigenvalue matrix and columns of T matrix being theeigenvectors of C . We construct a J × N dimensional vector by appendingall the N number of J dimensional vectors and label it as trajectory (cid:126)Y .To transform the trajectory to principal coordinates where the mean of thetrajectory is centered at 0 we project the eigenvectors onto the trajectoryto get mean-centered trajectory as (cid:126)Y (cid:48) = T ( (cid:126)Y − (cid:104) (cid:126)Y (cid:105) ) . (20)For the trajectory (cid:126)Y (cid:48) we choose two J dimensional vectors (cid:126)Y (cid:48) k and (cid:126)Y (cid:48) m andtransform them in the vector field shown before as (cid:126)V k = (cid:126)Y (cid:48) k + L (cid:88) l J (cid:88) i c il Φ ilσ ( (cid:126)Y (cid:48) k ) . (21) (cid:126)V m = (cid:126)Y (cid:48) m + L (cid:88) l J (cid:88) i c il Φ ilσ ( (cid:126)Y (cid:48) m ) . (22)The coefficients are chosen by minimizing the determinant of matrix C withrespect to the basis coefficients c l . The minimization is performed via con-jugate gradients method [10].The process is repeated for all the N number of J dimensional vectors.This in the end yields a trajectory (cid:126)V which has the least determinant of the8ovariance matrix and the underlying density has the same Shannon entropyas that of our starting system. The classes that build up the core of the program are ENTRA, trajectoryand grid. Objects of class trajectory represent real-valued arrays. The ar-rays are stored as high dimensional vectors and the dimensionality of thevectors is defined by the grid class. The grid class also defines the numberand spacing of the basis vectors. The methods provided by trajectory andgrid classes allow to initialize the corresponding arrays, to manipulate them,and to store (load) them to (from) files. To start using the program few pa-rameters have to be provided these are:long nsources = J;long nsubjects = N;double deltx = spacing between basis functions;long ngpsx = Number of basis functions=L;long Max iter = Maximum iteration to find basis coefficients;
The goal of this example is to generate a trajectory having a random under-lying density and then to transform it towards a trajectory having Gaussianunderlying density. To generate a random trajectory we use the utility
Random of Eigen to generate random matrix of any dimension. The pa-rameters chosen for this test are as follows:long nsources=10 ;long nsubjects=1000;double deltx =0.05;long ngpsx=80;long Max iter=500;Results of the transformation (after a single iteration cycle) are shownin Fig.[2]. In this example we transform a 30 dimensional vector having1000 configurations. The data is along these 30 axises which are labelled as( X , X · · · X ). Here X p = Y ip , i.e p th component of the J dimensionalvector. We plot two dimensional subspace of original 30 dimensional spacealong different axises as labelled in Fig.[2].9o prove that the transformation is indeed entropy conserving, we com-puted subspace entropy via histogramming for the 2D subspaces shown inFig.[2] for the original and the transformed subspaces.The Matlab code file along with the output to compute entropy of sub-spaces is provided here:*******************************************************************************************************Matlab File to estimate Entropy via Histogramming for the plot in Fig.[2]a*******************************************************************************************************load simulation examplehist.datX1=simulation examplehist(:,8);X2=simulation examplehist(:,9);X3=simulation examplehist(:,11);X4=simulation examplehist(:,12);X11=[X1;X3]; X12=[X2;X4];X = [X11,X12];plotmatrix(X);defaultn=500;error(nargchk(1, 2, nargin)); if nargin < < >> run simpleentropy InitialEntropy =8.7636TransformedEntropy =7.5625EntropyDifference =1 . >> run simpleentropyInitialEntropy =8.7770TransformedEntropy =7.6632EntropyDifference =1 . >> run simpleentropyInitialEntropy =8.7754TransformedEntropy =7.6483EntropyDifference =1 . X , X , X ) as seen in Fig.[3], in this plot we plot the dataalong one axis versus another as labelled along with histogram along eachaxis. We also plot the same plot for the transformed data as seen in Fig.[4].We compute the entropy of the three dimensional data using a Matlab codesimilar to above and its results are shown here:*****************************************************************************************************Result for computation of entropy for 3D data >> run highdimentropyInitialEntropy = 12.8220TransformedEntropy =7.5756EntropyDifference =1 . X , X , X ). More details in the text14igure 4: 3D Scatter and Historgram plots for the original subspace alongthree axises ( X , X , X ). More details in the text15 Summary and Outlook
ENTRA package has been presented. The aim of this package is to dodata transformation on high dimensional data sets as found in epidemiology.With this transformation the underlying high dimensional density functionis transformed that to a high dimensional Gaussian and due to the niceproperties associated with a Gaussian distribution the further data analysiscan be accomplished easier than before.The following major points need to be addressed which will form themain body of the work to be done at the institute, they are:1) The appropriate choice of the basis set vectors. The number of basisvectors depend on the data and need to be appropriately estimated before-hand. How exactly that can be achieved needs to be determined.2) Building an example with enough statistics to be able to prove thatthe entropy conservation is maintained to a high degree of accuracy.3) Furthermore developing full file support for epidemiologists to enablethem to load their data and get the transformed data.4) Also, developing complete regression analysis to estimate conditionalprobabilities of the likes in Equation[4] in the ecosystem of ENTRA.16 eferences [1] Y. Brenier. TOPICS ON HYDRODYNAMICS AND VOLUME PRE-SERVING MAPS. Handbook of Mathematical Fluid Dynamics II.(North-Holland, Amsterdam, 2003), pages 5586.[2] A. Wehrl, General properties of entropy. Reviews of Modern Physics,50(2):221 260, 1978.[3] F. J. Narcowich and J. D. Ward, Generalized Hermite Interpolation viaMatrix-Valued Conditionally Positive Definite Functions. Mathematics ofComputation, Vol. 63, Number 208, 661-687, 1994.[4] I. Andricioaei, and M. Karplus, On the calculation of entropy from co-variance matrices of the atomic fluctuations, J. Chem. Phys., 115:6289-6292, 2001.[5] M. Karplus and J.N. Kushick, Method for estimating the configurationalentropy of macromolecules. Macromolecules, 14(2):325332, 1981. Schlit-ter, J. (1993). Estimation of absolute entropies of macromolecules usingthe covariance matrix. Chemical Physics Letters, , 617621.[6] Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linearmodels. Biometrika 1986;73