Bayesian Regularization of Gaussian Graphical Models with Measurement Error
BBayesian Regularization of Gaussian GraphicalModels with Measurement Error
Michael Byrd, Linh Nghiem, and Monnie McGeeJuly 5, 2019
Abstract
We consider a framework for determining and estimating the conditionalpairwise relationships of variables when the observed samples are contami-nated with measurement error in high dimensional settings. Assuming thetrue underlying variables follow a multivariate Gaussian distribution, if nomeasurement error is present, this problem is often solved by estimating theprecision matrix under sparsity constraints. However, when measurement erroris present, not correcting for it leads to inconsistent estimates of the precisionmatrix and poor identification of relationships. We propose a new Bayesianmethodology to correct for the measurement error from the observed samples.This Bayesian procedure utilizes a recent variant of the spike-and-slab Lassoto obtain a point estimate of the precision matrix, and corrects for the con-tamination via the recently proposed Imputation-Regularization Optimizationprocedure designed for missing data. Our method is shown to perform betterthan the naive method that ignores measurement error in both identificationand estimation accuracy. To show the utility of the method, we apply the newmethod to establish a conditional gene network from a microarray dataset. a r X i v : . [ s t a t . M E ] J u l Introduction
A core problem in statistical inference is estimating the conditional relationshipamong random variables. Naturally, a full description of the underlying connectionsamong the numerous random variables is valuable information across many disci-plines, such as in biology where the relationships among hundreds of genes involvedin a metabolic process is desired to be uncovered. In fact, under the assumption thatthe variables follow a multivariate Gaussian distribution, the inverse covariance ma-trix, known as the precision matrix, characterizes conditional dependence betweentwo dimensions. This is accomplished by noting that if an element of the precisionmatrix is 0, then the two variables are conditionally independent; see [12] for a re-view. This setting, often referred to as a Gaussian graphical model, is where ouranalysis takes place.Estimating the precision matrix is a difficult task when the number of observations n is often much less than the dimension of the features d . A naive approach is toestimate the precision matrix by the inverse of the empirical covariance matrix; thisestimate, however, is known to perform poorly and is ill-posed when n < d [9]. Thecommon approach is to assume that the precision matrix is sparse [3]; that is, weassume the precision matrix’s off-diagonal elements are mostly 0. As a result, mostpairs of variables are conditionally independent. The sparsity assumption has led todifferent lines of research with regularized models to estimate the precision matrix.While one approach utilizes a sparse regression technique that estimates the precisionby iteratively regressing each variable on the remaining variables, for instance [10],we instead focus on the direct likelihood approach. The direct likelihood approachoptimizes the full likelihood function with an element-wise penalty on the precisionmatrix; common examples being graphical lasso [5], CLIME [1], and TIGER [14].We utilize a recent Bayesian optimization procedure, called BAGUS, that relies onoptimization performed by the EM-algorithm, which was shown to have desirabletheoretical properties, including consistent estimation of the precision matrix andselection consistency of the conditional pair-wise relationships [6].There are many practical issues associated with Gaussian graphical models, such2s hyperparameter tuning [24], missing data [13], and repeated trials [22], whichpractitioners need to adjust for a successful analysis. We address another practicalissue involved with these models, measurement error. Measurement error occurswhen the variables of interest are not observed directly; instead, the observations arethe desired variables that have been additionally perturbed with noise from somemeasurement process. This happens when, for instance, an inaccurate device is usedto measure some sort of health metric. Measurement error models have been studiedextensively for classical settings such as density deconvolution and regression [2],but, to our knowledge, have not yet been well studied in the context of Gaussiangraphical models, especially in high dimensional setting.We propose a Bayesian methodology to correct for measurement error in estimat-ing a sparse precision matrix; our new method extends the optimization procedureof [6]. While directly incorporating the estimate of the uncontaminated variable ispossible, we find the incorporation of the imputation-regularization technique of [13]to provide more desirable results. Our procedure imputes the mismeasured randomvariables, then performs BAGUS on this imputation; these steps are performed fora small number of cycles, requiring more computation but giving better results thanthe naive estimator. We prove consistency of the estimated precision matrix withthe imputed procedure, and illustrate the performance in a simulation study. Weconclude with an application to a microarray data. Given a d -dimensional random vector, x = { x , . . . , x d } , the conditional dependenceof two variables x i and x j , for any pair ( i, j ) with 1 ≤ i ≤ j ≤ d , given all theremaining variables is of interest. This conditional dependence structure is usuallyrepresented by an undirected graph G = ( V, E ), where V = { , . . . , d } is the set ofnodes and E ⊆ V × V = { (1 , , (1 , , . . . , ( d, d ) } is the set of edges [12]. In thisrepresentation, the two variables x i and x j are conditionally independent if there isno edge between node i and node j .If the vector x follows a multivariate normal distribution with mean and co-3ariance matrix Σ x , x ∼ N d ( , Σ x ), every edge corresponds to a non-zero entry inthe precision matrix Ω x = Σ − x , see [12]. The model in this scenario is often knownas a Gaussian graphical model. In the high dimensional setting, the set of edges areusually assumed to be sparse, meaning that only a few pairs ( x i , x j ) are condition-ally dependent. In the Gaussian case, this assumption implies only a few off-diagonalentries of Ω x are non-zero.When measurement error is present, denote U = ( u , . . . , u n ) T as measurementerrors that are independent from data X = ( x , . . . , x n ) T . For i = 1 , . . . , n , theamount of measurement error is drawn from another multivariate normal distribu-tion with mean and covariance matrix Σ u , u i ∼ N d ( , Σ u ) . We assume Σ u to bediagonal, and hence the amount of measurement error on each variable is uncorre-lated. We make a common assumption that Σ u is known or estimable from ancillarydata, such as replicate measurements. The contaminated variables w = x + u ingeneral have a different conditional dependence structure from that of x . Indeed,the covariance and precision matrix of w is given by Σ w = Σ x + Σ u and Ω w = Σ − w = ( Σ x + Σ u ) − = Ω x − Ω x ( I + Σ u Ω x ) − Σ u Ω x , (1)respectively; here, I denotes the d × d identity matrix. Equation (1) follows fromthe Kailath variant formula in [17]. Furthermore, equation (1) suggests that Ω w and Ω x are equal if the product Ω x ( I + Σ u Ω x ) − Σ u Ω x is equal to a zero matrix. This isgenerally not the case when the matrix Σ u is not zero.Suppose the data consist of iid observations w , . . . , w n , where w i = x i + u i , i =1 , . . . , n with x i ∼ N d ( , Σ x ) and u i ∼ N d ( , Σ u ). Here, w i = ( w i , . . . , w di ), withthe subscript and superscript denoting the observation and components respectively.Denote W as the n × d matrix of observed data. The model is equivalent to thefollowing hierarchical representation. First, the latent random variables x i are gen-erated from a N d ( , Σ x ) distribution, and when conditioned on x i and Σ u , we have w i | x i , Σ u ∼ N d ( x i , Σ u ) for each i = 1 , . . . , n . This forms an intuitive generative4rocess, where first x is realized, then contaminated by measurement error u , andobserved finally as w . The problem of interest is to estimate the precision matrix Ω x in the high dimensional setting n < d .When no measurement error is present, i.e the x i are directly observed, the samplecovariance matrix S = n − (cid:80) ni =1 ( x i − ¯ x )( x i − ¯ x ) (cid:62) , with ¯ x being the sample mean,is a consistent estimator for Σ x . However it has the rank of at most n < d , so it isnot invertible to estimate Ω x . When measurement error is present, we assume thecovariance matrix of measurement error Σ u is known or estimable from replicates.A naive approach is first to estimate Σ x by ˜ Σ x = S w − Σ u , where S w denotes thesample covariance from contaminated data W , and then to invert ˜ Σ x to estimate Ω x . The main issue with this approach is that ˜ Σ x is generally not positively definite.This implies its inverse is also not positively definite, which is necessary to find aconsistent estimate Ω x . Hence, a correction procedure to estimate Ω x need not relyupon the sample covariance matrix ˜ Σ x directly. Furthermore, the procedure is alsoable to incorporate sparsity constraints to recover the graphical model structure.These requirements are addressed by the procedure described in the next section. In a recent work, [13] develop a methodology to efficiently handle high dimensionproblems with missing data. Their solution is an EM-algorithm variant which alter-nates between two steps, the imputation step and regularized optimization step; werefer to their algorithm as the IRO algorithm. Denote the missing data as Y , and ob-served data as X . Also denote the desired parameter to be estimated by θ , and beginwith some initial guess θ (0) . During the t th iteration, the IRO algorithm generatesY from the distribution given by the current estimate of θ , i.e. Y ∼ π ( Y | X, θ ( t − ).Then, using X and Y , maximizes θ , under regulation, using the full likelihood. [13]show that this procedure results in a consistent estimate of θ ( t ) , and results in aMarkov chain with stationary distribution.We make use of this framework for our current problem pertaining to mismea-sured observations instead of missing values. The problems are naturally related in5he sense that both are generating values of the true process from some estimatedunderlying distribution. We return to the hierarchical structure of the problem, i.e. w ∼ N d ( x , Σ u ) and x ∼ N d ( , Σ x ). The IRO algorithm proceeds iteratively betweenthe two following steps: • Imputation step : At iteration t , draw X ( t ) = ( x ( t )1 , . . . , x ( t ) d ) from the posteriorfull conditional of X , using the current estimate of Ω ( t − x . Specifically, for i = 1 , . . . , n , draw x ( t ) i | w , Ω ( t − x ∼ N d ( Λ − Ω u w i , Λ − ) where Λ = ( Ω ( t − x + Ω u ). Note that the posterior distribution of x i depends only on w i due toindependence. This allows for easy generation of data from the true underlyingdistribution. • Regularization Step : Apply a regularization to the generated X ( t ) and obtaina new MAP estimate of Ω ( t ) x .In this work, the regularization step is carried out based on a recent Bayesianmethodology, called BAGUS. Hence, the whole algorithm is referred to as the IRO-BAGUS algorithm. The next subsections 3.1-3.3 outline prior specifications, the fullmodel, and variable selection for BAGUS. After that, section 3.4 discusses consis-tency of the IRO-BAGUS estimate. Denote the elements Ω x to be ω ij . Recently, a non-convex, continuous relaxationpenalty for the spike-and-slab prior was created for the standard lasso problem [20].This prior was extended to the case of graphical models by [6], and is given by π ( ω ij ) = η v exp (cid:26) − | ω ij | v (cid:27) + 1 − η v exp (cid:26) − | ω ij | v (cid:27) (2)for the off diagonal elements ( i (cid:54) = j ), where 0 < v < v and 0 < η <
1. This priorcan be interpreted as a mixture of the spike-and-slab prior. The first componentof the mixture has prior probability η , and is associated with the slab component,6.e. ω ij (cid:54) = 0. Conversely, with prior probability 1 − η the element is from the spikecomponent, suggesting ω ij = 0.Traditionally, the spike-and-slab prior has a point mass component at 0 and someother continuous distribution for the slab component. This is to represent settingunwanted terms exactly to 0. Here, both the spike and the slab components aredistributed according to a Laplace distribution; both are centered at 0, but the spikeis more tightly centered by a smaller variance term than the slab. This relaxationof the spike-and-slab prior allows for efficient gradient based algorithms, while stillbeing theoretically sound as shown in [19].Shrinkage is not desired on the diagonal elements, so a weakly informative expo-nential prior is given instead, π ( ω ii ) = τ exp {− τ ω ii } . Another consideration for theprior of Ω x is to ensure the whole matrix to be positive definite, denoting as Ω x (cid:31) B , || Ω x || ≤ B . This assumption will be important going forward. Thefull prior distribution for Ω x is then given by π ( Ω x ) = (cid:89) i Assuming || Ω x || ≤ B , then the estimate Ω ( t ) x is uniformly consistentto Ω x when log( t ) = O ( n ) . It can be seen that the nature of the IRO algorithm is similar to that of MCMC.Under a few more conditions, [13] note that the IRO results in a Markov chainwith a stationary distribution, and hence the average of the maximization stepsare consistent estimates of the underlying parameters. Our final estimates are the9veraged regularized optimization steps given by BAGUS from the imputed dataat each iteration, removing a small number of the beginning iterations as burn-in.By averaging instead of taking only the final iteration, we make the analysis lessvariable. In this sense, the relationships that the correction procedure identifies aremore likely to be true relationships, cutting down on the number of false positives. x Here we consider some computational aspects of our proposed methodology. First,we focus to the optimization of Ω x . In our procedure, once X is generated, theobjective function to be optimized is L uc , as was shown in Equation (7); we note thisis due to the conditional independence of W and Ω x in the hierarchical structureof the contamination process. Optimizing L uc is difficult to do directly; therefore,the latent factors r ij from Section 3.3 are introduced into the process as in [6]. Thisallows an E-step similar to the spike-and-slab Lasso and an M-step similar to theGraphical Lasso.The optimization seeks to find the MAP of the posterior proportional to | Ω x | exp (cid:26) − X T Ω x X (cid:27) (cid:89) i Theorem 2. The estimated precision matrix ˆ Ω x = T − (cid:80) Tt =1 Ω ( t ) x is symmetric andpositive definite if the initial value of Ω x for BAGUS at each iteration t was alsopositive definite.Proof. By Theorem 5 in [6], if the initial value to optimize BAGUS is positive definite,then Ω ( t ) x is also positive definite. The set of positive definite matrices form a cone,and hence the average will also be in this cone. There are four hyperparameters in BAGUS, η, τ, v , and v . As with [6], we alwaysset η = 0 . τ = v , which leaves two hyperparameters to tune. Again, wefollow [6], who suggest a BIC-like criteria to select the best model from a grid ofhyperparameters. This criteria isBIC = n (tr( S ˆ Ω x ) − logdet( ˆ Ω x )) + log( n ) × q, where ˆ Ω x is the estimated precision matrix and q is the number of non-zero elementsof the estimated in the upper diagonal of the precision matrix. We use this in similarfashion for the IRO procedure, but instead we use the averaged Ω ( t ) x in the BICcalculation. 12 Simulation We investigate the performance of our methodology under several different settings.For each setting we generate x i following a d -variate Gaussian distribution with mean and precision matrix Ω x according to some graph structure; we refer to this as the true data . Then, the contaminated observation w i was generated from w i = x i + u i ,where u i ∼ N d ( , Σ u ) , i = 1 , . . . , n . The measurement error covariance matrix Σ u isassumed to be a diagonal matrix, with element [ Σ u ] ii = γ [ Σ x ] ii , where [ Σ x ] ii is thevariance of the dimension x i . In other words, the constant γ controls the noise-to-signal ratio on each variable. For the purposes of simulation, we assume the amountof measurement error to be known.To generate the true data we use the huge package [25]. We inspect two differenttypes of graphs, referred to as hub and random ; we expand on these below where ω ij denotes the ( i, j ) element of Ω x .1. Hub: For d/ 20 groups, ω ij = ω ji = 1 if in the same group. ω ij = 0 otherwise.2. Random: For 1 ≤ i < j ≤ d , ω ij = 1 with probability d , 0 otherwise.We illustrate the stuctures in Figure 1.Each model was generated with n = 100 observations. We inspect each modelfor d = { , } and γ = { . , . , . } . The amount of correction-imputationswas set to be 50, with the first 20% discarded as burn-in; we note that we inspected25 and 100 imputations with the same percentage of burn-in samples with minimaldifferences in output. Each setting was replicated 50 times, and the final results arethe average of these replicates. Hyperparameter tuning was done as described inSection 4.2. Because measurement error is often ignored in the context of GGMs,our simulations also provide perspective onto the negative effect that measurementerror can impose into the model performance.To inspect model performance, we examine both the estimated precision matrixand the ability to do variable selection of BAGUS on the true data (true), BAGUS13 Figure 1: Graphical representation for d = 100 of the hub (left) and random (right) struc-ture, respectively. Note that the random graph is subject to change due to the randomness. on the contaminated data (naive), and our IRO-BAGUS methodology on the con-taminated data (corrected). For each estimated precision matrix ˆ Ω x , estimationerror is measured by || ˆ Ω x − Ω x || F , and variable selection is evaluated by differentmetrics involving the true positives (TP), false positives (FP), true negatives (TN),and false negatives (FN) are reported: specificity (SPE), sensitivity (SEN), precision(PRE), accuracy (ACC), and Matthews correlation coefficient (MCC); these valuesare defined as SPE = TNTN + FP , SEN = TPTP + FN , PRE = TPTP + FP , ACC = TP + TNTP + FP + TN + FNMCC = TP × TN − FP × FN(TP + FP)(TP + FN)(TN + FP)(TN + FN) . Additionally, we also report the area under the ROC curve (AUC), which givesinsight into the amount of seperation of the classification. These different metricsgive insight into the tradeoffs and gains of each setting.14 d Model SEN SPE PRE ACC MCC FROB AUC0.1 100 True 1.00 0.65 0.85 0.99 0.73 5.11 0.95Naive 1.00 0.50 0.76 0.99 0.61 6.81 0.93Corrected 1.00 0.51 0.78 0.99 0.62 6.17 0.97200 True 1.00 0.67 0.77 0.99 0.71 7.36 0.94Naive 1.00 0.51 0.69 0.99 0.59 9.63 0.92Corrected 1.00 0.51 0.71 0.99 0.60 8.68 0.960.25 100 True 1.00 0.66 0.84 0.99 0.74 5.09 0.95Naive 0.99 0.38 0.60 0.98 0.47 8.54 0.90Corrected 1.00 0.36 0.68 0.98 0.49 7.71 0.94200 True 1.00 0.67 0.78 1.00 0.72 7.29 0.94Naive 1.00 0.40 0.52 0.99 0.45 12.10 0.90Corrected 1.00 0.37 0.62 0.99 0.48 10.68 0.950.5 100 True 1.00 0.66 0.85 0.99 0.74 5.03 0.95Naive 1.00 0.20 0.50 0.98 0.31 9.67 0.84Corrected 1.00 0.20 0.70 0.98 0.37 8.74 0.89200 True 1.00 0.68 0.77 0.99 0.72 7.36 0.94Naive 1.00 0.20 0.37 0.99 0.27 13.70 0.83Corrected 1.00 0.17 0.59 0.99 0.31 12.53 0.89 Table 1: Simulation results for the hub graph structure, as specified in Section 5.1. Foreach signal-to-noise ratio and d , the true, naive, and corrected models are shown for metricsdefined in Section 5.1. Table 1 and Table 2 present the results for the hub and random structure, respec-tively. To begin, we note the effect of the increasing measurement error. This canbe observed by examining the growing difference in the performance of the true andnaive model when holding d fixed and increasing the amount of contamination. Fo-cusing on the hub structure, a decrease in the quality of selection and estimationcan be observed for each setting, which grows worse with more contamination. Theselection accuracy metrics with respect to the prespecified 0.5 cut-off show drops inperformance of around 50%. The estimated precision matrix from the naive growsworse with measurement error, and is also about 50% worse when the signal-to-noiseis 0.5. 15mt. ME d Model SEN SPE PRE ACC MCC FROB AUC0.1 100 True 1.00 0.42 0.84 0.98 0.59 4.61 0.89Naive 1.00 0.32 0.80 0.98 0.50 5.34 0.88Corrected 1.00 0.32 0.80 0.98 0.50 5.04 0.91200 True 1.00 0.36 0.76 0.99 0.52 6.72 0.86Naive 1.00 0.30 0.66 0.99 0.44 7.61 0.85Corrected 1.00 0.28 0.68 0.99 0.43 7.25 0.900.25 100 True 1.00 0.45 0.86 0.98 0.61 4.66 0.90Naive 1.00 0.27 0.70 0.97 0.42 6.68 0.85Corrected 1.00 0.23 0.75 0.97 0.40 5.72 0.88200 True 1.00 0.37 0.75 0.99 0.52 6.77 0.86Naive 1.00 0.23 0.52 0.99 0.34 9.04 0.82Corrected 1.00 0.18 0.60 0.99 0.32 7.95 0.860.5 100 True 1.00 0.43 0.85 0.98 0.59 4.65 0.89Naive 1.00 0.14 0.55 0.97 0.26 7.71 0.79Corrected 1.00 0.09 0.67 0.97 0.24 6.49 0.79200 True 1.00 0.37 0.76 0.99 0.53 6.74 0.86Naive 1.00 0.12 0.39 0.98 0.21 10.42 0.77Corrected 1.00 0.06 0.56 0.99 0.18 8.92 0.78 Table 2: Simulation results for the random graph structure, as specified in Section 5.1.For each signal-to-noise ratio and d , the true, naive, and corrected models are shown formetrics defined in Section 5.1. We now turn attention to the performance of the correction step. First, take noteof the first five metrics which are based on the confusion matrix for the 0.5 cutoffthreshold. Averaging across the IRO iterations was expected to result in an analysisthat favored identifying relationships that were more certain, which can be observedby inspection of the precision (PRE). The gains from the precision are most notableas d grows larger, and more pair-wise relationships exist; when d = 200, we notenearly 10% and 50% performance gains in the precision for signal-to-noise ratios of0.25 and 0.5, respectively. In both the hub and random structure the naive andcorrected models perform similarly in terms of the sensitivity, specificity, accuracy,and MCC.It seems at first glance that the selection performance, ignoring the precision,16f the correction procedure is comparable to the naive, but these discrepencies canbe attributed to the prespecified inclusion cut-off on the P matrix. In practice itcan often be more reasonable to rank order the inclusion probabilities to identifyrelationships to further investigate in future experiments. With this in mind, weturn to the performance with respect to the AUC where consistent improvementscan be seen for the hub and random structure in most all settings. The AUC helpsunderstand the amount of seperation found in the model across all thresholds, whichhelps justify that the correction step is making improvements in seperating the classesfor the true relationships as AUC improvements are seen in all but the random graphwith d = 200 and signal-to-noise ratio of 0 . . 1. This can be attributed tothe measurement error in models that are easily identified. Second, in the randomstructure with the amount of measurement error being 0 . 5, the corrected model doesnot make substantial improvements in results over the naive model. We note thedifficulty of this setting, as the random structure often performs worse than otherstructures in identification, and now we add more noise via the contamination. Witha relatively small sample size, this noise is difficult to overcome.Finally, we note the quality of the estimated precision matrix, as measured byFrobenious norm of the difference. In every setting for both the hub and randommatrices, the corrected model outperforms the naive model’s estimate of the precisionmatrix. In the hub structure this improvement is often of the order of 15-20% better,while in the random structure a 10% improvement is generally observed. If the intentof the analysis is to use the estimated precision matrix in downstream analysis, thiscan result in more refined results. A common source of noise in analysis involving gene expression datasets is measure-ment error [18]. Gaussian graphical models are often used to inspect the relationship17f different genes in varying experiments [11]. We illustrate our methodology using anAffymetrix microarray dataset containing 144 subjects of favorable histology Wilmstumors hybridized to the Affymetrix Human Genome U133A Array [8]. The datais publicly available on the GEO website, dataset GSE10320 uploaded 1/30/2009.A feature of Affymetrix data, and many other gene expression measurement plat-forms, is the use of multiple probes for each gene for each patient, giving replicatemeasurements for each patient’s gene measurement. The replicates for each patientenable an estimate of the measurement error, where we again assume the amount ofcontamination is independent across genes.We follow the preprocessing steps taken in [21] and [15], which used this study inthe context of measurement error in variable selection for linear models. The processbegins by processing the raw data with the Bayesian Gene Expression (BGX) package[23]. BGX creates a posterior distribution for the log-scale expression level of eachgene in each sample. The study recorded measurements for 22283 different genes.To remove unnecessary computational burden, we reduced the number of genes byapplying four different filters in the following order. The first filter removes expressionvalues that do not have a corresponding Entrez gene ID in the NCBI database [16].The second filter removes expression values with low variability by requiring at least25% of samples to have intensities above 100 fluorescence units. The third filterremoves expression values with low variability by requiring the interquartile range tobe at least 0.6 on the log scale. The last filter removes expression values that havehave an error to signal to noise ratio greater than 0.5, which we discuss in more depthbelow. After filtering, there were 273 expression values remaining for the analysis.Now, we discuss how we estimate the measurement error of each gene. We assumethat the measurement error variance is constant across patients for a given gene. Wealso assume that the measurement error is independent for each gene, and need notbe equal for each gene. Let ˆ µ = (ˆ µ j , . . . , ˆ µ nj ) T denote the estimated vector ofthe patient’s gene expression levels for gene j . Further, let ¯ µ = n − (cid:80) nj =1 ˆ µ ij andˆ σ j = n − (cid:80) j =1 (ˆ µ ij − ¯ µ j ) denote the mean and variance of each gene, respectively. Forpatient i , standardized measurements are given by W i = ( W i , . . . , W ip ), calculatedas W ij = ˆ σ − j (ˆ µ ij − ¯ µ j ) for each j = 1 , . . . , µ ij ) denote the posterior variance of the estimated distribution of patient i ’s gene j . These estimates are then combined as ˆ σ u,j = n − (cid:80) ni =1 var(ˆ µ ij ). Themeasurement error covariance matrix of the standardized data W is then estimatedby diagonal matrix ˆ Σ u , where ( ˆ Σ u ) j,j = ˆ σ u,j / ˆ σ j for j = 1 , . . . , p and off-diagonalelements are 0. The fourth filter can be now formalized, where genes are removed ifˆ σ u,j ≥ . σ j ; i.e. only genes with a noise-to-signal ratio less than 1 are kept for theanalysis.The original BAGUS algorithm and the corrected BAGUS algorithm were runfor the remaining genes found after filtering. As with the simulations, the correctedBAGUS found fewer conditional pair-wise relationships; for this data set, the originalBAGUS and corrected BAGUS found 1045 and 552 conditional pair-wise relation-ships, respectively. Of the 1045 naive pair-wise relationships, 42% were also found inthe corrected pair-wise relationships; similarly, of the 552 corrected conditional pair-wise relationships, 80% were found in the naive model. The large percentage overlapof relationships in the corrected model with relationships in the naive model suggeststhat most relationships in the corrected model are true relationships. Conversely, thesmall percentage overlap of relationships in the naive model with those in the correctmodel suggests that the naive model is finding many false positive relationships. Weillustrate the conditional pair-wise dependencies of the genes in Figure 2. The naiveanalysis is shown on the left and the corrected on the right, where the green edgessignify relationships found by both procedures and purple edges signify procedurespecific relationships. We proposed a correction methodology for Gaussian graphical models when contam-inated with additive measurement error. The core solution to the problem involvesusing the imputation-regularization algorithm to generate the true values of under-lying process with a consistent estimate of the precision matrix. This provides aconsistent, positive-definite estimate of the true precision matrix, which, as simu-lations illustrate, remove many false positive pair-wise relationships. Additionally,19 aive Network lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Corrected Network lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll Figure 2: The conditional pair-wise relationships for each of the 273 genes remaining afterfiltering from the Wilms tumor study. Each edge represents a conditional pair-wise depen-dency between two nodes. The left shows the naive analysis, not correcting for measurementerror, and the right shows the corrected analysis, correcting for measurement error. Greenedges signify edges found on both graphs, and purple signifies analysis specific edges. 20e show marked improvements in the AUC of the threshold matrix, indicating bet-ter separation of the underlying relationships. From a practitioner point of view,this allows for more reliable downstream analysis and further investigations to beundergone.To our knowledge, the novel imputation-regularization algorithm has yet to beused for problems pertaining to contaminated data. This provides an avenue of futureresearch for more a practical issue in high-dimensional problems, measurement error,which is starting to gain attention. Moreover, many practical issues still remain inthe Gaussian graphical model context, such as the tuning of hyperparameters andthe interpretation of the output from the Gibbs sampler-like IRO algorithm. Anotherpotiental avenue of research to pursue is when the amount of measurement error isunknown and not assumed independent. In this case, sparsity would need to be im-posed on Ω u in conjunction with Ω x , posing a challenging, but useful, computationalprocedure. The proof for Theorem 1 in Section 3 is established here. Work done for the IRO al-gorithm laid the foundation for certain conditions to be met to establish consistency,see the appendix of [13]. We follow closely with their development, and prove thenecessary conditions to establish consistency in our context of contaminated GGMs.These conditions include two main parts: (1) the consistency of the regularizationstep, specifically the BAGUS procedure in our context, and (2) some technical condi-tions regarding the log-likelihood π ( X , W ). To that end, Assumptions 1 and 2 belowensures the consistency of the BAGUS procedure, while Assumption 3 ensures themetric entropy of the log likelihood not to grow too fast. Discussion of Assumptions1 and 2 can be found in [6], while Assumption 3 has been commonly used in theliterature of high-dimensional statistics, see the Remark 1 in the appendix of [13]. Assumption 1. λ max ( Ω x ) ≤ /k ≤ ∞ , where λ max ( Ω x ) is the largest eigenvalue of x and k is a constant such that k > . For the following assumption we need to define the following values. Let thecolumn sparsity for Ω x be denoted b = max i =1 ,...,d (cid:80) dj =1 ( ω ij (cid:54) = 0). For a m × q matrix A let ||| A ||| ∞ = max ≤ j ≤ q (cid:80) di =1 | a ij | be the maximum absolute row sum.Define M Σ x = ||| Σ ||| ∞ and M Γ = ||| Γ − s,s ||| ∞ where Γ = Σ x ⊗ Σ x and Γ s,s denotesthe subset of Γ by indices s = { ( i, j ) : Ω x (cid:54) = 0 } . Let a > a > a and k be defined such that log( d ) n < a < and E ( e tx ( j )2 ) ≤ k for all | t | ≤ a and j = 1 , . . . , d . We define a = a (2+ a + a − k ), a = ( a + 2 M x ( a + a ) M Γ + 6( a + a ) bM M /M . Finally, define constants (cid:15) > (cid:15) > 0, where (cid:15) is small. Assumption 2. For the previously defined constants, the following three statementshold:1. The hyperparameters v , v , η, and τ satisfy(a) nv = a (cid:114) log( d ) n (1 − (cid:15) ) , (b) nv > a (cid:114) log( d ) n , (c) v (1 − η ) v η ≤ d (cid:15) , (d) τ ≤ a n (cid:114) log( d ) n . 2. For the bound || Ω x || < B , we have that B satisfies k + 2 b ( a + a ) M Γ (cid:114) log( d ) n < B < √ nv . 3. For M = max { b ( a + a ) M Γ max { M Σ , M Γ M , k } , a (cid:15) k } , we have √ n ≥ M (cid:112) log( p ) . ssumption 3. The parameter space of Ω x , or an L -ball containing the space of Ω x , grows at a rate of O ( n α ) for some ≤ α ≤ . Under these assumptions, we show that the developed procedure to correct formeasurement errors satisfy the general conditions for the consistency of the IROestimate. We state each condition and prove it to hold with our procedure. Condition 1. log π ( X , W | Ω x ) is a continuous function of Ω x for each x , w ∈ R d and a measurable function of ( X , W ) for each Ω x .Proof. We have the expansionlog π ( X , W | Ω x ) = log π ( X | Ω x ) + log π ( W | X , Ω u ) . Hence, the log posterior is continuous for symmetric positive-definite Ω x since x ∼ N ( d , Ω − x ). The log posterior is also measurable for ( X , W ) due to properties ofthe Gaussian distribution. Condition 2. Three conditions for the Glivenko-Cantelli theorem to hold. 1. There exists a function m n ( X , W ) such that sup Ω x , X | log π ( X , W | Ω x ) | ≤ m n ( X , W ).2. There exists m ∗ n ( W ), such that:(a) 0 ≤ (cid:82) m n ( X , W ) π ( X | W , Ω ( t ) x ) d X ≤ m ∗ n ( W ) for all Ω ( t ) x ,(b) E [ m ∗ n ( W )] < ∞ ,(c) sup n ∈ Z + E [ m ∗ n ( W ) I ( m ∗ n ( W ) ≥ ξ )] → ξ → ∞ .Also, as ξ → ∞ ,sup n ≥ sup X , Ω x | (cid:90) m n ( X , W ) I ( m n ( X , W ) > ξ ) π ( X | W , Ω x ) | → . 3. Define F n = { (cid:82) log π ( X , W | Ω x ) π ( X | W , Ω ( t ) x ) d X } and G n,M = { q { m ∗ n ( W ) ≤ M }| q ∈ F n } . Suppose that, for every (cid:15) and M > 0, the metric entropy23og( N ( (cid:15), G n,M , L ( P n ))) = O ( n ) , where P n is the emprical measure of W and N ( (cid:15), G n,M , L ( P n )) is the covering number with respect to the L ( P )-norm. Proof. We begin with part (1). Note thatlog π ( X , W | Ω x ) = n (cid:88) i =1 [log π ( w i | x i , Ω u ) + log π ( x i | Ω x )]= − n (cid:88) i =1 (cid:2) ( w i − x i ) T Ω u ( w i − x i ) + x Ti Ω x x i (cid:3) + 12 log det( Ω x ) + C, where C contains constants not related to ( X , W , Ω x ). Hence, | log π ( X , W | Ω x ) | ≤ n (cid:88) i =1 (cid:2) ( w i − x i ) T Ω u ( w i − x i ) + K x Ti x i (cid:3) + K = n (cid:88) i =1 m ( x i , w i ) = m ( X , W ) , where K and K are constants depending on upper bound B .To prove part (2) note˜ m ( W , Ω ( t ) x ) = (cid:90) m ( X , W ) π ( X | W , Ω ( t ) x ) d X = (cid:90) n (cid:88) i =1 m ( x i , w i ) (cid:34) n (cid:89) j =1 π ( x i | w i , Ω ( t ) x ) (cid:35) d x , . . . , d x n = n (cid:88) i =1 (cid:90) m ( x i , w i ) π ( x i | w i , Ω ( t ) x ) d x i , where the last equality follows from conditional independence of each x i . Let Λ ( t ) =( Ω ( t ) x + Ω u ) − , and notice this the sum of expectations of m ( x i , w i ) with respect toGaussian random variables following N ( Λ ( t ) Ω u w i , Λ ( t ) ) for each i = 1 , . . . , n . Now, E x i | w i , Ω ( t ) x [ m ( x i , w i )] = 12 w Ti Ω u w i + 12 tr(( Ω u + K I d ) Λ ( t ) ) − w Ti Ω u Λ ( t ) Ω u w i (cid:124) (cid:123)(cid:122) (cid:125) ≥ , || Λ ( t ) || ≤ K , implies˜ m ( W , Ω ( t ) x ≤ n (cid:88) i =1 w Ti Ω u w i + K = m ∗ ( W ) . Marginally w i ∼ N ( d , Σ x , Σ u ), and hence m ∗ ( W ) is the sum of scaled chi-squaredistributions. Conditions (b) and (c) easily follow from the properties of the chi-square distribution.To prove part (3), we make use of Remark 1 found in the Appendix of [13].Since all elements in ∪ n ≥ F n are uniformly Lipschitz, see [7], the metric entropycan be measured on the basis of the parameter space of Ω x . The functions in G n,M are bounded and the parameter space can be contained by the L ball due to thecontinuity of log π ( X , W | Ω x ). By Assumption 3, then log( N ( (cid:15), G n,M , L ( P n ))) = O ( n α log( d )) . Condition 3. Define Z t,i = log π ( x i , w i | Ω x ) − (cid:82) log π ( x i , w i | Ω x ) π ( X | w i , Ω ( t ) x ) . Z t,i are subexponential random variables.Proof. First, we note thatlog π ( x i , w i | Ω x ) = − 12 ( w i − x i ) T Ω u ( w i − x i ) − x i Ω x x i = − x Ti ( Ω x + Ω u ) x i + x Ti Ω u w i + C , where C is a constant free of X . Also note log π ( w i , X | Ω x ) = log π ( w i , x i | Ω x ) +25og π ( X − i | Ω x ). The integral can then be shown to be (cid:90) log π ( x i , w i | Ω x ) π ( X | w i , Ω ( t ) x )= (cid:90) (cid:2) log π ( w i , x i | Ω x ) + log π ( X − i | Ω x ) (cid:3) π ( x i | w i , Ω ( t ) x ) π ( X − i | Ω ( t ) x ) d x i d X − i = (cid:90) log π ( w i , x i | Ω x ) π ( x i | w i , Ω ( t ) x ) d x i (cid:124) (cid:123)(cid:122) (cid:125) = A (cid:90) π ( X − i | Ω ( t ) x ) d X − i (cid:124) (cid:123)(cid:122) (cid:125) =1 + (cid:90) log π ( X − i | Ω x π ( X − i | Ω ( t ) x ) d X − i (cid:124) (cid:123)(cid:122) (cid:125) = C (cid:90) π ( x i | w i , Ω ( t ) x ) d x i (cid:124) (cid:123)(cid:122) (cid:125) =1 . The value of A is the expectation of log π ( w i , x i | Ω x ) with respect to the full con-ditional of X at iteration t , x i | w i , Ω ( t ) x ∼ N d ( Λ − , ( t ) Ω u w i , Λ − , ( t ) ) where Λ ( t ) =( Ω ( t ) x + Ω u ). This expectation is composed of two parts, E x i | w i , Ω x ( x i ( Ω x + Ω u ) x i ) = tr(( Ω x + Ω u )Λ − , ( t ) ) + w i Ω u Λ ( t ) ( Ω x + Ω u ) Λ ( t ) Ω u w i and E x i | w i , Ω x ( x Ti Ω u w i ) = w Ti Ω u Λ ( t ) Ω u w i . Hence, Z t,i is − x Ti ( Ω x + Ω u ) x i + x Ti Ω u w i − w Ti Ω u Λ ( t ) ( Ω x + Ω u ) Λ ( t ) Ω u w i + w Ti Ω u Λ ( t ) Ω u w i + C, where C = C + C is free of x i and w i , which is the sum of scaled chi-squareddistributions and thus is subexponential. Condition 4.