[PDF] The covariance shift (C-SHIFT) algorithm for normalizing biological data

Abstract

Omics technologies are powerful tools for analyzing patterns in gene expression data for thousands of genes. Due to a number of systematic variations in experiments, the raw gene expression data is often obfuscated by undesirable technical noises. Various normalization techniques were designed in an attempt to remove these non-biological errors prior to any statistical analysis. One of the reasons for normalizing data is the need for recovering the covariance matrix used in gene network analysis. In this paper, we introduce a novel normalization technique, called the covariance shift (C-SHIFT) method. This normalization algorithm uses optimization techniques together with the blessing of dimensionality philosophy and energy minimization hypothesis for covariance matrix recovery under additive noise (in biology, known as the bias). Thus, it is perfectly suited for the analysis of logarithmic gene expression data. Numerical experiments on synthetic data demonstrate the method's advantage over the classical normalization techniques. Namely, the comparison is made with rank, quantile, cyclic LOESS (locally estimated scatterplot smoothing), and MAD (median absolute deviation) normalization methods.

Full PDF

11 The covariance shift (C-SHIFT) algorithm fornormalizing biological data

Evgenia Chunikhina ˚: , Paul Logan ; , Yevgeniy Kovchegov § , Anatoly Yambartsev ˚ , DebashisMondal ; , Andrey Morgun ¶ ˚ IME, University of São Paulo, São Paulo, Brazil Email: [email protected] : Computer Science, Willamette University, Salem, OR, USA ; Statistics Department, Oregon State University, Corvallis, OR, USA § Department of Mathematics, Oregon State University, Corvallis, OR, USA ¶ College of Pharmacy, Oregon State University, Corvallis, OR, USA

Abstract —Omics technologies are powerful toolsfor analyzing patterns in gene expression datafor thousands of genes. Due to a number of sys-tematic variations in experiments, the raw geneexpression data is often obfuscated by undesirabletechnical noises. Various normalization techniqueswere designed in an attempt to remove these non-biological errors prior to any statistical analysis.One of the reasons for normalizing data is the needfor recovering the covariance matrix used in genenetwork analysis. In this paper, we introduce anovel normalization technique, called the covari-ance shift (C-SHIFT) method. This normalizationalgorithm uses optimization techniques togetherwith the blessing of dimensionality philosophy andenergy minimization hypothesis for covariancematrix recovery under additive noise (in biology,known as the bias). Thus, it is perfectly suitedfor the analysis of logarithmic gene expressiondata. Numerical experiments on synthetic datademonstrate the method’s advantage over theclassical normalization techniques. Namely, thecomparison is made with rank, quantile, cyclicLOESS (locally estimated scatterplot smoothing),and MAD (median absolute deviation) normaliza-tion methods.

Gene expression analysis plays an important rolein genomic research. Several omics technologies suchas RNAseq and microarrays allow for the collectionof massive amounts of simultaneous measurementsof gene expression levels of thousands to tens ofthousands of genes. Analyzing diﬀerent patterns ofgene expressions helps to gain insight into com-plex biological phenomena such as development, ag-ing, onset and progression of diseases, and cellu-lar response/reaction to drugs/treatments. Althoughnew technologies are constantly developing, it iswell known that all of them generate some techni-cal noise which aﬀects the measured gene expres-sion levels [8], [20]. To extract accurate biologi-cal information it becomes necessary to normalize the data to ﬁlter out/compensate for these non-biological noises/errors. Normalization is a crucialpre-processing step in the gene expression data anal-ysis. The gene expression data will vary signiﬁcantlyafter diﬀerent normalization methods. Thus, the re-sults of further data analysis (e.g. gene expressionnetwork) will be critically dependent on a choice of anormalization technique. A variety of normalizationprocedures have been used on gene expression datasets. See [3], [4], [12], [13], [15], [17], [19] and refer-ence therein for a review and comparison of currentnormalization strategies. In this paper we develop anovel normalization technique, called the covarianceshift (C-SHIFT) method, and compare it to thefollowing well known normalization methods usedin large scale data analysis: rank, quantile, cyclicLOESS (locally estimated scatterplot smoothing),and MAD (median absolute deviation). See [1], [4],[14], [15] and references therein for more details onthe above listed normalization methods.Consider a situation where the gene expressiondata is subjected to multiplicative noise (aka bias).Speciﬁcally, let X p i q n be the true gene expression,where subscript index n stands for the n -th gene inthe network and the superscript index i stands forthe i -th measurement. The observed gene expression,denoted by r X p i q n , is diﬀerent from X p i q n due to all geneexpressions in the i -th measurement being distortedby i.i.d. multiplicative noise variable W p i q , i.e., r X p i q n “ W p i q X p i q n . (1)Here, both the observed and the true gene expres-sions are positive, i.e., X p i q n ą W p i q ą X p i q n are independent of thevariable W p i q .In biology, the multiplicative noise W p i q is referredto as the bias. The bias is prompted by random a r X i v : . [ q - b i o . GN ] M a r . Chunikhina et al. The covariance shift (C-SHIFT) algorithmevents causing an error in the measurement of thetotal amount of RNA. Such random events are oftenrelated to diﬀerent levels of tissue preservation indiﬀerent sample that leads to variability of RNAdegradation. Consequently, this leads to an RNAdetection problem. Additionally, there are other tech-nical reasons for an error in the measurement of thetotal amount of RNA in a given sample that maylead to a bias in (1). All other noise (e.g. misreadingparts of RNA) goes into the variable X p i q n .The multiplicative noise in (1) implies the corre-sponding additive noise (bias) in the logarithimicgene expression data: r Y p i q n “ Y p i q n ` V p i q , (2)where we let r Y p i q n : “ log r X p i q n , Y p i q n : “ log X p i q n , and V p i q : “ log W p i q .The bias, whether multiplicative as in (1) or additiveas in (2), causes the correlations to be shifted awayfrom ´

1. Indeed, since

Cov p r X n , r X m q “ E r W s Cov p X n , X m q` V ar p W q E r X n s E r X m s (3)the correlation of observed gene expressions corr p r X n , r X m q“ Cov p X n , X m q ` νE r X n s E r X m s c´ V ar p X n q ` νE r X n s ¯´ V ar p X m q ` νE r X m s ¯ , (4)where ν : “ V ar p W q E r W s ą

0. Notice that if

Cov p X n , X m q is negative, by adding positive multiples of ν ą corr p r X n , r X m q ą corr p X n , X m q . In otherwords, all negative correlations corr p X n , X m q willeither turn into positive, or negative of smaller mag-nitude in the observed variables r X n . While the mul-tiplicative bias W p i q « V p i q «

0) may not appear critical, they are known tocause signiﬁcant problems in gene correlation struc-ture analyses. Speciﬁcally, this phenomenon is knownto cause the disappearance of the large magnitudenegative correlations in the observed biological data, r X n , which hampers the ability to perform certaintypes of statistical data analysis, such as the falsediscovery rate (FDR) method.The same is observed for the logarithmic data (2).Similarly to (3), the independent additive noise in(2) implies an increase of covariance, Cov p r Y n , r Y m q “ Cov p Y n , Y m q ` ω, (5) where ω “ V ar p V q ą

0. Consequently, the correla-tions in the logarithmic data are corr p r Y n , r Y m q “ Cov p Y n , Y m q ` ω c´ V ar p Y n q ` ω ¯´ V ar p Y m q ` ω ¯ . (6)Here too, if Cov p Y n , Y m q is negative, by adding ω ą corr p r Y n , r Y m q ą corr p Y n , Y m q . Thus, the phenomenon of disappearance of the largemagnitude negative correlations also applies to thelogarithmic data r Y n .Denote by y Cov the empirical covariances taken over N subjects for each of ` M ˘ pairs of genes. Similarly,let y V ar denote the empirical variance. Then, equa-tion (2) yields the observed empirical covariance y Cov p r Y n , r Y m q “ y Cov p Y n , Y m q ´ ˆ a n ´ ˆ a m ` ˆ ω (7)for all pairs of gene indices n and m , where ˆ a n “´ y Cov p Y n , V q for all n “ , . . . , M , and ˆ ω “ y V ar p V q ą

0. As is often the case, ˆ ω can be verylarge relative to the values of ˆ a n , causing the disap-pearance of the large magnitude negative correlationsin empirical data.The goal of the covariance shift (C-SHIFT) nor-malization method introduced here is the recoveryof the true empirical covariances y Cov p Y n , Y m q andthe respective true empirical correlations in the caseof the logarithmic gene expression data or any othersituations with additive noise as in (2).Let r C “ ` y Cov p r Y n , r Y m q ˘ n,m be the empirical co-variance matrix of the observed data r Y p i q n , and let C “ ´ y Cov p Y n , Y m q ¯ n,m be the empirical covariancematrix of the cleaned data Y p i q n (i.e., the true empir-ical covariance) that we desire to recover. Formula(7) rewritten in the matrix form states C “ r C ` ˆ a T ` T p a ´ ˆ ω T , (8)where ˆ a “ ` ˆ a , . . . , ˆ a M ˘ T , and denotes the columnvector of 1’s, hence T is a square matrix of 1’s.Our goal here is to estimate ˆ a and ˆ ω in (8), andthus recover the true empirical covariance matrix C . This will be done in Section I. We assume largedimension M . There will be two cases. Case I:

If det p r C q “ N ă M ), we make asmall perturbation of the diagonal entries of r C (thevariances) resulting in a new covariance matrix beingpositive deﬁnite whose smallest eigenvalue is still2. Chunikhina et al. The covariance shift (C-SHIFT) algorithmvery close to zero. Next, we use energy minimizationto estimate ˆ a n and ˆ ω in (8). Case II: If r C is positive deﬁnite, our approachexploits the phenomenon sometimes referred to asthe curse of dimensionality [2], [16] and sometimesas the blessing of dimensionality [5], [7], [10], pos-tulating that in higher dimensions almost all datapoints are located near extrema (i.e., in the outershell) ∗ . In other words, for large M , we anticipatethe smallest eigenvalue of C to be near zero. Asa rigorous bound, we observe that if some of thecorrelations corr p Y n , Y m q are located in r´ , δ ´ s interval, then the smallest eigenvalue of C is locatedwithin “ , δ min n y V ar p Y n q ‰ interval. Thus, as in CaseI, under the blessing of dimensionality assumption,we again use energy minimization for estimating ˆ a n and ˆ ω .The problem of improving the existing and develop-ing new normalization methods is very important forscientists working with biological data. The fact thatnormalization alters the data-correlation structurewas stated in Saccenti [19]. Besides [19] gives a com-prehensive overview of normalization methods. InBolstad et al. [4] the authors compare three completedata normalization methods (cyclic loess, contrastbased method, and quantile) that make use of datafrom all arrays in an experiment to two methodsthat make use of a baseline array. The comparisonwas done on two publicly available datasets with theresults favoring the complete data methods. For moreon the normalization methods, see [1], [6], [8], [9],[15], [18], [21]. I. Theoretical derivations

Proposition 1.

Suppose M is a symmetric positivedeﬁnite square matrix. Then, v ˚ : “ max (cid:32) v : M ´ v T is positive semideﬁnite ( “ T M ´ . Proof.

Observe that x T ` M ´ v T ˘ x “ x T M x ´ v ´ÿ x i ¯ ě x P R M if and only if v ď v ˚ , where v ˚ minimizes x T M x under the condition ř x i “ Const .Next, applying the Lagrange multipliers method, weobtain 2 M x “ λ , and therefore, v ˚ “ x T M x p ř x i q “ λ x T p ř x i q “ λ { T x “ T M ´ ∗ In this paper we will refer to the phenomenon as the bless-ing of dimensionality rather than the curse of dimensionality. as x “ λ M ´ .Suppose the empirical covariance matrix r C is positivedeﬁnite, i.e., r C is of full rank. Consider values of acolumn vector α “ p α , . . . , α M q T such that r C ` α T ` T α is positive deﬁnite. Then, by Prop. 1, C α : “ r C ` α T ` T α ´ v p α q T (9)is positive semideﬁnite with det p C α q “

0, where welet v p α q : “ T ` r C ` α T ` T α ˘ ´ . (10)Next, recall the quantities ˆ a and ˆ w in (8). If r C is rankdeﬁcient, we perturb its diagonal entries by addingsmall positive (random or deterministic) values, andif r C is positive deﬁnite, we assume the blessing ofdimensionality phenomenon holds. Thus, in eithercase, we work under assumption that r C is positivedeﬁnite with its smallest eigenvalue located nearzero. Then, Prop. 1 implies ˆ w « v p ˆ a q , where v p α q is as deﬁned in (10). Therefore, letting α “ ˆ a in (9),we will have C ˆ a approximating C .Finally, for all X P R M ˆ N , let } X } F denote theFrobenius norm of X and let E p X q “ } X } F be theenergy function. Our next assumption states that ˆ a can be estimated by the minimizer α ˚ of the energyfunction E p C α q , i.e., we estimate ˆ a by α ˚ “ argmin } C α } F . The assumption is additionally justiﬁed by the obser-vation that a random adjustment of the covariancevia an additive noise (bias) as in (7) will result inan energy increase, i.e., E p r C q ą E p C q . Hence, weuse C α ˚ to approximate C ˆ a and the desired trueempirical covariance matrix C . Lemma 1.

Suppose the empirical covariance matrix r C is of full rank, and the quantities C α and v p α q areas in (9) and (10) . Then, the gradient of the Frobeniusnorm squared is given by ∇ } C α } F “ M α ` r M v p α q ´ c v p α q ´ M a v p α qs A ´ α ` r C ` r a ´ M v p α qs , (11) where } ¨ } F denotes the Forbenius norm, and we let A α : “ r C ` α T ` T α,c : “ T r C and a : “ M ř i “ α i .

3. Chunikhina et al. The covariance shift (C-SHIFT) algorithm

Proof.

By (9), we have } C α } F “} r C } F ` M M ÿ i “ α i ` M v p α q ` ´ T r Cα ¯ ` a ´ c v p α q ´ M a v p α q (12)Notice that BB α i A α “ ¯ e i T ` ¯ e Ti and BB α i A ´ α “ ´ A ´ α ` ¯ e i T ` ¯ e Ti ˘ A ´ α , (13)where ¯ e i is the i -th coordinate vector. Therefore, wehave BB α i v p α q “ v p α q T A ´ α ` ¯ e i T ` ¯ e Ti ˘ A ´ α “ v p α q T A ´ α ¯ e i (14)implying ∇ v p α q “ v p α q A ´ α . (15)Next, the gradient ∇ } C α } F in (11) is found via theequations (12) and (15).First, observe that C α is invariant under the additionof multiples of . Thus, without loss of generality,we restrict the domain to a hyperplane a “ Const.Next, observe that T ∇ } C α } F “ a remainsconstant, i.e., throughout the algorithm, vector α remains on the same hyperplane a “ Const.

Lemma 2.

Suppose the empirical covariance matrix r C is of full rank, and the quantities C α , v p α q , and A α are as in (9) , (10) , and (12) respectively. Then, theHessian of } C α } F , denoted by H α : “ Hess ` } C α } F ˘ isexpressed as follows H α “ M I ` T ´ M v p α q ` A ´ α T ` T A ´ α ˘ ` ` M v p α q ´ c ´ M a ˘ v p α q A ´ α T A ´ α ´ ` M v p α q ´ c ´ M a ˘ A ´ α , (16) where I is the identity matrix, c “ T r C , and a “ M ř i “ α i .Proof. By (11), we have14 H α “ ∇ ` ∇ } C α } F ˘ T “ M ∇ α T ` ´ ∇ ` M v p α q ´ c v p α q ´ M a v p α q ˘¯ T A ´ α ` ` M v p α q ´ c v p α q ´ M a v p α q ˘ ∇ T A ´ α ` ∇ T ` a ´ M v p α q ˘ , (17) where ∇ “ ´ BB α , . . . , BB α M ¯ T was used as the columnvector of the partial derivative operators. The sum-mation parts in (17) are calculated as follows. First, M ∇ α T “ M I. (18)Next, (15) implies ∇ ` M v p α q ´ c v p α q ´ M a v p α q ˘ “ ` M v p α q ´ c ´ M a ˘ v p α q A ´ α ´ M v p α q . (19)Equation (13) implies ∇ T A ´ α “ M ÿ i “ ¯ e i T BB α i A ´ α “ ´ M ÿ i “ ¯ e i T A ´ α ` ¯ e i T ` ¯ e Ti ˘ A ´ α “ ´ M ÿ i “ ` ¯ e Ti A ´ α ˘ ¯ e i T A ´ α ´ ` T A ´ α ˘ M ÿ i “ ¯ e i ¯ e Ti A ´ α “ ´ A ´ α T A ´ α ´ ` T A ´ α ˘ A ´ α “ ´ A ´ α T A ´ α ´ v p α q A ´ α . (20)Finally, (15) is used to derive ∇ T ` a ´ M v p α q ˘ “ T ´ M v p α q A ´ α T . (21)Combining together equations (18)-(21) and substi-tuting them into (17) we obtain (16). Theorem 1.

Suppose the empirical covariance ma-trix r C is of full rank, and the quantities C α and v p α q are as in (9) and (10) . Then, the Frobenius normsquared } C α } F is convex, i.e., } C α } F ě @ α. (22) Proof.

We will use the notations of this section suchas c : “ T r C and a : “ M ř i “ α i . Without loss ofgenerality we consider α on the hyperplane a “ A α “ r C ` α T ` T α is a positive deﬁnitesymmetric matrix with eigenvalues λ ě λ ě . . . ě λ M ą t v i u i “ ,...,M be the corresponding orthonormal basisof eigenvectors.4. Chunikhina et al. The covariance shift (C-SHIFT) algorithmEquation (16) implies14 } C α } F “

14 Tr ` H α ˘ “ M ´ ´ v p α q Tr ` A ´ α ˘¯ ` c ´ Tr ` A ´ α ˘ ´ v p α q T A ´ α ¯ ` M ´ M v p α q T A ´ α ´ ¯ . (23)The Laplacian in (23) is shown to be strictly positivein the following three steps. First, by the Cauchy-Bunyakovsky-Schwarz inequality, we have M v p α q T A ´ α ´ “ v p α q ´›› ›› ›› A ´ α ›› ´ ` T A ´ α ˘ ¯ ě . (24)Next, observe that for M ě M x ` p ´ x q ě @ x P r , s . Thus, for a given probability mass function t p k u k “ ,...,M such that p k ă k , and a givenindex i P t , . . . , M u , Jensen’s inequality implies M p i ` ˜ ÿ j : j “ i λ ´ j p j ¸ ˜ ÿ j : j “ i λ j p j ¸ “ M p i ` p ´ p i q ˜ ÿ j : j “ i λ ´ j q j ¸ ˜ ÿ j : j “ i λ j q j ¸ ě M p i ` p ´ p i q ě q j “ p j ´ p i for all j “ i . Summing overall i in (25), we obtain, ÿ i λ ´ i p i ` M ÿ i λ ´ i ˜ ÿ j : j “ i λ ´ j p j ¸ ˜ ÿ j : j “ i λ j p j ¸ ě M ÿ i λ ´ i . (26)Eqn. (26) implies ÿ i λ ´ i p i ` M ÿ i λ ´ i p i ˜ ÿ j : j “ i λ ´ j ¸ ˜ÿ k λ k p k ¸ ě M ÿ i λ ´ i . (27)which rewrites as ÿ i λ ´ i p i ` M ˜ÿ i λ ´ i p i ¸ ˜ÿ j λ ´ j ¸ ˜ÿ k λ k p k ¸ ě M ˜ÿ i λ ´ i p i ¸ ˜ÿ k λ k p k ¸ ` M ÿ i λ ´ i . (28) Finally, we let p i “ M ` T v i ˘ and substitute thefollowing expressions into (28): ÿ i λ i p i “ M T A α “ M T r C “ cM as a “ , ÿ i λ ´ i p i “ M T A ´ α “ M v p α q , ÿ i λ ´ i “ Tr ` A ´ α ˘ , and ÿ i λ ´ i p i “ M T A ´ α . Consequently, (28) rewrites as M ´ ´ v p α q Tr ` A ´ α ˘¯ ` c ´ Tr ` A ´ α ˘ ´ v p α q T A ´ α ¯ ě . (29)Substituting (24) and (29) into (23), we thenobtain } C α } F ě II. C-SHIFT algorithm and experiments

In this section we provide the C-SHIFT algo-rithm and evaluate its performance on syntheticdata sets. Moreover, we compare the C-SHIFT al-gorithm with the well-known and frequently usednormalization methods: Quantile, Rank, LOESS, andMedian absolute deviation (MAD). Our empiricalresults demonstrate that the C-SHIFT algorithmoutperforms other methods.

A. C-SHIFT algorithm

The pseudocode for the C-SHIFT algorithm isgiven in Algorithm 1. Note that the algorithms takesinto account both cases: when r C has full rank andwhen r C is rank deﬁcient (i.e., r C is positive semi-deﬁnite but not positive deﬁnite). When r C is rankdeﬁcient the rank of r C ` α T ` T α may exceed therank r C by no more than 2, and therefore may alsobe rank deﬁcient. Therefore, to make r C a full rankwe add to it a diagonal matrix diag p f q , where f is avector of i.i.d. random variables from Unif r , s .To ﬁnd the optimal α ˚ “ arg min α } C α } F , weuse gradient and Hessian, provided in equations (11)and (16), in the trust-region algorithm to minimize } C α } F . B. Numerical experiments

In this section we conduct experiments on twosynthetic datasets that we generate using randomcovariance method (RCM) and cascade method. Westart by describing both methods.

1) Data generation:

5. Chunikhina et al. The covariance shift (C-SHIFT) algorithm -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

Magnitude of correlations N u m b e r o f c o rr e l a ti on s EmpiricalObservedC-SHIFTRankQuantileLOESSMAD

Fig. 1. A bar graph of correlations for the RCM data set. On the x-axis we display the range of correlations, partitioned intointervals of length 0 .

1. The height of each bar describes the number of correlations that belong to the corresponding interval.Bars of diﬀerent colors correspond to diﬀerent correlation matrices, indicated in the legend. -1 -0.5 0 0.5 1 Empirical -1-0.500.51 O b s e r v e d -1 -0.5 0 0.5 1 Empirical -1-0.500.51 R a nk -1 -0.5 0 0.5 1 Empirical -1-0.500.51 Q u a n til e -1 -0.5 0 0.5 1 Empirical -1-0.500.51 C - S H I F T -1 -0.5 0 0.5 1 Empirical -1-0.500.51 L O E SS -1 -0.5 0 0.5 1 Empirical -1-0.500.51 M AD Fig. 2. The heat maps for the RCM data set. Each heat map illustrates the transformation of the true empirical correlations corr p Y n , Y m q (horizontal axis) after adding bias and applying the corresponding normalization method. In the top left plotthe vertical axis represents the observed correlations corr p r Y n , r Y m q . In the remaining ﬁve heat maps, the vertical coordinatesrepresent the correlations after normalization. Going clockwise, these ﬁve heat maps are Rank, Quantile, MAD, LOESS, andC-SHIFT. The darker the color, the higher the density. The number on top of each heat map indicates the relative leftover errorafter normalization. Smaller numbers indicate better recovery performance.

6. Chunikhina et al. The covariance shift (C-SHIFT) algorithm -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8

Magnitude of correlations N u m b e r o f c o rr e l a ti on s EmpiricalObservedC-SHIFTRankQuantileLOESSMAD

Fig. 3. A bar graph of correlations for the Cascade data set. On the x-axis we display the range of correlations, partitioned intointervals of length 0 .

1. The height of each bar describes the number of correlations that belong to the corresponding interval.Bars of diﬀerent colors correspond to diﬀerent correlation matrices, indicated in the legend. -1 -0.5 0 0.5 1 Empirical -1-0.500.51 O b s e r v e d -1 -0.5 0 0.5 1 Empirical -1-0.500.51 R a nk -1 -0.5 0 0.5 1 Empirical -1-0.500.51 Q u a n til e -1 -0.5 0 0.5 1 Empirical -1-0.500.51 C - S H I F T -1 -0.5 0 0.5 1 Empirical -1-0.500.51 L O E SS -1 -0.5 0 0.5 1 Empirical -1-0.500.51 M AD Fig. 4. The heat maps for the Cascade data set. Each heat map illustrates the transformation of the true empirical correlations corr p Y n , Y m q (horizontal axis) after adding bias and applying the corresponding normalization method. In the top left plotthe vertical axis represents the observed correlations corr p r Y n , r Y m q . In the remaining ﬁve heat maps, the vertical coordinatesrepresent the correlations after normalization. Going clockwise, these ﬁve heat maps are Rank, Quantile, MAD, LOESS, andC-SHIFT. The darker the color, the higher the density. The number on top of each heat map indicates the relative leftover errorafter normalization. Smaller numbers indicate better recovery performance.

7. Chunikhina et al. The covariance shift (C-SHIFT) algorithm

Input : observed covariance matrix r C Output : recovered empirical covariancematrix C if r C is rank deﬁcient then f Ð i.i.d. Unif [0,1]˜ C Ð ˜ C ` diag p f q end if v p α q Ð ” T ` r C ` α T ` T α ˘ ´ ı ´ C α Ð r C ` α T ` T α ´ v p α q T α ˚ Ð arg min α } C α } F C Ð C α ˚ if r C is rank deﬁcient then C Ð C ´ diag p f q end ifreturn C Algorithm 1:

C-SHIFT a) Random covariance method (RCM).:

Wegenerate a synthetic data set with M “ N “

50 measurements (samples) using RCM.For that we ﬁrst generate an auxiliary matrix H P R M ˆ m ( m “

2) whose entries are independent ran-dom variables, uniformly distributed over the interval I “ r´ , s . Next, we sample a diagonal matrix D P R M ˆ M with diagonal entries being i.i.d. expo-nential random variables with parameter λ D “ “ HH T ` D be the population (param-eter) covariance matrix. Then we generate the trueempirical logarithmic data Y p i q “ ` Y p i q n ˘ „ N ` , Σ ˘ for each i “ , . . . , N . Finally, we set the observedlogarithmic data be r Y p i q n “ Y p i q n ` V p i q , where vector V p i q are N ` ´ . , ˘ random variables. b) Cascade method.: The cascade datasets weregenerated according by a directed acyclic weightednetwork G “ p V, E q aka directed acyclic graph(DAG). The graph was randomly generated via a re-current cascade model. The parent-oﬀspring relationis represented by the direction of edges E “ tp u, v qu of the graph G , i.e., u is the parent vertex and v isits oﬀspring. For any vertex v let pa p v q be the set ofits parents, pa p v q “ t u P V : p u, v q P E u . Next, foreach edge p u, v q P E an independent random weight c uv is assigned, with c.d.f. p U r a ´ ,b ´ s p x q ` p ´ p q U r a ` ,b ` s p x q , where the parameters a ´ ă b ´ ď

0, 0 ď a ` ă b ` ,and p P p , q are ﬁxed, and U A p x q denotes the uni-form c.d.f. on an interval A . We generated a randomweighted DAG with the nodes v P V representingthe genes. The random variables t Y v u v P V represent-ing the logarithmic gene expressions are generated as a noisy multiplicative cascade via the followingstructural linear recursive equations: Y v “ ÿ u P pa p v q c uv Y u ` ε v , where the recursion begins with Y “ y , andproceeds from generation to generation. The noisevariables p ε v , v P V q are i.i.d. N ` , σ ˘ , sampledindependently from the random weights c uv . Forsimulation of p Y v , v P V q the following values ofparameters were chosen: p r a ´ , b ´ s r a ` , b ` s σ y | V | { r´ . , ´ . s r . , . s .

2) Simulation results:

We generate two data sets(RCM and Cascade) using the methods described insection II-B1. Each date set consists of a matrix withthe empirical data ` Y p i q n ˘ P R M ˆ N and a matrixwith the observed data ` r Y p i q n ˘ P R M ˆ N . In both,RCM and Cascade data sets, we let M “ N “

50 measurements (samples). For each dataset, we normalize the covariance matrix r C , obtainedfrom the observed data, by using C-SHIFT, Rank,Quantile, LOESS, and MAD methods. We comparethe performance of the algorithms using the resultspresented in Figures 1-4.In Figures 1 and 3 we depict the bar graphs ofcorrelations for RCM and Cascade data sets, respec-tively. On the x-axis we display the range of corre-lations, partitioned into the intervals of length 0 . ´ corr p Y n , Y m q (horizontal axis)after adding bias and applying the correspondingnormalization method. We consider 2,001,000 cor-relations corresponding to all pairs of genes. Foreach point, representing a pair of genes p n, m q , thehorizontal coordinate equals the true empirical cor-relation corr p Y n , Y m q in all six plots. The verticalcoordinate in the top left heat map is the correla-tion in the observed data, corr p r Y n , r Y m q . Importantly,it shows the shift of correlations rightward in theobserved data. In the remaining ﬁve heat maps,the vertical coordinates represent the correlationsafter normalization. Going clockwise, these ﬁve heatmaps are Rank, Quantile, MAD, LOESS, and C-SHIFT. The darker the color, the higher the density.Notice that the heat map for C-SHIFT is almostperfectly diagonal, which demonstrates how well C-SHIFT recovers the correlations. Thus, in addition tocorrectly recovering the right numbers of correlationsin each interval (which was demonstrated in Figures1 and 3), the proposed C-SHIFT algorithm alsoreturns (shifts back) the correlations to the correctmargins. Hence, the heat map is a diagonal line. Thenumber on top of each heat map indicates the relativeleftover error after normalization, i.e., the ‘ -norm ofthe vector of diﬀerences between the horizontal andvertical coordinates, scaled by the Frobenius norm ofthe diﬀerence between the empirical and the observedcorrelation matrices. Thus, the left top heat mapis assigned the value 1, and for each normalizationmethod, the smaller the number the better it recoversthe empirical correlation matrix. Any such numbersmaller than one is an improvement. The numberfor C-SHIFT is by far the smallest in each dataset (0 . . III. Discussion

In systems biology, the gene co-expression net-works (GCN) are reconstructed from the correla-tions between the genes. GCN recovery relies onremoving the bias with a normalization method,and thus improving the estimation of correlationsbetween the pairs of genes. However, the standardnormalization techniques such as Rank, Quantile,LOESS, and MAD are known to be insuﬃcient at recovering the true empirical correlations while theC-SHIFT algorithm is speciﬁcally designed to re-cover the true empirical correlations. The multipleexperiments with synthetic data sets demonstratethe algorithm’s superior performance in comparisonto the standard normalization techniques.Importantly, we notice that the C-SHIFT algo-rithm corrects the positive shift of covariances (andcorrelations) observed when ˆ ω “ y V ar p V q is largerthan ˆ a n “ ´ y Cov p Y n , V q ( n “ , . . . , M ) in (7).Hence, the independence of V from Y n assumptioncan be replaced with a weaker assumption statingthat Cov p Y n , V q ! V ar p V q . This will be explored ina follow-up publication.An alternative version of the C-SHIFT algorithmis based on trace minimization approach instead ofenergy minimization. In this alternative C-SHIFTalgorithm, the positive semi-deﬁnite matrix C α ˚ with α ˚ “ argmin Tr ` C α ˘ is used to approximate the true empirical covariancematrix C . The analogs of Lemmas 1 and 2 and theconvexity result in Theorem 1 are also establishedfor Tr ` C α ˘ in the trace minimization approach. See[11]. Empirically it appears that this alternative ap-proach produces the same α ˚ as the original C-SHIFT algorithm based on energy minimization aspresented in this paper, and therefore it recoversthe empirical covariance C with the same accuracy.Thus, the alternative, trace minimizing C-SHIFTalgorithm can be used instead of Algorithm 1. Thisapproach will be analyzed in a follow-up paper.Finally, the C-SHIFT algorithm was deposited onGitHub at https://github.com/prlogan/C-SHIFT Acknowledgments

This research is supported by the FAPESP awards2018/14952-7 and 2018/07826-5, and by the NSFaward DMS-1412557.

References [1] Dhammika Amaratunga and Javier Cabrera,

Analysis ofdata from viral DNA microchips

Journal of the AmericanStatistical Association, 96(456):1161–1170, 2001.[2] Francis Bach,

Breaking the curse of dimensionality withconvex neural networks

The Journal of Machine LearningResearch, 18(1):629–681, 2017.[3] Martin Bilban, Lukas K. Buehler, Steven Head, GernotDesoye, and Vito Quaranta,

Normalizing DNA microar-ray data

Current Issues in Molecular Biology, 4:57–64,2002.[4] Benjamin M. Bolstad, Rafael A. Irizarry, Magnus Ås-trand, and Terence P. Speed,

A comparison of normaliza-tion methods for high density oligonucleotide array databased on variance and bias Bioinformatics , 19(2):185–193, 2003.

9. Chunikhina et al. The covariance shift (C-SHIFT) algorithm [5] David L. Donoho,

High-dimensional data analysis: Thecurses and blessings of dimensionality

AMS conferenceon Mathematical Challenges of the 21st Century, Cite-seer, 2000.[6] Jianqing Fan, Heng Peng, and Tao Huang,

Semilin-ear high-dimensional model for normalization of mi-croarray data: a theoretical analysis and partial consis-tency

Journal of the American Statistical Association,100(471):781–796, 2005.[7] Alexander N. Gorban and Ivan Yu. Tyukin,

Blessing ofdimensionality: mathematical foundations of the statis-tical physics of data

Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineer-ing Sciences, 376(2118):20170237, 2018.[8] Alexander J. Hartemink, David K. Giﬀord, Tommi S.Jaakkola, and Richard A. Young,

Maximum-likelihoodestimation of optimal scaling factors for expression arraynormalization

Microarrays: Optical Technologies andInformatics, Vol. 4266, pages 132–140. InternationalSociety for Optics and Photonics, 2001.[9] Jianhua Hu and Xuming He,

Enhanced quantile normal-ization of microarray data to reduce loss of information ingene expression proﬁles

Biometrics, 63(1):50–59, 2007.[10] Paul C. Kainen,

Utilizing geometric anomalies of highdimension: When complexitymakes computation easier

Computer Intensive Methods in Control and Signal Pro-cessing, pages 283–294. Springer, 1997.[11] Paul Logan,

C-SHIFT, Quantile Theory, and AssessingMonotonicity

Ph.D. thesis, Oregon State University,2020.[12] Taesung Park, Sung-Gon Yi, Sung-Hyun Kang, Se-ungYeoun Lee, Yong-Sung Lee, and Richard Simon,

Eval-uation of normalization methods for microarray data

BMC Bioinformatics, 4(1):33, 2003.[13] Sylvain Pradervand, Johann Weber, Jérôme Thomas,Manuel Bueno, Pratyaksha Wirapati, Karine Lefort,G Paolo Dotto, and Keith Harshman

Impact of normal-ization on miRN microarray expression proﬁling

RNA,15(3):493–501, 2009.[14] Xing Qiu, Hulin Wu, and Rui Hu,

The impact of quantileand rank normalization procedures on the testing powerof gene diﬀerential expression analysis

BMC Bioinfor-matics, 14(1):124, 2013.[15] John Quackenbush,

Microarray data normalization andtransformation

Nature Genetics, 32(4):496–501, 2002.[16] Miloš Radovanović, Alexandros Nanopoulos, and MirjanaIvanović,

Hubs in space: Popular nearest neighbors inhigh-dimensional data

Journal of Machine LearningResearch, 11(Sep):2487–2531, 2010.[17] Youlan Rao, Yoonkyung Lee, David Jarjoura, Amy S.Ruppert, Chang-gong Liu, Jason C. Hsu, and John P.Hagan,

A comparison of normalization techniques formicroRNA microarray data

Statistical applications ingenetics and molecular biology, 7(1), 2008.[18] Cavan Reilly, Changchun Wang, and Mark Rutherford,

A method for normalizing microarrays using genes thatare not diﬀerentially expressed

Journal of the AmericanStatistical Association, 98(464):868–878, 2003.[19] Edoardo Saccenti,

Correlation patterns in experimentaldata are aﬀected by normalization procedures: conse-quences for data analysis and network inference

Journalof Proteome Research, 16(2):619–634, 2017.[20] Andreas Scherer,

Batch eﬀects and noise in microarrayexperiments: sources and solutions

John Wiley & Sons,2009.[21] Gordon K. Smyth and Terry Speed,

Normalization ofcDNA microarray data