[PDF] A Fast and Flexible Method for the Segmentation of aCGH Data

Abstract

Motivation: Array Comparative Genomic Hybridization (aCGH) is used to scan the entire genome for variations in DNA copy number. A central task in the analysis of aCGH data is the segmentation into groups of probes sharing the same DNA copy number. Some well known segmentation methods suffer from very long running times, preventing interactive data analysis. Results: We suggest a new segmentation method based on wavelet decomposition and thresholding, which detects significant breakpoints in the data. Our algorithm is over 1,000 times faster than leading approaches, with similar performance. Another key advantage of the proposed method is its simplicity and flexibility. Due to its intuitive structure it can be easily generalized to incorporate several types of side information. Here we consider two extensions which include side information indicating the reliability of each measurement, and compensating for a changing variability in the measurement noise. The resulting algorithm outperforms existing methods, both in terms of speed and performance, when applied on real high density CGH data. Availability: Implementation is available under software tab at: this http URL Contact: [email protected] http URL

Full PDF

©© The Author 2008 Gene Expression

A Fast and Flexible Method for the Segmentation of aCGH Data

Erez Ben-Yaacov , Yonina Eldar Department of Electrical Engineering, Technion – Israel Institute of Technology, Haifa Israel.

Preprint, Accepted for publication in Bioinformatics (Proceedings of ECCB08).

ABSTRACT Motivation:

Array Comparative Genomic Hybridization (aCGH) is used to scan the entire genome for variations in DNA copy number. A central task in the analysis of aCGH data is the segmentation into groups of probes sharing the same DNA copy number. Some well known segmentation methods suffer from very long running times, preventing interactive data analysis.

Results:

We suggest a new segmentation method based on wave-let decomposition and thresholding, which detects significant break-points in the data. Our algorithm is over 1,000 times faster than leading approaches, with similar performance. Another key advan-tage of the proposed method is its simplicity and flexibility. Due to its intuitive structure it can be easily generalized to incorporate several types of side information. Here we consider two extensions which include side information indicating the reliability of each measure-ment, and compensating for a changing variability in the measure-ment noise. The resulting algorithm outperforms existing methods, both in terms of speed and performance, when applied on real high density CGH data.

Availability:

Implementation is available under software tab at:

Contact: [email protected]

1 INTRODUCTION

Array Comparative Genomic Hybridization (aCGH) is used to scan the entire genome for variations in DNA copy number. DNA from a test and reference cell populations is differentially labeled and hybridized on the array, and the log ratio between the two hybridi-zation results is used to detect copy number variations. High den-sity aCGH, spanning hundreds of thousands of probes, is a power-ful tool in the research of cancer (Barrett et al., et al., et al., et al., et al. (2004) suggested a circular binary segmentation (CBS) algorithm, based on recursively applying a statistical test to detect significant breakpoints in the data. Picard et al. (2005) de-veloped a dynamic programming procedure to segment the data when the number of segments is known in advance, which is re-ferred to as CGHseg. The actual number of segments in real data is determined by maximizing a penalized likelihood function. While * To whom correspondence should be addressed. other segmentation methods exist, such as Lipson et al. (2005, 2006), a comparison study (Lai et al., et al.,

HaarSeg , a new segmentation method, based on well known wavelet denoising principles. HaarSeg identi-fies statistically significant breakpoints in the data, using the maxima of the Haar wavelet transform, and segments accordingly. HaarSeg is a fast method, over 1,000 times faster than CBS and CGHseg, enabling interactive data analysis, with a slight compro-mise in performance. Due to its simple and intuitive structure, it is also a flexible method, and therefore easy to extend. We show how HaarSeg can be generalized to use quality of measurement data, additional information which exists in some platforms, indicating the reliability of each measurement. The use of quality of meas-urement was first suggested in Lipson et al. (2005), and it is cur-rently used in ADM2, a segmentation algorithm based on Lipson et al. (2005, 2006) and used for example in de Smith et al. (2007) and in Perry et al. (2008). Since ADM2 does not have a freely avail-able implementation we did not compare our performance to this segmentation algorithm. We also suggest an extension to compen-sate for the large variance in the log ratio measurements which occurs when one of the raw measurements has a very low value. Using these two generalizations, we show that HaarSeg outper-forms existing methods, while remaining much faster. The use of the Haar wavelet for microarray analysis is not new. Hsu et al. (2005) suggested applying standard wavelet denoising on microarrays, using the Haar wavelet. HaarSeg is different from that approach as it performs segmentation rather than smoothing of the data. To emphasize this difference we compare our results to Hsu et al. , and show that HaarSeg outperforms this method as well. The rest of the paper is organized as follows: The basic HaarSeg algorithm is discussed in Section 2.1.Generalizations including quality of measurement data and adaptation to non-stationary vari-ance are presented in Section 2.2 In Sections 3.1 and 3.2 we pro-vide simulation results, and finally, analysis of real CGH data is presented in Section 3.3. .Ben-Yaacov et al.

2 METHODS

Each measurement in aCGH data is the log ratio of two raw measurements, red and green, which we denote by log( R / G ). Our signal, y [ n ], is a set of log( R / G ) measurements from a single chro-mosome, ordered according to their genomic coordinates. Alternations in the number of copies in the aCGH data occur in contiguous regions of the chromosome, often spanning multiple probes. We therefore consider the problem of recovering a piecewise constant signal x [ n ] from its noisy measurements y [ n ], which can be viewed as the segmentation of y [ n ]. The Basic HaarSeg Algorithm

We suggest the following scheme, which is explained in detail in the next subsections: • Apply the undecimated discrete wavelet transform (Mallat 1998) on the data, using the Haar wavelet. • Select a set of detail subbands from the transform { L MIN , L MIN+1 , ... , L MAX }. • Find the local maxima of the selected detail subbands. • Threshold the maxima of each subband separately, using an FDR thresholding procedure. • Unify selected maxima from all the subbands to create a list of sig-nificant breakpoints in the data. • Reconstruct the segmentation result from the list of significant break-points.

The discrete wavelet transform (Mallat 1998) decomposes a given signal into an approximation subband and a set of detail subbands at different resolution scales. The approximation subband is a coarse or smooth version of the original signal, containing the scale coefficients. The detail subbands describe the higher frequencies of the signal, and are composed of the wavelet coefficients. Here we consider the undecimated discrete wavelet transform (UDWT), where each subband has the same number of coeffi-cients. The UDWT is well suited for the task of data analysis, mainly due to its translation invariance property (Stark et al., L are given by: (2 1) 11 2

1[ ] [ ] [ ]2

L L n nL L k n k n w n y k y k + − −+ = = −   = −     ∑ ∑ . (1)

The wavelet coefficients w L [ n ] in (1) can be viewed as the difference be-tween two averages. In places where no breakpoint occurred in the signal, we expect w L [ n ] to be zero, as it is the difference between two identical averages. When zero mean additive noise is present it will typically average out for large enough L , so that w L [ n ] will still be close to 0. In places where a breakpoint occurred, we expect a high absolute value of w L [ n ], as the two averages are different. Let z L [ k ] denote the local maxima of the absolute values of w L [ n ]: ( ) [ ] localmax [ ] , 1 L L z k w n k K = ≤ ≤ , (2) where K is the number of local maxima in | w L [ n ]|. A coefficient is a local maximum if it is larger than its neighbors. We start by examining the two closest neighboring coefficients, and in case of equality we extend the neighborhood until we encounter a larger or smaller coefficient. High am-plitude coefficients in z L [ k ] correspond to locations where abrupt changes occurred in y [ n ], and low amplitude coefficients correspond to changes in y [ n ] which were caused by noise. Finer detail subbands provide better localization of abrupt changes, but are more sensitive to noise. Given a list of coefficients z [ k ] from a specific subband L , we wish to keep just the larger ones, which in our case correspond to significant breakpoints in the data. To this end we consider the false discovery rate (FDR) thresh-olding procedure (Benjamini et al., σ . We select the maximum number of coefficients such that the estimated FDR is kept under a predefined level q, where 0 < q < 0.5. To apply FDR thresholding we first sort z [ k ] in descending order, such that: ( ) ( ) ( ) ( ) ... ... i K z z z z ≥ ≥ ≥ ≥ ≥ . For each measurement z ( i ) we calculate the two-sided p-value: ( ) ( ) ( ) ( ) i i p z σ= − Φ , where Φ is the normal CDF. Starting from i = 1, we then find the largest index i for which ( ) ( ) / i p i K q ≤ . Thresholding is obtained by keeping the i largest coefficients, z (1) , …, z ( i ) . Since in practice the standard deviation of the noise is unknown, we esti-mate it by using the robust median absolute deviation (MAD) estimator (Donoho 1995) on the finest detail subband w [ n ]; ( ) ˆ [ ] / 0.6745 median w n σ = . (3) To reconstruct the signal x [ n ] from the local maxima in each subband, we first need to unify maxima from all the selected detail subbands { L MIN , L MIN+1 , … , L MAX } into a single list of breakpoints. To take into ac-count the possibility that the same breakpoint is detected at several levels with a slight offset, we use the following procedure. We first select all the significant coefficients detected at L MIN , the finest detail level, and add them to the final list of breakpoints. We then add coefficients from level L = L MIN + 1, provided that they are at least 2 L – 1 + 1 measurements away from any breakpoint in the final list. This step is repeated for all remaining subbands L = L MIN + 2 ,…, L MAX . At the end of this process we remain with a single list of significant breakpoints in y [ n ]. Given the list of breakpoints, we estimate the piecewise constant signal x [ n ] by setting the value of the signal between two consecu-tive breakpoints to be the average of all probes in y [ n ] over that interval. Two parameters need to be selected properly for HaarSeg: (1)

The set of detail subbands { L MIN , L MIN+1 , ... , L MAX }; (2) The FDR parameter q . The values of L MIN and L MAX are determined by the sampling resolution of our measurements. As L MIN increases, we are less sensitive to noise, but are also less likely to detect short segments in the data. As a general rule of thumb, if we expect a single segment in the data to span at least k probes, then we choose: log MIN

L k =    . Fast and Flexible Method for the Segmentation of aCGH Data L MAX should be set large enough to reduce the sensitivity to noise, but small enough to avoid detection of slow, unimportant changes in the data, such as the genome-wide technical artifact described in Marioni et al. (2007). In all our experiments we used detail subbands {1, 2, 3, 4, 5}. The FDR parameter 0 < q < 0.5 controls the false discovery rate of breakpoints in the data. Low values of q will reduce the false-positives at the possible cost of increasing the false-negatives, and vice versa. Let N be the total number of measurements in y [ n ]. Calculating w L [ n ] in the case of Haar UDWT (1) can be performed in O(N) operations regardless of the size of L, since it can be viewed as the difference between two running averages. FDR thresholding, applied to the transform maxima, has com-plexity O(NlogN) as it requires sorting the data. Since the entire procedure is applied to a small finite set of detail subbands, the total complexity re-mains

O(NlogN) . Application to aCGH

We demonstrate the flexibility of HaarSeg by suggesting two extensions which are specific to aCGH. In Section 3 we show that these extensions lead to better segmentation on real aCGH data.

Quality of Measurement

Each raw measurement, red or green, is estimated from a set of pixels, associated with the same probe on the array. The median is usually used to estimate the raw measurement from the set of pixels. Current array plat-forms often provide the user with a value of [ ] n σ , which is the empiri-cal standard deviation of the pixels corresponding to y [ n ]. High [ ] n σ indicates poor measurement. The use of this additional information in a segmentation algorithm was first suggested in Lipson et al. (2005). This quality measure can be easily incorporated into our framework as well. Two steps need adjustment: the calculation of the wavelet coefficients and the final signal reconstruction. The coefficients w L [ n ] in (1) can be rewritten as the difference between two averages: (2 1) 11 2 L L n nLL L Lk n k n w n y k y k + − −− = = −   = −    ∑ ∑ . When each probe has a different variance, we suggest using the difference between two weighted averages for the calculation of w L [ n ] : [ ][ ] [ ][ ][ ] 2 11 [ ][ ] L LL L nnL k n k nL nn k nk n y ky k kkw n kk σσ σσ −+ −− = = −−+ − = −=     = −     ∑∑ ∑∑ . (4)

Note that when [ ] n σ is constant for all n , (4) reduces to the original definition of w L [ n ] in (1). To reconstruct the signal we use a weighted average instead, in order to estimate the signal values between two consecutive breakpoints: [ ] 1ˆ [ ] [ ] n n y nn n µ σ σ= ∑ ∑ . Non-stationary Variance

In real CGH data, we observed that while most of the log( R / G ) measure-ments have similar variance, there are segments of measurements with larger variance. Typically the raw measurements in those segments, either red or green, have a very low value compared to the rest of the raw meas-urements. An example from real data is shown in Figure 4. Note that in the previous subsection we discussed the variance of pixels inside the same probe, while now we consider the variance between consecutive probes. The connection between low value of the raw measurements and large variance of the log ratio can be explained by sensitivity analysis of the log ratio function: R G R GR R G G ∂ ∂= = −∂ ∂ . Hence, if all the raw measurements are perturbed with the same additive noise, then raw measurements with lower values will result in larger varia-tions of the log ratio signal. In the case of gene expression microarrays, several variance stabilization and normalization techniques have been suggested to cope with non-stationary variance. For example see the review of Steinhoff and Vingron (2006). In order to adjust

HaarSeg to reduce the effect of the non-stationary variance, we suggest splitting the transform peaks into two groups: a group of high variance, containing peaks that correspond to low raw measure-ments, and a group of typical variance that corresponds to the remaining measurements. We adjust the FDR thresholding to use these two variances accordingly, by suggesting the following scheme: • Create a binary mask b [ n ] using a fixed threshold T NSV . Values of “1” correspond to probes with low raw measurements: ( )

NSV if R n G n Tb n else <  =  • For each detail subband w L [ n ], defined in (1), calculate a matching binary mask b L [ n ]. True values in b L [ n ] indicate that at least half of the measurements used to calculate w L [ n ] where marked as high vari-ance in b [ n ]:

11 [ ] 0.5[ ] 20 .

L L nL k nL if b kb n else + −+ = −    ≥    =    ∑ • We estimate two standard deviations from the finest detail subband, w [ n ], by splitting it to two groups according to the mask b [ n ] and using the estimator in (3) on each group: ˆ [ ] 1ˆ [ ] 0. hightypical b nb n σσ ⇔ =⇔ = • Update the transform peaks z L [ k ], defined in (2), such that all the peaks will have the same standard deviation. ˆ[ ] / [ ] 1' [ ] ˆ[ ] / . L high LL L typical z k if b kz k z k else σσ =  =  • Apply FDR thresholding on z ' L [ k ], using standard deviation of 1. We set T NSV to a fixed value of 50 in our CGH analysis below.

Determining Aberrant Intervals

In the segmentation process of CGH arrays there is a need to determine which segments are aberrant, and set remaining segments to zero. As in CBS, CGHseg, and other segmentation methods, we approach this as a .Ben-Yaacov et al. post-processing step. Several algorithms have been proposed for this task. A simple suggestion is to consider all segments with values outside m times the standard deviation range to be aberrant (Hodgson et al. , 2001), where m is frequently set to 3. An iterative method based on non-parametric statisti-cal tests called MergeLevels was suggested in Willenbrock et al. (2005). Tibshirani et al. (2008) used an FDR based approach. In our tests we used the simple method of considering all segments with values outside m times the standard deviation range to be aberrant. To estimate the standard deviation, we calculate the difference between y [ n ], the original signal, and x [ n ], the segmentation result, and apply the robust MAD estimator: ( ) ˆ [ ] [ ] / 0.6745 median y n x n σ = − . Any other preferred method can be used instead, as this is simply a post-processing step.

3 RESULTS

We compared the performance of HaarSeg to CBS (Olshen et al., et al., et al. (2005), which we denote as Wave.

Simulated Data

In their comparison study, Willenbrock et al. (2005) created simu-lated CGH data using empirical distributions of segment length and copy number, taken from CBS segmentation results on real data. The noise model used in this simulation is additive i.i.d Gaus-sian noise. The original simulation contained 500 arrays, where every array included 20 chromosomes of 100 probes each. In order to simulate chromosome sizes which are closer to current high density CGH arrays, we modified Willenbrock’s simulation to produce 100 arrays, each containing a single chromosome of 10,000 probes. We used the exact same model and noise levels used to produce the original simulations. Since this simulation does not contain quality of measurement, or the original raw red and green measurements, we use only the basic HaarSeg algorithm, without any of the suggested extensions. In order to compare results between HaarSeg and other algo-rithms, we computed the true positive rate and false discovery rate for all possible aberration thresholds, and plotted the receiver oper-ating characteristic (ROC) curve for each segmentation algorithm. We computed the true positive rate (TPR) as the number of probes inside aberrations whose fitted values are above the threshold level divided by the number of probes inside aberrations. The false dis-covery rate (FDR) was calculated as the number of probes outside aberrations whose fitted values are above the threshold level di-vided by all the probes whose fitted values are above the threshold level. The ROC curves and running times for HaarSeg, CBS, CGHseg and Wave appear in Figure 1. HaarSeg takes only 2 seconds to produce a result for all 100 arrays; this is over 1,500 times faster than CBS, and over 9,000 times faster than CGHseg, which was the slowest method. However, the speed gain of the basic HaarSeg algorithm comes with some performance price. HaarSeg performs slightly worse compared to CBS, about 1% worse in FDR and 1% worst in TPR. HaarSeg allows higher TPR than CGHseg, but at the cost of 1% in the FDR. Wave showed the worst ROC curve among the compared methods. T r ue P o s i t i v e R a t e CBS (3,081 sec)Wave (10 sec)CGHseg (18,944 sec)HaarSeg (2 sec)

Fig. 1.

ROC curves of the tested algorithms, using the simulation model from Willenbrock et al. (2005).

Simulated Data with Quality of Measurement

In order to test the performance gain when using our suggested extensions to HaarSeg, we created a simulation based on real data. We took 3 control self-self hybridization arrays, 236,404 probes each, from de Smith et al. (2007). These arrays contain quality of measurement and the raw red and green measurements. The true segmentation result of a self-self array is zero everywhere. We used the self-self arrays to create a simulation in the following manner: We reordered the self-self arrays and created 70 arrays of 10,000 probes each. For each array we created a mask of aberrant segments. Each segment was given a slightly different height, uni-formly distributed between 0.1 and 0.2. To create the aberrant mask we used the empirical length distribution of CBS, taken from Willenbrock et al. (2005). Figure 2 shows the ROC curve of TPR vs. FDR at various thresholds, and running times for all tested algorithms. We denote HaarSeg as the basic algorithm and W-HaarSeg as the algorithm with quality of measurement and non-stationary variance exten-sions described in Section 2.2. W-HaarSeg and CBS achieve the best results, where W-HaarSeg is about 1,000 times faster than CBS. Using the empirical length distributions of CBS is biased to-wards CBS. Short segments of 2-4 probes rarely exist since CBS is not sensitive enough to detect such segments. We therefore re-peated the experiment using the segment length distribution of W-HaarSeg, taken from segmentation results on the real data in de Smith et al. (2007). Since short segments are harder to detect, we increased the segment height to be uniformly distributed between the values of 0.15 and 0.25. Figure 3 shows the ROC curve and running times for this experiment. In this case, W-HaarSeg outper-forms all the other tested methods. This demonstrates that W-HaarSeg is able to detect short segments, which CBS cannot, while keeping the false positive at a low rate.

Fast and Flexible Method for the Segmentation of aCGH Data T r ue P o s i t i v e R a t e CBS (2,228 sec)Wave (6.7 sec)CGHseg (13,116 sec)HaarSeg (1.4 sec)W-HaarSeg (2.2 sec)

Fig. 2.

ROC curves of the tested algorithms, using a simulation based on real self-self data, with segment length distributions of CBS. T r ue P o s i t i v e R a t e CBS (2,232 sec)Wave (8 sec)CGHseg (13,081 sec)HaarSeg (1.5 sec)W-HaarSeg (2.2 sec)

Fig. 3.

ROC curves of the tested algorithms, using a simulation based on real self-self data, with segment length distributions of W-HaarSeg.

Real High Density CGH Data

In order to test performance on real high density arrays, we used data from de Smith et al. (2007), which enables performance evaluation to some extent, and contains quality of measurement side information. de Smith et al. (2007) compared samples from 50 healthy subjects to a reference sample in order to detect copy num-ber variations in healthy individuals. This experiment also includes 3 control self-self hybridizations of the reference sample, used to estimate false positives. Since the reference sample was a female and the test samples were males, we excluded chromosome Y from all our tests, and compensated chromosome X by adding a constant, estimated as the mean of the median of all X chromosomes in the 50 arrays. No other normalization was applied to the data. Each array therefore contains 23 chromosomes and a total of 236,404 probes. Each chromosome contains between 2,000-18,000 probes. To estimate the FDR, we divided the average number of aberrant probes in the 3 self-self arrays, which we expect to be zero in the ideal case, by the average number of aberrant probes in the 50 arrays. Estimating the false negative is not possible on real data, where the exact true answer is not known. We tested the performance of both HaarSeg, and W-HaarSeg, which is the HaarSeg algorithm with quality of measurement and non-stationary variance extensions described in Section 2.8. We compared results to CBS, CGHseg and Wave. For all tested seg-mentation methods, we used the aberrant threshold from Section 2.9, setting m to 3. Table 1 shows the FDR estimate, average number of active probes in the 50 arrays, and the time it took to segment all 53 ar-rays in each method. W-HaarSeg has the best false positive score, less than 1%, and CBS has the next best score, 4.3%. Compared to CBS, W-HaarSeg detects more active probes on average. This suggests that W-HaarSeg has a better false negative score, since it detects more probes, with a lower false positive estimate. Both HaarSeg and W-HaarSeg excel at running times compared to CBS and CGHseg. HaarSeg and W-HaarSeg segment the entire data in less than one minute, while CBS takes 10 hours and CGHseg takes 66 hours to produce the segmentation result. Table 1.

Results for real data Method FDR Avg. active probes num. Run time CBS 4.3 % 4603 36,420 sec CGHseg 10.0 % 5031 237,600 sec Wave 10.7 % 6284 121 sec HaarSeg 9.9 % 5317 29 sec W-HaarSeg 0.9 % 4782 38 sec

Figure 4 demonstrates the non-stationary variance effect in a section from a self-self array. The correct segmentation result in this case is zero everywhere. Only W-HaarSeg achieves an exact zero result for this section. Figure 5 shows an example of segmentation results of a short possible deletion spanning 4 probes. The true answer is not known, but in this example CBS was the only method that did not detect the deletion, indicating that CBS is less sensitive in the detection of short segments. This example also demonstrates the difference between the results of Wave, where each measurement has a dif-ferent value, and HaarSeg, where all measurements in the same segment share the same value.

Parameter Settings

We used R package DNAcopy version 1.12 for CBS, R package tilingArray version 1.16 (Huber et al., et al. (2005). For HaarSeg, we used 5 detail subbands, L = .Ben-Yaacov et al. {1,2,3,4,5} and set q to 0.05 for the simulated data in Section 3.1, and q = 0.001 for the simulated data in Section 3.2 and for the real data in Section 3.3. Running times were calculated on AMD Ath-lon 64X2 with 2GB RAM. -0.500.5 measurements020004000 red and green raw data-0.500.5 CBS -0.500.5 CGHseg-0.500.5 Wave-0.500.5 HaarSeg-0.500.5 W-HaarSeg Fig. 4.

Segmentation results of a section from chromosome 1 in a ‘self-self’ array, GSM215042, demonstrating the non-stationary variance effect. Graphs are in genomic coordinates. The correct result is zero at all the probes. Segmentation results are shown after applying the aberration threshold.

4 DISCUSSION

We presented HaarSeg, a new method for the segmentation of high density aCGH. Applied on both simulated and real data, our method is considerably faster, but with a slight performance pen-alty compared to leading approaches. We demonstrate the flexibil-ity of our method by suggesting two extensions. First, we propose using quality of measurement. This additional information, when it exists, enables HaarSeg to better handle outlier measurements. Second, we suggest an extension to compensate for the large vari- ance in part of the log ratio measurements, which occurs when at least one of the raw measurements has a very low value. This ex-tension enables HaarSeg to avoid over segmentation. Using both additions, HaarSeg outperforms existing algorithms. It is interesting to note that each of the two suggested extensions contributes about the same performance gain to the final result. Applying just one of the extensions, either the quality of measure-ment or the non-stationary variance, will result in about half the total performance gain. These extensions do not change the low complexity of HaarSeg, and running times remain short. The im-portance of reasonable running times will become more and more evident as microarray size and resolution continue to grow rapidly. While we showed application of our method to aCGH, where we seek to detect breakpoints in the data, our method can also be ex-tended to detect other interesting features in microarray data. This is a subject for future research. -0.4-0.200.2 measurements-0.4-0.200.2 CBS-0.4-0.200.2 CGHseg-0.4-0.200.2 Wave-0.4-0.200.2 HaarSeg-0.4-0.200.2 W-HaarSeg

Fig. 5.

Segmentation results of a possible deletion in chromosome 6, array GSM214509. Graphs are in genomic coordinates. Segmentation results are shown before applying the aberration threshold.

ACKNOWLEDGEMENTS

The authors would like to thank Prof. Zohar Yakhini for suggest-ing the use of quality of measurement, for many fruitful discus-sions regarding aCGH data, and for useful comments on the paper. The authors would also like to thank Anya Tsalenko and Adam J. de Smith for their assistance with the real data presented in this

Fast and Flexible Method for the Segmentation of aCGH Data paper, and Prof. Eran Segal for first suggesting to them the mi-croarray segmentation problem. REFERENCES

Abramovich,F. and Benjamini,Y. (1996) Adaptive thresholding of wavelet coeffi-cients.

Comput. Stat. Data An. (4): 351-361 Barrett,M. et al. (2004) Comparative genomic hybridization using oligonucleotide microarrays and total genomic DNA. PNAS 101: 17765-17770 Benjanimi,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc.,

Ser.

B 57 et al. (2005) High-resolution genome-wide mapping of genetic alterations in human glial brain tumors.

Cancer Research et al. (2006) A high-resolution survey of deletion polymorphism in the human genome.

Nat. Genet., , 75-81. Donoho,D.L. (1995) De-Noising by Soft-Thresholding. IEEE Transactions on Infor-mation Theory,

Vol. , No. 3, 613-621. de Smith,A.J. et al. (2007) Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association stud-ies of complex diseases. Human Molecular Genetics , Vol. , No. 23, 2783-2794. Hodgson,G. et al. (2001) Genome scanning with array CGH delineates regional altera-tions in mouse islet carcinomas. Nat. Genet. , , 459-464. Hsu,L. et al. (2005) Denoising array-based comparative genomic hybridization data using wavelets. Biostatistics , , 211-226. Huber,W. et al. (2006) Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics , et al. (2005) Comparative analysis of algorithms for identifying amplifica-tions and deletions in array CGH data. Bioinformatics, , 3763–3770. Lipson,D. et al. (2005) Interval Scores for Quality Annotated CGH Data. IEEE Inter-national Workshop on Genomic Signal Processing and Statistics (GENSIPS).

Lipson,D. et al. (2006) Efficient Calculation of Interval Scores for DNA Copy Num-ber Data Analysis.

Journal of Computational Biology , Vol. , No. 2: 215-228. Mallat,S. (1998) A wavelet tour of signal processing . Academic Press, London. Marioni,J.C. et al. (2007) Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization.

Genome Bi-ology , R228, doi:10.1186/gb-2007-8-10-r228 Olshen,A.B. et al . (2004) Circular binary segmentation for the analysis of array-based DNA copy number data.

Biostatistics , , 557–572. Perry et al. (2008) The Fine-Scale and Complex Architecture of Human Copy-Number Variation. The American Journal of Human Genetics , doi:10.1016/j.ajhg.2007.12.010 Picard,F. et al. (2005) A statistical approach for array CGH data analysis.

BMC Bioin-formatics , , 27. Pinkel,D. and Albertson,D.G (2005) Array comparative genomic hybridization and its applications in cancer. Nat. Genet. , (suppl), S11-S17. Redon et al. (2006) Global variation in copy number in the human genome. Nature , 444-454. Starck,J.L. et al. (2005) Redundant Multiscale Transforms and their Application for Morphological Component Analysis.

J. Advances in Imaging and Electron Phys-ics,

Vol. , pp. 287-348. Steinhoff,C. and Vingron,M. (2006) Normalization and quantification of differential expression in gene expression microarrays.

Brief. Bioinformatics , , 166–177. Tibshirani,R. and Wang,P. (2008) Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, Bioinformatics, , 4084-4091., 4084-4091.