[PDF] A CUSUM approach to the detection of copy-number neutral loss of heterozygosity

Abstract

Several genetic alterations are involved in the genesis and development of cancers. The determination of whether and how each genetic alterations contributes to cancer development is fundamental for a complete understanding of the human cancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenon linked to a variate of diseases and characterized by the change from heterozygosity (the presence of both alleles of a gene) to to homozygosity (presence of only one type of allele) in a particular DNA locus. Thus identification of DNA regions where LOH has taken place is a important issue in the health sciences. In this article we formulate the LOH detection as the identification of change-points in the parameters of a mixture model and present a detection algorithm based on the cumulative sums (CUSUM) method. We found that even under mild contamination our proposal is a fast and reliable method.

Full PDF

AA CUSUM approach to the detection of copy-number neutral loss of heterozygosity

Murilo S. Pinheiro, Alusio Pinheiro and Benilton S. deCarvalho, University of Campinas, Brazil

Address for correspondence:

Alusio Pinheiro, Department of Statistics, IMECCUniversity of Campinas, Rua Srgio Buarque de Holanda, 651 13083-855, Sokolovsk´a83, D. Baro Geraldo, Campinas, SP, Brazil..

E-mail: [email protected] . Phone: (+55)(19)3521-6080.

Fax: (+55)(19)3521-6080.

Abstract:

Several genetic alterations are involved in the genesis and development ofcancers. The determination of whether and how each genetic alterations contributesto cancer development is fundamental for a complete understanding of the humancancer etiology. Loss of heterozygosity (LOH) is one of such genetic phenomenonlinked to a variate of diseases and characterized by the change from heterozygosity(the presence of both alleles of a gene) to to homozygosity (presence of only onetype of allele) in a particular DNA locus. Thus identiﬁcation of DNA regions whereLOH has taken place is a important issue in the health sciences. In this article we a r X i v : . [ s t a t . A P ] J a n Pinheiro et al. et al. formulate the LOH detection as the identiﬁcation of change-points in the parametersof a mixture model and present a detection algorithm based on the cumulative sums(CUSUM) method. We found that even under mild contamination our proposal is afast and reliable method.

Key words:

Beta mixture; EM algorithm; Microarray data; change-point analysis;CUSUM

Several genetic alterations such as single base substitution, translocation, copy-numberalteration (CNA) and loss of heterozygosity (LOH) are involved in the genesis anddevelopment of cancers (Albertson et al., 2003; Beroukhim et al., 2010; Stratton et al.,2009). Determining whether and how genetic alterations contribute to cancer devel-opment is paramount for human cancer etiology. One of the most common genotypingtool for the identiﬁcation of those altered regions are single nucleotide polymorphismarrays (SNP Arrays). With resolution up to one marker for every 100 bp SNP Arraysgreatly increased the ability of geneticists to explore the structure of DNA and itseﬀect on health and disease. Although recent technology oﬀers even greater resolutionthe cost of SNP-A makes it a widely used technology to this day.Copy-number alteration is a type of structural variation in which a particular region ofDNA has a number of copies that diﬀers from the expected two in the diploid genome.These alterations can be of inherited origin or the consequence of somatic mutationsoccurring during the development of a tumor. Although copy-number variations not emplate paper

LRR = log R observed R expected , where R observed is the sum of observed measures for each possible allele and R expected Pinheiro et al. et al. is the expected sum value. The second one, B allele frequency (BAF)

BAF =  θ < θ AA . θ − θ AA ) / ( θ AB − θ AA ) if θ AA θ < θ AB . . θ − θ AB ) / ( θ BB − θ AB ) if θ AB ≤ θ < θ BB ≤ θ ≥ θ BB where θ = arctan( X /X ) / ( π/

2) is a measure of relative allele frequency. It is clearfrom those deﬁnitions that LRR is related to the copy-number of a locus, while BAFis relative to one of the alleles proportion.We present a new method for detecting CNNLOH regions based exclusively in theBAF sequence. Segments of the DNA where LOH has taken place are detected inthe BAF plot as the absence of a central band (or bands in the case of copy-numberfour), as is illustrated in Figure 1.Figure 1: BAF sequencing where the central region is aﬀected by LOH. We can seethat what characterizes the presence of LOH is the persistent absence of observationsclose to 0.5.Several approaches have been proposed for the identiﬁcation of CNA and LOH inpaired and unpaired comparative genomic hybridization and SNP Arrays data, suchas agglomerative clustering (Wang et al., 2005), penalized likelihood (Picard et al.,2005), circular binary segmentation (Olshen et al., 2004), piecewise linear models(Muggeo and Adelﬁo, 2010) and hidden Markov models (HMM) (Yau, 2013; Wang emplate paper

We assume that { x , . . . , x n } is the BAF data resulting from a SNP Arrays study andthat we already know the copy number in this sequence is two. We are interested indetecting regions of this sample where the BAF data appears to be lacking the centralband. This means that diﬀerences within the upper and lower strips are irrelevant forour purposes. For this reason we perform the following transformation on the data: y i = 2 × | x i − / | . This transformed sample { y , . . . , y n } will be called transformedBAF (tBAF).The tBAF data follows a pattern illustrated in Figure 2. It can be seen that fornon-LOH regions there are clearly two bands in the tBAF plot, one running near 1and another near 0. We will adopt the convention of calling upper band the band thatruns near 1 and lower band any band running below the upper one. The presence ofLOH can be identiﬁed in the tBAF plot by the absence of an identiﬁable lower band.We propose adopting the mixture of two distributions as a model for the tBAF distri-bution. The ﬁrst distribution describes the the upper band stochastic behavior whilst Pinheiro et al. et al. . . . Index

BAF non−LOH . . . Index x tBAF non−LOH . . . BAF with LOH . . . x tBAF with LOH Figure 2: BAF and tBAF plots for LOH and non-LOH regions. Considering ﬁrstthe non-LOH region we can see that after transformation the TBAF sequence ischaracterized by two observation strips, one running close to 1 and another close to0. When a region aﬀected by LOH is transformed there are very few observationsclose to 0 but the strip close to 1 is still present.the second component should yield the lower band behavior. Take f as the densityassociated to observations in the lower band and f for the observations on the upperband. Our model for both non-LOH and LOH regions will be of the form: p ( y | ξ ) = (1 − π ) f ( y | ξ ) + πf ( y | ξ )where y is the tBAF vector of observations, ξ is a vector containing all the necessaryparameters for characterizing our distribution and π is the probability of drawinga observation from f . The parameter π can be interpreted as the probability ofobserving a heterozygous DNA locus and will be called here the homozygosity level .In some methods it is assumed known (Chen et al., 2013) but its availability for aparticular combination of platform and population is not always warranted. For thisreason we propose an estimation method that requires only the available sample. emplate paper π in our statistical model,i.e., we expect that the density function describing LOH regions have a much smallercoeﬃcient associated to observing a value drawn from f . From this observation wecan formulate two such models: p ( y | ξ ) = π f ( y | ξ ) + (1 − π ) f ( y | ξ ) ,p ( y | ξ ) = π f ( y | ξ ) + (1 − π ) f ( y | ξ ) , where p is associated to non-LOH regions, p to LOH regions and π > π . In fact,since the LOH is supposed to lack it’s lower band, we may assume that π = δπ with0 ≤ δ < f OIB ( y | θ , α ) =  θ , if y = 1(1 − θ ) αy α − , if 0 < y ≤ , where θ ∈ [0 ,

1] is the probability of observing a 1 and α ∈ (0 , + ∞ ).The choice of f , which describes the lower band, is the zero inﬂated beta (ZIB)distribution (Ospina and Ferrari, 2010), whose density is given by: f ZIB ( y | θ , β ) =  θ if y = 0(1 − θ ) β (1 − y ) β − if 0 ≤ y < θ ∈ [0 ,

1] is the probability of observing a 0 and β ∈ (0 , + ∞ ).The estimation of the parameters in the proposed model is performed by Expecta-tion Maximization (EM) algorithm ( ? ) application to a set of observations where Pinheiro et al. et al.

CNNLOH is absent. This set can be obtained from the application of the microar-ray technology to somatic tissue or by visually inspecting the tBAF sequence andidentifying regions without LOH.

As discussed in subsection 2.1 the detection of LOH regions in a tBAF sequence can beformulated in the framework of statistical process control (SPC), i.e., the change fromnon-LOR, in control, to LOH, out of control, or vice-versa. Despite this connectionthere has been little work on the application of this perspective to the problem ofsegmenting SNP Arrays data, the work of Li et al. (Li et al., 2009) being the onlyexample known to the authors.There are a number of possible methods to perform SPC in a sequence of observations(Basseville et al., 1993). The CUSUM method, which was ﬁrst proposed by Page inPage (1954a), is one of the less commonly adopted (HAWKINS, 1993). The reasonis that the CUSUM plots are especially designed to identify a change in a parametervalue (or a distribution), to a known value (or distribution) after a change. Thisassumption of knowledge regarding prior and after change parameters values is usuallynot reasonable in real applications (Basseville et al., 1993).Although usually seen as a disadvantage this characteristic of the CUSUM plot istaken here as a advantage. As was shown in subsection 2.1, we are able to providegood approximations for in-control and out-of-control distributions previous to theapplication of any SPC tool. The CUSUM is not only specially adapted to the situa-tion of a change in known distributions but was shown to be optimal (Moustakides,1986) and also asymptotically optimal Lorden (1971) if one sets the average delay emplate paper

We now provide a detailed account of the CUSUM algorithm for our application. Fornotational convenience we will always assume that a transition of p to p is to bedetected.We suppose known whether the actual segment is of the non-LOH type. This meansthat we are working under the assumption that the observations are independentrealizations of p . We say that p is the assumed model . To detect a transition from p to p the CUSUM algorithm sequentially computes the instantaneous log-likelihood: s i = log p ( y i ) − log p ( y i ) and its cumulative sums: S = 0; S i = max { , S i − + s i } i ≥ t , if S t is the ﬁrst cumulative sumto be greater than the alarm threshold L .When the CUSUM detects the presence of a change-point the next step is to estimateits location. We adopt a maximum likelihood approach to solve this problem. Letthe supposed model be p , that we began the cumulative sum at t = i and that thealarm time was t = i . Our estimate of the change-point τ is:ˆ τ = argmax i ≤ t ≤ i (cid:40) t − (cid:88) j = i log p ( y j ) + i (cid:88) j = t log p ( y j ) (cid:41) Once the change-point is estimated, all observations between the previous change-point and the newly discovered one are said to be realizations of p (non-LOH) and0 Pinheiro et al. et al. the CUSUM algorithm restarts at the last change-point now considering p as theassumed model. When the assumed model is p the CUSUM statistic changes to s i = log p ( y i ) − log p ( y i ) and the threshold to L . A similar procedure is thencarried out until the next change-point comes by and we return to p after identifyinga LOH region.The main issue with the CUSUM algorithm is the correct choice of the two alarmthresholds. Usually those thresholds are found based on the average run length func-tions ARL ij ( L i ) for a threshold value L i , for i, j ∈ { , } . The i index is that of theassumed model and j that of the observed model, i.e., if i = 0 and j = 1 the assumedmodel is p and the observations are been generated from p . It follows that, forexample, ARL is the expected time to call a change-points from p to p and ARL is the amount of time to a false alarm when the assumed model is p .Page (1954a) shows that the ARL function is the solution for an integral equation ofthe Fredholm’s type but an analytical solution does not always exist. To overcamethis diﬃculty numerical approximations have been suggested (Goel and Wu, 1971;Page, 1954b; Siegmund, 2013; Brook and Evans, 1972; Lucas, 1982). In the nextsubsection we propose a novel approach to the threshold selection. Here we do not base threshold values on the previous approximations of the ARLfunction. The choice of a particular value of average run length is diﬃcult and it hasno theoretical meaning to geneticists performing the analysis. We propose a criterionfor selecting a threshold based on the a deﬁnition of segments with small length anda level of tolerance for segments with smaller lengths as follows. emplate paper m . This restriction of a minimum length can be integrated inthe CUSUM method by imposing a condition on the probability of raising an alarmonly after changes that persist for at least m observations. Take p as our supposedmodel and let { y , y , . . . } be a sequence of independent realizations of p . Consider S i , i = 0 , , . . . , the sequence of sums resulting from an application of the CUSUMalgorithm to { y , y , . . . } . Given a threshold L and a probability α ∈ (0 , P ( R m < L ) ≥ − α (2.1)where R m = max ≤ i ≤ m S i and α is the predeﬁned level of tolerance. This restriction canbe interpreted as imposing the condition of only raising an early alarm when there isabundant evidence of a change-point.For any two given values of m and α there are inﬁnitely many values of L such thatEquation 2.1 is valid. One of these values is the (1 − p )th quantile of the R m statisticaldistribution. This value of L can be estimated by bootstrapping the distribution of R m from simulations of the estimated distribution p . The threshold for the assumedmodel p can be found in a similar way. In this section we present a set of computational simulations where we evaluated theability of the presented segmentation method to correctly identify regions with andwithout LOH along a BAF sequence. In order for the study to be as close as possible2

Pinheiro et al. et al. to the situations encountered in practice we will use the BAF samples available atthe ”acnr” R package Pierre-Jean and Neuvial (2016).We simulate a set of sequences of the form { X , . . . , X } by randomly samplingwith replacement observations available in the ”acnr” package. The structure of thesequences are always as follows: we have { X , . . . , X } selected from the popula-tion with neutral number of copies and without LOH, then { X , . . . , X l − } areselected from the population with CNNLOH and ﬁnally { X l , . . . , X } are sam-pled from the population with copy-number neutral and without LOH. We use in ourstudy the values of l ∈ { , , } . In addition, we consider the cases where thepurity of the sample can assume the values p = 1 , . , . δ = 10 − and α = 0 .

05 as the parameter for LOHmodel and threshold calculation, respectively. We also consider m ∈ { , , } .In order to evaluate the segmentation quality, we note that the problem of detectingLOH regions can be seen as a classiﬁcation problem with only two categories. In ourcase we call negative (0) observations in regions without LOH and positive (1) theobservations in regions with LOH. Then for each simulated sequence we have a basesequence { t , . . . , t } of zeros and ones describing to which class each observationbelongs. If { I , . . . , I } is the sequence of zeros and ones indicating the classiﬁcationusing the proposed method, we deﬁneTP = (cid:88) i =1 { t i = 1 , I i = 1 } , FP = (cid:88) i =1 { t i = 0 , I i = 1 } TN = (cid:88) i =1 { t i = 0 , I i = 0 } , FN = (cid:88) i =1 { t i = 1 , I i = 0 } We use the measures of sensitivity =

T P/ ( T P + F N ) and of speciﬁcity =

T N/ ( T N + emplate paper F P ) to evaluate the performance of our method. Table 1 and Table 2 present themeans of speciﬁcity and sensitivity for 100 replicates of each sequence { X , . . . , X } ,respectively.Table 1: Sensitivity results. We can see that when m is small and purity is close orbigger to 0 .

78 our method is able to detect a large percentage of the LOH regions.The method’s ability to detect LOH regions decreases with the sample purity.m = 10 m = 25 m = 50Purity = 1 l = 25 0.97 0.69 0l = 50 0.98 0.98 0.64l = 100 0.99 0.99 0.99Purity = 0.79 l = 25 0.94 0.33 0.0000l = 50 0.97 0.97 0.06l = 100 0.99 0.9 0.90Purity = 0.5 l = 25 0.04 0.01 0l = 50 0.05 0 0l = 100 0.07 0 0Note that in Table 1 that our method correctively detected regions with LOH incases with purity equal to one and m < l . When m ≤ l the segmentation qualityis noticeably lower. As the purity decreases the segmentation quality also becomesworse. For the case of purity 0.5 the method was unable to correctly identify theLOH regions in the vast majority of cases. For 0.79 purity the situation is not assevere with the case m = 10, which yields reasonably good results.Table 2 shows that our proposed method does not overestimate the regions with LOH4 Pinheiro et al. et al.

Table 2: Speciﬁcity results. As one would expect the smaller m is the less speciﬁcthe method is since smaller values of m make the segmentation more susceptibleto false-discoveries. In no instance the false-discovery rate is big enough to causeconcerns. m = 10 m = 25 m = 50purity = 1 l = 25 0.94 0.99 1l = 50 0.95 0.99 0.99l = 100 0.95 0.99 0.99purity = 0.79 l = 25 0.95 1 1l = 50 0.95 1 1l = 100 0.94 0.99 1purity = 0.5 l = 25 0.95 1 1l = 50 0.95 1 1l = 100 0.95 1 1but miss some of then when the value of m is close or bigger than the region lengthor when the sample purity is close to 0.5. We will now apply the proposed method to a real data set. The data set we utilizeconsists of 482 benign and tumor samples from 259 men with prostate cancer studiedin Ross-Adams et al. (2015).We apply the oncoSNP segmentation procedure (Yau, 2013) to all tumoral sam- emplate paper α = 0 . δ = 0 .

01 and we used 10000 simulations to estimate the twothreshold values. Table 3 presents the sensitivity and speciﬁcity results.Table 3: Mean execution time, sensitivity and speciﬁcity of our method assumingthe results of oncoSNP as the gold standard.m = 25 m = 50 m = 100 m = 150Execution time (s) 46.6212 47.2890 46.2117 44.6742Sensitivity 0.9517 0.9050 0.8269 0.7958Speciﬁcity 0.7939 0.8706 0.9392 0.9573It is clear that for all values of m our method correctly detected most of the LOHregions pointed out by oncoSNP. As one would expect smaller values of m detect moreoncoSNP segments than do greater values of m . In terms of speciﬁcity the oppositeis true: greater values of m result in greater speciﬁcity. This is also not surprising.In terms of execution time our procedure shows a great advantage in comparison tooncoSNP. The worst mean execution time of our procedure is 47.29 seconds. Also ifwe remove the time necessary for data loading this times reduces to,a maximum of,9.72 seconds. This is a very small execution time when compared to the minuties6 Pinheiro et al. et al. mean execution time for oncoSNP.To explore the segmentation features we choose a small region of one of the segmentsand look at the regions of LOH detected by oncoSNP and by our method. Welook at the segmentation made by three diﬀerent parameter choices for our method: m = 100 and δ = 10 − , m = 50 and δ = 10 − , m = 50 and δ = 10 − . In all caseswe adopt α = 0 .

05 and 10000 simulations in the thresholds estimation process. Thesegmentation results are presented in Figure 3 where the order of annotation of thegraphs (a), (b), (c) and (d) follows the order in which we presented the algorithms.The ﬁrst thing we note in Figure 3 is that the oncoSNP identiﬁes only two regions withLOH and that the regions detected by oncoSNP are also detected by the proposedmethod. The largest of these regions coincides with that identiﬁed by oncoSNP andthe largest number found by our methods justiﬁes speciﬁcity values between 0.70 and0.90.The eﬀect of choosing m is not diﬃcult to interpret and is clearly justiﬁed by theFigure 3. Smaller values of m provide a segmentation with smaller identiﬁed regions.The eﬀect of δ is more subtle and can be seen in Figure 3 by a reduction on theﬁrst identiﬁed segment in panels (c) and (d). This happens because δ decreases themethod’s tolerance for the existence of points close to 0.5 within regions declared tocontain LOH. Note that in the largest region with identiﬁed LOH there is a near-half observation for all segmentations built by our methods. This is because thisobservation is isolated within the segment and therefore its inﬂuence is not as strongas that in the ﬁrst segment identiﬁed by the last two methods considered. emplate paper The segmentation method we propose is conceptually simple in addition to beingeasily implemented. A mixture of inﬂated betas models the BAF data allowing a fastmodel estimation procedure with the help of the EM algorithm. We segment the dataset with the CUSUM technical which originated in the statistical process control andhas optimal characteristics, resulting in a fast and accurate segmentation.The great advantages of the method are that the estimation procedure can be per-formed in a small portion of the data set, result in a faster execution time in compar-ison to the methods that use HMM-based approaches. Also the built segmentationfeatures can be adjusted by correctly selecting the parameters m , associated to theminimal length of a segment, and δ , related to the level of tolerance to near-half ob-servations inside LOH regions. The method is robust to mild levels of contaminations.The fast performance characteristic of the proposed method is of great importanceto the analysis of most recent array technology given the large amount of data re-sulting from application of said technologies. For example most recent SNP arrayscan produce more than 2millions observations and whole-genome shotgun sequencingproduce approximately 3billions observations, one for each base in the DNA chain.The sheer size of those numbers make clear the need for fast methods in the analysis ofgenetic related experiments. We believe that our proposed method is a advancementin this direction.8 Pinheiro et al. et al.

Acknowledgements

We want to thank FAPESP (grant 2013/00506-1) and CAPES for providing theresources for the realization of this project.

References

Albertson, D. G., Collins, C., McCormick, F., and Gray, J. W. (2003). Chromosomeaberrations in solid tumors.

Nature genetics , (4), 369–376.Basseville, M., Nikiforov, I. V., et al. (1993). Detection of abrupt changes: theory andapplication , volume 104. Prentice Hall Englewood Cliﬀs.Beroukhim, R., Lin, M., Park, Y., Hao, K., Zhao, X., Garraway, L. A., Fox, E. A.,Hochberg, E. P., Mellinghoﬀ, I. K., Hofer, M. D., et al. (2006). Inferring loss-of-heterozygosity from unpaired tumors using high-density oligonucleotide snp arrays.

PLoS Comput Biol , (5), e41.Beroukhim, R., Mermel, C. H., Porter, D., Wei, G., Raychaudhuri, S., Donovan,J., Barretina, J., Boehm, J. S., Dobson, J., Urashima, M., et al. (2010). Thelandscape of somatic copy-number alteration across human cancers. Nature , (7283), 899–905.Bignell, G. R., Huang, J., Greshock, J., Watt, S., Butler, A., West, S., Grigorova,M., Jones, K. W., Wei, W., Stratton, M. R., et al. (2004). High-resolution analysisof dna copy number using oligonucleotide microarrays. Genome research , (2),287–295. emplate paper Biometrika , (3), 539–549.Chen, G. K., Chang, X., Curtis, C., and Wang, K. (2013). Precise inference of copynumber alterations in tumor samples from snp arrays. Bioinformatics , (23),2964–2970.Goel, A. L. and Wu, S. (1971). Determination of arl and a contour nomogram forcusum charts to control normal mean. Technometrics , (2), 221–230.Ha, G., Roth, A., Lai, D., Bashashati, A., Ding, J., Goya, R., Giuliany, R., Rosner,J., Oloumi, A., Shumansky, K., et al. (2012). Integrative analysis of genome-wideloss of heterozygosity and monoallelic expression at nucleotide resolution revealsdisrupted pathways in triple-negative breast cancer. Genome research , (10),1995–2007.HAWKINS, D. M. (1993). Cumulative sum control charting: an underutilized spctool. Quality Engineering , (3), 463–477.Huang, J., Wei, W., Zhang, J., Liu, G., Bignell, G. R., Stratton, M. R., Futreal,P. A., Wooster, R., Jones, K. W., and Shapero, M. H. (2004). Whole genomedna copy number changes identiﬁed by high density oligonucleotide arrays. Humangenomics , (4), 1.J¨anne, P. A., Li, C., Zhao, X., Girard, L., Chen, T.-H., Minna, J., Christiani, D. C.,Johnson, B. E., and Meyerson, M. (2004). High-resolution single-nucleotide poly-morphism array and clustering analysis of loss of heterozygosity in human lungcancer cell lines. Oncogene , (15), 2716–2726.0 Pinheiro et al. et al.

Li, W., Lee, A., and Gregersen, P. K. (2009). Copy-number-variation and copy-number-alteration region detection by cumulative plots.

BMC bioinformatics , (1), 1.Lin, M., Wei, L.-J., Sellers, W. R., Lieberfarb, M., Wong, W. H., and Li, C.(2004). dchipsnp: signiﬁcance curve and clustering of snp-array-based loss-of-heterozygosity data. Bioinformatics , (8), 1233–1240.Lorden, G. (1971). Procedures for reacting to a change in distribution. The Annalsof Mathematical Statistics , pages 1897–1908.Lucas, J. M. (1982). Combined shewhart-cusum quality control schemes.

Journal ofquality technology , (2), 51–59.Moustakides, G. V. (1986). Optimal stopping times for detecting changes in distri-butions. The Annals of Statistics , pages 1379–1387.Muggeo, V. M. and Adelﬁo, G. (2010). Eﬃcient change point detection for genomicsequences of continuous measurements.

Bioinformatics , page btq647.Olshen, A. B., Venkatraman, E., Lucito, R., and Wigler, M. (2004). Circular binarysegmentation for the analysis of array-based dna copy number data.

Biostatistics , (4), 557–572.Ospina, R. and Ferrari, S. L. (2010). Inﬂated beta distributions. Statistical Papers , (1), 111–126.Page, E. (1954a). Continuous inspection schemes. Biometrika , (1/2), 100–115. emplate paper Journal of the Royal Statistical Society. Series B (Methodological) ,pages 136–139.Picard, F., Robin, S., Lavielle, M., Vaisse, C., and Daudin, J.-J. (2005). A statisticalapproach for array cgh data analysis.

BMC bioinformatics , (1), 1.Pierre-Jean, M. and Neuvial, P. (2016). acnr: Annotated Copy-Number Regions .URL https://R-Forge.R-project.org/projects/jointseg/ . R package version0.2.5/r153.Ross-Adams, H., Lamb, A., Dunning, M., Halim, S., Lindberg, J., Massie, C., Egevad,L., Russell, R., Ramos-Montoya, A., Vowler, S., et al. (2015). Integration of copynumber and transcriptomics provides risk stratiﬁcation in prostate cancer: a dis-covery and validation cohort study. EBioMedicine , (9), 1133–1144.Siegmund, D. (2013). Sequential analysis: tests and conﬁdence intervals . SpringerScience & Business Media.Stratton, M. R., Campbell, P. J., and Futreal, P. A. (2009). The cancer genome.

Nature , (7239), 719–724.Teh, M.-T., Blaydon, D., Chaplin, T., Foot, N. J., Skoulakis, S., Raghavan, M., Har-wood, C. A., Proby, C. M., Philpott, M. P., Young, B. D., et al. (2005). Genomewidesingle nucleotide polymorphism microarray mapping in basal cell carcinomas un-veils uniparental disomy as a key somatic event. Cancer Research , (19), 8597–8603.Wang, K., Li, M., Hadley, D., Liu, R., Glessner, J., Grant, S. F., Hakonarson, H.,and Bucan, M. (2007). Penncnv: an integrated hidden markov model designed for2 Pinheiro et al. et al. high-resolution copy number variation detection in whole-genome snp genotypingdata.

Genome research , (11), 1665–1674.Wang, P., Kim, Y., Pollack, J., Narasimhan, B., and Tibshirani, R. (2005). A methodfor calling gains and losses in array cgh data. Biostatistics , (1), 45–58.Yau, C. (2013). Oncosnp-seq: a statistical approach for the identiﬁcation of somaticcopy number alterations from next-generation sequencing of cancer genomes. Bioin-formatics , (19), 2482–2484. emplate paper m = 100 e δ = 10 − , (c) m = 50 e δ = 10 − e (d) m = 50 e δ = 10 − .The red lines indicates a transition between regions. (a) (b) (c)0.0 0.4 0.8